mlx.MlxEngine
Runs batched MLX inference on a dedicated worker thread.
Usage
mlx.MlxEngine()The constructor starts a daemon worker thread that loads the model from fused_dir, applies MLXPatches, and precomputes a prompt cache for prefix_messages. Await ensure_loaded to surface load failures before generating.
Parameters
fused_dir: Path-
Directory of the MLX model to load, such as the path returned by AdapterFuser.ensure_fused.
logits_processor_factory: Callable[[Any], Callable[…, Any]]-
Builds the logits processor from the loaded tokenizer; the processor is applied to every generation call.
prefix_messages: list[dict[str, str]]-
Chat messages shared by every conversation; their prompt cache is computed once and reused across batches. Each conversation passed to generate must start with these messages.
batch_size: int-
Number of conversations generated per batch.
worker_name: str = "mlx"- Name of the worker thread.
Raises
RuntimeError- When not running on macOS with Apple Silicon.
Methods
| Name | Description |
|---|---|
| close() | Signal the worker thread to exit after completing already-queued jobs. |
| ensure_loaded() | Wait for the worker thread to finish loading, re-raising the error if loading failed. |
| generate() | Generate one single-token completion per conversation. |
| peak_memory_gb() | Return the process’s peak resident set size in GiB. |
| submit() |
Run fn(*args) on the worker thread and await its result.
|
close()
Signal the worker thread to exit after completing already-queued jobs.
Usage
close()ensure_loaded()
Wait for the worker thread to finish loading, re-raising the error if loading failed.
Usage
ensure_loaded()generate()
Generate one single-token completion per conversation.
Usage
generate(message_lists, on_progress)Conversations are processed in chunks of batch_size, ordered by the length of each conversation’s final message so similar-length prompts batch together. Every conversation must start with the engine’s prefix_messages; the shared prefix is served from the precomputed prompt cache.
Parameters
message_lists: list[list[dict[str, str]]]-
Conversations, each a list of chat messages with
roleandcontentkeys. on_progress: Callable[[int], None]- Called after each chunk with the number of conversations completed in that chunk.
Returns
list[str]-
Generated texts, in the same order as
message_lists.
peak_memory_gb()
Return the process’s peak resident set size in GiB.
Usage
peak_memory_gb()submit()
Run fn(*args) on the worker thread and await its result.
Usage
submit(fn, *args)Parameters
fn: Callable[…, R]-
Callable to execute on the worker thread.
*args: Any-
Positional arguments passed to
fn.
Returns
R-
The value returned by
fn; an exception raised byfnpropagates to the awaiter.