sweep
Compare pre-quantized models against a baseline. Each model is a separate HuggingFace repo or local path. The first model (or --baseline) is the reference; all others are compared against it.
CLI Reference
infer-check
infer-check: correctness and reliability testing for LLM inference engines.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--version |
boolean | Show the version and exit. | False |
--max-tokens |
integer | Default max tokens for generation (applies to all prompts unless they specify their own). | 1024 |
--num-prompts |
integer range (1 and above) |
Limit the number of prompts to use from a suite; if omitted, all prompts are used. | None |
--help |
boolean | Show this message and exit. | False |
compare
Compare two quantizations of the same model.
MODEL_A and MODEL_B are model specs — HuggingFace repos, Ollama tags, or local GGUF paths. The backend is auto-detected from the identifier, or you can use an explicit prefix (ollama:, mlx:, gguf:, vllm-mlx:).
Examples: # Two MLX quants infer-check compare \ mlx-community/Llama-3.1-8B-Instruct-4bit \ mlx-community/Llama-3.1-8B-Instruct-8bit
# MLX native vs Ollama GGUF
infer-check compare \
mlx-community/Llama-3.1-8B-Instruct-4bit \
ollama:llama3.1:8b-instruct-q4_K_M
# Bartowski GGUF vs Unsloth GGUF (both via Ollama)
infer-check compare \
ollama:bartowski/Llama-3.1-8B-Instruct-GGUF \
ollama:unsloth/Llama-3.1-8B-Instruct-GGUF
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | adversarial-numerics |
--output |
path | Output directory. | ./results/compare/ |
--base-url |
text | Base URL override for HTTP backends. Applied to both models unless they resolve to mlx-lm. | None |
--label-a |
text | Custom label for model A (defaults to auto-derived short name). | None |
--label-b |
text | Custom label for model B (defaults to auto-derived short name). | None |
--report / --no-report |
boolean | Generate an HTML comparison report after the run. | True |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
determinism
Test whether a backend produces identical outputs across repeated runs at temperature=0.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--model |
text | Model ID or HuggingFace path. | Sentinel.UNSET |
--backend |
text | Backend type (auto-detected if omitted). | None |
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | Sentinel.UNSET |
--output |
path | Output directory. | ./results/determinism/ |
--runs |
integer | Number of runs per prompt. | 100 |
--base-url |
text | Base URL for HTTP backends. | None |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
diff
Compare outputs across different backends for the same model and prompts.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--model |
text | Model ID or HuggingFace path. | Sentinel.UNSET |
--backends |
text | Comma-separated backend names, e.g. 'mlx-lm,llama-cpp'. First is baseline. | Sentinel.UNSET |
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | Sentinel.UNSET |
--output |
path | Output directory. | ./results/diff/ |
--quant |
text | Quantization level applied to all backends. | None |
--base-urls |
text | Comma-separated base URLs for HTTP backends (positionally matched to --backends). | None |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
report
Generate a report from previously saved result JSON files.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--format |
choice (html | json) |
Output format. | html |
--output |
path | Output file path (defaults to |
None |
--help |
boolean | Show this message and exit. | False |
stress
Stress-test a backend with varying concurrency levels.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--model |
text | Model ID or HuggingFace path. | Sentinel.UNSET |
--backend |
text | Backend type (auto-detected if omitted). | None |
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | Sentinel.UNSET |
--output |
path | Output directory. | ./results/stress/ |
--concurrency |
text | Comma-separated concurrency levels. | 1,2,4,8,16 |
--base-url |
text | Base URL for HTTP backends. | None |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
sweep
Run a quantization sweep: compare pre-quantized models against a baseline.
Each model is a separate HuggingFace repo or local path. The first model (or --baseline) is the reference; all others are compared against it.
Example:
infer-check sweep \
--models "bf16=mlx-community/Llama-3.1-8B-Instruct-bf16,
4bit=mlx-community/Llama-3.1-8B-Instruct-4bit,
3bit=mlx-community/Llama-3.1-8B-Instruct-3bit" \
--prompts reasoning
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--models |
text | Comma-separated label=model_path pairs. Example: 'bf16=mlx-community/Llama-3.1-8B-Instruct-bf16,4bit=mlx-community/Llama-3.1-8B-Instruct-4bit' | Sentinel.UNSET |
--backend |
text | Backend type (auto-detected if omitted). | None |
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | Sentinel.UNSET |
--output |
path | Output directory. | ./results/sweep/ |
--baseline |
text | Baseline label (defaults to first in --models). | None |
--base-url |
text | Base URL for HTTP backends. | None |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
How it works
- Parse models -- splits the
--modelsstring into label/path pairs and creates a backend for each. - Baseline self-check -- runs the baseline model twice on all prompts and compares the results. If the baseline isn't perfectly deterministic (50/50 identical), you'll see a warning. This tells you whether your comparison data is reliable.
- Test comparisons -- runs every other quantization against the baseline and computes per-prompt metrics (text similarity, severity, KL divergence).
- Checkpoint saves -- results are saved incrementally after each quantization level completes, so partial results survive interruptions.
- Summary table -- displays a table grouped by quantization level with severity breakdowns.
Model format
The --models option accepts comma-separated entries. Each entry can be:
- Labeled:
label=model_path(e.g.,bf16=mlx-community/Meta-Llama-3.1-8B-Instruct-bf16) - Unlabeled: just the model path (the last path component becomes the label)
You need at least 2 models (one baseline + one test).
Example
Full sweep across three quantization levels:
infer-check --max-tokens 512 --num-prompts 10 sweep \
--models "bf16=mlx-community/Meta-Llama-3.1-8B-Instruct-bf16,\
8bit=mlx-community/Meta-Llama-3.1-8B-Instruct-8bit,\
4bit=mlx-community/Meta-Llama-3.1-8B-Instruct-4bit" \
--backend mlx-lm \
--prompts reasoning \
--output ./results/sweep/
Output:
Sweep Summary
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ quant_level ┃ identical ┃ minor ┃ moderate ┃ severe ┃ mean_similarity ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ bf16 (self-check) │ 50/50 │ 0/50 │ 0/50 │ 0/50 │ 1.0000 │
│ 8bit │ 20/50 │ 9/50 │ 12/50 │ 9/50 │ 0.8067 │
│ 4bit │ 1/50 │ 3/50 │ 11/50 │ 35/50 │ 0.3837 │
└─────────────────────┴───────────┴───────┴──────────┴────────┴─────────────────┘
The self-check row confirms the baseline is deterministic. The 4-bit row shows 35/50 severe divergences -- a clear signal of quantization-induced correctness degradation.
Output format
Results are saved as a JSON file containing a SweepResult with:
model_id-- the baseline model identifierbackend_name-- the backend usedquantization_levels-- list of quantization labelscomparisons-- all per-promptComparisonResultobjectstimestamp-- when the sweep completedsummary-- aggregate statistics (mean KL, failure counts)