diff
Compare outputs across different backends for the same model and prompts. Catches serving-layer bugs by holding the model and quantization constant while varying the inference path.
CLI Reference
infer-check
infer-check: correctness and reliability testing for LLM inference engines.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--version |
boolean | Show the version and exit. | False |
--max-tokens |
integer | Default max tokens for generation (applies to all prompts unless they specify their own). | 1024 |
--num-prompts |
integer range (1 and above) |
Limit the number of prompts to use from a suite; if omitted, all prompts are used. | None |
--help |
boolean | Show this message and exit. | False |
compare
Compare two quantizations of the same model.
MODEL_A and MODEL_B are model specs — HuggingFace repos, Ollama tags, or local GGUF paths. The backend is auto-detected from the identifier, or you can use an explicit prefix (ollama:, mlx:, gguf:, vllm-mlx:).
Examples: # Two MLX quants infer-check compare \ mlx-community/Llama-3.1-8B-Instruct-4bit \ mlx-community/Llama-3.1-8B-Instruct-8bit
# MLX native vs Ollama GGUF
infer-check compare \
mlx-community/Llama-3.1-8B-Instruct-4bit \
ollama:llama3.1:8b-instruct-q4_K_M
# Bartowski GGUF vs Unsloth GGUF (both via Ollama)
infer-check compare \
ollama:bartowski/Llama-3.1-8B-Instruct-GGUF \
ollama:unsloth/Llama-3.1-8B-Instruct-GGUF
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | adversarial-numerics |
--output |
path | Output directory. | ./results/compare/ |
--base-url |
text | Base URL override for HTTP backends. Applied to both models unless they resolve to mlx-lm. | None |
--label-a |
text | Custom label for model A (defaults to auto-derived short name). | None |
--label-b |
text | Custom label for model B (defaults to auto-derived short name). | None |
--report / --no-report |
boolean | Generate an HTML comparison report after the run. | True |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
determinism
Test whether a backend produces identical outputs across repeated runs at temperature=0.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--model |
text | Model ID or HuggingFace path. | Sentinel.UNSET |
--backend |
text | Backend type (auto-detected if omitted). | None |
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | Sentinel.UNSET |
--output |
path | Output directory. | ./results/determinism/ |
--runs |
integer | Number of runs per prompt. | 100 |
--base-url |
text | Base URL for HTTP backends. | None |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
diff
Compare outputs across different backends for the same model and prompts.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--model |
text | Model ID or HuggingFace path. | Sentinel.UNSET |
--backends |
text | Comma-separated backend names, e.g. 'mlx-lm,llama-cpp'. First is baseline. | Sentinel.UNSET |
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | Sentinel.UNSET |
--output |
path | Output directory. | ./results/diff/ |
--quant |
text | Quantization level applied to all backends. | None |
--base-urls |
text | Comma-separated base URLs for HTTP backends (positionally matched to --backends). | None |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
report
Generate a report from previously saved result JSON files.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--format |
choice (html | json) |
Output format. | html |
--output |
path | Output file path (defaults to |
None |
--help |
boolean | Show this message and exit. | False |
stress
Stress-test a backend with varying concurrency levels.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--model |
text | Model ID or HuggingFace path. | Sentinel.UNSET |
--backend |
text | Backend type (auto-detected if omitted). | None |
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | Sentinel.UNSET |
--output |
path | Output directory. | ./results/stress/ |
--concurrency |
text | Comma-separated concurrency levels. | 1,2,4,8,16 |
--base-url |
text | Base URL for HTTP backends. | None |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
sweep
Run a quantization sweep: compare pre-quantized models against a baseline.
Each model is a separate HuggingFace repo or local path. The first model (or --baseline) is the reference; all others are compared against it.
Example:
infer-check sweep \
--models "bf16=mlx-community/Llama-3.1-8B-Instruct-bf16,
4bit=mlx-community/Llama-3.1-8B-Instruct-4bit,
3bit=mlx-community/Llama-3.1-8B-Instruct-3bit" \
--prompts reasoning
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--models |
text | Comma-separated label=model_path pairs. Example: 'bf16=mlx-community/Llama-3.1-8B-Instruct-bf16,4bit=mlx-community/Llama-3.1-8B-Instruct-4bit' | Sentinel.UNSET |
--backend |
text | Backend type (auto-detected if omitted). | None |
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | Sentinel.UNSET |
--output |
path | Output directory. | ./results/sweep/ |
--baseline |
text | Baseline label (defaults to first in --models). | None |
--base-url |
text | Base URL for HTTP backends. | None |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
How it works
- Build backends -- creates a backend instance for each entry in
--backends, using the shared--modeland optional--quant. - Baseline pass -- generates outputs for all prompts using the first backend.
- Test passes -- generates outputs for all prompts using each remaining backend.
- Compare -- each test backend's outputs are compared against the baseline, producing per-prompt
ComparisonResultobjects with severity, text similarity, and flip metadata. - Summary table -- groups results by test backend and displays failure rate, flip rate, and mean similarity.
Base URL matching
The --base-urls option is positionally matched to --backends. Use an empty entry for backends that don't need a URL (e.g., mlx-lm):
This gives mlx-lm no URL (local inference) and openai-compat the vllm-mlx server URL.
Examples
mlx-lm vs vllm-mlx serving layer:
# Start vllm-mlx in another terminal:
# vllm-mlx serve mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8000
infer-check diff \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--backends "mlx-lm,openai-compat" \
--base-urls ",http://127.0.0.1:8000" \
--prompts reasoning \
--output ./results/diff/
With raw completions endpoint (no chat template):
infer-check diff \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--backends "mlx-lm,openai-compat" \
--base-urls ",http://127.0.0.1:8000" \
--prompts reasoning \
--no-chat
Output
Diff Summary
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ test_backend ┃ failures ┃ failure_rate ┃ flip_rate ┃ mean_similarity ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ openai-compat │ 0 │ 0.00% │ 0.0% │ 1.0000 │
└───────────────┴──────────┴──────────────┴───────────┴─────────────────┘
A 100% similarity with 0% flip rate means the serving layer introduces zero divergence -- any output differences in production come from quantization, not the backend itself.
Output format
Results are saved as a JSON array of ComparisonResult objects, each containing:
baseline/test-- theInferenceResultfrom each backendkl_divergence-- KL(baseline || test) if logprobs are availabletoken_divergence_index-- first token where the outputs differtext_similarity-- 0-1 similarity scoreis_failure-- true if similarity < 0.5metadata-- includes severity, flip status, answers, extraction strategy