determinism
Test whether a backend produces identical outputs across repeated runs at temperature=0. A correctly implemented inference engine should produce bit-identical output for the same prompt and parameters every time.
CLI Reference
infer-check
infer-check: correctness and reliability testing for LLM inference engines.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--version |
boolean | Show the version and exit. | False |
--max-tokens |
integer | Default max tokens for generation (applies to all prompts unless they specify their own). | 1024 |
--num-prompts |
integer range (1 and above) |
Limit the number of prompts to use from a suite; if omitted, all prompts are used. | None |
--help |
boolean | Show this message and exit. | False |
compare
Compare two quantizations of the same model.
MODEL_A and MODEL_B are model specs — HuggingFace repos, Ollama tags, or local GGUF paths. The backend is auto-detected from the identifier, or you can use an explicit prefix (ollama:, mlx:, gguf:, vllm-mlx:).
Examples: # Two MLX quants infer-check compare \ mlx-community/Llama-3.1-8B-Instruct-4bit \ mlx-community/Llama-3.1-8B-Instruct-8bit
# MLX native vs Ollama GGUF
infer-check compare \
mlx-community/Llama-3.1-8B-Instruct-4bit \
ollama:llama3.1:8b-instruct-q4_K_M
# Bartowski GGUF vs Unsloth GGUF (both via Ollama)
infer-check compare \
ollama:bartowski/Llama-3.1-8B-Instruct-GGUF \
ollama:unsloth/Llama-3.1-8B-Instruct-GGUF
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | adversarial-numerics |
--output |
path | Output directory. | ./results/compare/ |
--base-url |
text | Base URL override for HTTP backends. Applied to both models unless they resolve to mlx-lm. | None |
--label-a |
text | Custom label for model A (defaults to auto-derived short name). | None |
--label-b |
text | Custom label for model B (defaults to auto-derived short name). | None |
--report / --no-report |
boolean | Generate an HTML comparison report after the run. | True |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
determinism
Test whether a backend produces identical outputs across repeated runs at temperature=0.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--model |
text | Model ID or HuggingFace path. | Sentinel.UNSET |
--backend |
text | Backend type (auto-detected if omitted). | None |
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | Sentinel.UNSET |
--output |
path | Output directory. | ./results/determinism/ |
--runs |
integer | Number of runs per prompt. | 100 |
--base-url |
text | Base URL for HTTP backends. | None |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
diff
Compare outputs across different backends for the same model and prompts.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--model |
text | Model ID or HuggingFace path. | Sentinel.UNSET |
--backends |
text | Comma-separated backend names, e.g. 'mlx-lm,llama-cpp'. First is baseline. | Sentinel.UNSET |
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | Sentinel.UNSET |
--output |
path | Output directory. | ./results/diff/ |
--quant |
text | Quantization level applied to all backends. | None |
--base-urls |
text | Comma-separated base URLs for HTTP backends (positionally matched to --backends). | None |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
report
Generate a report from previously saved result JSON files.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--format |
choice (html | json) |
Output format. | html |
--output |
path | Output file path (defaults to |
None |
--help |
boolean | Show this message and exit. | False |
stress
Stress-test a backend with varying concurrency levels.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--model |
text | Model ID or HuggingFace path. | Sentinel.UNSET |
--backend |
text | Backend type (auto-detected if omitted). | None |
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | Sentinel.UNSET |
--output |
path | Output directory. | ./results/stress/ |
--concurrency |
text | Comma-separated concurrency levels. | 1,2,4,8,16 |
--base-url |
text | Base URL for HTTP backends. | None |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
sweep
Run a quantization sweep: compare pre-quantized models against a baseline.
Each model is a separate HuggingFace repo or local path. The first model (or --baseline) is the reference; all others are compared against it.
Example:
infer-check sweep \
--models "bf16=mlx-community/Llama-3.1-8B-Instruct-bf16,
4bit=mlx-community/Llama-3.1-8B-Instruct-4bit,
3bit=mlx-community/Llama-3.1-8B-Instruct-3bit" \
--prompts reasoning
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--models |
text | Comma-separated label=model_path pairs. Example: 'bf16=mlx-community/Llama-3.1-8B-Instruct-bf16,4bit=mlx-community/Llama-3.1-8B-Instruct-4bit' | Sentinel.UNSET |
--backend |
text | Backend type (auto-detected if omitted). | None |
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | Sentinel.UNSET |
--output |
path | Output directory. | ./results/sweep/ |
--baseline |
text | Baseline label (defaults to first in --models). | None |
--base-url |
text | Base URL for HTTP backends. | None |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
How it works
- Force temperature=0 -- all prompts are run at temperature=0 to ensure deterministic sampling.
- Repeat runs -- each prompt is sent to the backend N times (default 100).
- Count identical outputs -- counts how many runs produced the exact same text as the most common output.
- Find divergence positions -- for each pair of non-identical outputs, identifies the first token position where they diverge.
- Compute score -- determinism score = identical_count / num_runs (1.0 = perfect).
Example
infer-check determinism \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--backend mlx-lm \
--prompts determinism \
--runs 20 \
--output ./results/determinism/
Output:
Determinism Summary
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ prompt_id ┃ runs ┃ identical ┃ determinism_score ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ d1a2b3c4-... │ 20 │ 20 │ 100.00% │
│ e5f6a7b8-... │ 20 │ 20 │ 100.00% │
│ ... │ 20 │ 20 │ 100.00% │
└──────────────────────────────────────┴──────┴───────────┴───────────────────┘
Overall determinism score: 100.00%
What non-determinism means
A determinism score below 100% indicates that the backend is not producing consistent output at temperature=0. Common causes:
- Floating-point non-determinism in GPU kernels (different thread scheduling leads to different rounding)
- KV cache bugs that accumulate errors across requests
- Batching interference where concurrent requests affect each other's outputs
- Buggy sampling implementations that don't properly handle temperature=0
Warning
Non-determinism at temperature=0 is always a bug in the inference engine, not a property of the model. A correct implementation must produce identical output for identical inputs.
Output format
Results are saved as a JSON array of DeterminismResult objects, each containing:
prompt_id-- reference to the promptnum_runs-- total number of runsidentical_count-- how many runs matched the most common outputdivergence_positions-- token indices where any pair of runs divergeddeterminism_score-- identical_count / num_runs