determinism

Test whether a backend produces identical outputs across repeated runs at temperature=0. A correctly implemented inference engine should produce bit-identical output for the same prompt and parameters every time.

CLI Reference

infer-check

infer-check: correctness and reliability testing for LLM inference engines.

Usage:

infer-check [OPTIONS] COMMAND [ARGS]...

Options:

Name	Type	Description	Default
`--version`	boolean	Show the version and exit.	`False`
`--max-tokens`	integer	Default max tokens for generation (applies to all prompts unless they specify their own).	`1024`
`--num-prompts`	integer range (`1` and above)	Limit the number of prompts to use from a suite; if omitted, all prompts are used.	None
`--help`	boolean	Show this message and exit.	`False`

compare

Compare two quantizations of the same model.

MODEL_A and MODEL_B are model specs — HuggingFace repos, Ollama tags, or local GGUF paths. The backend is auto-detected from the identifier, or you can use an explicit prefix (ollama:, mlx:, gguf:, vllm-mlx:).

Examples: # Two MLX quants infer-check compare \ mlx-community/Llama-3.1-8B-Instruct-4bit \ mlx-community/Llama-3.1-8B-Instruct-8bit

# MLX native vs Ollama GGUF
infer-check compare \
  mlx-community/Llama-3.1-8B-Instruct-4bit \
  ollama:llama3.1:8b-instruct-q4_K_M

# Bartowski GGUF vs Unsloth GGUF (both via Ollama)
infer-check compare \
  ollama:bartowski/Llama-3.1-8B-Instruct-GGUF \
  ollama:unsloth/Llama-3.1-8B-Instruct-GGUF

Usage:

infer-check compare [OPTIONS] MODEL_A MODEL_B

Options:

Name	Type	Description	Default
`--prompts`	text	Bundled suite name (e.g. 'reasoning') or path to a .jsonl file.	`adversarial-numerics`
`--output`	path	Output directory.	`./results/compare/`
`--base-url`	text	Base URL override for HTTP backends. Applied to both models unless they resolve to mlx-lm.	None
`--label-a`	text	Custom label for model A (defaults to auto-derived short name).	None
`--label-b`	text	Custom label for model B (defaults to auto-derived short name).	None
`--report` / `--no-report`	boolean	Generate an HTML comparison report after the run.	`True`
`--max-tokens`	integer range (`1` and above)	Override default max tokens for generation.	None
`--num-prompts`	integer range (`1` and above)	Limit number of prompts to use.	None
`--disable-thinking` / `--enable-thinking`	boolean	Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it.	`True`
`--chat` / `--no-chat`	boolean	Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm.	`True`
`--help`	boolean	Show this message and exit.	`False`

determinism

Test whether a backend produces identical outputs across repeated runs at temperature=0.

Usage:

infer-check determinism [OPTIONS]

Options:

Name	Type	Description	Default
`--model`	text	Model ID or HuggingFace path.	`Sentinel.UNSET`
`--backend`	text	Backend type (auto-detected if omitted).	None
`--prompts`	text	Bundled suite name (e.g. 'reasoning') or path to a .jsonl file.	`Sentinel.UNSET`
`--output`	path	Output directory.	`./results/determinism/`
`--runs`	integer	Number of runs per prompt.	`100`
`--base-url`	text	Base URL for HTTP backends.	None
`--max-tokens`	integer range (`1` and above)	Override default max tokens for generation.	None
`--num-prompts`	integer range (`1` and above)	Limit number of prompts to use.	None
`--disable-thinking` / `--enable-thinking`	boolean	Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it.	`True`
`--chat` / `--no-chat`	boolean	Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm.	`True`
`--help`	boolean	Show this message and exit.	`False`

diff

Compare outputs across different backends for the same model and prompts.

Usage:

infer-check diff [OPTIONS]

Options:

Name	Type	Description	Default
`--model`	text	Model ID or HuggingFace path.	`Sentinel.UNSET`
`--backends`	text	Comma-separated backend names, e.g. 'mlx-lm,llama-cpp'. First is baseline.	`Sentinel.UNSET`
`--prompts`	text	Bundled suite name (e.g. 'reasoning') or path to a .jsonl file.	`Sentinel.UNSET`
`--output`	path	Output directory.	`./results/diff/`
`--quant`	text	Quantization level applied to all backends.	None
`--base-urls`	text	Comma-separated base URLs for HTTP backends (positionally matched to --backends).	None
`--max-tokens`	integer range (`1` and above)	Override default max tokens for generation.	None
`--num-prompts`	integer range (`1` and above)	Limit number of prompts to use.	None
`--disable-thinking` / `--enable-thinking`	boolean	Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it.	`True`
`--chat` / `--no-chat`	boolean	Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm.	`True`
`--help`	boolean	Show this message and exit.	`False`

report

Generate a report from previously saved result JSON files.

Usage:

infer-check report [OPTIONS] RESULTS_DIR

Options:

Name	Type	Description	Default
`--format`	choice (`html` \| `json`)	Output format.	`html`
`--output`	path	Output file path (defaults to /report.html or report.json).	None
`--help`	boolean	Show this message and exit.	`False`

stress

Stress-test a backend with varying concurrency levels.

Usage:

infer-check stress [OPTIONS]

Options:

Name	Type	Description	Default
`--model`	text	Model ID or HuggingFace path.	`Sentinel.UNSET`
`--backend`	text	Backend type (auto-detected if omitted).	None
`--prompts`	text	Bundled suite name (e.g. 'reasoning') or path to a .jsonl file.	`Sentinel.UNSET`
`--output`	path	Output directory.	`./results/stress/`
`--concurrency`	text	Comma-separated concurrency levels.	`1,2,4,8,16`
`--base-url`	text	Base URL for HTTP backends.	None
`--max-tokens`	integer range (`1` and above)	Override default max tokens for generation.	None
`--num-prompts`	integer range (`1` and above)	Limit number of prompts to use.	None
`--disable-thinking` / `--enable-thinking`	boolean	Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it.	`True`
`--chat` / `--no-chat`	boolean	Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm.	`True`
`--help`	boolean	Show this message and exit.	`False`

sweep

Run a quantization sweep: compare pre-quantized models against a baseline.

Each model is a separate HuggingFace repo or local path. The first model (or --baseline) is the reference; all others are compared against it.

Example:

infer-check sweep \
  --models "bf16=mlx-community/Llama-3.1-8B-Instruct-bf16,
            4bit=mlx-community/Llama-3.1-8B-Instruct-4bit,
            3bit=mlx-community/Llama-3.1-8B-Instruct-3bit" \
  --prompts reasoning

Usage:

infer-check sweep [OPTIONS]

Options:

Name	Type	Description	Default
`--models`	text	Comma-separated label=model_path pairs. Example: 'bf16=mlx-community/Llama-3.1-8B-Instruct-bf16,4bit=mlx-community/Llama-3.1-8B-Instruct-4bit'	`Sentinel.UNSET`
`--backend`	text	Backend type (auto-detected if omitted).	None
`--prompts`	text	Bundled suite name (e.g. 'reasoning') or path to a .jsonl file.	`Sentinel.UNSET`
`--output`	path	Output directory.	`./results/sweep/`
`--baseline`	text	Baseline label (defaults to first in --models).	None
`--base-url`	text	Base URL for HTTP backends.	None
`--max-tokens`	integer range (`1` and above)	Override default max tokens for generation.	None
`--num-prompts`	integer range (`1` and above)	Limit number of prompts to use.	None
`--disable-thinking` / `--enable-thinking`	boolean	Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it.	`True`
`--chat` / `--no-chat`	boolean	Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm.	`True`
`--help`	boolean	Show this message and exit.	`False`

How it works

Force temperature=0 -- all prompts are run at temperature=0 to ensure deterministic sampling.
Repeat runs -- each prompt is sent to the backend N times (default 100).
Count identical outputs -- counts how many runs produced the exact same text as the most common output.
Find divergence positions -- for each pair of non-identical outputs, identifies the first token position where they diverge.
Compute score -- determinism score = identical_count / num_runs (1.0 = perfect).

Example

infer-check determinism \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --backend mlx-lm \
  --prompts determinism \
  --runs 20 \
  --output ./results/determinism/

Output:

                          Determinism Summary
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ prompt_id                            ┃ runs ┃ identical ┃ determinism_score ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ d1a2b3c4-...                         │   20 │        20 │           100.00% │
│ e5f6a7b8-...                         │   20 │        20 │           100.00% │
│ ...                                  │   20 │        20 │           100.00% │
└──────────────────────────────────────┴──────┴───────────┴───────────────────┘

Overall determinism score: 100.00%

What non-determinism means

A determinism score below 100% indicates that the backend is not producing consistent output at temperature=0. Common causes:

Floating-point non-determinism in GPU kernels (different thread scheduling leads to different rounding)
KV cache bugs that accumulate errors across requests
Batching interference where concurrent requests affect each other's outputs
Buggy sampling implementations that don't properly handle temperature=0

Warning

Non-determinism at temperature=0 is always a bug in the inference engine, not a property of the model. A correct implementation must produce identical output for identical inputs.

Output format

Results are saved as a JSON array of DeterminismResult objects, each containing:

prompt_id -- reference to the prompt
num_runs -- total number of runs
identical_count -- how many runs matched the most common output
divergence_positions -- token indices where any pair of runs diverged
determinism_score -- identical_count / num_runs