Skip to content

determinism

Test whether a backend produces identical outputs across repeated runs at temperature=0. A correctly implemented inference engine should produce bit-identical output for the same prompt and parameters every time.

CLI Reference

infer-check

infer-check: correctness and reliability testing for LLM inference engines.

Usage:

infer-check [OPTIONS] COMMAND [ARGS]...

Options:

Name Type Description Default
--version boolean Show the version and exit. False
--max-tokens integer Default max tokens for generation (applies to all prompts unless they specify their own). 1024
--num-prompts integer range (1 and above) Limit the number of prompts to use from a suite; if omitted, all prompts are used. None
--help boolean Show this message and exit. False

compare

Compare two quantizations of the same model.

MODEL_A and MODEL_B are model specs — HuggingFace repos, Ollama tags, or local GGUF paths. The backend is auto-detected from the identifier, or you can use an explicit prefix (ollama:, mlx:, gguf:, vllm-mlx:).

Examples: # Two MLX quants infer-check compare \ mlx-community/Llama-3.1-8B-Instruct-4bit \ mlx-community/Llama-3.1-8B-Instruct-8bit

# MLX native vs Ollama GGUF
infer-check compare \
  mlx-community/Llama-3.1-8B-Instruct-4bit \
  ollama:llama3.1:8b-instruct-q4_K_M

# Bartowski GGUF vs Unsloth GGUF (both via Ollama)
infer-check compare \
  ollama:bartowski/Llama-3.1-8B-Instruct-GGUF \
  ollama:unsloth/Llama-3.1-8B-Instruct-GGUF

Usage:

infer-check compare [OPTIONS] MODEL_A MODEL_B

Options:

Name Type Description Default
--prompts text Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. adversarial-numerics
--output path Output directory. ./results/compare/
--base-url text Base URL override for HTTP backends. Applied to both models unless they resolve to mlx-lm. None
--label-a text Custom label for model A (defaults to auto-derived short name). None
--label-b text Custom label for model B (defaults to auto-derived short name). None
--report / --no-report boolean Generate an HTML comparison report after the run. True
--max-tokens integer range (1 and above) Override default max tokens for generation. None
--num-prompts integer range (1 and above) Limit number of prompts to use. None
--disable-thinking / --enable-thinking boolean Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. True
--chat / --no-chat boolean Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. True
--help boolean Show this message and exit. False

determinism

Test whether a backend produces identical outputs across repeated runs at temperature=0.

Usage:

infer-check determinism [OPTIONS]

Options:

Name Type Description Default
--model text Model ID or HuggingFace path. Sentinel.UNSET
--backend text Backend type (auto-detected if omitted). None
--prompts text Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. Sentinel.UNSET
--output path Output directory. ./results/determinism/
--runs integer Number of runs per prompt. 100
--base-url text Base URL for HTTP backends. None
--max-tokens integer range (1 and above) Override default max tokens for generation. None
--num-prompts integer range (1 and above) Limit number of prompts to use. None
--disable-thinking / --enable-thinking boolean Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. True
--chat / --no-chat boolean Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. True
--help boolean Show this message and exit. False

diff

Compare outputs across different backends for the same model and prompts.

Usage:

infer-check diff [OPTIONS]

Options:

Name Type Description Default
--model text Model ID or HuggingFace path. Sentinel.UNSET
--backends text Comma-separated backend names, e.g. 'mlx-lm,llama-cpp'. First is baseline. Sentinel.UNSET
--prompts text Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. Sentinel.UNSET
--output path Output directory. ./results/diff/
--quant text Quantization level applied to all backends. None
--base-urls text Comma-separated base URLs for HTTP backends (positionally matched to --backends). None
--max-tokens integer range (1 and above) Override default max tokens for generation. None
--num-prompts integer range (1 and above) Limit number of prompts to use. None
--disable-thinking / --enable-thinking boolean Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. True
--chat / --no-chat boolean Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. True
--help boolean Show this message and exit. False

report

Generate a report from previously saved result JSON files.

Usage:

infer-check report [OPTIONS] RESULTS_DIR

Options:

Name Type Description Default
--format choice (html | json) Output format. html
--output path Output file path (defaults to /report.html or report.json). None
--help boolean Show this message and exit. False

stress

Stress-test a backend with varying concurrency levels.

Usage:

infer-check stress [OPTIONS]

Options:

Name Type Description Default
--model text Model ID or HuggingFace path. Sentinel.UNSET
--backend text Backend type (auto-detected if omitted). None
--prompts text Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. Sentinel.UNSET
--output path Output directory. ./results/stress/
--concurrency text Comma-separated concurrency levels. 1,2,4,8,16
--base-url text Base URL for HTTP backends. None
--max-tokens integer range (1 and above) Override default max tokens for generation. None
--num-prompts integer range (1 and above) Limit number of prompts to use. None
--disable-thinking / --enable-thinking boolean Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. True
--chat / --no-chat boolean Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. True
--help boolean Show this message and exit. False

sweep

Run a quantization sweep: compare pre-quantized models against a baseline.

Each model is a separate HuggingFace repo or local path. The first model (or --baseline) is the reference; all others are compared against it.

Example:

infer-check sweep \
  --models "bf16=mlx-community/Llama-3.1-8B-Instruct-bf16,
            4bit=mlx-community/Llama-3.1-8B-Instruct-4bit,
            3bit=mlx-community/Llama-3.1-8B-Instruct-3bit" \
  --prompts reasoning

Usage:

infer-check sweep [OPTIONS]

Options:

Name Type Description Default
--models text Comma-separated label=model_path pairs. Example: 'bf16=mlx-community/Llama-3.1-8B-Instruct-bf16,4bit=mlx-community/Llama-3.1-8B-Instruct-4bit' Sentinel.UNSET
--backend text Backend type (auto-detected if omitted). None
--prompts text Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. Sentinel.UNSET
--output path Output directory. ./results/sweep/
--baseline text Baseline label (defaults to first in --models). None
--base-url text Base URL for HTTP backends. None
--max-tokens integer range (1 and above) Override default max tokens for generation. None
--num-prompts integer range (1 and above) Limit number of prompts to use. None
--disable-thinking / --enable-thinking boolean Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. True
--chat / --no-chat boolean Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. True
--help boolean Show this message and exit. False

How it works

  1. Force temperature=0 -- all prompts are run at temperature=0 to ensure deterministic sampling.
  2. Repeat runs -- each prompt is sent to the backend N times (default 100).
  3. Count identical outputs -- counts how many runs produced the exact same text as the most common output.
  4. Find divergence positions -- for each pair of non-identical outputs, identifies the first token position where they diverge.
  5. Compute score -- determinism score = identical_count / num_runs (1.0 = perfect).

Example

infer-check determinism \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --backend mlx-lm \
  --prompts determinism \
  --runs 20 \
  --output ./results/determinism/

Output:

                          Determinism Summary
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ prompt_id                            ┃ runs ┃ identical ┃ determinism_score ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ d1a2b3c4-...                         │   20 │        20 │           100.00% │
│ e5f6a7b8-...                         │   20 │        20 │           100.00% │
│ ...                                  │   20 │        20 │           100.00% │
└──────────────────────────────────────┴──────┴───────────┴───────────────────┘

Overall determinism score: 100.00%

What non-determinism means

A determinism score below 100% indicates that the backend is not producing consistent output at temperature=0. Common causes:

  • Floating-point non-determinism in GPU kernels (different thread scheduling leads to different rounding)
  • KV cache bugs that accumulate errors across requests
  • Batching interference where concurrent requests affect each other's outputs
  • Buggy sampling implementations that don't properly handle temperature=0

Warning

Non-determinism at temperature=0 is always a bug in the inference engine, not a property of the model. A correct implementation must produce identical output for identical inputs.

Output format

Results are saved as a JSON array of DeterminismResult objects, each containing:

  • prompt_id -- reference to the prompt
  • num_runs -- total number of runs
  • identical_count -- how many runs matched the most common output
  • divergence_positions -- token indices where any pair of runs diverged
  • determinism_score -- identical_count / num_runs