diff

Compare outputs across different backends for the same model and prompts. Catches serving-layer bugs by holding the model and quantization constant while varying the inference path.

CLI Reference

infer-check

infer-check: correctness and reliability testing for LLM inference engines.

Usage:

infer-check [OPTIONS] COMMAND [ARGS]...

Options:

Name	Type	Description	Default
`--version`	boolean	Show the version and exit.	`False`
`--max-tokens`	integer	Default max tokens for generation (applies to all prompts unless they specify their own).	`1024`
`--num-prompts`	integer range (`1` and above)	Limit the number of prompts to use from a suite; if omitted, all prompts are used.	None
`--help`	boolean	Show this message and exit.	`False`

compare

Compare two quantizations of the same model.

MODEL_A and MODEL_B are model specs — HuggingFace repos, Ollama tags, or local GGUF paths. The backend is auto-detected from the identifier, or you can use an explicit prefix (ollama:, mlx:, gguf:, vllm-mlx:).

Examples: # Two MLX quants infer-check compare \ mlx-community/Llama-3.1-8B-Instruct-4bit \ mlx-community/Llama-3.1-8B-Instruct-8bit

# MLX native vs Ollama GGUF
infer-check compare \
  mlx-community/Llama-3.1-8B-Instruct-4bit \
  ollama:llama3.1:8b-instruct-q4_K_M

# Bartowski GGUF vs Unsloth GGUF (both via Ollama)
infer-check compare \
  ollama:bartowski/Llama-3.1-8B-Instruct-GGUF \
  ollama:unsloth/Llama-3.1-8B-Instruct-GGUF

Usage:

infer-check compare [OPTIONS] MODEL_A MODEL_B

Options:

Name	Type	Description	Default
`--prompts`	text	Bundled suite name (e.g. 'reasoning') or path to a .jsonl file.	`adversarial-numerics`
`--output`	path	Output directory.	`./results/compare/`
`--base-url`	text	Base URL override for HTTP backends. Applied to both models unless they resolve to mlx-lm.	None
`--label-a`	text	Custom label for model A (defaults to auto-derived short name).	None
`--label-b`	text	Custom label for model B (defaults to auto-derived short name).	None
`--report` / `--no-report`	boolean	Generate an HTML comparison report after the run.	`True`
`--max-tokens`	integer range (`1` and above)	Override default max tokens for generation.	None
`--num-prompts`	integer range (`1` and above)	Limit number of prompts to use.	None
`--disable-thinking` / `--enable-thinking`	boolean	Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it.	`True`
`--chat` / `--no-chat`	boolean	Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm.	`True`
`--help`	boolean	Show this message and exit.	`False`

determinism

Test whether a backend produces identical outputs across repeated runs at temperature=0.

Usage:

infer-check determinism [OPTIONS]

Options:

Name	Type	Description	Default
`--model`	text	Model ID or HuggingFace path.	`Sentinel.UNSET`
`--backend`	text	Backend type (auto-detected if omitted).	None
`--prompts`	text	Bundled suite name (e.g. 'reasoning') or path to a .jsonl file.	`Sentinel.UNSET`
`--output`	path	Output directory.	`./results/determinism/`
`--runs`	integer	Number of runs per prompt.	`100`
`--base-url`	text	Base URL for HTTP backends.	None
`--max-tokens`	integer range (`1` and above)	Override default max tokens for generation.	None
`--num-prompts`	integer range (`1` and above)	Limit number of prompts to use.	None
`--disable-thinking` / `--enable-thinking`	boolean	Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it.	`True`
`--chat` / `--no-chat`	boolean	Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm.	`True`
`--help`	boolean	Show this message and exit.	`False`

diff

Compare outputs across different backends for the same model and prompts.

Usage:

infer-check diff [OPTIONS]

Options:

Name	Type	Description	Default
`--model`	text	Model ID or HuggingFace path.	`Sentinel.UNSET`
`--backends`	text	Comma-separated backend names, e.g. 'mlx-lm,llama-cpp'. First is baseline.	`Sentinel.UNSET`
`--prompts`	text	Bundled suite name (e.g. 'reasoning') or path to a .jsonl file.	`Sentinel.UNSET`
`--output`	path	Output directory.	`./results/diff/`
`--quant`	text	Quantization level applied to all backends.	None
`--base-urls`	text	Comma-separated base URLs for HTTP backends (positionally matched to --backends).	None
`--max-tokens`	integer range (`1` and above)	Override default max tokens for generation.	None
`--num-prompts`	integer range (`1` and above)	Limit number of prompts to use.	None
`--disable-thinking` / `--enable-thinking`	boolean	Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it.	`True`
`--chat` / `--no-chat`	boolean	Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm.	`True`
`--help`	boolean	Show this message and exit.	`False`

report

Generate a report from previously saved result JSON files.

Usage:

infer-check report [OPTIONS] RESULTS_DIR

Options:

Name	Type	Description	Default
`--format`	choice (`html` \| `json`)	Output format.	`html`
`--output`	path	Output file path (defaults to /report.html or report.json).	None
`--help`	boolean	Show this message and exit.	`False`

stress

Stress-test a backend with varying concurrency levels.

Usage:

infer-check stress [OPTIONS]

Options:

Name	Type	Description	Default
`--model`	text	Model ID or HuggingFace path.	`Sentinel.UNSET`
`--backend`	text	Backend type (auto-detected if omitted).	None
`--prompts`	text	Bundled suite name (e.g. 'reasoning') or path to a .jsonl file.	`Sentinel.UNSET`
`--output`	path	Output directory.	`./results/stress/`
`--concurrency`	text	Comma-separated concurrency levels.	`1,2,4,8,16`
`--base-url`	text	Base URL for HTTP backends.	None
`--max-tokens`	integer range (`1` and above)	Override default max tokens for generation.	None
`--num-prompts`	integer range (`1` and above)	Limit number of prompts to use.	None
`--disable-thinking` / `--enable-thinking`	boolean	Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it.	`True`
`--chat` / `--no-chat`	boolean	Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm.	`True`
`--help`	boolean	Show this message and exit.	`False`

sweep

Run a quantization sweep: compare pre-quantized models against a baseline.

Each model is a separate HuggingFace repo or local path. The first model (or --baseline) is the reference; all others are compared against it.

Example:

infer-check sweep \
  --models "bf16=mlx-community/Llama-3.1-8B-Instruct-bf16,
            4bit=mlx-community/Llama-3.1-8B-Instruct-4bit,
            3bit=mlx-community/Llama-3.1-8B-Instruct-3bit" \
  --prompts reasoning

Usage:

infer-check sweep [OPTIONS]

Options:

Name	Type	Description	Default
`--models`	text	Comma-separated label=model_path pairs. Example: 'bf16=mlx-community/Llama-3.1-8B-Instruct-bf16,4bit=mlx-community/Llama-3.1-8B-Instruct-4bit'	`Sentinel.UNSET`
`--backend`	text	Backend type (auto-detected if omitted).	None
`--prompts`	text	Bundled suite name (e.g. 'reasoning') or path to a .jsonl file.	`Sentinel.UNSET`
`--output`	path	Output directory.	`./results/sweep/`
`--baseline`	text	Baseline label (defaults to first in --models).	None
`--base-url`	text	Base URL for HTTP backends.	None
`--max-tokens`	integer range (`1` and above)	Override default max tokens for generation.	None
`--num-prompts`	integer range (`1` and above)	Limit number of prompts to use.	None
`--disable-thinking` / `--enable-thinking`	boolean	Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it.	`True`
`--chat` / `--no-chat`	boolean	Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm.	`True`
`--help`	boolean	Show this message and exit.	`False`

How it works

Build backends -- creates a backend instance for each entry in --backends, using the shared --model and optional --quant.
Baseline pass -- generates outputs for all prompts using the first backend.
Test passes -- generates outputs for all prompts using each remaining backend.
Compare -- each test backend's outputs are compared against the baseline, producing per-prompt ComparisonResult objects with severity, text similarity, and flip metadata.
Summary table -- groups results by test backend and displays failure rate, flip rate, and mean similarity.

Base URL matching

The --base-urls option is positionally matched to --backends. Use an empty entry for backends that don't need a URL (e.g., mlx-lm):

--backends "mlx-lm,openai-compat" \
--base-urls ",http://127.0.0.1:8000"

This gives mlx-lm no URL (local inference) and openai-compat the vllm-mlx server URL.

Examples

mlx-lm vs vllm-mlx serving layer:

# Start vllm-mlx in another terminal:
# vllm-mlx serve mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8000

infer-check diff \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --backends "mlx-lm,openai-compat" \
  --base-urls ",http://127.0.0.1:8000" \
  --prompts reasoning \
  --output ./results/diff/

With raw completions endpoint (no chat template):

infer-check diff \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --backends "mlx-lm,openai-compat" \
  --base-urls ",http://127.0.0.1:8000" \
  --prompts reasoning \
  --no-chat

Output

                              Diff Summary
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ test_backend  ┃ failures ┃ failure_rate ┃ flip_rate ┃ mean_similarity ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ openai-compat │        0 │        0.00% │      0.0% │          1.0000 │
└───────────────┴──────────┴──────────────┴───────────┴─────────────────┘

A 100% similarity with 0% flip rate means the serving layer introduces zero divergence -- any output differences in production come from quantization, not the backend itself.

Output format

Results are saved as a JSON array of ComparisonResult objects, each containing:

baseline / test -- the InferenceResult from each backend
kl_divergence -- KL(baseline || test) if logprobs are available
token_divergence_index -- first token where the outputs differ
text_similarity -- 0-1 similarity score
is_failure -- true if similarity < 0.5
metadata -- includes severity, flip status, answers, extraction strategy