report

Generate a report from previously saved result JSON files. Supports HTML and JSON output formats.

CLI Reference

infer-check

infer-check: correctness and reliability testing for LLM inference engines.

Usage:

infer-check [OPTIONS] COMMAND [ARGS]...

Options:

Name	Type	Description	Default
`--version`	boolean	Show the version and exit.	`False`
`--max-tokens`	integer	Default max tokens for generation (applies to all prompts unless they specify their own).	`1024`
`--num-prompts`	integer range (`1` and above)	Limit the number of prompts to use from a suite; if omitted, all prompts are used.	None
`--help`	boolean	Show this message and exit.	`False`

compare

Compare two quantizations of the same model.

MODEL_A and MODEL_B are model specs — HuggingFace repos, Ollama tags, or local GGUF paths. The backend is auto-detected from the identifier, or you can use an explicit prefix (ollama:, mlx:, gguf:, vllm-mlx:).

Examples: # Two MLX quants infer-check compare \ mlx-community/Llama-3.1-8B-Instruct-4bit \ mlx-community/Llama-3.1-8B-Instruct-8bit

# MLX native vs Ollama GGUF
infer-check compare \
  mlx-community/Llama-3.1-8B-Instruct-4bit \
  ollama:llama3.1:8b-instruct-q4_K_M

# Bartowski GGUF vs Unsloth GGUF (both via Ollama)
infer-check compare \
  ollama:bartowski/Llama-3.1-8B-Instruct-GGUF \
  ollama:unsloth/Llama-3.1-8B-Instruct-GGUF

Usage:

infer-check compare [OPTIONS] MODEL_A MODEL_B

Options:

Name	Type	Description	Default
`--prompts`	text	Bundled suite name (e.g. 'reasoning') or path to a .jsonl file.	`adversarial-numerics`
`--output`	path	Output directory.	`./results/compare/`
`--base-url`	text	Base URL override for HTTP backends. Applied to both models unless they resolve to mlx-lm.	None
`--label-a`	text	Custom label for model A (defaults to auto-derived short name).	None
`--label-b`	text	Custom label for model B (defaults to auto-derived short name).	None
`--report` / `--no-report`	boolean	Generate an HTML comparison report after the run.	`True`
`--max-tokens`	integer range (`1` and above)	Override default max tokens for generation.	None
`--num-prompts`	integer range (`1` and above)	Limit number of prompts to use.	None
`--disable-thinking` / `--enable-thinking`	boolean	Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it.	`True`
`--chat` / `--no-chat`	boolean	Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm.	`True`
`--help`	boolean	Show this message and exit.	`False`

determinism

Test whether a backend produces identical outputs across repeated runs at temperature=0.

Usage:

infer-check determinism [OPTIONS]

Options:

Name	Type	Description	Default
`--model`	text	Model ID or HuggingFace path.	`Sentinel.UNSET`
`--backend`	text	Backend type (auto-detected if omitted).	None
`--prompts`	text	Bundled suite name (e.g. 'reasoning') or path to a .jsonl file.	`Sentinel.UNSET`
`--output`	path	Output directory.	`./results/determinism/`
`--runs`	integer	Number of runs per prompt.	`100`
`--base-url`	text	Base URL for HTTP backends.	None
`--max-tokens`	integer range (`1` and above)	Override default max tokens for generation.	None
`--num-prompts`	integer range (`1` and above)	Limit number of prompts to use.	None
`--disable-thinking` / `--enable-thinking`	boolean	Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it.	`True`
`--chat` / `--no-chat`	boolean	Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm.	`True`
`--help`	boolean	Show this message and exit.	`False`

diff

Compare outputs across different backends for the same model and prompts.

Usage:

infer-check diff [OPTIONS]

Options:

Name	Type	Description	Default
`--model`	text	Model ID or HuggingFace path.	`Sentinel.UNSET`
`--backends`	text	Comma-separated backend names, e.g. 'mlx-lm,llama-cpp'. First is baseline.	`Sentinel.UNSET`
`--prompts`	text	Bundled suite name (e.g. 'reasoning') or path to a .jsonl file.	`Sentinel.UNSET`
`--output`	path	Output directory.	`./results/diff/`
`--quant`	text	Quantization level applied to all backends.	None
`--base-urls`	text	Comma-separated base URLs for HTTP backends (positionally matched to --backends).	None
`--max-tokens`	integer range (`1` and above)	Override default max tokens for generation.	None
`--num-prompts`	integer range (`1` and above)	Limit number of prompts to use.	None
`--disable-thinking` / `--enable-thinking`	boolean	Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it.	`True`
`--chat` / `--no-chat`	boolean	Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm.	`True`
`--help`	boolean	Show this message and exit.	`False`

report

Generate a report from previously saved result JSON files.

Usage:

infer-check report [OPTIONS] RESULTS_DIR

Options:

Name	Type	Description	Default
`--format`	choice (`html` \| `json`)	Output format.	`html`
`--output`	path	Output file path (defaults to /report.html or report.json).	None
`--help`	boolean	Show this message and exit.	`False`

stress

Stress-test a backend with varying concurrency levels.

Usage:

infer-check stress [OPTIONS]

Options:

Name	Type	Description	Default
`--model`	text	Model ID or HuggingFace path.	`Sentinel.UNSET`
`--backend`	text	Backend type (auto-detected if omitted).	None
`--prompts`	text	Bundled suite name (e.g. 'reasoning') or path to a .jsonl file.	`Sentinel.UNSET`
`--output`	path	Output directory.	`./results/stress/`
`--concurrency`	text	Comma-separated concurrency levels.	`1,2,4,8,16`
`--base-url`	text	Base URL for HTTP backends.	None
`--max-tokens`	integer range (`1` and above)	Override default max tokens for generation.	None
`--num-prompts`	integer range (`1` and above)	Limit number of prompts to use.	None
`--disable-thinking` / `--enable-thinking`	boolean	Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it.	`True`
`--chat` / `--no-chat`	boolean	Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm.	`True`
`--help`	boolean	Show this message and exit.	`False`

sweep

Run a quantization sweep: compare pre-quantized models against a baseline.

Each model is a separate HuggingFace repo or local path. The first model (or --baseline) is the reference; all others are compared against it.

Example:

infer-check sweep \
  --models "bf16=mlx-community/Llama-3.1-8B-Instruct-bf16,
            4bit=mlx-community/Llama-3.1-8B-Instruct-4bit,
            3bit=mlx-community/Llama-3.1-8B-Instruct-3bit" \
  --prompts reasoning

Usage:

infer-check sweep [OPTIONS]

Options:

Name	Type	Description	Default
`--models`	text	Comma-separated label=model_path pairs. Example: 'bf16=mlx-community/Llama-3.1-8B-Instruct-bf16,4bit=mlx-community/Llama-3.1-8B-Instruct-4bit'	`Sentinel.UNSET`
`--backend`	text	Backend type (auto-detected if omitted).	None
`--prompts`	text	Bundled suite name (e.g. 'reasoning') or path to a .jsonl file.	`Sentinel.UNSET`
`--output`	path	Output directory.	`./results/sweep/`
`--baseline`	text	Baseline label (defaults to first in --models).	None
`--base-url`	text	Base URL for HTTP backends.	None
`--max-tokens`	integer range (`1` and above)	Override default max tokens for generation.	None
`--num-prompts`	integer range (`1` and above)	Limit number of prompts to use.	None
`--disable-thinking` / `--enable-thinking`	boolean	Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it.	`True`
`--chat` / `--no-chat`	boolean	Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm.	`True`
`--help`	boolean	Show this message and exit.	`False`

How it works

Scan -- recursively finds all .json files in the results directory.
Load -- reads each file and collects all result objects. Handles both single objects and arrays. Skips files that fail to parse.
Generate -- delegates to the format-specific exporter (HTML or JSON).
Open -- for HTML reports, automatically opens the report in your default browser.

Examples

Generate an HTML report from all results:

infer-check report ./results/ --format html

Generate a JSON report to a specific file:

infer-check report ./results/ --format json --output ./summary.json

Report from a specific command's results:

infer-check report ./results/compare/ --format html

Notes

The report command does not have --max-tokens or --num-prompts options since it operates on previously generated results.
Result files from any command (sweep, compare, diff, stress, determinism) can be mixed in the same directory. The report handles heterogeneous result types.
If the HTML reporting module is not available, a minimal HTML page with raw JSON data is generated as a fallback.