report
Generate a report from previously saved result JSON files. Supports HTML and JSON output formats.
CLI Reference
infer-check
infer-check: correctness and reliability testing for LLM inference engines.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--version |
boolean | Show the version and exit. | False |
--max-tokens |
integer | Default max tokens for generation (applies to all prompts unless they specify their own). | 1024 |
--num-prompts |
integer range (1 and above) |
Limit the number of prompts to use from a suite; if omitted, all prompts are used. | None |
--help |
boolean | Show this message and exit. | False |
compare
Compare two quantizations of the same model.
MODEL_A and MODEL_B are model specs — HuggingFace repos, Ollama tags, or local GGUF paths. The backend is auto-detected from the identifier, or you can use an explicit prefix (ollama:, mlx:, gguf:, vllm-mlx:).
Examples: # Two MLX quants infer-check compare \ mlx-community/Llama-3.1-8B-Instruct-4bit \ mlx-community/Llama-3.1-8B-Instruct-8bit
# MLX native vs Ollama GGUF
infer-check compare \
mlx-community/Llama-3.1-8B-Instruct-4bit \
ollama:llama3.1:8b-instruct-q4_K_M
# Bartowski GGUF vs Unsloth GGUF (both via Ollama)
infer-check compare \
ollama:bartowski/Llama-3.1-8B-Instruct-GGUF \
ollama:unsloth/Llama-3.1-8B-Instruct-GGUF
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | adversarial-numerics |
--output |
path | Output directory. | ./results/compare/ |
--base-url |
text | Base URL override for HTTP backends. Applied to both models unless they resolve to mlx-lm. | None |
--label-a |
text | Custom label for model A (defaults to auto-derived short name). | None |
--label-b |
text | Custom label for model B (defaults to auto-derived short name). | None |
--report / --no-report |
boolean | Generate an HTML comparison report after the run. | True |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
determinism
Test whether a backend produces identical outputs across repeated runs at temperature=0.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--model |
text | Model ID or HuggingFace path. | Sentinel.UNSET |
--backend |
text | Backend type (auto-detected if omitted). | None |
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | Sentinel.UNSET |
--output |
path | Output directory. | ./results/determinism/ |
--runs |
integer | Number of runs per prompt. | 100 |
--base-url |
text | Base URL for HTTP backends. | None |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
diff
Compare outputs across different backends for the same model and prompts.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--model |
text | Model ID or HuggingFace path. | Sentinel.UNSET |
--backends |
text | Comma-separated backend names, e.g. 'mlx-lm,llama-cpp'. First is baseline. | Sentinel.UNSET |
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | Sentinel.UNSET |
--output |
path | Output directory. | ./results/diff/ |
--quant |
text | Quantization level applied to all backends. | None |
--base-urls |
text | Comma-separated base URLs for HTTP backends (positionally matched to --backends). | None |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
report
Generate a report from previously saved result JSON files.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--format |
choice (html | json) |
Output format. | html |
--output |
path | Output file path (defaults to |
None |
--help |
boolean | Show this message and exit. | False |
stress
Stress-test a backend with varying concurrency levels.
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--model |
text | Model ID or HuggingFace path. | Sentinel.UNSET |
--backend |
text | Backend type (auto-detected if omitted). | None |
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | Sentinel.UNSET |
--output |
path | Output directory. | ./results/stress/ |
--concurrency |
text | Comma-separated concurrency levels. | 1,2,4,8,16 |
--base-url |
text | Base URL for HTTP backends. | None |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
sweep
Run a quantization sweep: compare pre-quantized models against a baseline.
Each model is a separate HuggingFace repo or local path. The first model (or --baseline) is the reference; all others are compared against it.
Example:
infer-check sweep \
--models "bf16=mlx-community/Llama-3.1-8B-Instruct-bf16,
4bit=mlx-community/Llama-3.1-8B-Instruct-4bit,
3bit=mlx-community/Llama-3.1-8B-Instruct-3bit" \
--prompts reasoning
Usage:
Options:
| Name | Type | Description | Default |
|---|---|---|---|
--models |
text | Comma-separated label=model_path pairs. Example: 'bf16=mlx-community/Llama-3.1-8B-Instruct-bf16,4bit=mlx-community/Llama-3.1-8B-Instruct-4bit' | Sentinel.UNSET |
--backend |
text | Backend type (auto-detected if omitted). | None |
--prompts |
text | Bundled suite name (e.g. 'reasoning') or path to a .jsonl file. | Sentinel.UNSET |
--output |
path | Output directory. | ./results/sweep/ |
--baseline |
text | Baseline label (defaults to first in --models). | None |
--base-url |
text | Base URL for HTTP backends. | None |
--max-tokens |
integer range (1 and above) |
Override default max tokens for generation. | None |
--num-prompts |
integer range (1 and above) |
Limit number of prompts to use. | None |
--disable-thinking / --enable-thinking |
boolean | Suppress reasoning/thinking mode on models that support it (Qwen3, DeepSeek-R1, Ollama think, vLLM chat_template_kwargs, OpenAI/OpenRouter reasoning). Models without a thinking mode are unaffected. Defaults to disabled so outputs are directly comparable across runs; pass --enable-thinking to restore it. | True |
--chat / --no-chat |
boolean | Use /v1/chat/completions for HTTP backends (applies chat template server-side). Pass --no-chat to use raw /v1/completions instead. Ignored for mlx-lm. | True |
--help |
boolean | Show this message and exit. | False |
How it works
- Scan -- recursively finds all
.jsonfiles in the results directory. - Load -- reads each file and collects all result objects. Handles both single objects and arrays. Skips files that fail to parse.
- Generate -- delegates to the format-specific exporter (HTML or JSON).
- Open -- for HTML reports, automatically opens the report in your default browser.
Examples
Generate an HTML report from all results:
Generate a JSON report to a specific file:
Report from a specific command's results:
Notes
- The report command does not have
--max-tokensor--num-promptsoptions since it operates on previously generated results. - Result files from any command (sweep, compare, diff, stress, determinism) can be mixed in the same directory. The report handles heterogeneous result types.
- If the HTML reporting module is not available, a minimal HTML page with raw JSON data is generated as a fallback.