Getting Started
Requirements
- Python >= 3.11
- macOS with Apple Silicon (for mlx-lm backend) or Linux
- At least one supported backend
Installation
For Apple Silicon users who want local inference via MLX:
To verify the installation:
Global options
These options apply to all commands:
| Option | Default | Description |
|---|---|---|
--max-tokens |
1024 |
Default max tokens for generation. Applies to all prompts unless they specify their own. |
--num-prompts |
all | Limit the number of prompts to use from a suite. |
--version |
Show version and exit. |
Global options go before the subcommand:
Your first test
The simplest way to start is with the compare command. It takes two model specs and runs them against a prompt suite:
infer-check compare \
mlx-community/Llama-3.1-8B-Instruct-4bit \
mlx-community/Llama-3.1-8B-Instruct-8bit \
--prompts adversarial-numerics
This will:
- Auto-detect the backend (mlx-lm for
mlx-community/repos) - Load 30 adversarial-numerics prompts
- Run each prompt through both models
- Compare outputs and compute metrics (flip rate, KL divergence, text similarity)
- Display a summary table and save JSON results to
./results/compare/
Prompt suites
infer-check ships with curated prompt suites targeting known quantization failure modes:
| Suite | Count | Purpose |
|---|---|---|
reasoning |
50 | Multi-step math and logic |
code |
49 | Python, JSON, SQL generation |
adversarial-numerics |
30 | IEEE 754 edge cases, overflow, precision |
long-context |
10 | Tables and transcripts with recall questions |
quant-sensitive |
20 | Multi-digit arithmetic, long CoT, precise syntax |
determinism |
50 | High-entropy continuations for determinism testing |
All suites ship with the package -- no need to clone the repo. Use them by name:
Custom prompt suites
Create a .jsonl file with one JSON object per line:
{"id": "custom-001", "text": "What is 2^31 - 1?", "category": "math", "max_tokens": 512}
{"id": "custom-002", "text": "Write a Python function to sort a list.", "category": "code"}
| Field | Required | Default | Description |
|---|---|---|---|
id |
no | auto-generated UUID | Unique identifier |
text |
yes | The prompt text | |
category |
no | "general" |
Task category (used for per-category breakdowns) |
max_tokens |
no | 1024 |
Max generation tokens for this prompt |
Then pass the path:
Model resolution
The compare command auto-detects the backend from the model spec. You can also use explicit prefixes:
| Prefix | Backend | Example |
|---|---|---|
mlx: |
mlx-lm | mlx:mlx-community/Llama-3.1-8B-Instruct-4bit |
ollama: |
openai-compat (Ollama) | ollama:llama3.1:8b-instruct-q4_K_M |
gguf: |
llama-cpp | gguf:/path/to/model.gguf |
vllm-mlx: |
vllm-mlx | vllm-mlx:mlx-community/Llama-3.1-8B-Instruct-4bit |
Without a prefix, resolution follows these rules:
- Path ends in
.gguf--> llama-cpp - Repo starts with
mlx-community/or contains-mlx--> mlx-lm - Repo contains
gguf,bartowski,maziyarpanahi,mradermacher--> llama-cpp - Contains
:but no/(Ollama tag style) --> openai-compat - Fallback --> mlx-lm
Output and results
All commands save JSON results to their --output directory (defaults to ./results/<command>/). Result files include timestamps in their filenames to avoid overwrites.
Generate an HTML report from any results directory:
See Interpreting Results for details on what the metrics mean.
Next steps
- Commands reference -- full details on every command
- Backends -- supported backends and configuration
- Interpreting Results -- understanding metrics and severity levels