Getting Started

Requirements

Python >= 3.11
macOS with Apple Silicon (for mlx-lm backend) or Linux
At least one supported backend

Installation

pip install infer-check

For Apple Silicon users who want local inference via MLX:

pip install "infer-check[mlx]"

To verify the installation:

infer-check --version

Global options

These options apply to all commands:

Option	Default	Description
`--max-tokens`	`1024`	Default max tokens for generation. Applies to all prompts unless they specify their own.
`--num-prompts`	all	Limit the number of prompts to use from a suite.
`--version`		Show version and exit.

Global options go before the subcommand:

infer-check --max-tokens 512 --num-prompts 10 compare ...

Your first test

The simplest way to start is with the compare command. It takes two model specs and runs them against a prompt suite:

infer-check compare \
  mlx-community/Llama-3.1-8B-Instruct-4bit \
  mlx-community/Llama-3.1-8B-Instruct-8bit \
  --prompts adversarial-numerics

This will:

Auto-detect the backend (mlx-lm for mlx-community/ repos)
Load 30 adversarial-numerics prompts
Run each prompt through both models
Compare outputs and compute metrics (flip rate, KL divergence, text similarity)
Display a summary table and save JSON results to ./results/compare/

Prompt suites

infer-check ships with curated prompt suites targeting known quantization failure modes:

Suite	Count	Purpose
`reasoning`	50	Multi-step math and logic
`code`	49	Python, JSON, SQL generation
`adversarial-numerics`	30	IEEE 754 edge cases, overflow, precision
`long-context`	10	Tables and transcripts with recall questions
`quant-sensitive`	20	Multi-digit arithmetic, long CoT, precise syntax
`determinism`	50	High-entropy continuations for determinism testing

All suites ship with the package -- no need to clone the repo. Use them by name:

--prompts reasoning
--prompts adversarial-numerics

Custom prompt suites

Create a .jsonl file with one JSON object per line:

{"id": "custom-001", "text": "What is 2^31 - 1?", "category": "math", "max_tokens": 512}
{"id": "custom-002", "text": "Write a Python function to sort a list.", "category": "code"}

Field	Required	Default	Description
`id`	no	auto-generated UUID	Unique identifier
`text`	yes		The prompt text
`category`	no	`"general"`	Task category (used for per-category breakdowns)
`max_tokens`	no	`1024`	Max generation tokens for this prompt

Then pass the path:

--prompts ./my-custom-suite.jsonl

Model resolution

The compare command auto-detects the backend from the model spec. You can also use explicit prefixes:

Prefix	Backend	Example
`mlx:`	mlx-lm	`mlx:mlx-community/Llama-3.1-8B-Instruct-4bit`
`ollama:`	openai-compat (Ollama)	`ollama:llama3.1:8b-instruct-q4_K_M`
`gguf:`	llama-cpp	`gguf:/path/to/model.gguf`
`vllm-mlx:`	vllm-mlx	`vllm-mlx:mlx-community/Llama-3.1-8B-Instruct-4bit`

Without a prefix, resolution follows these rules:

Path ends in .gguf --> llama-cpp
Repo starts with mlx-community/ or contains -mlx --> mlx-lm
Repo contains gguf, bartowski, maziyarpanahi, mradermacher --> llama-cpp
Contains : but no / (Ollama tag style) --> openai-compat
Fallback --> mlx-lm

Output and results

All commands save JSON results to their --output directory (defaults to ./results/<command>/). Result files include timestamps in their filenames to avoid overwrites.

Generate an HTML report from any results directory:

infer-check report ./results/ --format html

See Interpreting Results for details on what the metrics mean.

Next steps

Commands reference -- full details on every command
Backends -- supported backends and configuration
Interpreting Results -- understanding metrics and severity levels