infer-check
Correctness and reliability testing for LLM inference engines.
Quantization silently breaks arithmetic. Serving layers silently alter output. KV caches silently corrupt under load. Benchmarks like lm-evaluation-harness test whether models are smart -- infer-check tests whether engines are correct.
The problem
Every LLM inference engine has correctness bugs that benchmarks don't catch:
- KV cache NaN pollution in vLLM-Ascend permanently corrupts all subsequent requests
- FP8 KV quantization in vLLM causes repeated garbage output
- 32.5% element mismatches in SGLang's FP8 DeepGEMM kernels on Blackwell GPUs
- Batch-size-dependent output where tokens change depending on concurrent request count
These aren't model quality problems -- they're engine correctness failures. infer-check runs differential tests across backends, quantization levels, and concurrency conditions to surface them automatically.
What it does
infer-check provides six commands for testing inference correctness:
| Command | Purpose |
|---|---|
sweep |
Compare pre-quantized models against a baseline |
compare |
Head-to-head comparison of two models or quantizations |
diff |
Compare outputs across different backends for the same model |
determinism |
Test output reproducibility at temperature=0 |
stress |
Test correctness under concurrent load |
report |
Generate HTML/JSON reports from saved results |
Example results
Results from running infer-check on Llama-3.1-8B-Instruct and Qwen3.5-4B on Apple Silicon using mlx-lm and vllm-mlx.
Quantization sweep
4-bit quantization on Llama-3.1-8B showed clear task-dependent degradation:
Llama-3.1-8B: bf16 vs 4bit
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ prompt suite ┃ identical ┃ severe ┃ mean_similarity ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ adversarial-numerics │ 0/30 │ 23/30 │ 0.311 │
│ reasoning │ 1/50 │ 35/50 │ 0.384 │
│ code │ 0/49 │ 30/49 │ 0.452 │
└───────────────────────┴───────────┴──────────┴─────────────────┘
Cross-backend diff
mlx-lm vs vllm-mlx at temperature=0 on Llama-3.1-8B-4bit: 50/50 identical (reasoning) and 30/30 identical (numerics). The vllm-mlx serving layer introduced zero divergence.
Determinism
Llama-3.1-8B-4bit and Qwen3.5-4B both scored 50/50 identical across 20 runs per prompt on single-request mlx-lm inference at temperature=0.
Stress test
vllm-mlx at concurrency 1/2/4/8: zero errors, 100% output consistency at all levels. No KV cache corruption or batch-dependent divergence detected.
Quick start
Then run your first comparison:
infer-check compare \
mlx-community/Llama-3.1-8B-Instruct-4bit \
mlx-community/Llama-3.1-8B-Instruct-8bit \
--prompts adversarial-numerics
See the Getting Started guide for a full walkthrough.