Interpreting Results
This guide explains the metrics infer-check computes and how to interpret them.
Severity tiers
Every per-prompt comparison is classified into one of four severity tiers based on text similarity:
| Severity | Similarity range | Meaning |
|---|---|---|
| identical | 1.0 | Outputs are character-for-character the same. |
| minor | >= 0.8 | Small wording differences. The answer is functionally the same. |
| moderate | >= 0.5 | Significant differences, but some overlap. May or may not affect correctness. |
| severe | < 0.5 | Outputs are fundamentally different. Likely a correctness failure. |
Text similarity is computed using difflib.SequenceMatcher, which measures the ratio of matching characters between the two outputs.
Flip rate
The flip rate is the fraction of prompts where the functional answer changed between the two models or backends. Unlike text similarity (which measures surface-level text overlap), flip rate uses answer extraction to determine whether the actual answer is different.
Answer extraction strategies
infer-check extracts the functional answer from each response based on the prompt's category:
| Category | Strategy | What it extracts |
|---|---|---|
| Numeric prompts | numeric |
Last number in the response (integers, decimals, scientific notation) |
| Boolean prompts | boolean |
Yes/no with negation detection |
| Code prompts | code |
Fenced code blocks with whitespace normalization |
| JSON prompts | json |
Parsed and canonicalized JSON |
| Everything else | raw |
Full lowercased text |
A "flip" occurs when the extracted answers from two models don't match. For example:
- Model A answers "42", Model B answers "43" --> flipped (numeric)
- Model A answers "Yes", Model B answers "No" --> flipped (boolean)
- Model A and B give the same code but different commentary --> not flipped (code blocks match)
Flip rate vs severity
These metrics capture different things:
- Severity measures how different the full text outputs are
- Flip rate measures whether the functional answer changed
A response can have "severe" text divergence but no flip (e.g., different reasoning paths reaching the same answer), or "minor" text divergence with a flip (e.g., nearly identical text except the final number is wrong).
Flip rate is generally the more actionable metric for assessing correctness.
KL divergence
KL divergence (Kullback-Leibler divergence) measures how different the token probability distributions are between two backends. It's computed as KL(baseline || test) -- how much information is lost when using the test distribution to approximate the baseline.
| KL divergence | Interpretation |
|---|---|
| 0.0 | Identical distributions |
| < 0.01 | Very similar -- negligible difference |
| 0.01 - 0.1 | Small differences in token probabilities |
| 0.1 - 1.0 | Moderate divergence -- different confidence levels |
| > 1.0 | Large divergence -- fundamentally different predictions |
Note
KL divergence is only available when both backends provide logprobs or token probability distributions. Not all backends support this. When unavailable, the field is null in the output.
Text similarity
A 0-1 score from difflib.SequenceMatcher measuring character-level overlap. Used to classify severity tiers.
| Score | Interpretation |
|---|---|
| 1.0 | Identical output |
| 0.9+ | Very similar -- minor rewording |
| 0.7-0.9 | Moderately similar -- different phrasing, same general content |
| 0.5-0.7 | Partially similar -- some shared content |
| < 0.5 | Mostly different -- classified as "severe" |
Token divergence index
The index of the first token where the baseline and test outputs diverge. A low index (e.g., 0-5) means the outputs diverge early and are likely completely different. A high index means the outputs share a common prefix before diverging.
Determinism score
For determinism tests, the score is:
- 1.0 (100%) -- all runs produced identical output. The backend is deterministic.
- < 1.0 -- some runs produced different output at temperature=0. This is a bug.
The divergence_positions field lists the token indices where pairs of runs first diverged, helping locate where non-determinism creeps in.
Output consistency (stress tests)
For stress tests, output consistency is:
Where the baseline is the output from concurrency=1. This measures whether increasing concurrency changes the outputs.
- 100% -- concurrent requests don't affect output. The backend is correct under load.
- < 100% -- some outputs changed under concurrency. Investigate KV cache correctness and batch-dependent computation.
Per-category stats
The compare command breaks down results by prompt category (as defined in the prompt suite). Each category gets:
| Stat | Description |
|---|---|
count |
Number of prompts in this category |
flip_rate |
Fraction of prompts with answer flips |
mean_similarity |
Average text similarity |
This helps identify which task types are most affected by quantization or backend differences. Numerical tasks typically show the highest degradation.
Reading the summary tables
Sweep table
- Self-check row: The baseline compared against itself. Should be 100% identical. If not, your baseline isn't deterministic and all other comparisons are unreliable.
- Test rows: Each quantization level compared against the baseline. More severe divergences = more correctness degradation.
Compare table
Look at flip rate first -- it's the most direct measure of correctness. Then check severity tiers for the distribution of divergence. The flipped prompts detail table shows exactly which prompts broke and what the answers changed to.
Diff table
Failures indicate the backend returned errors. Flip rate and mean similarity show whether the serving layer changes outputs. Ideally, a diff test shows 0 failures, 0% flip rate, and 1.0 similarity.
Stress table
Look for errors and consistency drops at higher concurrency levels. A sudden drop at a specific concurrency level often indicates a buffer overflow or cache corruption bug in the backend.