Skip to content

Interpreting Results

This guide explains the metrics infer-check computes and how to interpret them.

Severity tiers

Every per-prompt comparison is classified into one of four severity tiers based on text similarity:

Severity Similarity range Meaning
identical 1.0 Outputs are character-for-character the same.
minor >= 0.8 Small wording differences. The answer is functionally the same.
moderate >= 0.5 Significant differences, but some overlap. May or may not affect correctness.
severe < 0.5 Outputs are fundamentally different. Likely a correctness failure.

Text similarity is computed using difflib.SequenceMatcher, which measures the ratio of matching characters between the two outputs.

Flip rate

The flip rate is the fraction of prompts where the functional answer changed between the two models or backends. Unlike text similarity (which measures surface-level text overlap), flip rate uses answer extraction to determine whether the actual answer is different.

Answer extraction strategies

infer-check extracts the functional answer from each response based on the prompt's category:

Category Strategy What it extracts
Numeric prompts numeric Last number in the response (integers, decimals, scientific notation)
Boolean prompts boolean Yes/no with negation detection
Code prompts code Fenced code blocks with whitespace normalization
JSON prompts json Parsed and canonicalized JSON
Everything else raw Full lowercased text

A "flip" occurs when the extracted answers from two models don't match. For example:

  • Model A answers "42", Model B answers "43" --> flipped (numeric)
  • Model A answers "Yes", Model B answers "No" --> flipped (boolean)
  • Model A and B give the same code but different commentary --> not flipped (code blocks match)

Flip rate vs severity

These metrics capture different things:

  • Severity measures how different the full text outputs are
  • Flip rate measures whether the functional answer changed

A response can have "severe" text divergence but no flip (e.g., different reasoning paths reaching the same answer), or "minor" text divergence with a flip (e.g., nearly identical text except the final number is wrong).

Flip rate is generally the more actionable metric for assessing correctness.

KL divergence

KL divergence (Kullback-Leibler divergence) measures how different the token probability distributions are between two backends. It's computed as KL(baseline || test) -- how much information is lost when using the test distribution to approximate the baseline.

KL divergence Interpretation
0.0 Identical distributions
< 0.01 Very similar -- negligible difference
0.01 - 0.1 Small differences in token probabilities
0.1 - 1.0 Moderate divergence -- different confidence levels
> 1.0 Large divergence -- fundamentally different predictions

Note

KL divergence is only available when both backends provide logprobs or token probability distributions. Not all backends support this. When unavailable, the field is null in the output.

Text similarity

A 0-1 score from difflib.SequenceMatcher measuring character-level overlap. Used to classify severity tiers.

Score Interpretation
1.0 Identical output
0.9+ Very similar -- minor rewording
0.7-0.9 Moderately similar -- different phrasing, same general content
0.5-0.7 Partially similar -- some shared content
< 0.5 Mostly different -- classified as "severe"

Token divergence index

The index of the first token where the baseline and test outputs diverge. A low index (e.g., 0-5) means the outputs diverge early and are likely completely different. A high index means the outputs share a common prefix before diverging.

Determinism score

For determinism tests, the score is:

determinism_score = identical_count / num_runs
  • 1.0 (100%) -- all runs produced identical output. The backend is deterministic.
  • < 1.0 -- some runs produced different output at temperature=0. This is a bug.

The divergence_positions field lists the token indices where pairs of runs first diverged, helping locate where non-determinism creeps in.

Output consistency (stress tests)

For stress tests, output consistency is:

output_consistency = identical_to_baseline / total_compared

Where the baseline is the output from concurrency=1. This measures whether increasing concurrency changes the outputs.

  • 100% -- concurrent requests don't affect output. The backend is correct under load.
  • < 100% -- some outputs changed under concurrency. Investigate KV cache correctness and batch-dependent computation.

Per-category stats

The compare command breaks down results by prompt category (as defined in the prompt suite). Each category gets:

Stat Description
count Number of prompts in this category
flip_rate Fraction of prompts with answer flips
mean_similarity Average text similarity

This helps identify which task types are most affected by quantization or backend differences. Numerical tasks typically show the highest degradation.

Reading the summary tables

Sweep table

┃ quant_level       ┃ identical ┃ minor ┃ moderate ┃ severe ┃ mean_similarity ┃
  • Self-check row: The baseline compared against itself. Should be 100% identical. If not, your baseline isn't deterministic and all other comparisons are unreliable.
  • Test rows: Each quantization level compared against the baseline. More severe divergences = more correctness degradation.

Compare table

┃ metric                              ┃ value ┃

Look at flip rate first -- it's the most direct measure of correctness. Then check severity tiers for the distribution of divergence. The flipped prompts detail table shows exactly which prompts broke and what the answers changed to.

Diff table

┃ test_backend ┃ failures ┃ failure_rate ┃ flip_rate ┃ mean_similarity ┃

Failures indicate the backend returned errors. Flip rate and mean similarity show whether the serving layer changes outputs. Ideally, a diff test shows 0 failures, 0% flip rate, and 1.0 similarity.

Stress table

┃ concurrency ┃ errors ┃ output_consistency ┃

Look for errors and consistency drops at higher concurrency levels. A sudden drop at a specific concurrency level often indicates a buffer overflow or cache corruption bug in the backend.