Skip to content

Backends

infer-check supports four inference backends. Each backend implements a common protocol for generation, health checks, and cleanup.

Overview

Backend Type Default URL Use case
mlx-lm In-process (local) Local Apple Silicon inference with logprobs
llama-cpp HTTP http://127.0.0.1:8080 llama-server via /completion endpoint
vllm-mlx HTTP http://127.0.0.1:8000 Continuous batching on Apple Silicon
openai-compat HTTP http://127.0.0.1:11434 Any OpenAI-compatible server (vLLM, SGLang, Ollama)

mlx-lm

In-process inference using Apple's MLX framework. Runs directly on Apple Silicon with no server required.

Install: pip install "infer-check[mlx]" (requires mlx and mlx-lm packages)

Features:

  • Generates per-token logprobs when available via generate_step()
  • Falls back to simple generation if logprobs aren't supported
  • Lazy model loading -- the model is downloaded and loaded on first use, not at import time
  • Single-threaded sequential inference

When to use: Local testing on Mac. Best baseline for quantization sweeps since it runs natively with no serving layer overhead.

Example:

infer-check compare \
  mlx-community/Llama-3.1-8B-Instruct-bf16 \
  mlx-community/Llama-3.1-8B-Instruct-4bit

llama.cpp

HTTP backend targeting llama-server (the built-in HTTP server from llama.cpp).

Setup: Start llama-server separately:

llama-server -m /path/to/model.gguf --port 8080

Features:

  • Uses the /completion endpoint for text generation
  • Requests top-10 token probabilities and converts them to logprobs
  • Aligns token distributions by ID metadata for cross-backend comparison
  • 120-second request timeout

When to use: Testing GGUF models served via llama.cpp. Good for comparing GGUF quantization formats against each other or against MLX native quantization.

Example:

infer-check determinism \
  --model my-model \
  --backend llama-cpp \
  --base-url http://127.0.0.1:8080 \
  --prompts determinism \
  --runs 20

vllm-mlx

HTTP backend for vllm-mlx, a continuous-batching inference server for Apple Silicon. Extends the OpenAI-compatible backend with model-aware health checks.

Setup: Start vllm-mlx separately:

vllm-mlx serve mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8000

Features:

  • Inherits all capabilities from the openai-compat backend
  • Model-aware health check verifies the expected model is loaded
  • Supports both /v1/chat/completions and /v1/completions endpoints

When to use: Testing continuous-batching serving layer correctness. Ideal for diff and stress commands to verify the serving layer doesn't introduce divergence.

Example:

infer-check diff \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --backends "mlx-lm,vllm-mlx" \
  --base-urls ",http://127.0.0.1:8000" \
  --prompts reasoning

openai-compat

Generic backend for any server that implements the OpenAI API format. Works with vLLM, SGLang, Ollama, and others.

Features:

  • Supports both /v1/chat/completions and /v1/completions endpoints
  • Requests logprobs with graceful fallback if unsupported
  • 120-second request timeout
  • Detailed error messages for connection, timeout, and HTTP errors

When to use: Any OpenAI-compatible server. This is the most flexible backend and the default for Ollama-style model tags.

Default URLs by resolution:

Model source Default URL
Ollama tags (e.g., llama3.1:8b) http://127.0.0.1:11434
Custom server Use --base-url

Example with Ollama:

infer-check compare \
  ollama:llama3.1:8b-instruct-q4_K_M \
  ollama:llama3.1:8b-instruct-q8_0

Example with custom server:

infer-check stress \
  --model my-model \
  --backend openai-compat \
  --base-url http://my-server:8000/v1 \
  --prompts reasoning \
  --concurrency 1,2,4,8

Chat vs completions

HTTP backends support two endpoint modes:

  • Chat (--chat, default) -- uses /v1/chat/completions. The server applies its chat template. Use this when the server is configured with the correct chat template for your model.
  • Completions (--no-chat) -- uses /v1/completions. Sends raw text with no template. Use this for raw text generation or when you want to control the prompt format yourself.

The --chat / --no-chat flag applies to the diff command. The compare command always uses completions mode to avoid template differences between backends.

Backend selection

Backends are selected in different ways depending on the command:

Command How backend is chosen
compare Auto-detected from each model spec
sweep --backend flag (shared across all models) or auto-detected
diff --backends flag (explicit list)
stress --backend flag or auto-detected from model
determinism --backend flag or auto-detected from model
report N/A (operates on saved results)