Own your inference

hybrid

Answer the easy majority on a small model you run — escalate only the genuinely hard queries to the frontier.

github → build note → MIT the substrate inference

01What it is

Most of what you ask an LLM is easy. Facts, rewrites, plain Q&A — a small model running on your own machine nails those, free and private. The rare hard query — a proof, real code, multi-step reasoning — is where you actually want frontier quality. Paying the frontier rate for the whole stream, easy and hard alike, is the part nobody questions.

hybrid is the router that splits the stream. Every query passes a deterministic first pass — category rules escalate the domains a small model is known to fail and keep open-ended tasks local — and everything else runs a local self-consistency check: the small model answers a few times, and unanimity keeps it local while disagreement escalates. The frontier key only ever leaves your machine on a query that earns it.

hybrid · route

# every query takes one of three paths
query → router ─┬─ rule: hard category   ─▶ ESCALATE
                ├─ rule: open-ended      ─▶ LOCAL
                └─ verify: self-consistency
                     unanimous ─▶ LOCAL
                     else      ─▶ ESCALATE

Fig. 1 — the routing decision, from the repo README.

02What it does

Answers the easy majority locally

Facts, definitions, simple arithmetic, and short Q&A run on a small local model through Ollama — qwen2.5:3b by default. These are the queries a 3B model gets right, so they never leave the box: free, private, and answered without a frontier call.

Escalates the known-hard domains by rule

Categories a small model is known to fail — code, proofs, puzzles, powers and roots and factorials — escalate to the frontier on a deterministic rule, with no point spent trying locally first. Open-ended tasks like rewrite and summarize stay local, because there's no single right answer for the verifier to check against.

Verifies with self-consistency

For everything else, the local model answers a few times at temperature; unanimous agreement means confident and stays local, while disagreement means uncertain and escalates. It catches genuine uncertainty cheaply, without a frontier call to decide whether a frontier call is needed.

Keeps the limit visible

Self-consistency catches uncertainty but cannot catch confident wrongness — a 3B can agree with itself 3/3 on a wrong answer. hybrid is honest about that boundary rather than papering over it; --demo includes the trap that exposes it, and the build note works through why router quality is the open problem.

Drops in front of any OpenAI client

server.py exposes an OpenAI-compatible endpoint, so Cursor, Cline, or any script gets local-first routing transparently. Each reply carries an x_hybrid field — route, reason, backend, latency — so you can see which tier answered. ~160 lines of dependency-free, stdlib-only Python, MIT-licensed.

03Where it sits

Part of The Substrate.

hybrid owns the inference decision. The Substrate is the layer every model call flows through — a router you own, local-first inference, and a cache so you don't pay for the same answer twice.

dario

Local LLM router — run on a plan, not metered tokens.

deja

A three-tier LLM cache — exact, single-flight, semantic.

Own your inference, not your invoice.

hybrid is open source and MIT-licensed. Read the code, read the build note, run it on your own box.

View hybrid on GitHub →