Own Your Stack/The Substrate/hybrid
Own your inference
hybrid
Answer the easy majority on a small model you run — escalate only the genuinely hard queries to the frontier.
Most of what you ask an LLM is easy. Facts, rewrites, plain Q&A — a small model running on your own machine nails those, free and private. The rare hard query — a proof, real code, multi-step reasoning — is where you actually want frontier quality. Paying the frontier rate for the whole stream, easy and hard alike, is the part nobody questions.
hybrid is the router that splits the stream. Every query passes a deterministic first pass — category rules escalate the domains a small model is known to fail and keep open-ended tasks local — and everything else runs a local self-consistency check: the small model answers a few times, and unanimity keeps it local while disagreement escalates. The frontier key only ever leaves your machine on a query that earns it.
# every query takes one of three paths query → router ─┬─ rule: hard category ─▶ ESCALATE ├─ rule: open-ended ─▶ LOCAL └─ verify: self-consistency unanimous ─▶ LOCAL else ─▶ ESCALATE
Answers the easy majority locally
Facts, definitions, simple arithmetic, and short Q&A run on a small local model through Ollama — qwen2.5:3b by default. These are the queries a 3B model gets right, so they never leave the box: free, private, and answered without a frontier call.
Escalates the known-hard domains by rule
Categories a small model is known to fail — code, proofs, puzzles, powers and roots and factorials — escalate to the frontier on a deterministic rule, with no point spent trying locally first. Open-ended tasks like rewrite and summarize stay local, because there's no single right answer for the verifier to check against.
Verifies with self-consistency
For everything else, the local model answers a few times at temperature; unanimous agreement means confident and stays local, while disagreement means uncertain and escalates. It catches genuine uncertainty cheaply, without a frontier call to decide whether a frontier call is needed.
Keeps the limit visible
Self-consistency catches uncertainty but cannot catch confident wrongness — a 3B can agree with itself 3/3 on a wrong answer. hybrid is honest about that boundary rather than papering over it; --demo includes the trap that exposes it, and the build note works through why router quality is the open problem.
Drops in front of any OpenAI client
server.py exposes an OpenAI-compatible endpoint, so Cursor, Cline, or any script gets local-first routing transparently. Each reply carries an x_hybrid field — route, reason, backend, latency — so you can see which tier answered. ~160 lines of dependency-free, stdlib-only Python, MIT-licensed.
Part of The Substrate.
hybrid owns the inference decision. The Substrate is the layer every model call flows through — a router you own, local-first inference, and a cache so you don't pay for the same answer twice.
Own your inference, not your invoice.
hybrid is open source and MIT-licensed. Read the code, read the build note, run it on your own box.
View hybrid on GitHub →