Own Your Stack/The Substrate/deja
Own your LLM cache
deja
A three-tier cache — exact, single-flight, semantic — so you don't buy the same answer twice.
You pay the provider every time, even when the answer never changes. The same question, asked a hundred ways by a hundred users, bills a hundred times — and a burst of identical concurrent calls pays for the same work over and over while it's still in flight.
deja sits in front of the model and answers from your own cache when it already knows the reply. It checks three tiers in order: an exact match on the request, a single-flight coalesce that collapses duplicate in-flight calls into one upstream round-trip, and a semantic near-match for queries that mean the same thing in different words. It speaks the OpenAI and Anthropic wire APIs, so any client adopts it by changing one base URL — no SDK, no rewrite. Errors are never cached, and a miss falls straight through to the provider.
client ──▶ deja ──(miss)──▶ OpenAI / Anthropic
│
├─ L1 exact-match µs, zero risk
├─ SF single-flight coalesce dupes
└─ L2 semantic behind a guard
Exact match, in microseconds
The L1 tier keys the full request and serves a byte-identical repeat from memory or Redis — no model call, no risk of a different answer. Embeddings are cached the same way, since they're deterministic and repeat constantly. A response header (X-Cache: HIT-L1) tells the caller exactly where the answer came from.
Single-flight de-duplication
When several identical requests arrive at once, deja sends one upstream and fans the result back to every waiter. A thundering herd of the same call pays for it once — which protects your throughput and your rate limit, not just your bill.
Semantic near-match, behind a guard
The L2 tier embeds the query and serves a cached answer for a paraphrase that means the same thing. It's deliberately conservative: matches are gated by a confidence band and a verification step, and volatile queries (today, latest, price) are skipped. A shadow harness labels real traffic so you set the threshold from data, not a guess.
One base URL, any client
deja speaks the OpenAI and Anthropic wire APIs verbatim. Point an existing client at it by changing the base URL — the caller's own Authorization / x-api-key is forwarded to the provider untouched, and the provider is detected by route, so one endpoint serves both.
Per-tenant isolation and an ops view
Keys and the semantic partition are scoped per tenant — by default derived from the API key, so different callers never share a cache. A purge and TTL admin API, cost accounting, and a single-file dashboard (hit rate, dollars saved, traffic mix) make it operable. Deterministic core, fails open on a miss, MIT-licensed.
Part of The Substrate.
deja caches the calls. The Substrate is the layer every model call flows through — a router you own, local-first inference, and a cache so you don't pay for the same answer twice.
Don't buy the same answer twice.
deja is open source and MIT-licensed. Read the code, point a client at it, run it on your own box.
View deja on GitHub →