Own your LLM cache

deja

A three-tier cache — exact, single-flight, semantic — so you don't buy the same answer twice.

github → MIT the substrate cache

01What it is

You pay the provider every time, even when the answer never changes. The same question, asked a hundred ways by a hundred users, bills a hundred times — and a burst of identical concurrent calls pays for the same work over and over while it's still in flight.

deja sits in front of the model and answers from your own cache when it already knows the reply. It checks three tiers in order: an exact match on the request, a single-flight coalesce that collapses duplicate in-flight calls into one upstream round-trip, and a semantic near-match for queries that mean the same thing in different words. It speaks the OpenAI and Anthropic wire APIs, so any client adopts it by changing one base URL — no SDK, no rewrite. Errors are never cached, and a miss falls straight through to the provider.

deja · tiers

client ──▶ deja ──(miss)──▶ OpenAI / Anthropic
            │
            ├─ L1  exact-match     µs, zero risk
            ├─ SF  single-flight   coalesce dupes
            └─ L2  semantic        behind a guard

Fig. 1 — three tiers, checked in order; a miss falls through.

02What it does

Exact match, in microseconds

The L1 tier keys the full request and serves a byte-identical repeat from memory or Redis — no model call, no risk of a different answer. Embeddings are cached the same way, since they're deterministic and repeat constantly. A response header (X-Cache: HIT-L1) tells the caller exactly where the answer came from.

Single-flight de-duplication

When several identical requests arrive at once, deja sends one upstream and fans the result back to every waiter. A thundering herd of the same call pays for it once — which protects your throughput and your rate limit, not just your bill.

Semantic near-match, behind a guard

The L2 tier embeds the query and serves a cached answer for a paraphrase that means the same thing. It's deliberately conservative: matches are gated by a confidence band and a verification step, and volatile queries (today, latest, price) are skipped. A shadow harness labels real traffic so you set the threshold from data, not a guess.

One base URL, any client

deja speaks the OpenAI and Anthropic wire APIs verbatim. Point an existing client at it by changing the base URL — the caller's own Authorization / x-api-key is forwarded to the provider untouched, and the provider is detected by route, so one endpoint serves both.

Per-tenant isolation and an ops view

Keys and the semantic partition are scoped per tenant — by default derived from the API key, so different callers never share a cache. A purge and TTL admin API, cost accounting, and a single-file dashboard (hit rate, dollars saved, traffic mix) make it operable. Deterministic core, fails open on a miss, MIT-licensed.

03Where it sits

Part of The Substrate.

deja caches the calls. The Substrate is the layer every model call flows through — a router you own, local-first inference, and a cache so you don't pay for the same answer twice.

dario

Local LLM router — run on a plan, not metered tokens.

hybrid

Local-first inference — escalate only the hard queries.

Don't buy the same answer twice.

deja is open source and MIT-licensed. Read the code, point a client at it, run it on your own box.

View deja on GitHub →