Low-Latency Semantic Answer Cache for Voice

Cut voice AI latency and cost with a semantic answer cache — serve a stored answer when a caller's question means the same as one you've answered before, even if worded differently, with a vector lookup in SQL. For any STT/TTS stack (Vapi, LiveKit, Twilio, Pipecat).

All recipes· voice-agents· 13 minutesintermediateen
Instance: localhost:8080

Opens your running SynapCores (Low-Latency Semantic Answer Cache for Voice will be staged for a preview — nothing runs until you click Run). No instance yet? Install free in ~30s.

Share

Objective

On a voice call, every second of latency is dead air — and most callers ask the same handful of questions in slightly different words. An exact-match cache misses all the paraphrases. A semantic cache hits when the meaning matches: if you've already answered "what time do you close?", you instantly reuse it for "how late are you open?" — no LLM round trip, no dead air, no token cost. Here you'll build a semantic answer cache with a vector lookup and a hit threshold. Your STT/TTS and telephony live outside; the cache is the database. See Use it from your agent for wiring it into any voice stack.

Step 1: Create the semantic answer cache

Each row is a previously answered question, its answer, an embedding, and a hit counter.

CREATE TABLE IF NOT EXISTS recipe_voice_cache (
  cache_id   INTEGER PRIMARY KEY,
  question   TEXT,
  answer     TEXT,
  hits       INTEGER DEFAULT 0,
  embedding  VECTOR(384)
);

Step 2: Warm the cache with common Q&A

The questions a phone line answers all day — store the canonical answer once.

INSERT INTO recipe_voice_cache (cache_id, question, answer) VALUES
 (1,'What are your hours?','We''re open 9am to 8pm Monday through Saturday, closed Sundays.'),
 (2,'Where can I park?','There''s free customer parking in the lot behind the building.'),
 (3,'What is your return policy?','You can return items within 14 days with a receipt for a full refund.'),
 (4,'Do you offer delivery?','Yes, same-day delivery before 2pm within 10 miles.'),
 (5,'What payment methods do you take?','We accept all major cards, mobile wallets, and cash.');

Step 3: Embed the cached questions

The embedding model runs in-database; this is the index the cache is looked up against.

UPDATE recipe_voice_cache SET embedding = EMBED(question);

Step 4: Look up a paraphrased question in the cache

The caller asks "how late are you open tonight?"; find the closest cached question by meaning.

SELECT cache_id, question, answer,
       COSINE_SIMILARITY(embedding, EMBED('how late are you open tonight?')) AS similarity
FROM recipe_voice_cache
ORDER BY similarity DESC
LIMIT 1;

Step 5: Decide hit vs. miss with a threshold

Serve the cached answer only on a strong semantic match; otherwise fall through to the live LLM path.

SELECT answer, similarity,
       CASE WHEN similarity >= 0.55 THEN 'CACHE_HIT' ELSE 'MISS_CALL_LLM' END AS decision
FROM (
  SELECT answer,
         COSINE_SIMILARITY(embedding, EMBED('what time do you shut tonight?')) AS similarity
  FROM recipe_voice_cache
  ORDER BY similarity DESC
  LIMIT 1
);

Step 6: Count the hit (so you can measure cache value)

On a hit, increment the counter for the matched entry to track how much latency/cost the cache saves. We first pin the matched entry into a one-row table (projecting the similarity we order by), then increment by that id.

CREATE TABLE IF NOT EXISTS recipe_voice_cache_hit (id INTEGER PRIMARY KEY, cache_id INTEGER, similarity DOUBLE);
INSERT INTO recipe_voice_cache_hit (id, cache_id, similarity)
SELECT 1, cache_id,
       COSINE_SIMILARITY(embedding, EMBED('what time do you shut tonight?')) AS similarity
FROM recipe_voice_cache
ORDER BY similarity DESC
LIMIT 1;
UPDATE recipe_voice_cache
SET hits = hits + 1
WHERE cache_id = (SELECT cache_id FROM recipe_voice_cache_hit LIMIT 1);

Step 7: Fall through and populate the cache on a miss

A genuinely new question misses the threshold; generate the answer once, then store it so the next paraphrase hits.

INSERT INTO recipe_voice_cache (cache_id, question, answer, embedding)
SELECT 6, 'Do you price match?',
       GENERATE('Answer in one short spoken sentence: Do you price match competitors? Assume yes, within 30 days with proof.'),
       EMBED('Do you price match?');

Cleanup (Optional)

DROP TABLE IF EXISTS recipe_voice_cache;
DROP TABLE IF EXISTS recipe_voice_cache_hit;

Expected Outcomes

  • Step 4 matches "how late are you open tonight?" to the cached hours question — a paraphrase an exact cache would miss.
  • Step 5 returns CACHE_HIT with the stored answer for "what time do you shut tonight?" — instant, no LLM call.
  • Step 6 increments the hit counter so you can quantify the saved latency and cost.
  • Step 7 answers a brand-new question once, then caches it so the next phrasing of it hits.

You now have a semantic answer cache that turns repeated, reworded questions into instant replies — less dead air on calls, fewer LLM tokens.

Use it from your agent (framework-agnostic — the DB is the brain, the voice stack is swappable)

A semantic cache is just a Q&A table + a lookup-before-LLM check, called from any voice runtime:

  • REST / SDKPOST /v1/query/execute (any language), or @synapcores/sdk client.executeQuery(...). On each question, run the Step-5 lookup first; on CACHE_HIT send the stored answer straight to TTS, on a miss call your LLM and write the result back (Step 7). Works with Vapi, LiveKit Agents, Pipecat, Twilio, or Retell.
  • MCP (native, on by default) — point your voice runtime's MCP client at ws://<your-instance>/mcp?token=<jwt> (JWT from one POST /v1/auth/loginaccess_token). The query tool checks the cache; the execute tool records hits and inserts new answers — the cache as tool calls in the turn loop.
  • Any framework — the same cache fronts a phone agent, a voice widget, or a chatbot; the latency win is biggest on voice but the pattern is universal. The database is the brain; the framework (and the voice stack) is swappable.

Key Concepts Learned

  • A semantic cache hits on meaning, so paraphrases reuse a stored answer — unlike an exact-match cache.
  • A similarity threshold separates safe cache hits from genuine misses that need the live LLM.
  • Populating the cache on every miss makes it warm up automatically and pay off over a call's lifetime.
  • Because it's plain data ops (SQL / REST / MCP), the semantic cache works with any STT/TTS stack — the database-as-the-brain pattern the voice cluster builds on.

Tags

voice-agentsemantic-cachelatencyvectorembeddingscost-reductionmcp

Run this on your own machine

Install SynapCores Community Edition free, paste the SQL or Cypher above into the bundled web UI, and watch it run.

Download Free CE