Objective

A voice agent that forgets what the caller said two turns ago feels broken — and a long call quickly overruns the context the model can hold. Voice is just text once it's transcribed, so the same memory pattern applies: store every transcribed turn, recall older ones by meaning, and maintain a rolling summary so the agent keeps the thread on a 10-minute call. Your STT/TTS and telephony (Vapi, LiveKit, Twilio, Pipecat, Retell) live outside the database; the memory is the database. See Use it from your agent for wiring it into any voice stack.

Step 1: Create the voice turn store

One row per transcribed utterance, tagged by call and speaker, embedded for recall.

CREATE TABLE IF NOT EXISTS recipe_voice_turns (
  turn_id    INTEGER PRIMARY KEY,
  call_id    TEXT,
  speaker    TEXT,                                    -- 'caller' or 'agent'
  transcript TEXT,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  embedding  VECTOR(384)
);

Step 2: Create the rolling call summary store

One compact summary per call, so older turns can be dropped from the live context.

CREATE TABLE IF NOT EXISTS recipe_voice_summary (
  call_id    TEXT PRIMARY KEY,
  summary    TEXT,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Step 3: Record a transcribed conversation

What STT would hand you, turn by turn, during a support call.

INSERT INTO recipe_voice_turns (turn_id, call_id, speaker, transcript) VALUES
 (1,'call-42','caller','Hi, my internet has been dropping every evening this week.'),
 (2,'call-42','agent','I''m sorry to hear that. Is it all devices or just one?'),
 (3,'call-42','caller','All of them, and it''s worst around 8pm.'),
 (4,'call-42','agent','Got it. Have you restarted the router recently?'),
 (5,'call-42','caller','Yesterday, but it didn''t help. Also I work from home so this is urgent.'),
 (6,'call-42','agent','Understood, I''ll prioritize this. What''s your account number?');

Step 4: Embed the turns

The embedding model runs in-database, so older turns are recallable by meaning, not just position.

UPDATE recipe_voice_turns SET embedding = EMBED(transcript);

Step 5: Recall the relevant earlier turns mid-call

The caller circles back to "when does it happen?"; pull the turns that answer it without replaying the whole call.

SELECT turn_id, speaker, transcript,
       COSINE_SIMILARITY(embedding, EMBED('when and how often does the connection drop?')) AS relevance
FROM recipe_voice_turns
WHERE call_id = 'call-42'
ORDER BY relevance DESC
LIMIT 3;

Step 6: Maintain a rolling summary of the call

Compress the call so far into two sentences the agent can carry cheaply on every turn.

INSERT INTO recipe_voice_summary (call_id, summary)
SELECT 'call-42',
       GENERATE('Summarize this support call in 2 sentences (problem, key details, urgency): ' ||
                (SELECT GROUP_CONCAT(speaker || ': ' || transcript, '\n') FROM recipe_voice_turns WHERE call_id='call-42'));

Step 7: Generate the next spoken reply from compact memory

Ground the agent's next utterance in the summary plus the last turn — full context, low latency.

SELECT GENERATE(
  'You are a phone support agent; reply in one short spoken sentence. Call context: ' ||
  (SELECT summary FROM recipe_voice_summary WHERE call_id='call-42') ||
  ' The caller just gave their account number. Acknowledge and state the next step.') AS spoken_reply;

Cleanup (Optional)

DROP TABLE IF EXISTS recipe_voice_turns;
DROP TABLE IF EXISTS recipe_voice_summary;

Expected Outcomes

Step 5 surfaces the "worst around 8pm / all devices" turns for a timing question — recall by meaning, no replay.
Step 6 writes a 2-sentence rolling summary capturing the problem, the timing, and the urgency.
Step 7 produces a short, spoken-style reply grounded in the summary — natural on a call, cheap on tokens.

You now have a voice agent that remembers within a call (and across calls, keyed by call_id/caller) without ever blowing the context window.

Use it from your agent (framework-agnostic — the DB is the brain, the voice stack is swappable)

Voice memory is just two tables + the same data ops, called from wherever your audio pipeline runs:

REST / SDK — POST /v1/query/execute (any language), or @synapcores/sdk client.executeQuery(...). In your STT callback, append the transcribed turn (Step 3); before TTS, recall (Step 5) and assemble summary + last turn (Step 7) as the LLM context. Works with Vapi, LiveKit Agents, Pipecat, Twilio, or Retell — they handle audio, the database handles memory.
MCP (native, on by default) — point your voice runtime's MCP client at ws://<your-instance>/mcp?token=<jwt> (JWT from one POST /v1/auth/login → access_token). The execute tool appends turns and updates the summary; the query tool recalls relevant history — voice memory as tool calls inside the turn loop.
Any framework — the same store backs a phone agent, a browser voice widget, or a text chatbot; only the transport changes. The database is the brain; the framework (and the voice stack) is swappable.

Key Concepts Learned

Transcribed voice is just text — the store-recall-summarize memory pattern applies unchanged.
A rolling summary + the last turn keeps long calls inside the context window with low latency.
Keying turns by call_id (and the caller) gives both in-call and cross-call memory.
Because it's plain data ops (SQL / REST / MCP), voice memory works with any STT/TTS stack — the database-as-the-brain pattern the voice cluster builds on.

Conversation Memory for a Voice Agent