Objective
A voice agent that forgets what the caller said two turns ago feels broken — and a long call quickly overruns the context the model can hold. Voice is just text once it's transcribed, so the same memory pattern applies: store every transcribed turn, recall older ones by meaning, and maintain a rolling summary so the agent keeps the thread on a 10-minute call. Your STT/TTS and telephony (Vapi, LiveKit, Twilio, Pipecat, Retell) live outside the database; the memory is the database. See Use it from your agent for wiring it into any voice stack.
Step 1: Create the voice turn store
One row per transcribed utterance, tagged by call and speaker, embedded for recall.
CREATE TABLE IF NOT EXISTS recipe_voice_turns (
turn_id INTEGER PRIMARY KEY,
call_id TEXT,
speaker TEXT, -- 'caller' or 'agent'
transcript TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
embedding VECTOR(384)
);
Step 2: Create the rolling call summary store
One compact summary per call, so older turns can be dropped from the live context.
CREATE TABLE IF NOT EXISTS recipe_voice_summary (
call_id TEXT PRIMARY KEY,
summary TEXT,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Step 3: Record a transcribed conversation
What STT would hand you, turn by turn, during a support call.
INSERT INTO recipe_voice_turns (turn_id, call_id, speaker, transcript) VALUES
(1,'call-42','caller','Hi, my internet has been dropping every evening this week.'),
(2,'call-42','agent','I''m sorry to hear that. Is it all devices or just one?'),
(3,'call-42','caller','All of them, and it''s worst around 8pm.'),
(4,'call-42','agent','Got it. Have you restarted the router recently?'),
(5,'call-42','caller','Yesterday, but it didn''t help. Also I work from home so this is urgent.'),
(6,'call-42','agent','Understood, I''ll prioritize this. What''s your account number?');
Step 4: Embed the turns
The embedding model runs in-database, so older turns are recallable by meaning, not just position.
UPDATE recipe_voice_turns SET embedding = EMBED(transcript);
Step 5: Recall the relevant earlier turns mid-call
The caller circles back to "when does it happen?"; pull the turns that answer it without replaying the whole call.
SELECT turn_id, speaker, transcript,
COSINE_SIMILARITY(embedding, EMBED('when and how often does the connection drop?')) AS relevance
FROM recipe_voice_turns
WHERE call_id = 'call-42'
ORDER BY relevance DESC
LIMIT 3;
Step 6: Maintain a rolling summary of the call
Compress the call so far into two sentences the agent can carry cheaply on every turn.
INSERT INTO recipe_voice_summary (call_id, summary)
SELECT 'call-42',
GENERATE('Summarize this support call in 2 sentences (problem, key details, urgency): ' ||
(SELECT GROUP_CONCAT(speaker || ': ' || transcript, '\n') FROM recipe_voice_turns WHERE call_id='call-42'));
Step 7: Generate the next spoken reply from compact memory
Ground the agent's next utterance in the summary plus the last turn — full context, low latency.
SELECT GENERATE(
'You are a phone support agent; reply in one short spoken sentence. Call context: ' ||
(SELECT summary FROM recipe_voice_summary WHERE call_id='call-42') ||
' The caller just gave their account number. Acknowledge and state the next step.') AS spoken_reply;
Cleanup (Optional)
DROP TABLE IF EXISTS recipe_voice_turns;
DROP TABLE IF EXISTS recipe_voice_summary;
Expected Outcomes
- Step 5 surfaces the "worst around 8pm / all devices" turns for a timing question — recall by meaning, no replay.
- Step 6 writes a 2-sentence rolling summary capturing the problem, the timing, and the urgency.
- Step 7 produces a short, spoken-style reply grounded in the summary — natural on a call, cheap on tokens.
You now have a voice agent that remembers within a call (and across calls, keyed by call_id/caller) without ever blowing the context window.
Use it from your agent (framework-agnostic — the DB is the brain, the voice stack is swappable)
Voice memory is just two tables + the same data ops, called from wherever your audio pipeline runs:
- REST / SDK —
POST /v1/query/execute(any language), or@synapcores/sdkclient.executeQuery(...). In your STT callback, append the transcribed turn (Step 3); before TTS, recall (Step 5) and assemble summary + last turn (Step 7) as the LLM context. Works with Vapi, LiveKit Agents, Pipecat, Twilio, or Retell — they handle audio, the database handles memory. - MCP (native, on by default) — point your voice runtime's MCP client at
ws://<your-instance>/mcp?token=<jwt>(JWT from onePOST /v1/auth/login→access_token). Theexecutetool appends turns and updates the summary; thequerytool recalls relevant history — voice memory as tool calls inside the turn loop. - Any framework — the same store backs a phone agent, a browser voice widget, or a text chatbot; only the transport changes. The database is the brain; the framework (and the voice stack) is swappable.
Key Concepts Learned
- Transcribed voice is just text — the store-recall-summarize memory pattern applies unchanged.
- A rolling summary + the last turn keeps long calls inside the context window with low latency.
- Keying turns by
call_id(and the caller) gives both in-call and cross-call memory. - Because it's plain data ops (SQL / REST / MCP), voice memory works with any STT/TTS stack — the database-as-the-brain pattern the voice cluster builds on.