
Every chatbot demo looks great for about eight turns. Then the wheels come off.
You've seen it. The bot that introduced itself five minutes ago suddenly forgets your name. The "travel assistant" that asked about your dietary restrictions in turn three happily recommends a steakhouse in turn twenty. The support agent that loses the thread of the ticket it's supposed to be resolving.
The usual diagnosis is "the model needs more context." So people do the obvious thing: re-send the entire transcript on every single call. It works — right up until it doesn't. You blow the context window, your latency creeps up, and you're paying for thousands of tokens of stale conversation on every turn just to remind the model what it already said. That's not memory. That's a tax.
I want to walk through how I actually solve this, because the fix isn't a bigger context window or a cleverer prompt. It's a database pattern. Two tables and a handful of SQL operations.
The core idea: two layers, not one
Human memory isn't a transcript. You don't replay every word of a conversation to recall what someone wants. You keep a compressed sense of the whole thing — vegetarian, on a budget, hates flying — plus a sharp recollection of the last couple of exchanges. Everything older gets summarized down to what matters.
That's exactly the architecture a chatbot needs:
- Store the full history so you can recall any past turn by meaning, not by position.
- Maintain a rolling summary of older turns so the model carries durable context cheaply.
- Feed the model the summary plus the last few verbatim turns — never the whole log.
Full memory, tiny context. Let me show you the whole thing in SQL.
Step 1 & 2: Two tables
One table holds every message, embedded for semantic recall. The other holds one rolling summary per session — the compressed memory of everything that came before the recent window.
CREATE TABLE IF NOT EXISTS recipe_chat_turns (
turn_id INTEGER PRIMARY KEY,
session_id TEXT,
role TEXT, -- 'user' or 'assistant'
content TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
embedding VECTOR(384)
);
CREATE TABLE IF NOT EXISTS recipe_chat_summary (
session_id TEXT PRIMARY KEY,
summary TEXT,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Notice the VECTOR(384) column sitting right next to your text. No separate vector database, no sync job, no second system to operate. The embedding lives in the same row as the message. That detail matters more than it looks — I'll come back to it.
Step 3 & 4: Record and embed
Drop in a realistic conversation — a 10-day Japan trip, $3000 budget, vegetarian, prefers trains, gets motion sick — the kind of session that's already too long to resend in full every turn. Then embed every message in a single statement:
UPDATE recipe_chat_turns SET embedding = EMBED(content);
EMBED() is a function in the query. The constraints the user mentioned ten turns ago are now recallable by meaning.
Step 5: Recall what matters, not everything
Here's where the "re-send the whole transcript" crowd loses. When the user asks something new, you don't need the entire log. You need the three turns that are actually relevant:
SELECT turn_id, role, content,
COSINE_SIMILARITY(
embedding,
EMBED('what are the user''s dietary and travel constraints?')
) AS relevance
FROM recipe_chat_turns
WHERE session_id = 's1'
ORDER BY relevance DESC
LIMIT 3;
That query surfaces the vegetarian / rail / motion-sickness turns and ignores the small talk. Relevant history, none of the noise.
Step 6: Compress the rest into a rolling summary
Flatten the transcript, then generate a durable summary from it — again, right in the query:
INSERT INTO recipe_chat_summary (session_id, summary)
SELECT session_id,
GENERATE('Summarize this chat as durable user context in 2-3 sentences (preferences, constraints, goal): ' || transcript)
FROM recipe_chat_transcript;
GENERATE() calls the model from inside SQL. Two or three sentences capture the budget, the cities, the preferences. You refresh this every N turns instead of on every call.
Step 7 & 8: Assemble cheap context, then answer
Now combine the rolling summary with only the last two verbatim turns — and generate the reply grounded in that:
SELECT GENERATE(
'You are a travel assistant. Context: ' || summary ||
' The user now asks: Suggest a vegetarian dinner spot in Kyoto. Answer in one sentence respecting their constraints.') AS reply
FROM recipe_chat_summary WHERE session_id = 's1';
The model answers with a vegetarian Kyoto suggestion that respects every constraint — and it did it on a fraction of the tokens a full transcript would have cost.
Why this is a database pattern, not a framework feature
Here's the part I care about most, and the reason I keep pushing this point: the database is the brain; the framework is swappable.
Conversation memory here is just two tables and the same data operations — append a turn, recall by meaning, refresh the summary, assemble the context. That's it. So it doesn't matter what's driving your bot. REST, an SDK call, an MCP client like Claude Code or a custom loop, OpenClaw, LangChain, a voice agent — they all read and write the same tables. No memory abstraction baked into one framework you'll regret choosing in six months.
I've watched too many teams couple their entire memory layer to whatever orchestration library was hot that quarter, then spend the next year unwinding it. Don't. Keep the memory in the database where it belongs, and treat the framework as a detail you can change your mind about.
That's also why I built this with embeddings, SQL, and GENERATE() living in one engine. Every time you split semantic recall into a separate vector store and the text into a separate database, you've signed up for a sync problem and a second thing to operate at 3 a.m. One row, one place, one query language.
Run it yourself
Reading SQL is one thing. Watching it run is another. This whole pattern is a free recipe you can paste into a running instance and execute end to end in about fourteen minutes — turn buffer, semantic recall, and auto-summary, all in one SQL database. You can install the Community Edition free and have it running in about thirty seconds.
👉 Conversation Memory + Rolling Summary for a Chatbot
If this clarified how chatbot memory should work — or saved you from re-sending a transcript on every call — do me a favor and hold the clap button (you can leave more than one 👏). And subscribe if you want more hands-on tutorials like this; I publish practical, build-it-yourself walkthroughs on AI-native data infrastructure regularly, and the next ones go straight to your feed.
What's the worst context-window blowup you've hit in production? Tell me in the responses — I read them.