Objective
At 3am the on-call's first 10 minutes are spent gathering context: which service, what changed, who's affected, recent deploys, runbook steps. Same handful of queries every incident, high cost-of-mistake. This recipe wires an incident-triage agent that runs all of them in one call and posts a structured triage message.
Step 1: Create alerts, deploys, customers, runbooks
CREATE TABLE IF NOT EXISTS recipe_op_alerts (
alert_id TEXT PRIMARY KEY,
service TEXT,
severity TEXT, -- 'sev1' | 'sev2' | 'sev3'
symptom TEXT,
fired_at TIMESTAMP
);
CREATE TABLE IF NOT EXISTS recipe_op_deploys (
deploy_id INTEGER PRIMARY KEY,
service TEXT,
commit_sha TEXT,
author TEXT,
deployed_at TIMESTAMP
);
CREATE TABLE IF NOT EXISTS recipe_op_customers_affected (
ca_id INTEGER PRIMARY KEY,
alert_id TEXT,
customer_id INTEGER,
customer_tier TEXT,
observed_at TIMESTAMP
);
CREATE TABLE IF NOT EXISTS recipe_op_runbooks (
runbook_id INTEGER PRIMARY KEY,
title TEXT,
body TEXT,
embedding VECTOR(384)
);
Step 2: Seed
INSERT INTO recipe_op_alerts VALUES
('PD-9281','checkout','sev1','5xx error rate spike from 0.1% to 14% on /checkout/submit','2026-05-27 03:01:00'),
('PD-9282','recs', 'sev2','p99 latency jumped from 80ms to 1.2s on /recommend','2026-05-26 21:14:00'),
('PD-9283','auth', 'sev3','elevated 401 rate on /v1/auth/login (4.2% vs baseline 0.6%)','2026-05-25 11:30:00');
INSERT INTO recipe_op_deploys VALUES
(1,'checkout','b7d3a8c1','sarah@yourco','2026-05-27 02:38:00'),
(2,'recs', 'f9c2e441','luis@yourco', '2026-05-26 20:55:00'),
(3,'payments','3a8d11ee','priya@yourco','2026-05-27 02:39:00'),
(4,'auth', '6e7912b0','alex@yourco', '2026-05-23 15:00:00');
INSERT INTO recipe_op_customers_affected VALUES
(1,'PD-9281', 5,'enterprise','2026-05-27 03:03:00'),
(2,'PD-9281',12,'starter', '2026-05-27 03:04:00'),
(3,'PD-9281',47,'pro', '2026-05-27 03:05:00'),
(4,'PD-9282',77,'enterprise','2026-05-26 21:20:00');
INSERT INTO recipe_op_runbooks (runbook_id, title, body) VALUES
(1,'checkout 5xx playbook','When checkout error rate exceeds 5%, the first action is to check whether a deploy went out in the last 2 hours. Roll back if so. If no recent deploy, check the payments service health and the database connection pool. Page payments-oncall if the database is healthy but payments is degraded.'),
(2,'recs latency playbook','Recommendation latency above 500ms p99 usually means the vector index is rebuilding or a downstream feature store is slow. Check the most recent deploy. If the deploy added new features, check the feature-store hit rate. If hit rate is below 90%, traffic-shift to the previous version.'),
(3,'auth 401 playbook','Elevated 401s on /v1/auth/login often indicate password-stuffing attacks. Check the source-IP distribution. If concentrated, enable the rate-limit rule. If distributed, check whether a recent deploy changed the JWT validation logic.'),
(4,'rollback procedure','To roll back a service: identify the previous green deploy SHA from the deploy log, run "kubectl rollout undo deploy/<service>", monitor error rate for 5 minutes, then post in #incidents.'),
(5,'communication SLA','Sev1 incidents require an external status update within 15 minutes of detection. Sev2 within 30. Sev3 by next business day. Updates go on status.yourco.com.');
Step 3: Embed
UPDATE recipe_op_runbooks SET embedding = EMBED(body);
Step 4: Ground truth — recent deploy correlation
SELECT a.alert_id, a.service, a.fired_at,
d.commit_sha, d.deployed_at,
(julianday(a.fired_at) - julianday(d.deployed_at)) * 24 * 60 AS mins_since_deploy
FROM recipe_op_alerts a
JOIN recipe_op_deploys d ON d.service = a.service
WHERE a.alert_id = 'PD-9281'
ORDER BY d.deployed_at DESC LIMIT 1;
Expected: deploy b7d3a8c1 fired 23 minutes before the alert — high suspicion.
Step 5: Ground truth — runbook retrieval
SELECT title, COSINE_SIMILARITY(embedding, EMBED('checkout 5xx spike')) AS sim
FROM recipe_op_runbooks ORDER BY sim DESC LIMIT 2;
Expected: checkout 5xx playbook first.
Step 6: Triage the alert
SELECT AGENT_RUN(
'aidb-assistant',
'You are an SRE on-call agent. Alert PD-9281 just fired. Use execute_query to look up the alert details, recent deploys on the affected service, and the count and tier of customers impacted. Use rag_search on recipe_op_runbooks to find the relevant playbook. Output: a structured incident summary with (a) what fired, (b) suspected root cause based on the deploy timeline, (c) customer impact (count + tiers), (d) the runbook step you recommend right now. Be specific — cite SHAs, timestamps, and runbook titles.'
) AS triage_summary;
Cleanup
DROP TABLE IF EXISTS recipe_op_alerts;
DROP TABLE IF EXISTS recipe_op_deploys;
DROP TABLE IF EXISTS recipe_op_customers_affected;
DROP TABLE IF EXISTS recipe_op_runbooks;
Use it from your agent
- REST/SDK: wire PagerDuty/Opsgenie webhooks to call Step 6 — every page arrives with an agent-generated triage attached.
- MCP: the SRE bot in your incident channel calls
queryover MCP. - Why in-DB: the alert, deploy log, and runbook all need to be queryable in one transaction. Stitched stacks add 30–60s of refresh latency — exactly the wrong time to add latency.
Key Concepts Learned
- Mean time to first hypothesis (MTTFH) is the metric this pattern optimizes.
- The agent's value isn't "answering questions" — it's eliminating the same 4 lookups every incident.
AGENT_RUNcan be triggered by a webhook → write the triage to a channel → page only on confirmed sev1.