Objective

At 3am the on-call's first 10 minutes are spent gathering context: which service, what changed, who's affected, recent deploys, runbook steps. Same handful of queries every incident, high cost-of-mistake. This recipe wires an incident-triage agent that runs all of them in one call and posts a structured triage message.

Step 1: Create alerts, deploys, customers, runbooks

CREATE TABLE IF NOT EXISTS recipe_op_alerts (
  alert_id      TEXT PRIMARY KEY,
  service       TEXT,
  severity      TEXT,            -- 'sev1' | 'sev2' | 'sev3'
  symptom       TEXT,
  fired_at      TIMESTAMP
);

CREATE TABLE IF NOT EXISTS recipe_op_deploys (
  deploy_id    INTEGER PRIMARY KEY,
  service      TEXT,
  commit_sha   TEXT,
  author       TEXT,
  deployed_at  TIMESTAMP
);

CREATE TABLE IF NOT EXISTS recipe_op_customers_affected (
  ca_id         INTEGER PRIMARY KEY,
  alert_id      TEXT,
  customer_id   INTEGER,
  customer_tier TEXT,
  observed_at   TIMESTAMP
);

CREATE TABLE IF NOT EXISTS recipe_op_runbooks (
  runbook_id  INTEGER PRIMARY KEY,
  title       TEXT,
  body        TEXT,
  embedding   VECTOR(384)
);

Step 2: Seed

INSERT INTO recipe_op_alerts VALUES
 ('PD-9281','checkout','sev1','5xx error rate spike from 0.1% to 14% on /checkout/submit','2026-05-27 03:01:00'),
 ('PD-9282','recs',    'sev2','p99 latency jumped from 80ms to 1.2s on /recommend','2026-05-26 21:14:00'),
 ('PD-9283','auth',    'sev3','elevated 401 rate on /v1/auth/login (4.2% vs baseline 0.6%)','2026-05-25 11:30:00');

INSERT INTO recipe_op_deploys VALUES
 (1,'checkout','b7d3a8c1','sarah@yourco','2026-05-27 02:38:00'),
 (2,'recs',    'f9c2e441','luis@yourco', '2026-05-26 20:55:00'),
 (3,'payments','3a8d11ee','priya@yourco','2026-05-27 02:39:00'),
 (4,'auth',    '6e7912b0','alex@yourco', '2026-05-23 15:00:00');

INSERT INTO recipe_op_customers_affected VALUES
 (1,'PD-9281', 5,'enterprise','2026-05-27 03:03:00'),
 (2,'PD-9281',12,'starter',   '2026-05-27 03:04:00'),
 (3,'PD-9281',47,'pro',       '2026-05-27 03:05:00'),
 (4,'PD-9282',77,'enterprise','2026-05-26 21:20:00');

INSERT INTO recipe_op_runbooks (runbook_id, title, body) VALUES
 (1,'checkout 5xx playbook','When checkout error rate exceeds 5%, the first action is to check whether a deploy went out in the last 2 hours. Roll back if so. If no recent deploy, check the payments service health and the database connection pool. Page payments-oncall if the database is healthy but payments is degraded.'),
 (2,'recs latency playbook','Recommendation latency above 500ms p99 usually means the vector index is rebuilding or a downstream feature store is slow. Check the most recent deploy. If the deploy added new features, check the feature-store hit rate. If hit rate is below 90%, traffic-shift to the previous version.'),
 (3,'auth 401 playbook','Elevated 401s on /v1/auth/login often indicate password-stuffing attacks. Check the source-IP distribution. If concentrated, enable the rate-limit rule. If distributed, check whether a recent deploy changed the JWT validation logic.'),
 (4,'rollback procedure','To roll back a service: identify the previous green deploy SHA from the deploy log, run "kubectl rollout undo deploy/<service>", monitor error rate for 5 minutes, then post in #incidents.'),
 (5,'communication SLA','Sev1 incidents require an external status update within 15 minutes of detection. Sev2 within 30. Sev3 by next business day. Updates go on status.yourco.com.');

Step 3: Embed

UPDATE recipe_op_runbooks SET embedding = EMBED(body);

Step 4: Ground truth — recent deploy correlation

SELECT a.alert_id, a.service, a.fired_at,
       d.commit_sha, d.deployed_at,
       (julianday(a.fired_at) - julianday(d.deployed_at)) * 24 * 60 AS mins_since_deploy
FROM recipe_op_alerts a
JOIN recipe_op_deploys d ON d.service = a.service
WHERE a.alert_id = 'PD-9281'
ORDER BY d.deployed_at DESC LIMIT 1;

Expected: deploy b7d3a8c1 fired 23 minutes before the alert — high suspicion.

Step 5: Ground truth — runbook retrieval

SELECT title, COSINE_SIMILARITY(embedding, EMBED('checkout 5xx spike')) AS sim
FROM recipe_op_runbooks ORDER BY sim DESC LIMIT 2;

Expected: checkout 5xx playbook first.

Step 6: Triage the alert

SELECT AGENT_RUN(
  'aidb-assistant',
  'You are an SRE on-call agent. Alert PD-9281 just fired. Use execute_query to look up the alert details, recent deploys on the affected service, and the count and tier of customers impacted. Use rag_search on recipe_op_runbooks to find the relevant playbook. Output: a structured incident summary with (a) what fired, (b) suspected root cause based on the deploy timeline, (c) customer impact (count + tiers), (d) the runbook step you recommend right now. Be specific — cite SHAs, timestamps, and runbook titles.'
) AS triage_summary;

Cleanup

DROP TABLE IF EXISTS recipe_op_alerts;
DROP TABLE IF EXISTS recipe_op_deploys;
DROP TABLE IF EXISTS recipe_op_customers_affected;
DROP TABLE IF EXISTS recipe_op_runbooks;

Use it from your agent

REST/SDK: wire PagerDuty/Opsgenie webhooks to call Step 6 — every page arrives with an agent-generated triage attached.
MCP: the SRE bot in your incident channel calls query over MCP.
Why in-DB: the alert, deploy log, and runbook all need to be queryable in one transaction. Stitched stacks add 30–60s of refresh latency — exactly the wrong time to add latency.

Key Concepts Learned

Mean time to first hypothesis (MTTFH) is the metric this pattern optimizes.
The agent's value isn't "answering questions" — it's eliminating the same 4 lookups every incident.
AGENT_RUN can be triggered by a webhook → write the triage to a channel → page only on confirmed sev1.

Agentic DevOps Incident Triage