API Latency Spike Detection (3-sigma)
Tested against SynapCores CE v1.7.0.1-ce (the currently-shipped release on Docker Hub:
synapcores/community:v1.7.0.1-ce).
Objective
Flag latency samples that sit more than 3 standard deviations from the baseline mean — the canonical "this is a real incident, not noise" threshold for SRE pages.
Why this matters: median APM tools fire alerts on absolute thresholds ("p99 > 1000ms"), which produce constant false alarms in low-traffic periods and miss slow-burn regressions. A 3-sigma rule is self-calibrating — it adapts to whatever "normal" looks like for this service, this week.
Step 1 — Schema + 1-minute samples
120 one-minute samples: 110 baseline (~120ms ± 18ms) + 10 spikes (900– 2400ms).
DROP TABLE IF EXISTS api_lat;
CREATE TABLE api_lat (
id INTEGER PRIMARY KEY,
latency_ms DOUBLE
);
INSERT INTO api_lat VALUES
(1,101.1),
(2,132.2),
(3,127.3),
(4,111.4),
(5,120.7),
(6,124.3),
(7,121.8),
(8,121.6),
(9,132.0),
(10,138.0),
(11,147.6),
(12,118.2),
(13,106.8),
(14,132.7),
(15,114.2),
(16,105.1),
(17,150.9),
(18,135.4),
(19,113.5),
(20,114.1),
(21,127.3),
(22,123.4),
(23,128.3),
(24,117.8),
(25,86.4),
(26,86.3),
(27,137.3),
(28,123.9),
(29,80.3),
(30,125.7),
(31,125.2),
(32,132.9),
(33,120.1),
(34,125.7),
(35,104.7),
(36,90.6),
(37,115.4),
(38,121.9),
(39,113.3),
(40,91.3),
(41,110.8),
(42,106.8),
(43,132.3),
(44,97.8),
(45,101.0),
(46,108.0),
(47,122.1),
(48,139.7),
(49,127.5),
(50,93.7),
(51,101.1),
(52,147.4),
(53,122.9),
(54,115.5),
(55,127.0),
(56,130.7),
(57,150.0),
(58,122.8),
(59,106.4),
(60,102.0),
(61,136.9),
(62,119.4),
(63,122.8),
(64,128.3),
(65,101.6),
(66,149.0),
(67,115.1),
(68,116.1),
(69,114.8),
(70,122.3),
(71,113.4),
(72,113.6),
(73,131.1),
(74,143.3),
(75,132.8),
(76,98.5),
(77,113.0),
(78,144.9),
(79,119.7),
(80,122.1),
(81,115.9),
(82,147.2),
(83,111.8),
(84,111.6),
(85,138.3),
(86,91.4),
(87,111.6),
(88,165.1),
(89,119.5),
(90,126.6),
(91,84.1),
(92,117.2),
(93,101.7),
(94,106.7),
(95,105.3),
(96,121.1),
(97,114.4),
(98,137.4),
(99,78.8),
(100,107.4),
(101,132.3),
(102,126.1),
(103,144.7),
(104,110.8),
(105,137.3),
(106,128.8),
(107,96.9),
(108,87.8),
(109,153.5),
(110,139.6),
(111,1952.0),
(112,1740.1),
(113,1009.5),
(114,1307.6),
(115,1518.7),
(116,1109.3),
(117,2315.8),
(118,1275.1),
(119,1247.6),
(120,1795.8)
;
SELECT COUNT(*) FROM api_lat;
-- → 120
Step 2 — Compute baseline μ and σ
SELECT AVG(latency_ms) AS mu, STDDEV(latency_ms) AS sigma FROM api_lat;
-- → μ ≈ 236.78ms, σ ≈ 407.38ms (σ is wide because spikes are baked in)
Step 3 — Score every sample
SELECT
id,
latency_ms,
ABS(latency_ms - (SELECT AVG(latency_ms) FROM api_lat))
/ (SELECT STDDEV(latency_ms) FROM api_lat) AS z
FROM api_lat
ORDER BY z DESC
LIMIT 10;
-- → top row: latency_ms=2315.8, z=5.10 (clear incident)
Step 4 — Page on z > 3
SELECT id, latency_ms,
ABS(latency_ms - (SELECT AVG(latency_ms) FROM api_lat))
/ (SELECT STDDEV(latency_ms) FROM api_lat) AS z
FROM api_lat
WHERE ABS(latency_ms - (SELECT AVG(latency_ms) FROM api_lat))
/ (SELECT STDDEV(latency_ms) FROM api_lat) > 3
ORDER BY z DESC;
-- → 5 rows above 3σ → 5 incidents
Productionizing
Stream APM samples into AIDB via the REST ingest. Compute μ/σ on a
rolling 60-minute window (re-run μ/σ subqueries hourly) so the
detector adapts to traffic patterns. For multi-service, wrap the whole
query in PARTITION BY service_name.