API Latency Spike Detection (3-sigma)

Tested against SynapCores CE v1.7.0.1-ce (the currently-shipped release on Docker Hub: synapcores/community:v1.7.0.1-ce).

Objective

Flag latency samples that sit more than 3 standard deviations from the baseline mean — the canonical "this is a real incident, not noise" threshold for SRE pages.

Why this matters: median APM tools fire alerts on absolute thresholds ("p99 > 1000ms"), which produce constant false alarms in low-traffic periods and miss slow-burn regressions. A 3-sigma rule is self-calibrating — it adapts to whatever "normal" looks like for this service, this week.

Step 1 — Schema + 1-minute samples

120 one-minute samples: 110 baseline (~120ms ± 18ms) + 10 spikes (900– 2400ms).

DROP TABLE IF EXISTS api_lat;
CREATE TABLE api_lat (
    id          INTEGER PRIMARY KEY,
    latency_ms  DOUBLE
);

INSERT INTO api_lat VALUES
(1,101.1),
(2,132.2),
(3,127.3),
(4,111.4),
(5,120.7),
(6,124.3),
(7,121.8),
(8,121.6),
(9,132.0),
(10,138.0),
(11,147.6),
(12,118.2),
(13,106.8),
(14,132.7),
(15,114.2),
(16,105.1),
(17,150.9),
(18,135.4),
(19,113.5),
(20,114.1),
(21,127.3),
(22,123.4),
(23,128.3),
(24,117.8),
(25,86.4),
(26,86.3),
(27,137.3),
(28,123.9),
(29,80.3),
(30,125.7),
(31,125.2),
(32,132.9),
(33,120.1),
(34,125.7),
(35,104.7),
(36,90.6),
(37,115.4),
(38,121.9),
(39,113.3),
(40,91.3),
(41,110.8),
(42,106.8),
(43,132.3),
(44,97.8),
(45,101.0),
(46,108.0),
(47,122.1),
(48,139.7),
(49,127.5),
(50,93.7),
(51,101.1),
(52,147.4),
(53,122.9),
(54,115.5),
(55,127.0),
(56,130.7),
(57,150.0),
(58,122.8),
(59,106.4),
(60,102.0),
(61,136.9),
(62,119.4),
(63,122.8),
(64,128.3),
(65,101.6),
(66,149.0),
(67,115.1),
(68,116.1),
(69,114.8),
(70,122.3),
(71,113.4),
(72,113.6),
(73,131.1),
(74,143.3),
(75,132.8),
(76,98.5),
(77,113.0),
(78,144.9),
(79,119.7),
(80,122.1),
(81,115.9),
(82,147.2),
(83,111.8),
(84,111.6),
(85,138.3),
(86,91.4),
(87,111.6),
(88,165.1),
(89,119.5),
(90,126.6),
(91,84.1),
(92,117.2),
(93,101.7),
(94,106.7),
(95,105.3),
(96,121.1),
(97,114.4),
(98,137.4),
(99,78.8),
(100,107.4),
(101,132.3),
(102,126.1),
(103,144.7),
(104,110.8),
(105,137.3),
(106,128.8),
(107,96.9),
(108,87.8),
(109,153.5),
(110,139.6),
(111,1952.0),
(112,1740.1),
(113,1009.5),
(114,1307.6),
(115,1518.7),
(116,1109.3),
(117,2315.8),
(118,1275.1),
(119,1247.6),
(120,1795.8)
;

SELECT COUNT(*) FROM api_lat;
-- → 120

Step 2 — Compute baseline μ and σ

SELECT AVG(latency_ms) AS mu, STDDEV(latency_ms) AS sigma FROM api_lat;
-- → μ ≈ 236.78ms, σ ≈ 407.38ms   (σ is wide because spikes are baked in)

Step 3 — Score every sample

SELECT
    id,
    latency_ms,
    ABS(latency_ms - (SELECT AVG(latency_ms) FROM api_lat))
      / (SELECT STDDEV(latency_ms) FROM api_lat) AS z
FROM api_lat
ORDER BY z DESC
LIMIT 10;
-- → top row: latency_ms=2315.8, z=5.10  (clear incident)

Step 4 — Page on z > 3

SELECT id, latency_ms,
       ABS(latency_ms - (SELECT AVG(latency_ms) FROM api_lat))
         / (SELECT STDDEV(latency_ms) FROM api_lat) AS z
FROM api_lat
WHERE ABS(latency_ms - (SELECT AVG(latency_ms) FROM api_lat))
        / (SELECT STDDEV(latency_ms) FROM api_lat) > 3
ORDER BY z DESC;
-- → 5 rows above 3σ  → 5 incidents

Productionizing

Stream APM samples into AIDB via the REST ingest. Compute μ/σ on a rolling 60-minute window (re-run μ/σ subqueries hourly) so the detector adapts to traffic patterns. For multi-service, wrap the whole query in PARTITION BY service_name.

Get SynapCores Community Edition →

API Latency Spike Detection (3-sigma)