Field notes · SRE confessional · 4,820 words

The dashboard we stopped looking at.

For nineteen months the wall in the war room ran a forty-two-tile Grafana board nobody believed anymore. This is a record of how the noble dashboard died, why our PagerDuty queue ate four hours of sleep a week, and the six rules we wrote after a typo in a YAML file paged the whole platform team at 04:11 on a Sunday.

AuthorIben Vlk-Marston, Staff SRE — Platform Reliability Edited byPia Halvorsen, Engineering Editor DatelineLisbon · Tallinn · remote — 12 June Issue014 / 028 in the operations sequence

42^tiles

Grafana panels at peak

1,408^pages

PagerDuty / 12 months

63^%

Acknowledged then ignored

4h 12m

Mean sleep lost / on-call wk

6^rules

Survived dashboard reform

04:11

The Sunday the YAML paged

¶ Contents — seven sections, one retrospective, no apologies

§01A room with forty-two tilesEssay · 820w
§02The Sunday the YAML pagedIncident · 690w
§03An anatomy of fatigueData · table
§04The blameless retro, editedTimeline · phases
§05Three boards that survivedField guide · cards
§06Six rules for a watchable dashboardManifesto
§07What you'll ask, what we'll answerFAQ · 7

§01Essay
Reading 7 min
Newsreader · 19/29

A room with forty-two tiles and no one in it.

The board was built by Mihail in the autumn of 2022, when he was still excited about us. It had forty-two tiles arranged in seven rows of six, and every tile had a thin amber border that pulsed when the underlying metric crossed a threshold. It looked, from the doorway of the war room in our second-floor office on Rua das Janelas Verdes, like the cockpit of something serious. We were proud of it for about eleven weeks.

By February it had become wallpaper. By June it had become a kind of indoor weather — pulsing slightly all the time, ignored by people walking past, occasionally turning the room a faint amber when the p99 latency on the order-ingest service drifted past 480 ms for no reason anyone could explain and then drifted back. We stopped pointing at it during standups. The intern stopped asking what the tiles meant. The CTO, on his rare visits from the Tallinn office, would glance at it and say "good, good" and we would all nod.

The truth was that nobody was watching it. Not really. We were watching PagerDuty, which was watching the same metrics through a less ambitious lens, and which had the rude virtue of waking us up. The dashboard was decorative; the pager was operational. We were building the wrong instrument for the wrong audience and we did not notice because the building of it had felt productive — a thing engineers do instead of asking what alerting is actually for.

A dashboard nobody trusts is worse than no dashboard. It absorbs attention without paying it back. — Pia Halvorsen, retro notes, 17 January

I want to be specific about the failure modes, because "alert fatigue" is one of those phrases that has been worn so smooth by conference talks it barely catches anymore. Our fatigue had a shape. It was: a Slack channel called #alerts-prod with 2,300 messages a week, of which roughly 87% were the same six warnings firing on the same six hosts. It was Marko, our newest hire, asking on his second Monday whether p99_orders_ms_high was something he should care about, and three of us saying "no, that one's always like that" without irony.

It was the dashboard tile labeled queue_depth_orders showing 14,000 at 09:32 on a Tuesday, which would have been alarming in 2022 and was now, in our institutional memory, "just how Tuesdays look after the Polish locale runs its overnight reconciliations." We had stopped asking why. We had stopped looking. The dashboard was a museum of metrics we no longer interrogated, and the alerts were a doorbell rung by ghosts.

The thing about a dashboard is that it is a statement of what the team believes matters. Ours, in retrospect, was a statement that we were not sure, that we wanted everything in front of us in case any of it turned out to be the important thing. It was, to use a word we have since adopted from Pia, cowardly. We had built a screen that absolved us of the harder work of editing.

What follows is the story of the page that made us edit it. It is not a triumphant story. It involves an engineer named Sami crying quietly in a hotel room in Porto, a Slack thread of forty-one replies arguing about whether to roll back a config change, and a single line in a Helm values file with a transposed digit that should not have been able to do what it did.

§02Incident report
Severity 2
Duration 47 minutes

The Sunday the YAML paged everyone at 04:11.

It began at 04:11 UTC on Sunday, 23 April, with a PagerDuty notification titled OrderIngestLatencyHigh — multi-region. Sami, who was in Porto for a wedding, acknowledged it from his phone in 47 seconds, which is good, and then spent the next eleven minutes trying to find his laptop charger in a hotel room he did not entirely remember checking into, which is human.

By 04:23 the page had escalated. Three more engineers were on, plus me, plus eventually our manager Helga, who had her phone on her bedside table in Tallinn and answered with the particular flat voice of someone who has been here before. We had eleven Datadog dashboards open between us. None of them showed what was happening, because what was happening was not a latency problem. What was happening was that the staging instance of our order-ingest service had been deployed with thirty replicas instead of three, due to a transposed digit in a Helm values.yaml file Marko had touched the previous Friday afternoon and forgotten to revert.

The thirty replicas had started up cleanly. They had then, all of them, opened persistent connections to the production Kafka cluster — because staging and production shared brokers, a decision made in 2021 we kept meaning to reverse — and begun consuming from the live orders.v3 topic, which they had no business reading. Production consumers slowed. Latency on the customer-facing API rose. The dashboard tile for p99 latency went amber, then red, then nobody looked at it because we were arguing in a Zoom call about whether to roll back the last production deploy, which had nothing to do with anything.

We rolled back the wrong service. We rolled it back twice. At 04:42 Helga, with a clarity that I think about often, said "nobody touch anything for sixty seconds and someone read the goddamn changes." Marko, who had not slept either, opened the Helm chart, read the transposed digit, and said the word "oh" in a way I will not forget. By 04:58 the staging replicas were scaled down. By 05:04 latency was back to baseline. By 05:11 Sami was crying, quietly, in his hotel room, and I was on a call with him pretending I could not hear it.

The dashboard, throughout, had done nothing useful. It had pulsed. It had shown forty-two tiles, of which precisely zero pointed at the actual problem, which was a config-management failure that no panel had ever been designed to detect. We had instrumented for performance and forgotten to instrument for identity — for which workload, deployed by whom, against which broker, was doing the consuming. The board was a map of where we had looked before. The incident was somewhere we had not.

§03Data
Twelve-month window
Anonymised host names

An anatomy of fatigue, in numbers we should have read sooner.

We pulled the PagerDuty export covering 1 May 2023 — 30 April 2024 and joined it against the Datadog monitor catalogue. What follows is the unflattering audit we should have done in the first quarter. Each monitor below sent at least 40 pages over the period; the median monitor sent 7.

Top alerting monitors, 12 mo · ranked by page volumen = 1,408 pages

Monitor	Role in the system	Pages	Ack > 5m	Useful	Action taken
p99_orders_ms_high	customer latency canary	312	71%	12%	Silence rule
kafka_lag_orders_v3	consumer health	241	29%	58%	Kept
node_disk_pressure_*	infra capacity	198	82%	4%	Deleted
cert_expiry_warn_30d	tls hygiene	142	94%	19%	Demoted to ticket
checkout_error_rate	user-facing SLO	118	18%	81%	Kept · tightened
queue_depth_orders	throughput health	96	66%	28%	Rewritten
memory_oom_kube_*	runtime safety	81	22%	74%	Kept
synthetic_login_pt	user journey	67	31%	62%	Kept
grafana_render_slow	meta · ironic	53	88%	2%	Deleted
helm_release_drift	deploy identity	0	—	—	Added · post-23 April

"Useful" here is the post-hoc judgement of the engineer who handled the page, recorded in the PagerDuty post-incident notes. The bottom row — helm_release_drift — is the monitor that did not exist before the 23 April incident and that has, in the six weeks since, fired exactly twice and prevented two repeats.

§04Timeline
The blameless retro
5 phases · 8 weeks

How the retro actually went, edited for length and dignity.

The retrospective opened on 26 April in a room nobody had used since the office reorganisation. We brought coffee from Hello Kristof on Rua Poiais de São Bento and a printout of every page from the previous ninety days. We did not, at first, talk about the YAML.

01_{26 Apr}

The intake — read every page, aloud, in order.

Pia chaired. The rule was: we read each PagerDuty incident title aloud, the ack time, and the resolution note, and we said either "real" or "not real" before moving on. It took two hours and forty minutes. By 13:20 we had said "not real" 184 times out of 296.

ObservationSaying the words aloud was uncomfortable in a way that reading them silently in the morning queue was not. Marko later said this was the first time he understood the shape of the problem rather than the volume of it.

02_{30 Apr}

The catalogue — every monitor, mapped to a person.

We exported the Datadog monitor list (147 active monitors) into a shared Notion table and assigned each one an owning engineer. Twenty-eight monitors had no owner. Nineteen had been written by people who had left the company. Six referenced services we had decommissioned in 2022.

ObservationOwnership is the cheapest possible filter. A monitor without an owner is, almost by definition, a monitor nobody will tune. We deleted the six dead ones in the meeting.

03_{07 May}

The cull — three categories, hard decisions.

We bucketed every remaining monitor into page, ticket, or delete. The argument about cert_expiry_warn_30d lasted forty-one minutes. The argument about node_disk_pressure lasted nine seconds; we deleted it. Net change: 147 monitors became 58, of which 31 are pageable.

ObservationThe hard category is "ticket" — the page that is real but not urgent. Without that middle bucket every monitor becomes a doorbell. We bound tickets to a Friday triage at 14:00 Lisbon time.

04_{17 May}

The dashboard reform — the forty-two tiles became seven.

Mihail, with surprising grace, agreed to tear down his board. The replacement is one Grafana dashboard — internally we call it The Wall — with seven panels, four of them golden-signal SLOs, two of them deploy-identity panels (added after April), and one rolling page-volume counter. The amber-border pulse was retired with what felt like a small funeral.

ObservationThe number seven was not magic. We landed on it because Pia argued, persuasively, that a dashboard with more panels than a person can hold in working memory is not a dashboard, it is a wall.

05_{14 Jun}

The follow-through — page volume down 71%, sleep up.

Six weeks after the cull: 41 pages in the period vs. 142 in the equivalent window before. Marko's sleep tracker — he volunteered the data — shows a recovery of roughly 38 minutes per on-call night. The board is on the war-room TV and people, occasionally, walk past and actually stop to look at it.

ObservationThe dashboard works because it is small enough to be wrong about. When something on it changes, we notice. The forty-two-tile board was unfalsifiable; the seven-tile board is testable.

§05Field guide
What survived the cull
Three reference boards

Three boards that survived, and why we kept them.

Not every dashboard is a candidate for the wall. Some are for a specific audience at a specific moment — the on-call engineer at 03:00, the deploy operator at 14:00 Friday, the executive who skims on Monday. We kept three boards in addition to the wall. Each has one job.

BOARD 01On-call · 03:00

The pager-side panel.

A single dark Grafana page, eight tiles, optimised for being read on a phone in a hotel room. Latency, error rate, queue depth, deploy identity, the four golden signals for the two services most likely to wake you. No history beyond 30 minutes.

Hit rate · opened in 94% of pages since 17 May. Median load time on a 4G connection in Porto: 1.3 s.

BOARD 02Deploy · 14:00 Fri

The release-eve panel.

Owned by the deploy operator of the week. Shows the diff of in-flight Helm releases, drift between desired and applied state across staging and production, and a kill switch for ArgoCD auto-sync. Lives on Pia's second monitor every Friday afternoon.

Prevented · two staging-to-production bleeds since April. Cost of the dashboard: one afternoon of Marko's time.

BOARD 03Executive · Monday

The weekly health page.

One page, four numbers, a small sparkline each. Reads in eleven seconds. Designed for Helga to forward to the CTO without comment. We argued for a week about whether to include the page-volume counter. We included it. He has, twice, replied with the word "good".

Cadence · Monday 09:30 Lisbon. Auto-rendered. No human assembles it on Sunday night anymore.

A dashboard is a statement of what your team believes matters. Ours, for a while, was a statement that we were not sure.

— Quiet Pager · Issue 014 · §05

§06Manifesto
Six rules
Pinned in #platform-ops

Six rules for a dashboard humans will keep watching.

These are pinned in #platform-ops as a Slack canvas. They are not principles. They are concrete enough that a junior engineer can be told "you violated rule three" without ambiguity. They are also, mercifully, short.

I.Fewer panels than fingers.

Seven tiles, or fewer. Always fewer.

If you cannot summarise the system in seven panels, the system is too unfamiliar to monitor, not too complicated. Add panels only by removing one. The constraint is the point.

Violated · 3 times since May · all reverted by Pia within a day.

II.Every monitor has an owner.

A name, not a team. A person.

Team ownership is no ownership. Each pageable monitor has one engineer responsible for tuning and retiring it, and that name is in the monitor's description in plain English. Hand-offs require a pull request.

Current owners · five engineers · 31 pageable monitors · audited quarterly.

III.If it pages, it is real or it is gone.

No "informational" pages. No "FYI" pages.

The page is a contract: wake me up, this matters. A monitor that has paged three times without action in ninety days is automatically demoted to a ticket. The on-call engineer can demote on the spot, with a one-line justification.

Auto-demotions · 11 monitors · since 17 May · zero appeals.

IV.Instrument identity, not just performance.

Which workload, deployed by whom, against which broker.

Before April we measured how fast our system ran. We did not measure what our system was at any given moment. The deploy-identity panel — Helm release SHA, ArgoCD sync status, broker target — is now non-negotiable on every service board.

Added · post 23 April · 14 services instrumented · two near-misses caught.

V.Read your pages out loud, once a quarter.

The intake exercise from §04 is now a ritual.

Last Friday of each quarter, 14:00 Lisbon, the on-call rotation sits in the war room and reads the previous ninety days of PagerDuty incidents aloud. "Real" or "not real". Coffee from Hello Kristof. Two hours, give or take.

Held · Q2 done · next session 27 September · all engineers required.

VI.The wall is for the team. Other boards are for the moment.

One dashboard everyone watches. Many they don't.

The seven-tile wall is the shared instrument. Individual boards for on-call, for deploy, for executive briefing — those are owned and short-lived. Do not put your service's debugging board on the wall. Build your own; pin it to your monitor; tear it down when it has done its work.

Wall boards · 1 · personal boards · 23 and counting · half are stale, that is fine.

§07FAQ
Reader questions
From the operations Slack

What you'll ask, and what we'll answer.

Q · 01 Isn't seven panels just an opinion dressed up as a rule? +

It is. The number itself is not load-bearing — we have a service team at the Tallinn office that runs on six, and the platform-data board runs on five. The load-bearing claim is that a dashboard you cannot hold in working memory is one you will not consult under stress. Seven happens to be the upper bound of what Pia could glance at and describe accurately, on the cleanest white board we had, in under twelve seconds. Pick your own bound; defend it.

Q · 02 How did you get Mihail to agree to tear down his dashboard? +

We did not, at first. The retro on 26 April was difficult. Mihail was not in the room — he was on annual leave in Slovenia — and we made the decision provisionally without him. When he returned on 5 May, Pia took him to lunch at A Cevicheria and walked him through the data: 312 pages from a monitor he had built, 71% acked-and-ignored, 12% useful. He asked for one week. He came back on 12 May with the seven-panel design we still use. The hardest part of dashboard reform is not technical; it is letting the person who built the old thing be the one who builds the new one.

Q · 03 Why didn't your existing monitors catch the 23 April incident? +

Because the symptom looked like a latency problem and we had instrumented exhaustively for latency. We had not instrumented for workload identity — the question of which deployment, with which config, was responsible for the consumer pressure on orders.v3. The 30-replica staging deployment was invisible to our boards because our boards assumed staging and production were separate, which they were not at the Kafka layer. The fix was helm_release_drift, which compares the in-cluster Helm release SHA against a recorded baseline and fires if any workload in the production broker namespace has not been blessed by ArgoCD. It is boring. It works.

Q · 04 Did anyone get blamed? +

Marko was, briefly, blamed by Marko. The retro was blameless in the formal sense — Helga set the tone in the opening five minutes by reading the post-incident note aloud and saying "this is a config-management failure, not a Marko failure" — but Marko spent the next ten days quietly miserable anyway. The thing that helped was not absolution, it was the fact that we shipped helm_release_drift with him as the named owner and the monitor description begins with the words "after 23 April." He owns the fix. The fix exists because of him. That is what blameless is for.

Q · 05 How do you keep the wall from drifting back to forty-two tiles? +

Two mechanics. First, the rule that you cannot add a panel without removing one — enforced by Pia, who is the named owner of the wall and has refused three change requests since May, all of them politely. Second, the quarterly read-aloud (rule V). If a panel has not earned a "real" verdict in two consecutive read-alouds, it is a candidate for removal. The wall has lost two panels and gained two since 17 May. Net change: zero. The composition has changed; the headcount has not.

Q · 06 What about anomaly detection? AI-driven alerting? +

We piloted Datadog's anomaly detection on the order-ingest service for six weeks in late 2023. It fired 47 times. Two were real. The rest were the order-ingest service being its noisy self on Tuesdays and Thursdays, which is a pattern our threshold-based monitors already accommodated via business-hours windows. We turned it off. Anomaly detection on noisy services is, in our experience, a way of moving the tuning problem from "where do I set the threshold" to "where do I set the sensitivity," and the second problem is harder because it is one layer removed from the system you actually understand.

Q · 07 If you could rebuild the whole monitoring stack from scratch, would you? +

No. The stack is fine. Prometheus, Datadog for the SaaS side, Grafana for the boards, PagerDuty for the rotation — none of these were the problem. The problem was that we had built instruments without first agreeing on what the instruments were for. We did the engineering before we did the editing. If we could rebuild anything from scratch, it would be the conversation we should have had in autumn 2022, in the war room with Mihail and the wall TV, about which six things we believed mattered. Six, not forty-two. We would have saved ourselves nineteen months and Sami a long night in Porto.