Grafana Alerting is excellent at detecting that a metric crossed a threshold. It is deliberately not in the business of deciding which of those threshold breaches deserve a human's attention. The result, for most teams running Grafana at scale, is alert fatigue: 50+ pages in a single bad hour, the same Postgres replica failing 12 different rules, the same DNS blip firing across every microservice. This guide walks through how to cut that volume by ~80% — first by tuning what Grafana already gives you, then by adding a correlation layer downstream when those native controls run out of gas.
Why Grafana alerts get noisy
- One-rule-per-symptom design. Engineers create a Grafana alert rule for every metric that has ever burned them. Each rule fires independently. A single root cause can trip 30 rules.
- Fan-out across labels. Multi-dimensional metrics (per-service, per-pod, per-region) produce one alert instance per label combination. A bad node can mean 200 instances, all firing.
- Flapping thresholds. A metric that hovers near a threshold opens-and-closes the same alert dozens of times in an hour. Each flap is a new notification.
- No cross-rule correlation. Grafana doesn't know that 'high latency on service-A' and 'increased error rate on service-A' are the same incident. They're separate firings on separate channels.
- send_resolved misconfigured. Without
send_resolved: trueon receivers, downstream tools never know an alert cleared and keep paging.
Native Grafana / Alertmanager features that help (and where they fall short)
Grafana Alerting inherits Prometheus Alertmanager semantics for grouping, inhibition, and resolution. These are powerful but require careful tuning, and they only operate within Grafana's own context.
group_by — bundle related alerts into one notification
Define group_by on a notification policy so Grafana waits a few seconds, collects all alerts sharing the listed labels, and emits one notification per group instead of one per alert. This is the single highest-leverage Grafana-native control:
# grafana provisioning notification-policies.yaml
route:
group_by: ['alertname', 'service', 'cluster']
group_wait: 30s # collect alerts for 30s before sending
group_interval: 5m # then add new alerts to that group at most every 5m
repeat_interval: 4h # remind on still-firing groups every 4hWhere it falls short: grouping is exact-match on labels. Two alerts that should logically cluster — say, service=checkout and app=checkout-api — will not group because the label keys differ. You also can't correlate across rule names that describe the same outage in different vocabularies (e.g. 'HighErrorRate' and 'P99LatencySLO').
inhibit_rules — suppress symptom alerts when a cause alert is firing
Alertmanager's inhibit_rules let a higher-severity alert silence lower-severity ones with overlapping labels:
inhibit_rules:
- source_match:
severity: 'critical'
alertname: 'NodeDown'
target_match:
severity: 'warning'
equal: ['cluster', 'node']Where it falls short: you have to know in advance which alert is the cause and which is the symptom. In real outages the causal chain is rarely that clean, and inhibit rules are static — they don't learn from incident patterns.
send_resolved — auto-close on recovery
Set send_resolved: true on every receiver. Without it, downstream tools (PagerDuty, Slack, Saneops) can't auto-close, and on-call has to manually clear stale incidents — which they won't always remember to do.
# In your Grafana contact point JSON
{
"name": "saneops",
"type": "webhook",
"settings": {
"url": "https://app.saneops.in/webhooks/grafana/<your-token>",
"httpMethod": "POST"
},
"disableResolveMessage": false // ← this is the one that matters
}Repeat-interval discipline
Default repeat_interval in some Grafana configs is 1h, which paginates a still-firing alert hourly. Most teams should set this to 4-12h for warning, 1h for critical, and let humans acknowledge to silence.
Where Alertmanager-native tuning runs out
Even with perfect group_by + inhibit_rules + send_resolved, three classes of noise still get through:
- Cross-rule correlation. Two firings from different alert rules describing the same outage (e.g. an HTTP 5xx alert and a saturated-CPU alert on the same service) won't group because the rule names differ.
- Cross-source correlation. If you also run Datadog or Prometheus directly, Grafana has no view of those alerts and can't correlate across.
- Semantic deduplication. A flap that re-fires under a slightly different label set (different pod ID, restarted instance) reads as a new alert to Grafana, not a duplicate.
- First-pass triage. Even after grouping, the on-call still wakes up to a one-line summary and starts the investigation from zero. There's no LLM-drafted hypothesis attached.
The Saneops augmentation pattern
Saneops ingests Grafana Alerting webhooks downstream of your Grafana group_by grouping and adds three things Grafana can't do natively:
- Label-strong + semantic correlation. Saneops clusters firings within a 10-minute window that share strong labels (
service,namespace,cluster,deployment,job,app,pod) and applies cosine similarity over alert text as a fallback. So 'HighErrorRate on checkout' and 'P99LatencySLO on checkout-api' end up in one incident. - Content-hash dedup + flap detection. A fingerprint that re-fires with cosmetic label drift is recognised as the same alert and counted as a dedup hit, not a new page.
- LLM-drafted RCA. When an incident reaches your severity threshold, Saneops asks your tenant's LLM (BYOK across Anthropic / OpenAI / Gemini / Grok / DeepSeek / OpenAI-compatible / Ollama) for a 3-bullet first-pass cause hypothesis. The on-call wakes up with a hypothesis to verify or reject in 30 seconds, not zero.
- Idle-incident sweep. Saneops auto-closes incidents idle for 24h by default — a safety net for receivers that don't honour
send_resolved. - Severity gating to PagerDuty / Slack / Teams. Saneops decides what reaches your pager based on severity, business hours, on-call capacity, or any CEL expression — so PagerDuty stays the rotation engine but only fires on incidents that actually warrant a human at 3 AM.
Want to see this on your own alert stream? Saneops is free for the first 1,000 alerts/month — no card, BYOK LLM, Docker self-host or hosted cloud.
Real before/after — example team
A 12-engineer SaaS team running Grafana on a 50-node Kubernetes cluster, ~120 alert rules in Grafana Alerting, ~3,500 alert firings/month before Saneops:
- Before: ~3,500 Grafana alert firings → ~1,800 Slack messages → ~280 PagerDuty pages → 14 P1 night-shift wake-ups in May. Engineers had grown to ignore Slack channel #alerts entirely.
- After (with Grafana
group_bytuned + Saneops downstream): ~3,500 Grafana firings → 540 Saneops incidents (correlated + dedup'd) → ~95 incidents above warning threshold → ~38 PagerDuty pages → 2 P1 night-shift wake-ups in June. Same outages, ~83% page reduction. - Bonus: every PagerDuty page now opens with an LLM-drafted 3-bullet hypothesis at the top of the Saneops incident — first-call resolution time fell from a median of 22 minutes to 9 minutes.
service and cluster labels see the largest gains. Teams whose alert rules emit unique label sets per rule see less.How to set this up in 10 minutes
If you already have Grafana Alerting wired with a contact point, the migration is non-destructive and reversible:
- Sign up for a Saneops tenant at app.saneops.in/signup — three fields, no credit card.
- Add a Grafana contact point pointing at
https://app.saneops.in/webhooks/grafana/<your-token>. Set HTTP Method to POST, leave 'Disable resolved messages' OFF. - Update one notification policy to route to the Saneops contact point. Keep your existing PagerDuty / Slack contact points for now — run them in parallel for 4 weeks to confirm Saneops is correlating correctly.
- Connect PagerDuty as a Saneops outbound (Events API v2 routing key) and set the severity floor to critical so Saneops only pages PagerDuty when warranted.
- After 4 weeks of parallel run, remove the direct Grafana → PagerDuty contact point. PagerDuty's seat count and rotation logic stay; you just feed it ~80% fewer events.
Companion guides
If your stack also runs Datadog or you're considering replacing PagerDuty entirely, the same correlation pattern applies. See the Grafana integration setup for the full webhook reference, the PagerDuty alternative page for migration patterns, and the Saneops vs Keep comparison if you're evaluating an open-source self-hosted route.