Datadog Alert Deduplication

Saneops ingests Datadog monitor webhooks, correlates duplicate firings across services, deduplicates flapping monitors, and drafts a first-pass RCA using your own LLM key. Result: a fraction of the pages, each one already enriched with the failing service, common labels across alerts, and a starting hypothesis your on-call can verify or reject.

Why Datadog alone isn't enough

Datadog's monitors are excellent at detecting a metric crossing a threshold. They are deliberately not in the business of correlating — every monitor fires independently, every firing is a separate notification. A single bad Postgres replica can fan out to 30+ Datadog notifications spanning every service that talks to it. Saneops adds the missing layer: cluster the 30 firings into one incident, deduplicate the redundant ones, and page once.

How the Datadog → Saneops pipeline works

Webhook receiver: Each Saneops tenant has a unique inbound URL: https://app.saneops.in/webhooks/datadog/<tenant-token>.
Payload parsing: The Datadog webhook integration sends a JSON body with monitor name, alert title, transition (Triggered/Recovered/No Data), tags, scoped host or service, and event message. Saneops normalises this into the same NormalizedAlert shape it uses for every source.
Severity mapping: Datadog priority (P1–P5) maps cleanly to Saneops severity (critical/high/warning/info/low). You can override per-monitor via tags.
Correlation by tag: Datadog tags become Saneops labels. Strong-label matches on service, env, cluster, kube_namespace, host drive the correlation decision.
Auto-resolve: Datadog's Recovered and No Data transitions close the corresponding Saneops incident. Idle incidents auto-close after 24 hours regardless.
LLM RCA: The drafted RCA references the failing tags, recent transition history, and any related alerts in the same cluster — surfacing a likely cause your on-call can verify in 30 seconds.

Setup

1. Create a Saneops tenant

2. Add a Datadog webhook

In Datadog: Integrations → Webhooks → New Webhook. Name it saneops, paste your tenant URL, leave the default POST payload (Saneops parses the standard Datadog format).

URL: https://app.saneops.in/webhooks/datadog/<your-token>
Name: saneops
Payload: leave default — Saneops parses Datadog's standard JSON

3. Reference the webhook from monitors

In any monitor's notification message, add @webhook-saneops on its own line. Bulk-edit your existing monitors via the Datadog API to add this — the migration is non-destructive (you keep your existing PagerDuty/Slack notifications).

4. Verify

Force a test notification (Test Notifications button). The Saneops Webhook Inspector shows the exact payload received; the corresponding incident appears in the Incidents view.

Tag-driven correlation tuning

Saneops correlation reads the same Datadog tags you already use. A few that make a big difference:

service:<name> — tightest correlation signal. Two firings on the same service in the same window cluster.
env:prod — keeps prod and staging incidents separate even when the service tag matches.
kube_cluster_name:<cluster>, kube_namespace:<ns> — multi-tenant Kubernetes clusters cluster correctly.
team:<name> — pairs nicely with Saneops notification rules so the right Slack channel is paged.

FAQ

Does Saneops replace Datadog?

No. Datadog is your observability platform — metrics, traces, logs, dashboards. Saneops sits at the alert layer, downstream of Datadog monitors, replacing the 1:1 monitor → page model with a correlated incident model.

Does this work with Datadog Synthetics, APM error tracking, log monitors?

Yes. Anything that emits a Datadog event with a webhook destination works — synthetic test failures, APM error rate breaches, log search monitors, all routed through the same @webhook-saneops mention.

Where does the LLM call go?

To whichever provider you configure — Anthropic, OpenAI, OpenAI-compatible (Together, Mistral, Groq), Gemini, DeepSeek, Grok, or self-hosted Ollama. BYOK; Saneops stores your key encrypted via Fernet and never logs it.

Datadog alert deduplication, correlation, and LLM RCA