Reducing Alert Fatigue: Bridging Prometheus Severity Labels to Icinga

When every firing alert screams CRITICAL, nothing is truly critical anymore. This is the story of how a subtle gap between Prometheus-style alerting and Icinga created unnecessary noise, and how we fixed it.

The Setup

Our monitoring stack combines several battle-tested tools:

Consul (service registry)
    |
    v
Consul Exporter -> VictoriaMetrics -> VMAlert -> check_prometheus -> Icinga

Consul provides service health and registration data
VictoriaMetrics stores metrics (Prometheus-compatible)
VMAlert evaluates alerting rules and exposes alert state
check_prometheus bridges VMAlert to Icinga via active checks
Icinga displays alerts and handles notifications

This architecture lets us define alerts using PromQL, leverage VictoriaMetrics' performance, and present everything in Icinga's familiar interface.

The Problem: All Alerts Are CRITICAL

We had carefully designed our alerting rules with appropriate severity levels:

# Warning: Some instances unhealthy, but not catastrophic
- alert: ConsulServiceLowHealthyRatio
  expr: consul_service_healthy_ratio < 0.5
  labels:
    severity: warning
  annotations:
    summary: "Service {{ $labels.service_name }} healthy ratio below 50%"

# Critical: Complete service outage
- alert: ConsulServiceNoHealthyInstances
  expr: consul_service_instances_passing == 0
  labels:
    severity: critical
  annotations:
    summary: "Service {{ $labels.service_name }} has no healthy instances"

But in Icinga, both showed up as CRITICAL:

[CRITICAL] [ConsulServiceLowHealthyRatio] is firing - value: 0.45 - {"severity":"warning"}
[CRITICAL] [ConsulServiceNoHealthyInstances] is firing - value: 0 - {"severity":"critical"}

The severity=warning label was right there in the output, but Icinga treated both as CRITICAL. This created two problems:

Alert fatigue: On-call engineers became desensitized to CRITICAL alerts
Incorrect escalations: Warning-level issues triggered critical response procedures

Root Cause: Lost in Translation

The check_prometheus plugin determines Icinga exit codes based solely on alert state:

Alert State	Icinga Status
`firing`	CRITICAL (2)
`pending`	WARNING (1)
`inactive`	OK (0)

The translation gap:

Prometheus world: Severity is a label (severity=warning)
Icinga world: Severity is an exit code (0, 1, 2, 3)
check_prometheus: Only looks at state, ignores severity label

Options We Considered

Option 1: Signalilo (Alertmanager to Icinga bridge)

Signalilo receives webhooks from Alertmanager and creates passive check results in Icinga.

Problems we found (confirmed by reading the source code):

Service names are hashed: alertname_<hash(UUID + labels)>
Display names are just alertname, causing duplicate entries in Icinga UI
No history retention for passive checks
Additional operational overhead

We deployed it briefly and ended up with 276 services with identical display names. Not ideal.

Option 2: Custom wrapper script

Write a script that queries VMAlert API, parses severity, and outputs appropriate exit codes.

Problems:

Another tool to maintain
Duplicates functionality already in check_prometheus
More moving parts

Option 3: Patch check_prometheus

Modify the plugin to honor the severity label when determining exit codes.

Advantages:

Minimal workflow change
Same CLI interface
Fixes the problem at the right layer

We chose Option 3.

The Fix

Two changes to internal/alert/alert.go:

1. Honor severity label for firing alerts

func (a *Rule) GetStatus() (status int) {
    state := a.AlertingRule.State

    switch state {
    case string(v1.AlertStateFiring):
        status = check.Critical
    case string(v1.AlertStatePending):
        status = check.Warning
    case string(v1.AlertStateInactive):
        status = check.OK
    default:
        status = check.Unknown
    }

    // Honor severity label for firing alerts
    if state == string(v1.AlertStateFiring) {
        severity := ""
        // Check alert-level labels first
        if a.Alert != nil {
            if v, ok := a.Alert.Labels["severity"]; ok {
                severity = strings.ToLower(string(v))
            }
        }
        switch severity {
        case "warning", "warn":
            return check.Warning
        case "info", "informational":
            return check.OK
        case "critical":
            return check.Critical
        }
    }

    return status
}

The mapping:

Severity Label	Icinga Exit Code
`critical` (or absent)	CRITICAL (2)
`warning`, `warn`	WARNING (1)
`info`, `informational`	OK (0)

2. Include annotations in output

// Append annotations for context
if summary, ok := a.Alert.Annotations["summary"]; ok {
    out.WriteString(fmt.Sprintf(" - summary: %s",
        strings.ReplaceAll(string(summary), "\n", " ")))
}
if description, ok := a.Alert.Annotations["description"]; ok {
    out.WriteString(fmt.Sprintf(" - description: %s",
        strings.ReplaceAll(string(description), "\n", " ")))
}

This surfaces the alert context directly in Icinga, so engineers can triage without opening Grafana or Consul.

The Result

Before:

[CRITICAL] - 17 Alerts: 17 Firing - 0 Pending - 0 Inactive
_ [CRITICAL] [ConsulServiceLowHealthyRatio] is firing - value: 0.00 - {"severity":"warning"}

After:

[WARNING] - 17 Alerts: 17 Firing - 0 Pending - 0 Inactive
_ [WARNING] [ConsulServiceLowHealthyRatio] is firing - value: 0.00 - {"severity":"warning"}
  - summary: Service healthy ratio below 50%
  - description: Healthy ratio is 0%

The alert state still says "is firing" (accurate), but Icinga correctly shows WARNING status.

Lessons Learned

Severity is a convention, not a standard: Prometheus doesn't enforce how you use the severity label. Downstream tools may ignore it entirely.
Check your bridges: When connecting monitoring systems, verify that metadata (like severity) survives the translation.
Alert fatigue is real: When everything is CRITICAL, engineers stop responding urgently. Proper severity mapping directly impacts incident response quality.
Read the source: We only understood Signalilo's naming behavior by reading the code. Documentation didn't cover the hash-based service naming.
Context matters: Including annotations in check output reduces MTTR by eliminating the need to look up alert details elsewhere.

What's Next

The fix is available in my fork of check_prometheus. The change is backwards-compatible: alerts without a severity label behave exactly as before.

View Implementation Feature Request Discussion

For teams with similar setups, you can either:

Use the patched version from the repository above
Apply the patch to your own build
Use the approach as a reference for your own monitoring bridges

The code changes are minimal, but the impact on alert quality is significant.

Have you encountered similar gaps between monitoring systems? How did you solve them? The monitoring ecosystem is full of these translation challenges, and sharing solutions helps everyone build more reliable systems.