Reducing Alert Fatigue: Bridging Prometheus Severity Labels to Icinga
January 23, 2026 · 10 min read
When every firing alert screams CRITICAL, nothing is truly critical anymore. This is the story of how a subtle gap between Prometheus-style alerting and Icinga created unnecessary noise, and how we fixed it.
The Setup
Our monitoring stack combines several battle-tested tools:
Consul (service registry)
|
v
Consul Exporter -> VictoriaMetrics -> VMAlert -> check_prometheus -> Icinga- Consul provides service health and registration data
- VictoriaMetrics stores metrics (Prometheus-compatible)
- VMAlert evaluates alerting rules and exposes alert state
- check_prometheus bridges VMAlert to Icinga via active checks
- Icinga displays alerts and handles notifications
This architecture lets us define alerts using PromQL, leverage VictoriaMetrics' performance, and present everything in Icinga's familiar interface.
The Problem: All Alerts Are CRITICAL
We had carefully designed our alerting rules with appropriate severity levels:
# Warning: Some instances unhealthy, but not catastrophic
- alert: ConsulServiceLowHealthyRatio
expr: consul_service_healthy_ratio < 0.5
labels:
severity: warning
annotations:
summary: "Service {{ $labels.service_name }} healthy ratio below 50%"
# Critical: Complete service outage
- alert: ConsulServiceNoHealthyInstances
expr: consul_service_instances_passing == 0
labels:
severity: critical
annotations:
summary: "Service {{ $labels.service_name }} has no healthy instances"But in Icinga, both showed up as CRITICAL:
[CRITICAL] [ConsulServiceLowHealthyRatio] is firing - value: 0.45 - {"severity":"warning"}
[CRITICAL] [ConsulServiceNoHealthyInstances] is firing - value: 0 - {"severity":"critical"}The severity=warning label was right there in the output, but Icinga treated both as CRITICAL. This created two problems:
- Alert fatigue: On-call engineers became desensitized to CRITICAL alerts
- Incorrect escalations: Warning-level issues triggered critical response procedures
Root Cause: Lost in Translation
The check_prometheus plugin determines Icinga exit codes based solely on alert state:
| Alert State | Icinga Status |
|---|---|
firing | CRITICAL (2) |
pending | WARNING (1) |
inactive | OK (0) |
The translation gap:
- Prometheus world: Severity is a label (
severity=warning) - Icinga world: Severity is an exit code (0, 1, 2, 3)
- check_prometheus: Only looks at state, ignores severity label
Options We Considered
Option 1: Signalilo (Alertmanager to Icinga bridge)
Signalilo receives webhooks from Alertmanager and creates passive check results in Icinga.
Problems we found (confirmed by reading the source code):
- Service names are hashed:
alertname_<hash(UUID + labels)> - Display names are just
alertname, causing duplicate entries in Icinga UI - No history retention for passive checks
- Additional operational overhead
We deployed it briefly and ended up with 276 services with identical display names. Not ideal.
Option 2: Custom wrapper script
Write a script that queries VMAlert API, parses severity, and outputs appropriate exit codes.
Problems:
- Another tool to maintain
- Duplicates functionality already in check_prometheus
- More moving parts
Option 3: Patch check_prometheus
Modify the plugin to honor the severity label when determining exit codes.
Advantages:
- Minimal workflow change
- Same CLI interface
- Fixes the problem at the right layer
We chose Option 3.
The Fix
Two changes to internal/alert/alert.go:
1. Honor severity label for firing alerts
func (a *Rule) GetStatus() (status int) {
state := a.AlertingRule.State
switch state {
case string(v1.AlertStateFiring):
status = check.Critical
case string(v1.AlertStatePending):
status = check.Warning
case string(v1.AlertStateInactive):
status = check.OK
default:
status = check.Unknown
}
// Honor severity label for firing alerts
if state == string(v1.AlertStateFiring) {
severity := ""
// Check alert-level labels first
if a.Alert != nil {
if v, ok := a.Alert.Labels["severity"]; ok {
severity = strings.ToLower(string(v))
}
}
switch severity {
case "warning", "warn":
return check.Warning
case "info", "informational":
return check.OK
case "critical":
return check.Critical
}
}
return status
}The mapping:
| Severity Label | Icinga Exit Code |
|---|---|
critical (or absent) | CRITICAL (2) |
warning, warn | WARNING (1) |
info, informational | OK (0) |
2. Include annotations in output
// Append annotations for context
if summary, ok := a.Alert.Annotations["summary"]; ok {
out.WriteString(fmt.Sprintf(" - summary: %s",
strings.ReplaceAll(string(summary), "\n", " ")))
}
if description, ok := a.Alert.Annotations["description"]; ok {
out.WriteString(fmt.Sprintf(" - description: %s",
strings.ReplaceAll(string(description), "\n", " ")))
}This surfaces the alert context directly in Icinga, so engineers can triage without opening Grafana or Consul.
The Result
Before:
[CRITICAL] - 17 Alerts: 17 Firing - 0 Pending - 0 Inactive
_ [CRITICAL] [ConsulServiceLowHealthyRatio] is firing - value: 0.00 - {"severity":"warning"}After:
[WARNING] - 17 Alerts: 17 Firing - 0 Pending - 0 Inactive
_ [WARNING] [ConsulServiceLowHealthyRatio] is firing - value: 0.00 - {"severity":"warning"}
- summary: Service healthy ratio below 50%
- description: Healthy ratio is 0%The alert state still says "is firing" (accurate), but Icinga correctly shows WARNING status.
Lessons Learned
- Severity is a convention, not a standard: Prometheus doesn't enforce how you use the severity label. Downstream tools may ignore it entirely.
- Check your bridges: When connecting monitoring systems, verify that metadata (like severity) survives the translation.
- Alert fatigue is real: When everything is CRITICAL, engineers stop responding urgently. Proper severity mapping directly impacts incident response quality.
- Read the source: We only understood Signalilo's naming behavior by reading the code. Documentation didn't cover the hash-based service naming.
- Context matters: Including annotations in check output reduces MTTR by eliminating the need to look up alert details elsewhere.
What's Next
The fix is available in my fork of check_prometheus. The change is backwards-compatible: alerts without a severity label behave exactly as before.
For teams with similar setups, you can either:
- Use the patched version from the repository above
- Apply the patch to your own build
- Use the approach as a reference for your own monitoring bridges
The code changes are minimal, but the impact on alert quality is significant.
Have you encountered similar gaps between monitoring systems? How did you solve them? The monitoring ecosystem is full of these translation challenges, and sharing solutions helps everyone build more reliable systems.