DNS-SD Service Discovery for Prometheus Monitoring
Project Overview
Corporate NDA Policy
Due to corporate NDA restrictions, I cannot share the exact implementation code. This page provides an overview of the architecture and approach used to solve the problem.
When your monitoring system can't see your services, you've got a problem. This project implemented DNS SRV-based service discovery to monitor infrastructure that existed outside our Consul-registered world.
The Problem
Our monitoring stack was built on a solid foundation: VictoriaMetrics for storage, vmagent for scraping, VMAlert for alerting, and Consul for service discovery. Everything registered in Consul got monitored automatically.
Then came the streaming nodes—critical media services that weren't registered in Consul. No Consul registration meant no automatic discovery. No discovery meant no monitoring. No monitoring meant flying blind on production services.
The Solution: DNS SRV Records
DNS SRV records are often overlooked, but they're a powerful mechanism for service discovery. They've been around since RFC 2782 (2000) and are used by everything from Active Directory to SIP.
_node_exporter._tcp.prod.example.com. 60 IN SRV 10 50 9100 stream01.prod.example.com.The key insight: our infrastructure team was already maintaining these SRV records for other purposes. We could piggyback on them for monitoring discovery.
Architecture
DNS SRV Record (_node_exporter._tcp.prod.example.com)
│
▼
vmagent (dns_sd_configs)
│
▼
Blackbox Exporter (port 9115)
│
▼
VictoriaMetrics
│
▼
VMAlert
│
▼
Alertmanager → IcingaThe magic happens in vmagent's dns_sd_configs. Instead of asking Consul "what services exist?", we ask DNS "what hosts are in this SRV record?" and then probe them through blackbox exporter.
Configuration Pattern
The prometheus.yml configuration uses relabeling to extract hostnames, filter by node type, and route probes through blackbox exporter:
- job_name: 'blackbox-streaming-health'
metrics_path: /probe
params:
module: [http_2xx]
dns_sd_configs:
- names: ['_node_exporter._tcp.prod.example.com']
type: SRV
relabel_configs:
# Extract the hostname from the SRV target
- source_labels: [__meta_dns_srv_record_target]
regex: '(.+)\.'
target_label: hostname
replacement: '$1'
# Keep only streaming nodes
- source_labels: [node_type]
regex: stream
action: keep
# Route through blackbox exporter
- target_label: __address__
replacement: 'monitoring.prod.example.com:9115'Alerting Rules
With discovery working, we built comprehensive alerting:
groups:
- name: streaming_service_alerts
rules:
- alert: StreamingServiceDown
expr: probe_success{job=~"blackbox-streaming-.*"} == 0
for: 5m
labels:
severity: critical
- alert: StreamingDegraded
expr: |
count(probe_success{job="blackbox-streaming-health"} == 0)
/ count(probe_success{job="blackbox-streaming-health"}) > 0.5
for: 5m
labels:
severity: warningKey Lessons Learned
DNS SD Doesn't Auto-Discover by Pattern
DNS SRV records must explicitly list every host. The filtering by node type happens in relabeling after discovery, not during.
Debug from Inside the Container
When DNS discovery shows 0/0 targets, the issue is often DNS resolution from within the container. Test with docker exec vmagent nslookup.
Have a Fallback Plan
We kept file_sd_configs as a backup. If DNS discovery fails, we can drop in a static target file.
Results
- Dynamic discovery of streaming nodes without touching Consul
- Per-node health visibility in our monitoring dashboards
- Automatic pickup when new nodes are added to DNS
- Graceful degradation alerts before full outages
- Integration with our existing alerting workflow
The monitoring gap is closed. Services that were invisible are now fully observable.
When to Use DNS SD
Consider DNS-based discovery when:
- Services aren't registered in your primary service mesh
- You have existing DNS SRV records you can leverage
- You need to monitor infrastructure outside your control
- You want discovery without adding agent dependencies
It's not a replacement for Consul or Kubernetes service discovery—it's a complement for the edges of your infrastructure that those systems don't reach.