DNS-SD Service Discovery - Inderdeep Singh

When your monitoring system can't see your services, you've got a problem. This project implemented DNS SRV-based service discovery to monitor infrastructure that existed outside our Consul-registered world.

The Problem

Our monitoring stack was built on a solid foundation: VictoriaMetrics for storage, vmagent for scraping, VMAlert for alerting, and Consul for service discovery. Everything registered in Consul got monitored automatically.

Then came the streaming nodes—critical media services that weren't registered in Consul. No Consul registration meant no automatic discovery. No discovery meant no monitoring. No monitoring meant flying blind on production services.

The Solution: DNS SRV Records

DNS SRV records are often overlooked, but they're a powerful mechanism for service discovery. They've been around since RFC 2782 (2000) and are used by everything from Active Directory to SIP.

_node_exporter._tcp.prod.example.com. 60 IN SRV 10 50 9100 stream01.prod.example.com.

The key insight: our infrastructure team was already maintaining these SRV records for other purposes. We could piggyback on them for monitoring discovery.

Architecture

DNS SRV Record (_node_exporter._tcp.prod.example.com)
                    │
                    ▼
            vmagent (dns_sd_configs)
                    │
                    ▼
        Blackbox Exporter (port 9115)
                    │
                    ▼
            VictoriaMetrics
                    │
                    ▼
                VMAlert
                    │
                    ▼
           Alertmanager → Icinga

The magic happens in vmagent's dns_sd_configs. Instead of asking Consul "what services exist?", we ask DNS "what hosts are in this SRV record?" and then probe them through blackbox exporter.

Configuration Pattern

The prometheus.yml configuration uses relabeling to extract hostnames, filter by node type, and route probes through blackbox exporter:

- job_name: 'blackbox-streaming-health'
  metrics_path: /probe
  params:
    module: [http_2xx]
  dns_sd_configs:
    - names: ['_node_exporter._tcp.prod.example.com']
      type: SRV
  relabel_configs:
    # Extract the hostname from the SRV target
    - source_labels: [__meta_dns_srv_record_target]
      regex: '(.+)\.'
      target_label: hostname
      replacement: '$1'

    # Keep only streaming nodes
    - source_labels: [node_type]
      regex: stream
      action: keep

    # Route through blackbox exporter
    - target_label: __address__
      replacement: 'monitoring.prod.example.com:9115'

Alerting Rules

With discovery working, we built comprehensive alerting:

groups:
  - name: streaming_service_alerts
    rules:
      - alert: StreamingServiceDown
        expr: probe_success{job=~"blackbox-streaming-.*"} == 0
        for: 5m
        labels:
          severity: critical

      - alert: StreamingDegraded
        expr: |
          count(probe_success{job="blackbox-streaming-health"} == 0)
          / count(probe_success{job="blackbox-streaming-health"}) > 0.5
        for: 5m
        labels:
          severity: warning

Key Lessons Learned

DNS SD Doesn't Auto-Discover by Pattern

DNS SRV records must explicitly list every host. The filtering by node type happens in relabeling after discovery, not during.

Debug from Inside the Container

When DNS discovery shows 0/0 targets, the issue is often DNS resolution from within the container. Test with docker exec vmagent nslookup.

Have a Fallback Plan

We kept file_sd_configs as a backup. If DNS discovery fails, we can drop in a static target file.

Results

Dynamic discovery of streaming nodes without touching Consul
Per-node health visibility in our monitoring dashboards
Automatic pickup when new nodes are added to DNS
Graceful degradation alerts before full outages
Integration with our existing alerting workflow

The monitoring gap is closed. Services that were invisible are now fully observable.

When to Use DNS SD

Consider DNS-based discovery when:

Services aren't registered in your primary service mesh
You have existing DNS SRV records you can leverage
You need to monitor infrastructure outside your control
You want discovery without adding agent dependencies

It's not a replacement for Consul or Kubernetes service discovery—it's a complement for the edges of your infrastructure that those systems don't reach.