DNS Observability with eBPF: Uncovering Client-Side Issues

Share

Audience: SREs, platform engineers, and on-call production operators.

Purpose: turn talk content into actionable production guidance.

Source: https://www.youtube.com/watch?v=YMqSK6uBhxU

Date: 2023-10-26

Speaker(s): Nikola Grachevsky (OpenTelemetry BPF instrumentation project, Grafana Labs); Andras Szabo (Causely)

Tags: video_summary, monitoring, DNS, eBPF, OpenTelemetry, Observability, Kubernetes, Troubleshooting, Performance, Client-Side Issues

Executive Summary

What this session is about, why it matters in production, and what action operators should take.

This talk highlights the critical but often overlooked blind spot of client-side DNS performance, which frequently causes silent service slowdowns. Traditional application instrumentation misses the DNS lookup phase, making it difficult to diagnose latency issues. The OpenTelemetry BPF (OB) instrumentation project leverages eBPF to provide low-overhead, auto-instrumentation for DNS queries, capturing lookup durations and correlating them with application traces. This enables operators to identify specific services or pods making inefficient DNS requests, such as those generating excessive NXDOMAIN responses due to partial domain names, and optimize client configurations to significantly improve performance and reduce DNS server load.

Recommended action: Deploy OpenTelemetry BPF (OB) instrumentation to gain deep visibility into client-side DNS lookup performance and identify applications causing DNS-related latency or overload.

Quick Definitions

Short definitions of terms used in this brief.

  • eBPF: a Linux kernel technology used to observe system/network behavior with low overhead.
  • NXDOMAIN: DNS response meaning the requested domain name does not exist.
  • DaemonSet: Kubernetes workload type that runs one pod on each node.
  • FQDN: Fully Qualified Domain Name, the complete domain (for example `api.prod.example.com`).

What Failed in Production

Concrete failure symptoms, triggers, and impact observed in real systems.

  • Symptom: Application services are slow or experience high latency during startup or request processing.
  • Trigger: DNS lookups for external or internal services take an unexpectedly long time, often due to inefficient client-side resolution logic.
  • Impact: Degraded application performance, increased user-facing latency, and difficulty in identifying the root cause due to a lack of visibility into the DNS resolution phase. (Ref: 00:01:36, 00:02:31, 00:03:49, 00:03:57, 00:04:06)
  • Symptom: DNS servers (e.g., CoreDNS) experience high load, increased query rates, or elevated error rates for NXDOMAIN responses.
  • Trigger: Application clients (e.g., Kafka bootstrap) attempt to resolve partial domain names, leading to a 'storm' of multiple DNS queries and NXDOMAIN (Non-Existent Domain) responses before a Fully Qualified Domain Name (FQDN) is eventually resolved.
  • Impact: Overload of DNS infrastructure, increased latency for all DNS queries, and prolonged application startup or connection times, even if the final lookup succeeds. (Ref: 00:13:17, 00:13:45, 00:14:04, 00:14:15, 00:14:17)

Key Reliability Insights

Reusable lessons for SRE teams running Kubernetes in production.

  • DNS problems are frequently client-side issues, not server-side, often stemming from inefficient client lookup patterns. (Ref: 00:02:17, 00:12:56)
  • Traditional application instrumentation does not cover the DNS lookup phase, creating a blind spot for performance issues that occur before a connection is established. (Ref: 00:02:44, 00:04:33)
  • OpenTelemetry BPF (OB) instrumentation uses eBPF to automatically capture DNS query metrics and traces with low overhead, requiring no code changes or service restarts. (Ref: 00:01:52, 00:05:31, 00:11:36)
  • OB provides `dns.lookup.duration` metrics and trace spans, enriched with Kubernetes workload attribution, to pinpoint which applications or pods are making specific DNS requests. (Ref: 00:06:21, 00:06:41, 00:07:10)
  • A common client-side issue is the use of partial domain names, leading to multiple failed lookups (NXDOMAIN responses) before a successful resolution, significantly increasing latency and DNS server load. (Ref: 00:13:17, 00:14:04)

Architecture Decision

Why this design was chosen, what problem it solves, and where to apply it.

The OpenTelemetry BPF (OB) project uses eBPF to tap into the Linux kernel's TCP and UDP layers, specifically monitoring ports 53 and 5353. It correlates DNS requests and responses using the DNS ID, extracts process information, and enriches this data with Kubernetes API metadata in user space. This allows for comprehensive DNS observability without modifying application code or restarting services.

Ref: 00:08:01, 00:08:10, 00:08:24, 00:08:41, 00:09:03, 00:10:01

How to Reproduce / Detect

How to verify the issue exists and detect it before customer impact grows.

  • Deploy the OpenTelemetry BPF (OB) instrumentation as a daemon set across your Kubernetes cluster to automatically instrument all workloads for DNS activity. (Ref: 00:11:41, 00:11:44)

Configuration and Code Changes

Concrete implementation changes you can apply in platform or service configuration.

  • Change: Update application configurations to use Fully Qualified Domain Names (FQDNs) for service lookups instead of partial names. This reduces the number of DNS queries and eliminates unnecessary NXDOMAIN responses. (Ref: 00:14:17, 00:14:20)
  • Implementation details: process change; format: text
  • Example snippet: Example: Change 'kafka-broker' to 'kafka-broker.your-namespace.svc.cluster.local'

Operational Playbook (Prioritized)

Prioritized runbook actions with explicit verification after each step.

  • P1: Deploy OpenTelemetry BPF (OB) instrumentation as a daemon set to all nodes in your Kubernetes cluster to enable automatic DNS observability. (Low effort) (Ref: 00:05:31, 00:11:41, 00:11:44)
  • Verification: Confirm OB pods are running and exporting `dns.lookup.duration` metrics and DNS trace spans to your OpenTelemetry collector/backend.
  • P1: Analyze `dns.lookup.duration` metrics and DNS trace spans, using Kubernetes workload attribution (pod, service, namespace labels), to identify applications or pods exhibiting high DNS latency or a large number of NXDOMAIN responses. (Med effort) (Ref: 00:06:21, 00:07:10, 00:10:59, 00:13:03)
  • Verification: Pinpoint specific client applications responsible for slow DNS lookups or excessive NXDOMAIN queries.
  • P2: For identified problematic clients, modify their configurations to use Fully Qualified Domain Names (FQDNs) for all service lookups instead of partial names. (Med effort) (Ref: 00:14:17, 00:14:20)
  • Verification: Monitor `dns.lookup.duration` metrics and NXDOMAIN counts for the affected services to confirm a reduction in latency and query volume.
  • P3: If sensitive DNS query names (QNames) are being collected, configure the OpenTelemetry collector to mask or remove these fields before exporting telemetry. (Low effort) (Ref: 00:09:40, 00:09:53)
  • Verification: Verify that sensitive QName data is no longer present in the exported telemetry in your observability backend.

Observability and Alerting

What to measure, what should alert, and how to interpret signals in plain operational terms.

  • Metric to track: dns.lookup.duration
  • Component: DNS client (application)
  • Monitors: The time taken for an application to complete a DNS lookup, collected as a histogram.
  • Why it matters: Directly measures the latency introduced by DNS resolution, which is often a blind spot for traditional application monitoring. High durations indicate potential performance bottlenecks.
  • Failure signal: Elevated p99 or p95 latency, or a significant increase in average duration, especially when correlated with specific service names or pods.
  • Threshold hint: Monitor for durations exceeding typical network latency (e.g., >50ms for internal lookups, >200ms for external).
  • Ref: 00:06:41, 00:06:51

Operational Readiness (Validation and Rollback)

Pre-flight and post-rollout checks, plus risk controls and rollback actions.

Risk and rollback plan:

  1. Risk: Collection of sensitive DNS query names (QNames) that may contain proprietary or confidential information. (Ref: 00:09:40, 00:09:53)
  • Mitigation: Utilize the OpenTelemetry collector's processing capabilities to mask, redact, or remove the QName field from DNS telemetry before it is stored or exposed.
  • Rollback action: Disable the DNS instrumentation in OpenTelemetry BPF or remove the masking configuration from the OpenTelemetry collector.

References

Only references with valid reachable links are included.

  • OpenTelemetry BPF instrumentation project: Mentioned as 'OB' project, a new project in OpenTelemetry using eBPF for application and network observability. (Ref: 00:01:12, 00:05:11)
  • CoreDNS: Mentioned as a Kubernetes DNS server, with ongoing work to improve reliability, scalability, and performance. (Ref: 00:12:32, 00:12:49)
  • Ground Cover: Mentioned as having another eBPF-based DNS implementation, but OpenTelemetry BPF is the only open-source one. (Ref: 00:11:52, 00:11:58)