Value of DNS Metrics for Prometheus and Grafana

March 25, 2020 | EfficientIP |

The DNS service is critical to any IP network and, in fact to the entire Internet. Therefore it needs to be monitored very seriously. Even if a lot of metrics are available in any DNS engine, only very few are mandatory to be shown on an I&O global monitoring dashboard, starting with the number of requests and recursion rate.

As a complete DDI (DNS-DHCP-IPAM) appliance solution, EfficientIP SOLIDserver proposes analytics and performance graphs for the DNS service, either at the server level or at a SmartArchitecture level. The central DDI management console is a powerful tool, so usage is reserved for network administrators and DDI experts only, making it unsuitable for global IT supervision. Fortunately, most DNS metrics can be made available to the external world through a standard API endpoint, and thus integrated into broader supervision and alerting solutions.

Valuable metrics and analytics are important for high-quality monitoring

In order to have visibility on the way a system is working, the two main methods are through analytics and through states. The specific case of event or log monitoring is more reserved to correlation tools like SIEM. Generally, monitoring teams tend to use a mix of both kinds of systems in order to cover trending through analytics and events for incident management.

We propose exploring the standard open-source solution for analytics and metrics, composed of Prometheus and Grafana. Prometheus is a monitoring system and time series database widely used, being part of the CNCF (Cloud Native Computing Foundation) Graduated Projects. It collects numerical metrics and stores them with the time of day in a specific database that is able to easily perform statistical calculations on time ranges. Grafana is an analytics and event triggering solution mainly used for creating real-time dashboards composed of metrics on most IT infrastructure systems. It allows easy visualization of the metrics on graphical dashboards which are displayed on a supervision screen. Grafana can use Prometheus as a data source, as shown in our scenario below.

DNS is vital for app availability and performance, monitoring it is essential

DNS metrics are important to monitor, as behavior of the service can significantly impact user experience and availability of applications. Keep in mind that DNS is the intent link between any client and the application he uses. If the DNS is not working or not working well, the impact is immediate. Therefore it’s a good idea to integrate information on availability and response time of the DNS service in an upper-level dashboard that any supervision team member can see and understand. Correlating these indicators with metrics on application availability, network bandwidth and any delay globally will help answer the first basic question concerning any incident: is it the network or is it the DNS? As everyone knows: “it’s always the DNS”.

For deeper investigation by operation teams, more detailed dashboards can be proposed with the type of queries, on answer size, or answering delay for example. A typical advanced dashboard making use of DNS analytics can look like the following when used with the EfficientIP SOLIDserver solution:

Grafana DNS dashboard with EfficientIP

Integrating DNS with metrology solutions via API

Integrating SOLIDserver DNS with such a metrology solution is really easy. It can be based on the SNMP agent embedded in the solution, or on the exposed API endpoint proposed by the DNS engine that can be simply translated to a Prometheus format in order to be added to an existing solution. In case of an incident, or in order to simply dispel a doubt, a more advanced dashboard can be proposed directly within the Grafana solution.

The solution we have used to demonstrate this integration with Prometheus and Grafana is based on a specific configuration on the DNS server and a small Python software used to convert the DNS data format into Prometheus format. You can try the Docker solution available on our gitlab repository as an integration example and starting point. The 3 components (converter, Prometheus and Grafana) are embedded in a Docker environment for easy proof of concept. It provides you the same dashboard as the one presented in the screenshot above. The metrics directly collected and available for insertion in specific dashboards are:

  • QPS – queries per seconds, history of queries, trend of the queries
  • Query misses and recursion rate: depending on the function of the DNS server supervised, it provides the efficiency of the recursive function, the lower this value is, the better the efficiency of your caching
  • Internal engine information like memory and worker threads currently in use, cache size and cache content
  • Types of DNS queries received from clients and other DNS servers
  • Types of DNS responses
  • Return codes – it’s important to monitor these on a detailed view as they are a good indication of suspicious activities and misconfiguration on the network
  • Size of the queries and the responses, for which trends and patterns are important, graphical analysis may be enough to find errors
  • Response time, very important as an indicator of the impact on the user experience of any client

SOLIDserver advanced metrics enable central monitoring of critical DNS service

Evidently, if more information is required, the SOLIDserver dashboards embedded in the central management solution will provide far more advanced information. By default, top clients, top domains requested and combination of client/domain are proposed in the analytics section of each Smart Architecture DNS and at the DNS server level if required. For organizations who are integrating the Guardian DNS security solution, metrics are even more advanced thanks to the enhanced cache technology used for performance and client behavior security analysis.

EfficientIP SOLIDserver DNS service is able to be integrated really easily in a wider supervision ecosystem and still provide very advanced metrics in its internal dashboards. This allows constant monitoring of the critical DNS service, correlated with any other application events, directly by the central I&O supervision team.

