Replacing MRTG with Grafana, InfluxDB and Telegraf

December 21, 2018 @09:28

MRTG... If you have read my previous post about monitoring my ADS-B receiver it probably won't come as a surprise that the impetus for this whole project has been to deprecate MRTG from my environment. MRTG was a fine enough tool when it was basically all we had (though I had rolled a few iterations of a replacement for personal projects over the years) but these days it is woefully dated. The biggest issues lie in the data gathering engine. Even a moderately sized environment is asking for trouble, dropped polls, and stuck perl processes. MRTG also fails to provide any information beyond the aggregated traffic statistics.

Years ago I wrote a small script that renders some web pages to display the switchports on the network linked to their MRTG graphs. Each port is enumerated by operational status and description to make it easy to find what you are looking for. It turns out it also makes it pretty easy to throw MRTG out and switch to something else.

I had already settled on Grafana and InfluxDB for a large part of the new monitoring infrastructure with most of the data being collected via collectd running on all my physical and virtual hosts. I am monitoring containers with cAdvisor which also feeds into InfluxDB, so I very much wanted to keep data going into InfluxDB yet I needed something to bridge the gap to the SNMP monitoring that the switches and UPSes in my network require. Enter Telegraf.

My only complaint is that the configuration for the SNMP input module in Telegraf is garbage. It took a bunch of trial and error to figure out the most efficient way to get everything in and working. I do very much like the results though...

Grafana

Setting up Telegraf as a SNMP agent

There are a number of blog posts kicking around with fragments of information and copy/paste chunks of configuration files but not much in the way of well written documentation. I guess I'll just pile more of the former on.

I deployed Telegraf as a Docker container, though the configuration is largely the same if you deploy directly on a host. I did install all the SNMP MIBs I needed (in Debian, the snmp-mibs-downloader package covered most of them, I added the APC PowerNet MIB for my UPSes and the Synology MIBs for my work NAS) on my Docker host so I could mount them into the container. I pulled the official container and extracted the default configuration file.

docker run --rm telegraf telegraf config > telegraf.conf

With that in hand I set about killing the vast majority of it, leaving only the [agent] section. Since I am only doing SNMP collection the only change I made there was to back the interval off to 120s instead of 10s.

I then configured Telegraf to send metrics to InfluxDB

# Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
  urls = [ "http://influxdb:8086" ]
  database = "telegraf"
  skip_database_creation = true
  username = "[REDACTED]"
  password = "[REDACTED]"

This just left the SNMP input configuration, which I'll break up and describe a bit inline.

[[inputs.snmp]]
  agents = [ "sw01.internal.ub3rgeek.net" ]
  community = "[REDACTED]"
  version = 2

This is pretty self-explanatory, the basic information to poll the agent. You can pass a list into agents and it will use all the same configuration for all of the targets. You can have multiple inputs.snmp stanzas.

  [[inputs.snmp.field]]
  name = "hostname"
  oid = "SNMPv2-MIB::sysName.0"
  is_tag = true

This collects the value of the SNMPv2-MIB::sysName.0 OID and makes it available as a tag.

  [[inputs.snmp.table]]
  inherit_tags = [ "hostname" ]
  oid = "IF-MIB::ifXTable"

This is the meat, it walks the IF-MIB::ifXTable and collects all the leaf OIDs as metrics. It inherits the hostname tag from above.

    [[inputs.snmp.table.field]]
      name = "ifName"
      oid = "IF-MIB::ifName"
      is_tag = true

    [[inputs.snmp.table.field]]
      name = "ifDescr"
      oid = "IF-MIB::ifDescr"
      is_tag = true

    [[inputs.snmp.table.field]]
      name = "ifAlias"
      oid = "IF-MIB::ifAlias"
      is_tag = true

These specify additional OIDs to use as tags on the metrics. The difference between this and the hostname tag is that these are scoped to the index in the walk of the IF-MIB::ifXTable, so if you are looking at index 0 in IF-MIB::ifXTable, it will fetch IF-MIB::ifName.0 and use that. I put the configuration and a docker-compose file in Puppet and let the agent crank the wheel and was rewarded with a happy stack of containerized monitoring goodness.

Telegraf, InfluxDB and Grafana in Containers

The compose file is below, but I'll leave the configuration management bits up to you, dear reader.

version: '2'
services:
  telegraf:
    image: registry.hub.docker.com/library/telegraf:latest
    environment:
      - MIBDIRS=/usr/share/snmp/mibs:/usr/share/snmp/mibs/iana:/usr/share/snmp/mibs/ietf:/usr/share/snmp/mibs/syno
    networks:
      - grafana_backend
    volumes:
      - /var/local/docker/data/telegraf/telegraf.conf:/etc/telegraf/telegraf.conf:ro
      - /usr/share/snmp/mibs:/usr/share/snmp/mibs:ro
      - /var/lib/snmp/mibs/iana:/usr/share/snmp/mibs/iana
      - /var/lib/snmp/mibs/ietf:/usr/share/snmp/mibs/ietf

networks:
  grafana_backend:
    external:
      name: grafana_backend

Gluing it to Grafana

The last piece was updating the links to the new graphs. Happily if you setup a variable in a dashboard you can pass it in the URL to the dashboard so I was able to simply change the URL in the template file and regenerate the page.

Graph Homepage

In my case the new URL was

https://[REDACTED]/grafana/d/[REDACTED]/switch-statistics?var-host=[SWITCH NAME]&var-port=[PORT NAME]

Hopefully this makes it a little clearer if you are trying to achieve a complex SNMP configuration in Telegraf.

🍻