Posts

Turning Prometheus data into metrics for alerting

As you may have seen in previous blog posts, we have a Warboard in our office which shows the current status of the servers we manage. Most of our customers are using our new alerting stack, but some have their own monitoring solutions which we want to integrate with. One of these was Prometheus. This blog post covers how we transformed raw Prometheus values into percentages which we could display on our warboard and create alerts against.

Integrating with Prometheus

In order to get a summary of the data from Prometheus to display on the Warboard we first needed to look at what information the Node Exporter provided and how it was tagged. Node Exporter is the software which makes the raw server stats available for Prometheus to collect. Given that our primary concerns are CPU, memory, disk IO and disk space usage we needed to construct queries to calculate them as percentages to be displayed on the Warboard.

Prometheus makes the data accessible through its API on "/api/v1/query?query=<query>". Most of the syntax is fairly logical with the general rule being that if we take an average or maximum value we need to specify "by (instance)" in order to keep each server separate. Node Exporter mostly returns raw values from the kernel rather than trying to manipulate them.  This is nice as it gives you freedom to decide how to use the data but does mean we have to give our queries a little extra consideration:

CPU

(1 - avg(irate(node_cpu{mode="idle"}[10m])) by (instance)) * 100

CPU usage is being reported as an integer that increases over time so we need to calculate the current percentage of usage ourselves. Fortunately Prometheus has the rate and irate functions for us. Since rate is mostly for use in calculating whether or not alert thresholds have been crossed and we are just trying to display the most recent data, irate seems a better fit. We are currently taking data over the last 10 minutes to ensure we get data for all servers, even if they’ve not reported very recently. As total CPU usage isn’t being reported it is easiest use the idle CPU usage to calculate the total as 100% – idle% rather than trying to add up all of the other CPU usage metrics. Since we want separate data for each server we need to group by instance.

Memory

((node_memory_MemTotal - node_memory_MemFree) / node_memory_MemTotal) * 100

The memory query is very simple, the only interesting thing to mention would be that MemAvailable wasn’t added until Linux 3.14 so we are using MemFree to get consistent values from every server.

Disk IO

(max(avg(irate(node_disk_io_time_ms[10m])) by (instance, device)) by (instance))/10

Throughout setting up alerting I feel disk IO has been the “most interesting” metric to calculate. For both Telegraf, which we discuss setting up here, and Node Exporter I found looking at the kernel docs most useful for confirming that disk “io_time” was the correct metric to calculate disk IO as a percentage from. Since we need a percentage we have to rule out anything dealing with bytes or blocks as we don’t want to benchmark or assume the speed of every disk. This leaves us with “io_time” and “weighted_io_time”. “weighted_io_time” might give the more accurate representation of how heavily disks are being used; it multiplies the time waited by a process, by the total number of processes waiting. However we need to use “io_time” in order to calculate a percentage or we would have to factor in the number of processes running at a given time. If there are multiple disks on a system, we are displaying the disk with the greatest IO as we are trying to spot issues so we only need to consider the busiest device. Finally we need to divide by 1000 to convert to seconds and multiply by 100 to get a percentage.

Disk Space

max(((node_filesystem_size{fstype=~"ext4|vfat"} - node_filesystem_free{fstype=~"ext4|vfat"}) / node_filesystem_size{fstype=~"ext4|vfat"}) * 100) by (instance)

As Node Exporter is returning 0 filesystem size for nsfs volumes and there are quite a few temporary and container filesystems that we aren’t trying to monitor, we either need to exclude ones we aren’t interested in or just include those that we are. As with disk IO, many servers have multiple devices / mountpoints so we are just displaying the fullest disk, since again we are trying to spot potential issues.

If you are looking to set-up your own queries I would recommend having a look through the Prometheus functions here and running some ad hoc queries from the "/graph" section of Prometheus in order to look at what data you have available.

If you need any help with Prometheus monitoring then please get in touch and we’ll be happy to help.

Replacement Server Monitoring – Part 3: Kapacitor alerts and going live!

So far in this series of blog posts we’ve discussed picking a replacement monitoring solution and getting it up and running. This instalment will cover setting up the actual alerting rules for our customers’ servers, and going live with the new solution.

Kapacitor Alerts

As mentioned in previous posts, the portion of the TICK stack responsible for the actual alerting is Kapacitor. Put simply, Kapacitor takes metrics stored in the InfluxDB database, processes and transforms them, and then sends alerts based on configured thresholds. It can deal with both batches and streams of data, the difference is fairly clear from the names; batch data takes multiple data points as an input and looks at them as a whole. Streams accept a single point at a time, folding each new point into the mix and re-evaluating thresholds each time.

As we wanted to monitor servers constantly over large time periods, stream data was the obvious choice for our alerts.

We went through many iterations of out alerting scripts, known as TICK scripts, before mostly settling on what we have now. I’ll explain one of our “Critical” CPU alert scripts to show how things work (comments inline):

var critLevel = 80 // The CPU percentage we want to alert on
var critTime = 15 // How long the CPU percentage must be at the critLevel (in this case, 80) percentage before we alert
var critResetTime = 15 // How long the CPU percentage must be back below the critLevel (again, 80) before we reset the alert

stream // Tell Kapacitor that this alert is using stream data
    |from()
        .measurement('cpu') // Tell Kapacitor to look at the CPU data
    |where(lambda: ("host" == '$reported_hostname') AND ("cpu" == 'cpu-total')) // Only look at the data for a particular server (more on this below)
    |groupBy('host')
    |eval(lambda: 100.0 - "usage_idle") // Calculate percentage of CPU used...
      .as('cpu_used') // ... and save this value in it's own variable
    |stateDuration(lambda: "cpu_used" >= critLevel) // Keep track of how long CPU percentage has been above the alerting threshold
        .unit(1m) // Minutely resolution is enough for us, so we use minutes for our units
        .as('crit_duration') // Store the number calculated above for later user
    |stateDuration(lambda: "cpu_used" < critLevel) // The same as the above 3 lines, but for resetting the alert status 
        .unit(1m) .as('crit_reset_duration') 
    |alert() // Create an alert... 
        .id('CPU - {{ index .Tags "host" }}') // The alert title 
        .message('{{.Level}} - CPU Usage > ' + string(critLevel) + ' on {{ index .Tags "host" }}') // The information contained in the alert
        .details('''
        {{ .ID }}
        {{ .Message }}
        ''')
        .crit(lambda: "crit_duration" >= critTime) // Generate a critical alert when CPU percentage has been above the threshold for the specified amount of time
        .critReset(lambda: "crit_reset_duration" >= critResetTime) // Reset the alert when CPU percentage has come back below the threshold for the right time
        .stateChangesOnly() // Only send out information when an alert changes from normal to critical, or back again
        .log('/var/log/kapacitor/kapacitor_alerts.log') // Record in a log file that this alert was generated / reset
        .email() // Send the alert via email 
        |influxDBOut() // Write the alert data back into InfluxDB for later reference...
             .measurement('kapacitor_alerts') // The name to store the data under
             .tag('kapacitor_alert_level', 'critical') // Information on the alert
.tag('metric', 'cpu') // The type of alert that was generated

The above TICK script generates a “Critcal” level alert when the CPU usage on a given server has been above 80% for 15 minutes or more. Once it has alerted, the alert will not reset until the CPU usage has come back down below 80% for a further 15 minutes. Both the initial notification and the “close” notification are sent via email.

The vast majority of our TICK scripts are very similar to the above, with changes to monitor different metrics (memory, disk space, disk IO etc) with different threshold levels and times etc.

To load this TICK script into Kapacitor, we use the kapacitor command line interface. Here’s what we’d run:

kapacitor define example_server_cpu -type stream -tick cpu.tick -dbrp example_server.autogen
kapacitor enable example_server_cpu

This creates a Kapacitor alert with the name “example_server_cpu”, with the “stream” alert type, against a database and retention policy we specify.

In reality, we automate this process with another script. This also replaces the $reported_hostname slug with the actual hostname of the server we’re setting the alert up for.

Getting customer servers reporting

Now that we could actually alert on information coming into InfluxDB, it was time to get each of our customers’ servers reporting in. Since we have a large number of customer systems to monitor, installing and configuring Telegraf by hand was simply not an option. We used ansible to roll the configuration out to the servers that needed it which involved 12 different operating systems and 4 different configurations.

Here’s a list of the tasks that Ansible carries out for us:

  • On our servers:
    • Create a specific InfluxDB database for the customers server
    • Create a locked down InfluxDB write only user for the server to send it’s data in with
    • Add Grafana data source to link the database to the customer
  • On the customers server:
    • Setup the Telegraf repo to ensure it is updated
    • Install Telegraf
    • Configure Telegraf outputs to point to our endpoint with the correct server specific credentials
    • Configure Telegraf inputs with all the metrics we want to capture
    • Restart Telegraf to load the new configuration

The above should be pretty self-explanatory. Whilst every one of the above steps would be carried out for a new server, we wrote the Ansible files to allow for most of them to be run independently of one another. This means that in future we’d be able to, for example, include another input to report metrics on, with relative ease.

For those of you not familiar with Ansible, here’s an excerpt from one of the files. It places a Telegraf config file into the relevant directory on the server, and sets the file permissions to the values we want:

---
- name: Copy inputs config onto client
  copy:
    src: ../files/telegraf/telegraf_inputs.conf
    dest: /etc/telegraf/telegraf.d/telegraf_inputs.conf
    owner: root
    group: root
    mode: 0644
become: yes

 

With the use of more ansible we incorporated various different tasks into a single repository structure, did lots of testing, and then ran things against our customers’ servers. Shortly after, we had all of our customers’ servers reporting in. After making sure everything looked right, we created and enabled various alerts for each server. The process for this was to write a BASH script which looped over a list of our customers’ servers and the available alert scripts, and combined them so that we had alerts for the key metrics across all servers. The floodgates had been opened!

Summary

So, at the end of everything covered in the posts in this series, we had ourselves a very respectable New Relic replacement. We ran the two systems side by side for a few weeks and are very happy with the outcome.  While what we have described here is a basic guide to setting the system up we have already started to make improvements way beyond the power we used to have.  If any of them are exciting enough, there will be more blog posts coming your way, so make sure you come back soon.

We’re also hoping to open source all of our TICK scripts, ansible configs, and various other snippets used to tie everything together at some point, once they’ve been tidied up and improved a bit more. If you cannot wait that long and need them now, drop us a line and we’ll do our best to help you out.

I hope you’ve enjoyed this series. It was great of a project that the whole company took part in and that enabled us to provide an even better experience for our customers. Thanks for reading!

Feature image background by swadley licensed CC BY 2.0.

Replacement Server Monitoring – Part 2: Building the replacement

This is part two of a three part series of blog posts about picking a replacement monitoring solution, getting it running and ready, and finally moving our customers over to it.

In our last post we discussed our need for a replacement monitoring system and our pick for the software stack we were going to build it on. If you haven’t already, you should go and read that before continuing with this blog post.

This post aims to detail the set up and configuration of the different components to work together, along with some additional customisations we made to get the functionality we wanted.

Component Installation

As mentioned in the previous entry in this series, InfluxData, the TICK stack creators, provide package repositories where pre-built and ready to use packages are available. This eliminates the need for configuration and compilation of source code before we can use it. This allows us to install and run software with the use of a few commands with very predictable results, as opposed to often many commands needed for compilation, with sometimes wildly varying results. Great stuff.

All components are available from the same repository. Here’s how you install them (example shown is for an Ubuntu 16.04 “Xenial” system

curl -sL https://repos.influxdata.com/influxdb.key | sudo apt-key add -
source /etc/lsb-release
echo "deb https://repos.influxdata.com/${DISTRIB_ID,,} ${DISTRIB_CODENAME} stable" | sudo tee /etc/apt/sources.list.d/influxdb.list
sudo apt-get update && sudo apt-get install influxdb
sudo systemctl start influxdb

The above steps are also identical for the other components, Telegraf, Chronograf and Kapacitor. You’ll just need to replace “influxdb” with the correct name in lines 4 and 5.

Configuring and linking the components

As each of the components are created by the same people, InfluxData, linking them together is fortunately very easy (another reason we went with the TICK stack). I’ll show you what additional configuration was put in place for the components and how we then linked together. Note that the components are out of order here, as the configuration of some components is a prerequisite to linking them to another.

InfluxDB

The main change that we make to InfluxDB is to have it listen for connections over HTTPS, meaning any data flowing to/from it will be encrypted. (To do this, you will need to have an SSL certificate and key pair to use. Obtaining that cert/key pair is outside the scope of the blog post). We also require authentication for logins, and disable the query log. We then restart InfluxDB for these changes to take effect.

sudo vim /etc/influx/influx.conf

[http]
    enabled = true
    bind-address = "0.0.0.0:8086"
    auth-enabled = true
    log-enabled = false
    https-enabled = true
    https-certificate = "/etc/influxdb/ssl/reporting-endpoint.dogsbodytecnhology.com.pem"

sudo systemctl restart influxd

Note that the path used for the “https-certificate” parameter will need to exist on your system of course.

We then need to create an administrative user like so:

influx -ssl -host ivory.dogsbodyhosting.net
> CREATE USER admin WITH PASSWORD 'superstrongpassword' WITH ALL PRIVILEGES

Telegraf

The customisations for Telegraf involve telling it where to reports its metrics to, and what metrics to record. We have an automated process, using ansible for rolling these customisations out to customer servers, which we’ll cover in the next part of this series. Make sure you check back for that. These are essentially what changes are made:

sudo vim /etc/telegraf.d/outputs.conf

[[outputs.influxdb]]
  urls = ["https://reporting-endpoint.dogsbodytechnology.com:8086"]
  database = "3340ad1c-31ac-11e8-bfaf-5ba54621292f"
  username = "3340ad1c-31ac-11e8-bfaf-5ba54621292f"
  password = "supersecurepassword"
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"

The above dictates that Telegraf should connect securely over HTTPS and tells it the username, database and password to use for it’s connection.

We also need to tell Telegraf what metrics it should record. This is configured like so:

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = true
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs"]
[[inputs.diskio]]
[[inputs.net]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.procstat]]
  pattern = "."

The above tells Telegraf what metrics to report, and customises how they are reported a little. For example, we tell it to ignore some pseudo-filesystems in the disk section, as these aren’t important to us.

Kapacitor

The customisations for Kapacitor primarily tell it which InfluxDB instance it should use, and the channels it should use for sending out alerts:

sudo vim /etc/kapacitor/kapacitor.conf
    [http]
    log-enabled = false
    
    [logging]
    level = “WARN”

    [[influxdb]]
    name = "ivory.dogsbodyhosting.net"
    urls = ["https://reporting-endpoint.dogsbodytechnology.com:8086"]
    username = admin
    password = “supersecurepassword”

    [pushover]
    enabled = true
    token = “yourpushovertoken”
    user-key = “yourpushoveruserkey”

    [smtp]
    enabled = true
    host = "localhost"
    port = 25
    username = ""
    password = ""
    from = "alerts@dogsbodytechnology.com"
    to = ["sysadmin@dogsbodytechnology.com"]

As you can probably work out, we use Pushover and email to send/receive our alert messages. This is subject to change over time. During the development phase, I used the Slack output.

Chronograf Grafana

Although the TICK stack offers it’s own visualisation (and control) tool, Chronograf, we ended up using the very popular Grafana instead. At the time when we were building the replacement solution, Chronograf, although very pretty, was somewhat lacking in features, and the features that did exist were sometimes buggy. Please do note that Chronograf was the only component that was still in beta at this period in time. It’s now had a full release and another ~5 months of development. You should definitely try it out for yourself before jumping straight to Grafana. We intend to re-evaluate Chronograf ourselves soon, especially as it is able to control the other components in the TICK stack, something which Grafana does not offer at all.

The Grafana install is pretty straightforward, as it also has a package repository:

sudo vim /etc/apt/sources.list.d/grafana.list
    deb https://packagecloud.io/grafana/stable/debian/ jessie main
sudo apt update
sudo apt install grafana

We then of course make some customisations. The important part here is setting the base URL which is required due to the fact we’ve got Grafana running behind an nginx reverse proxy. (We love nginx and use it wherever we get the chance. We won’t detail the customisations here though as they’re not strictly related to the monitoring solution, and Grafana works just fine on it’s own.)

sudo vim /etc/grafana/grafana.ini
    [server]
    domain = display-endpoint.dogsbodytechnology
    root_url = %(protocol)s://%(domain)s:/grafana
sudo systemctl restart grafana

Summary

The steps above left us with a very powerful and customisable monitoring solution, which worked fantastically for us. Be sure to check back for future instalments in this series. We’ll cover setting up alerts with Kapacitor, creating awesome visualisations with Grafana, and getting all of our hundreds of customers’ servers reporting in and alerting.

Feature image background by tomandellystravels licensed CC BY 2.0.

Replacement Server Monitoring – Part 1: Picking a Replacement

As a company primarily dealing with Linux servers and keeping them online constantly, here at Dogsbody we take a huge interest in the current status of any and all servers we’re responsible for. Having accurate and up to date information allows us to move proactively and remedy potential problems before they became service-impacting for our customers.

For many years, and as long as I have worked at the company, we’d used an offering from New Relic, called simply “Servers”. In 2017, New Relic announced that they would be discontinuing their “Servers” offering, with their “Infrastructure” product taking it’s place. The pricing for New Relic infrastructure was exorbitant for our use case, and there were a few things we wanted from our monitoring solution that New Relic didn’t offer, so being the tinkerers that we are, we decided to implement our own.

This is a 3 part series of blog posts about picking a replacement monitoring solution, getting it running and ready, and finally moving our customers over to it.

What we needed from our new solution

The phase one objective for this project was rather simple: to replicate the core functionality offered by New Relic. This meant that the following items were considered crucial:

  • Configurable alert policies – All servers are different. Being able to tweak the thresholds for alerts depending on the server was very important to us. Nobody likes false alarms, especially not in the middle of the night!
  • Historical data – Being able to view system metrics at a given timestamp is of huge help when investigating problems that have occurred in the past
  • Easy to install and lightweight server-side software – As we’d be needing to install the monitoring tool on hundreds of servers, some with very low resources, we needed to ensure that this was a breeze to configure and as slim as possible
  • Webhook support for alerts – Our alerting process is built around having alerts from various different monitoring tools report to a single endpoint where we handle the alerting with custom logic. Flexibility in ours alerts was a must-have

Solutions we considered

A quick Google for “linux server monitoring” returns a lot of results. The first round of investigations essentially consisted of checking out the ones we’d heard about and reading up on what they had to offer. Anything of note got recorded for later reference, including any solutions that we knew would not be suitable for whatever reason. It didn’t take very long for a short list of “big players” to present themselves. Now, this is not to say that we discounted any solutions on the account of them being small, but we did want a solution that was gonna be stable and widely supported from the get-go. We wanted to get on with using the software, instead of spending time getting it to install/run.

The big names were Nagios, Zabbix, Prometheus, and Influx (TICK).

After much reading of the available documentation, performing some test installations (some successful, some very much not), and having a general play with each of them, I decided to look further at the TICK stack from InfluxData. I wont go too much into the negatives of the failed candidates, but the main points across them were:

  • Complex installation and/or management of central server
  • Poor / convoluted documentation
  • Lack of repositories for agent installation

Influx (TICK)

The monitoring solution offered by Influx consists of 4 parts, each of which can be installed independently of one another

TTelegraf – Agent for collecting and reporting system metrics

IInfluxDB – Database to store metrics

CChronograf – Management and graphing interface for the rest of the stack

KKapacitor – Data processing and alerting engine

 

Package repositories existed for all parts of the stack, most importantly for Telegraf which would be going on customer systems. This allowed for easy installation, updating, and removal of any of the components.

One of the biggest advantages for InfluxDB was the very simple installation: add the repo, install the package, start the software. At this point Influx was ready to accept metrics reported from a server running Telegraf (or anything else for that matter. There were many clients that support reporting to InfluxDB, which was another positive)

In the same vein, the Telegraf installation was also very easy, using the same steps as above, with the additional step of updating the config to tell the software where to report it’s metrics too. This is a one-line change in the config, followed by a quick restart of the software.

At this point we had basically all of the system information we could ever need, in an easy to access format, only a few seconds after things happen. Awesome.

Although the most important functionality to replicate was the alerting, the next thing we installed and focused on was the visualisation of the data Telegraf was reporting into InfluxDB. We needed to ensure the data we were receiving mirrored what we were seeing in New Relic, and it can also be tricky to create test alerts when you have no visibility of the data you’re alerting against too, so we needed some graphs (everyone loves pretty graphs as well of course!)

As mentioned above, Chronograf is the component of the TICK stack responsible for data visualisation, and also allows you to interface with InfluxDB and Kapacitor, to run queries and create alerts, respectively.

In summary, the TICK stack offered us an open source, modular and easy to use system. It felt pleasant to use, the documentation was reasonable, and the system seemed very stable. We had a great base, one which we could design and build our new server monitoring system. Exciting!

Be sure to check back for the next part in the series which covers designing and building the real thing.

Feature image background by xmodulo licensed CC BY 2.0.

End of Life for New Relic ‘Servers’ – What are your options?

Today (14 Nov 2017) New Relic are making their ‘Alerts’ and ‘Server’ services end of life (EOL). This will impact anyone who used this service to monitor server resources such as CPU, Memory, Disk Space and Disk IO. All existing alert policies will cease from today.

If you rely on these alerts to monitor your servers then hopefully you have a contingency plan in place already but if not below are your options….

If you do nothing

New Relic Servers will go EOL TODAY (14 Nov 2017) and data will stop being collected.  You would no longer be able to monitor your system resources meaning outages that could have otherwise been prevented could sneak up on you. We do not recommend this option.  See below on how to remove the `newrelic-sysmond` daemon.

Upgrade to New Relic Infrastructure

“Infrastructure” is their new paid server monitoring offering. Infrastructure pricing is based on your servers CPU so prices vary and offers added functionality over the legacy New Relic Servers offering.  The maximum price per server per month is $7.20 however the minimum monthly charge is $9.90 so it’s not effective if you’re only looking to monitor your main production system. Most of the new functionality is integration into other products (including their own) so it’s up to you if this additional functionality is useful and worth the cost for your requirements.

Dogsbody Technology Minder

Over the last year we have been developing our own replacement for New Relic Servers using open source solutions. This product has old New Relic Server customers in mind giving all the information needed to run and maintain a Linux server. It also has the monitoring and hooks required to alert the relevant people of issues allowing us to prevent issues before they happen.  This is a paid service but it is included as standard with all our maintenance packages so any customers using New Relic Servers are being upgraded automatically. If you would like further information please do contact us.

Another alternative monitoring solution

There are plenty of other monitoring providers and solutions out there from in-house build your own open source solutions to paid services.  Monitoring your system resources is essential in helping to prevent major outages of your systems. Pick the best one for you and let the service take the hard work out of monitoring your servers.  We have experience with a number of implementations including the TICK stack (Telegraf, InfluxDB, Chronograf, Kapacitor) and Prometheus.

Removing the `newrelic-sysmond` daemon

If you were using New Relic Servers then you are running the `newrelic-sysmond` daemon on your systems.  While New Relic have turned the service off we have confirmed with them that the daemon will keep running using valuable system resources.

We highly recommend that you uninstall the daemon (tidy server tidy mind) following New Relic uninstallation guide.  That way it won’t take much of your system’s resources, and minimal impact is to be expected.

 

Happy Server Monitoring

If you need help, further advise or to discuss our monitoring solutions please do contact us.