Tag Archive for: monitoring

Warning sign - Outages

Common warning signs before server outages

Everyone knows that server outages and server down time cost. It directly affects your business in a number of ways including:

  • Loss of opportunities
  • Damage to your brand
  • Data loss
  • Lost sales
  • Lost trust

It’s essential to stay on top and ahead of any potential downtime.

Here are three areas where you need to be ahead of the curve:

Know your limits / server resources

Physical resource shortages

A common cause of downtime is from running out of server resources.

Whether it is RAM, CPU, disk space or other, when you run out, you risk data corruption, programs crashing and severe slowdowns to say the least. It is essential to perform regular server monitoring of your resources.

One of the most important; yet overlooked metrics, is disk space. Running out of disk space is one of the most preventable issues facing IT systems in our opinion.

When you run out of disk space, your system can no longer save files, losing data and leading to data corruption.

Often your website might still look like it is up and running and it’s only when a customer interacts with it, perhaps uploading new data or adding an item to a shopping basket, that you find it then fails to work.

We see this happen most frequently, when there is a “run-away log file” that keeps expanding until everything stops on the server!

CMS systems like Magento fall particularly prey to this as they often have unchecked application logs.

Internally, we record all server resource metrics every 10 seconds onto our MINDER stack and alerts will be raised well in advance of disk space running out. You don’t need to be this ‘advanced’ – you could simply have a script check current disk space hourly and email you if it is running out.

Configured resource shortages

Another common resource limit is a misconfigured server.

You could have a huge server with more CPU cores, RAM and storage than you could dream of using, but if your software isn’t configured to use it it won’t matter.

For example, if you were using PHP-FPM, and hadn’t configured it correctly, it would only have five processes running to process PHP. This means that in the case of a traffic spike, the first five requests would be served as normal but anything beyond that 5th request will be queued up until the first five had been served. This will of course needlessly slowing the site down for visitors.

Issues like this are often flagged up in server logs, letting you know when you hit these configured limits, so it is good to keep your eyes on them. These logs can also indicate that your site is getting busier and help you to grow your infrastructure in good time, along with your visitors.

You might be thinking, “why are there these arbitrary limits getting in my way? I don’t need these at all”.

Well, it is good to have these limits so that in the case of an unusual traffic spike, everything will run slowly but importantly it will work! If they are set too high, or not set at all, you might reach the aforementioned “physical limits” issue risking data corruption and crashing.

Did you know, by default NGINX only runs with one single threaded worker!

Providers

As a small business, it is normally impossible to do everything in house – and why would you want to,  when you need to focus on your business?

So it is good to step back every once in a while and document your suppliers.

Even if you only own a simple website, suppliers could include:

  • Domain registrar (OpenSRS, Domainbox, …)
  • DNS providers (Route 53, DNS Made Easy, …)
  • Server hosting (Rapidswitch, Linode, AWS EC2, …)
  • Server maintenance (Dogsbody Technology, …)
  • Website software updates (WordPress, Magento, …)
  • Website plugin updates (Akismit, W3 Total Cache, …)
  • Content Delivery Network (Cloudflare, Akamai, …)
  • Third parties (Sagepay, Worldpay, …)

All of these providers need to keep their software and/or services up to date. Some will cause more impact on you than others.

Planned maintenance

Looking at server hosting, all servers need maintenance every now and again, perhaps to load in a recent security update or to migrate you away from ageing hardware.

The most important point here is to be aware of it.

All reputable providers will send notifications about upcoming maintenance windows and depending on the update they will let you reschedule the maintenance for a more convenient time – reducing the effect on your business.

It is also good to have someone (like us) on call in case it doesn’t go to plan. Maintenance work might start in the dead of night, but if no one realises it’s still broken at 09:00, heads might roll!

Unplanned maintenance

Not all downtime can is planned. Even the giants, Facebook and Amazon have unplanned outages now and again.

This makes it critical to know where to go if your provider is having issues. Most providers have support systems where you can reach their technical team. Our customers can call us up at a moments notice.

Another good first point of call is a provider’s status page, here you can see any current (as well as past or future) maintenance or issues that are occurring. For example if you use Linode you can see issues on their status page here.

Earlier this year, we developed Status Pile a webapp, which combines provider status information into one place, making it easier for you to see issues at a glance.

Uptime monitoring

This isn’t really a warning sign, but it’s impossible to foresee everything. The above areas are great places to start, but they can’t cover you for the unexpected.

That’s where uptime monitoring comes in. Regardless of the cause, you need to know when your site goes down and you need to know fast.

We recommend monitoring your website at least minutely with a provider like Pingdom or AppBeat.

Proper configuration

Just setting up uptime monitoring is one thing, but it is imperative to configure it properly. You can tell someone to “watch the turkey in the oven” and so they watch the turkey burn!

I’ve seen checks which make sure a site returns a webpage, but if that page says, “error connecting to database” it doesn’t matter!

Good website monitoring checks the page returned includes the correct status code and site content. Perhaps your website connects to your docker application but only for specific actions then you should test specifically as well.

Are you checking your entire website stack?

Cartoon Dog sat at Table woth fire around him. Next frames he is saying 'this is fine'

Who is responsible?

A key part of uptime monitoring – and a number of other items I have mentioned – is that it alerts the right people and that they action those alerts.

If your uptime alerts flag an outage and they are sent to an accounts team it’s unlikely they’ll be able to take action. Equally if an alert comes in late in the evening when no one is around your site might be down until 0900 the next morning.

This is where our maintenance service comes in. We have a support team on call 24/7, ready to jump on any issues.

 

Phew that was a lot, we handle all of this and more. Contact us and see how we can give you peace of mind.

Feature image by Andrew Klimin licensed CC BY 2.0.

Tripwire – How and Why

Open Source Tripwire is a powerful tool to have access to.  Tripwire is used by the MOD to monitor systems.  The tool is based on code contributed by Tripwire – a company that provide security products and solutions.  If you need to ensure the integrity of your filesystem Tripwire could be the perfect tool for you.

What is Tripwire

Open Source Tripwire is a popular host based intrusion detection system (IDS).  It is used to monitor filesystems and alert when changes occur.  This allows you to detect intrusions or unexpected changes and respond accordingly.  Tripwire has great flexibility over which files and directories you choose to monitor which you specify in a policy file.

How does it work

Tripwire keeps a database of file and directory meta data.  Tripwire can then be ran regularly to report on any changes.

If you install Tripwire from Ubuntu’s repo as per the instructions below a daily cron will be set-up to send you an email report.  The general view with alerting is that no news is good news.  Due to the nature of Tripwire it’s useful to receive the daily email, that way you’ll notice if Tripwire gets disabled.

Before we start

Before setting up Tripwire please check the following:

  • You’ve configured email on your server.  If not you’ll need to do that first, we’ve got a guide.
  • You’re manually patching your server.  Make sure you don’t have unattended upgrades running (see the manual updates section) as unless you’re co-ordinating Tripwire with your patching process it will be hard for you to distinguish between expected and unexpected changes.
  • You’re prepared to put some extra time into maintaining this system for the benefit of knowing when your files change.

Installation on Ubuntu

sudo apt-get update
sudo apt-get install tripwire

You’ll be prompted to create your site and local keys, make sure you record them in your password manager.

In your preferred editor open /etc/tripwire/twpol.txt

The changes you make here are based on what you’re looking to monitor, the default config has good coverage of system files but is unlikely to be monitoring your website files if that’s something you wanted to do.  For example, I’ve needed to remove /proc and some of the files in /root that haven’t existed on systems I’ve been monitoring.

Then create the signed policy file and then the database:

sudo twadmin --create-polfile /etc/tripwire/twpol.txt
sudo tripwire --init

At this point it’s worth running a check. You’ll want to make sure it has no errors.

sudo tripwire --check

Finally I’d manually run the daily cron to check the email comes through to you.

sudo /etc/cron.daily/tripwire

Day to day usage

Changing files

After you make changes to your system you’ll need to run a report to check what tripwire sees have changed.

sudo tripwire --check

You can then update the signed database.  This will open up the report allowing you to check you’re happy with the changes before exiting.

sudo tripwire --update -r /var/lib/tripwire/report/$HOSTNAME-yyyyMMdd-HHmmss.twr

You’ll need your local key in order to update the database.

Changing policy

If you decide you’d like to monitor or exclude some more files you can update /etc/tripwire/twpol.txt.  If you’re monitoring this file you’ll need to update the database as per the above section.  After that you can update the signed policy file (you’ll need your site and local keys for this).

sudo tripwire --update-policy /etc/tripwire/twpol.txt

 

As you can see tripwire can be an amazingly powerful tool in any security arsenal.  We use it as part of our maintenance plans and encourage others to do the same.

 

Feature image by Nathalie licensed CC BY 2.0.

What are Status Pages?

A status pages allows a supplier of a service to let their customers know about outages and issues with their service.  They can be used to show planned maintenance and can hook into e-mail or other update methods but typically they are a website firstly.  Status pages are great; they make things easier for everyone and save time.  If you think you’re having an issue related to a provider you can quickly look at their status page to see if they’re already aware of it before deciding whether or not to contact them.  If they have already acknowledged the problem it also means you don’t need to spend time working out what has changed at your end.

We have one

Our status page status.dogsbody.com, has been running for quite a while. We suggest our customer use it as your first point of call when you spot something odd.  If you’ve got your own server(s) that we maintain, we’ll contact you directly if we start seeing issues.  The status page covers the below…

Support – methods we usually communicate with you

  • Email
  • Telephone
  • Slack

Hosting – our shared hosting servers

  • Indigo (our WordPress only hosting)
  • Purple (our general purpose hosting)
  • Violet (our cPanel hosting)

When we schedule maintenance or have issues we’ll update you via the status page.  If you are an Indigo or Purple customer and want to be notified of issues or maintenance go ahead and subscribe.

You want one

Having an (up to date) status page improves your users experience.  It gives them a quick way to find out what’s going on.  This means they’ll be have a better understanding and (usually) more tolerant of issues you’re already dealing with.

Having a status page is likely to cut down on the number of similar questions you get if you have an outage.  We’ve been really happy with the self-hosted open source software we’re using – Cachet.  We wanted to make sure our status page doesn’t go down at the same time as our other services.  So we’ve used a different server provider to our main infrastructure.  If you want to avoid worrying about that sort of thing, we’ve seen a lot of people are using statuspage.io and status.io.

Feature image background by Wolfgang.W. licensed CC BY 2.0.

Turning Prometheus data into metrics for alerting

As you may have seen in previous blog posts, we have a Warboard in our office which shows the current status of the servers we manage. Most of our customers are using our new alerting stack, but some have their own monitoring solutions which we want to integrate with. One of these was Prometheus. This blog post covers how we transformed raw Prometheus values into percentages which we could display on our warboard and create alerts against.

Integrating with Prometheus

In order to get a summary of the data from Prometheus to display on the Warboard we first needed to look at what information the Node Exporter provided and how it was tagged. Node Exporter is the software which makes the raw server stats available for Prometheus to collect. Given that our primary concerns are CPU, memory, disk IO and disk space usage we needed to construct queries to calculate them as percentages to be displayed on the Warboard.

Prometheus makes the data accessible through its API on "/api/v1/query?query=<query>". Most of the syntax is fairly logical with the general rule being that if we take an average or maximum value we need to specify "by (instance)" in order to keep each server separate. Node Exporter mostly returns raw values from the kernel rather than trying to manipulate them.  This is nice as it gives you freedom to decide how to use the data but does mean we have to give our queries a little extra consideration:

CPU

(1 - avg(irate(node_cpu{mode="idle"}[10m])) by (instance)) * 100

CPU usage is being reported as an integer that increases over time so we need to calculate the current percentage of usage ourselves. Fortunately Prometheus has the rate and irate functions for us. Since rate is mostly for use in calculating whether or not alert thresholds have been crossed and we are just trying to display the most recent data, irate seems a better fit. We are currently taking data over the last 10 minutes to ensure we get data for all servers, even if they’ve not reported very recently. As total CPU usage isn’t being reported it is easiest use the idle CPU usage to calculate the total as 100% – idle% rather than trying to add up all of the other CPU usage metrics. Since we want separate data for each server we need to group by instance.

Memory

((node_memory_MemTotal - node_memory_MemFree) / node_memory_MemTotal) * 100

The memory query is very simple, the only interesting thing to mention would be that MemAvailable wasn’t added until Linux 3.14 so we are using MemFree to get consistent values from every server.

Disk IO

(max(avg(irate(node_disk_io_time_ms[10m])) by (instance, device)) by (instance))/10

Throughout setting up alerting I feel disk IO has been the “most interesting” metric to calculate. For both Telegraf, which we discuss setting up here, and Node Exporter I found looking at the kernel docs most useful for confirming that disk “io_time” was the correct metric to calculate disk IO as a percentage from. Since we need a percentage we have to rule out anything dealing with bytes or blocks as we don’t want to benchmark or assume the speed of every disk. This leaves us with “io_time” and “weighted_io_time”. “weighted_io_time” might give the more accurate representation of how heavily disks are being used; it multiplies the time waited by a process, by the total number of processes waiting. However we need to use “io_time” in order to calculate a percentage or we would have to factor in the number of processes running at a given time. If there are multiple disks on a system, we are displaying the disk with the greatest IO as we are trying to spot issues so we only need to consider the busiest device. Finally we need to divide by 1000 to convert to seconds and multiply by 100 to get a percentage.

Disk Space

max(((node_filesystem_size{fstype=~"ext4|vfat"} - node_filesystem_free{fstype=~"ext4|vfat"}) / node_filesystem_size{fstype=~"ext4|vfat"}) * 100) by (instance)

As Node Exporter is returning 0 filesystem size for nsfs volumes and there are quite a few temporary and container filesystems that we aren’t trying to monitor, we either need to exclude ones we aren’t interested in or just include those that we are. As with disk IO, many servers have multiple devices / mountpoints so we are just displaying the fullest disk, since again we are trying to spot potential issues.

It’s worth noting that newer versions of Node exporter have slightly updated the metric names.  For example, instead of node_cpu you’ll now want node_cpu_seconds_total, you can see some of our other updates to the above queries in this code.

If you are looking to set-up your own queries I would recommend having a look through the Prometheus functions here and running some ad hoc queries from the "/graph" section of Prometheus in order to look at what data you have available.

If you need any help with Prometheus monitoring then please get in touch and we’ll be happy to help.

Replacement Server Monitoring – Part 3: Kapacitor alerts and going live!

So far in this series of blog posts we’ve discussed picking a replacement monitoring solution and getting it up and running. This instalment will cover setting up the actual alerting rules for our customers’ servers, and going live with the new solution.

Kapacitor Alerts

As mentioned in previous posts, the portion of the TICK stack responsible for the actual alerting is Kapacitor. Put simply, Kapacitor takes metrics stored in the InfluxDB database, processes and transforms them, and then sends alerts based on configured thresholds. It can deal with both batches and streams of data, the difference is fairly clear from the names; batch data takes multiple data points as an input and looks at them as a whole. Streams accept a single point at a time, folding each new point into the mix and re-evaluating thresholds each time.

As we wanted to monitor servers constantly over large time periods, stream data was the obvious choice for our alerts.

We went through many iterations of out alerting scripts, known as TICK scripts, before mostly settling on what we have now. I’ll explain one of our “Critical” CPU alert scripts to show how things work (comments inline):

var critLevel = 80 // The CPU percentage we want to alert on
var critTime = 15 // How long the CPU percentage must be at the critLevel (in this case, 80) percentage before we alert
var critResetTime = 15 // How long the CPU percentage must be back below the critLevel (again, 80) before we reset the alert

stream // Tell Kapacitor that this alert is using stream data
    |from()
        .measurement('cpu') // Tell Kapacitor to look at the CPU data
    |where(lambda: ("host" == '$reported_hostname') AND ("cpu" == 'cpu-total')) // Only look at the data for a particular server (more on this below)
    |groupBy('host')
    |eval(lambda: 100.0 - "usage_idle") // Calculate percentage of CPU used...
      .as('cpu_used') // ... and save this value in it's own variable
    |stateDuration(lambda: "cpu_used" >= critLevel) // Keep track of how long CPU percentage has been above the alerting threshold
        .unit(1m) // Minutely resolution is enough for us, so we use minutes for our units
        .as('crit_duration') // Store the number calculated above for later user
    |stateDuration(lambda: "cpu_used" < critLevel) // The same as the above 3 lines, but for resetting the alert status 
        .unit(1m) .as('crit_reset_duration') 
    |alert() // Create an alert... 
        .id('CPU - {{ index .Tags "host" }}') // The alert title 
        .message('{{.Level}} - CPU Usage > ' + string(critLevel) + ' on {{ index .Tags "host" }}') // The information contained in the alert
        .details('''
        {{ .ID }}
        {{ .Message }}
        ''')
        .crit(lambda: "crit_duration" >= critTime) // Generate a critical alert when CPU percentage has been above the threshold for the specified amount of time
        .critReset(lambda: "crit_reset_duration" >= critResetTime) // Reset the alert when CPU percentage has come back below the threshold for the right time
        .stateChangesOnly() // Only send out information when an alert changes from normal to critical, or back again
        .log('/var/log/kapacitor/kapacitor_alerts.log') // Record in a log file that this alert was generated / reset
        .email() // Send the alert via email 
        |influxDBOut() // Write the alert data back into InfluxDB for later reference...
             .measurement('kapacitor_alerts') // The name to store the data under
             .tag('kapacitor_alert_level', 'critical') // Information on the alert
.tag('metric', 'cpu') // The type of alert that was generated

The above TICK script generates a “Critcal” level alert when the CPU usage on a given server has been above 80% for 15 minutes or more. Once it has alerted, the alert will not reset until the CPU usage has come back down below 80% for a further 15 minutes. Both the initial notification and the “close” notification are sent via email.

The vast majority of our TICK scripts are very similar to the above, with changes to monitor different metrics (memory, disk space, disk IO etc) with different threshold levels and times etc.

To load this TICK script into Kapacitor, we use the kapacitor command line interface. Here’s what we’d run:

kapacitor define example_server_cpu -type stream -tick cpu.tick -dbrp example_server.autogen
kapacitor enable example_server_cpu

This creates a Kapacitor alert with the name “example_server_cpu”, with the “stream” alert type, against a database and retention policy we specify.

In reality, we automate this process with another script. This also replaces the $reported_hostname slug with the actual hostname of the server we’re setting the alert up for.

Getting customer servers reporting

Now that we could actually alert on information coming into InfluxDB, it was time to get each of our customers’ servers reporting in. Since we have a large number of customer systems to monitor, installing and configuring Telegraf by hand was simply not an option. We used ansible to roll the configuration out to the servers that needed it which involved 12 different operating systems and 4 different configurations.

Here’s a list of the tasks that Ansible carries out for us:

  • On our servers:
    • Create a specific InfluxDB database for the customers server
    • Create a locked down InfluxDB write only user for the server to send it’s data in with
    • Add Grafana data source to link the database to the customer
  • On the customers server:
    • Setup the Telegraf repo to ensure it is updated
    • Install Telegraf
    • Configure Telegraf outputs to point to our endpoint with the correct server specific credentials
    • Configure Telegraf inputs with all the metrics we want to capture
    • Restart Telegraf to load the new configuration

The above should be pretty self-explanatory. Whilst every one of the above steps would be carried out for a new server, we wrote the Ansible files to allow for most of them to be run independently of one another. This means that in future we’d be able to, for example, include another input to report metrics on, with relative ease.

For those of you not familiar with Ansible, here’s an excerpt from one of the files. It places a Telegraf config file into the relevant directory on the server, and sets the file permissions to the values we want:

---
- name: Copy inputs config onto client
  copy:
    src: ../files/telegraf/telegraf_inputs.conf
    dest: /etc/telegraf/telegraf.d/telegraf_inputs.conf
    owner: root
    group: root
    mode: 0644
become: yes

 

With the use of more ansible we incorporated various different tasks into a single repository structure, did lots of testing, and then ran things against our customers’ servers. Shortly after, we had all of our customers’ servers reporting in. After making sure everything looked right, we created and enabled various alerts for each server. The process for this was to write a BASH script which looped over a list of our customers’ servers and the available alert scripts, and combined them so that we had alerts for the key metrics across all servers. The floodgates had been opened!

Summary

So, at the end of everything covered in the posts in this series, we had ourselves a very respectable New Relic replacement. We ran the two systems side by side for a few weeks and are very happy with the outcome.  While what we have described here is a basic guide to setting the system up we have already started to make improvements way beyond the power we used to have.  If any of them are exciting enough, there will be more blog posts coming your way, so make sure you come back soon.

We’re also hoping to open source all of our TICK scripts, ansible configs, and various other snippets used to tie everything together at some point, once they’ve been tidied up and improved a bit more. If you cannot wait that long and need them now, drop us a line and we’ll do our best to help you out.

I hope you’ve enjoyed this series. It was great of a project that the whole company took part in and that enabled us to provide an even better experience for our customers. Thanks for reading!

Replacement Server Monitoring

Feature image background by swadley licensed CC BY 2.0.

Replacement Server Monitoring – Part 2: Building the replacement

This is part two of a three part series of blog posts about picking a replacement monitoring solution, getting it running and ready, and finally moving our customers over to it.

In our last post we discussed our need for a replacement monitoring system and our pick for the software stack we were going to build it on. If you haven’t already, you should go and read that before continuing with this blog post.

This post aims to detail the set up and configuration of the different components to work together, along with some additional customisations we made to get the functionality we wanted.

Component Installation

As mentioned in the previous entry in this series, InfluxData, the TICK stack creators, provide package repositories where pre-built and ready to use packages are available. This eliminates the need for configuration and compilation of source code before we can use it. This allows us to install and run software with the use of a few commands with very predictable results, as opposed to often many commands needed for compilation, with sometimes wildly varying results. Great stuff.

All components are available from the same repository. Here’s how you install them (example shown is for an Ubuntu 16.04 “Xenial” system

curl -sL https://repos.influxdata.com/influxdb.key | sudo apt-key add -
source /etc/lsb-release
echo "deb https://repos.influxdata.com/${DISTRIB_ID,,} ${DISTRIB_CODENAME} stable" | sudo tee /etc/apt/sources.list.d/influxdb.list
sudo apt-get update && sudo apt-get install influxdb
sudo systemctl start influxdb

The above steps are also identical for the other components, Telegraf, Chronograf and Kapacitor. You’ll just need to replace “influxdb” with the correct name in lines 4 and 5.

Configuring and linking the components

As each of the components are created by the same people, InfluxData, linking them together is fortunately very easy (another reason we went with the TICK stack). I’ll show you what additional configuration was put in place for the components and how we then linked together. Note that the components are out of order here, as the configuration of some components is a prerequisite to linking them to another.

InfluxDB

The main change that we make to InfluxDB is to have it listen for connections over HTTPS, meaning any data flowing to/from it will be encrypted. (To do this, you will need to have an SSL certificate and key pair to use. Obtaining that cert/key pair is outside the scope of the blog post). We also require authentication for logins, and disable the query log. We then restart InfluxDB for these changes to take effect.

sudo vim /etc/influx/influx.conf

[http]
    enabled = true
    bind-address = "0.0.0.0:8086"
    auth-enabled = true
    log-enabled = false
    https-enabled = true
    https-certificate = "/etc/influxdb/ssl/reporting-endpoint.dogsbodytecnhology.com.pem"

sudo systemctl restart influxd

Note that the path used for the “https-certificate” parameter will need to exist on your system of course.

We then need to create an administrative user like so:

influx -ssl -host ivory.dogsbodyhosting.net
> CREATE USER admin WITH PASSWORD 'superstrongpassword' WITH ALL PRIVILEGES

Telegraf

The customisations for Telegraf involve telling it where to reports its metrics to, and what metrics to record. We have an automated process, using ansible for rolling these customisations out to customer servers, which we’ll cover in the next part of this series. Make sure you check back for that. These are essentially what changes are made:

sudo vim /etc/telegraf.d/outputs.conf

[[outputs.influxdb]]
  urls = ["https://reporting-endpoint.dogsbodytechnology.com:8086"]
  database = "3340ad1c-31ac-11e8-bfaf-5ba54621292f"
  username = "3340ad1c-31ac-11e8-bfaf-5ba54621292f"
  password = "supersecurepassword"
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"

The above dictates that Telegraf should connect securely over HTTPS and tells it the username, database and password to use for it’s connection.

We also need to tell Telegraf what metrics it should record. This is configured like so:

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = true
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs"]
[[inputs.diskio]]
[[inputs.net]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.procstat]]
  pattern = "."

The above tells Telegraf what metrics to report, and customises how they are reported a little. For example, we tell it to ignore some pseudo-filesystems in the disk section, as these aren’t important to us.

Kapacitor

The customisations for Kapacitor primarily tell it which InfluxDB instance it should use, and the channels it should use for sending out alerts:

sudo vim /etc/kapacitor/kapacitor.conf
    [http]
    log-enabled = false
    
    [logging]
    level = “WARN”

    [[influxdb]]
    name = "ivory.dogsbodyhosting.net"
    urls = ["https://reporting-endpoint.dogsbodytechnology.com:8086"]
    username = admin
    password = “supersecurepassword”

    [pushover]
    enabled = true
    token = “yourpushovertoken”
    user-key = “yourpushoveruserkey”

    [smtp]
    enabled = true
    host = "localhost"
    port = 25
    username = ""
    password = ""
    from = "alerts@example.com"
    to = ["sysadmin@example.com"]

As you can probably work out, we use Pushover and email to send/receive our alert messages. This is subject to change over time. During the development phase, I used the Slack output.

Chronograf Grafana

Although the TICK stack offers it’s own visualisation (and control) tool, Chronograf, we ended up using the very popular Grafana instead. At the time when we were building the replacement solution, Chronograf, although very pretty, was somewhat lacking in features, and the features that did exist were sometimes buggy. Please do note that Chronograf was the only component that was still in beta at this period in time. It’s now had a full release and another ~5 months of development. You should definitely try it out for yourself before jumping straight to Grafana. We intend to re-evaluate Chronograf ourselves soon, especially as it is able to control the other components in the TICK stack, something which Grafana does not offer at all.

The Grafana install is pretty straightforward, as it also has a package repository:

sudo vim /etc/apt/sources.list.d/grafana.list
    deb https://packagecloud.io/grafana/stable/debian/ jessie main
sudo apt update
sudo apt install grafana

We then of course make some customisations. The important part here is setting the base URL which is required due to the fact we’ve got Grafana running behind an nginx reverse proxy. (We love nginx and use it wherever we get the chance. We won’t detail the customisations here though as they’re not strictly related to the monitoring solution, and Grafana works just fine on it’s own.)

sudo vim /etc/grafana/grafana.ini
    [server]
    domain = display-endpoint.dogsbodytechnology
    root_url = %(protocol)s://%(domain)s:/grafana
sudo systemctl restart grafana

Summary

The steps above left us with a very powerful and customisable monitoring solution, which worked fantastically for us. Be sure to check back for future instalments in this series. In part 3 we cover setting up alerts with Kapacitor, creating awesome visualisations with Grafana, and getting all of our hundreds of customers’ servers reporting in and alerting.

Part three is here.

Replacement Server Monitoring

Feature image background by tomandellystravels licensed CC BY 2.0.

Replacement Server Monitoring – Part 1: Picking a Replacement

As a company primarily dealing with Linux servers and keeping them online constantly, here at Dogsbody we take a huge interest in the current status of any and all servers we’re responsible for. Having accurate and up to date information allows us to move proactively and remedy potential problems before they became service-impacting for our customers.

For many years, and as long as I have worked at the company, we’d used an offering from New Relic, called simply “Servers”. In 2017, New Relic announced that they would be discontinuing their “Servers” offering, with their “Infrastructure” product taking it’s place. The pricing for New Relic infrastructure was exorbitant for our use case, and there were a few things we wanted from our monitoring solution that New Relic didn’t offer, so being the tinkerers that we are, we decided to implement our own.

This is a 3 part series of blog posts about picking a replacement monitoring solution, getting it running and ready, and finally moving our customers over to it.

What we needed from our new solution

The phase one objective for this project was rather simple: to replicate the core functionality offered by New Relic. This meant that the following items were considered crucial:

  • Configurable alert policies – All servers are different. Being able to tweak the thresholds for alerts depending on the server was very important to us. Nobody likes false alarms, especially not in the middle of the night!
  • Historical data – Being able to view system metrics at a given timestamp is of huge help when investigating problems that have occurred in the past
  • Easy to install and lightweight server-side software – As we’d be needing to install the monitoring tool on hundreds of servers, some with very low resources, we needed to ensure that this was a breeze to configure and as slim as possible
  • Webhook support for alerts – Our alerting process is built around having alerts from various different monitoring tools report to a single endpoint where we handle the alerting with custom logic. Flexibility in ours alerts was a must-have

Solutions we considered

A quick Google for “linux server monitoring” returns a lot of results. The first round of investigations essentially consisted of checking out the ones we’d heard about and reading up on what they had to offer. Anything of note got recorded for later reference, including any solutions that we knew would not be suitable for whatever reason. It didn’t take very long for a short list of “big players” to present themselves. Now, this is not to say that we discounted any solutions on the account of them being small, but we did want a solution that was gonna be stable and widely supported from the get-go. We wanted to get on with using the software, instead of spending time getting it to install/run.

The big names were Nagios, Zabbix, Prometheus, and Influx (TICK).

After much reading of the available documentation, performing some test installations (some successful, some very much not), and having a general play with each of them, I decided to look further at the TICK stack from InfluxData. I wont go too much into the negatives of the failed candidates, but the main points across them were:

  • Complex installation and/or management of central server
  • Poor / convoluted documentation
  • Lack of repositories for agent installation

Influx (TICK)

The monitoring solution offered by Influx consists of 4 parts, each of which can be installed independently of one another

TTelegraf – Agent for collecting and reporting system metrics

IInfluxDB – Database to store metrics

CChronograf – Management and graphing interface for the rest of the stack

KKapacitor – Data processing and alerting engine

 

Package repositories existed for all parts of the stack, most importantly for Telegraf which would be going on customer systems. This allowed for easy installation, updating, and removal of any of the components.

One of the biggest advantages for InfluxDB was the very simple installation: add the repo, install the package, start the software. At this point Influx was ready to accept metrics reported from a server running Telegraf (or anything else for that matter. There were many clients that support reporting to InfluxDB, which was another positive)

In the same vein, the Telegraf installation was also very easy, using the same steps as above, with the additional step of updating the config to tell the software where to report it’s metrics too. This is a one-line change in the config, followed by a quick restart of the software.

At this point we had basically all of the system information we could ever need, in an easy to access format, only a few seconds after things happen. Awesome.

Although the most important functionality to replicate was the alerting, the next thing we installed and focused on was the visualisation of the data Telegraf was reporting into InfluxDB. We needed to ensure the data we were receiving mirrored what we were seeing in New Relic, and it can also be tricky to create test alerts when you have no visibility of the data you’re alerting against too, so we needed some graphs (everyone loves pretty graphs as well of course!)

As mentioned above, Chronograf is the component of the TICK stack responsible for data visualisation, and also allows you to interface with InfluxDB and Kapacitor, to run queries and create alerts, respectively.

In summary, the TICK stack offered us an open source, modular and easy to use system. It felt pleasant to use, the documentation was reasonable, and the system seemed very stable. We had a great base, one which we could design and build our new server monitoring system. Exciting!

Part two is here.

Replacement Server Monitoring

Feature image background by xmodulo licensed CC BY 2.0.