Prometheus 2.0 Alpha.3 with New Rule Format

Today we release the third alpha version of Prometheus 2.0. Aside from a variety of bug fixes in the new storage layer, it contains a few planned breaking changes.

Flag Changes

First, we moved to a new flag library, which uses the more common double-dash -- prefix for flags instead of the single dash Prometheus used so far. Deployments have to be adapted accordingly. Additionally, some flags were removed with this alpha. The full list since Prometheus 1.0.0 is:

  • web.telemetry-path
  • All storage.remote.* flags
  • All storage.local.* flags
  • query.staleness-delta
  • alertmanager.url

Recording Rules changes

Alerting and recording rules are one of the critical features of Prometheus. But they also come with a few design issues and missing features, namely:

  • All rules ran with the same interval. We could have some heavy rules that are better off being run at a 10-minute interval and some rules that could be run at 15-second intervals.

  • All rules were evaluated concurrently, which is actually Prometheus’ oldest open bug. This has a couple of issues, the obvious one being that the load spikes every eval interval if you have a lot of rules. The other being that rules that depend on each other might be fed outdated data. For example:

instance:network_bytes:rate1m = sum by(instance) (rate(network_bytes_total[1m]))

ALERT HighNetworkTraffic
  IF instance:network_bytes:rate1m > 10e6
  FOR 5m

Here we are alerting over instance:network_bytes:rate1m, but instance:network_bytes:rate1m is itself being generated by another rule. We can get expected results only if the alert HighNetworkTraffic is run after the current value for instance:network_bytes:rate1m gets recorded.

  • Rules and alerts required users to learn yet another DSL.

To solve the issues above, grouping of rules has been proposed long back but has only recently been implemented as a part of Prometheus 2.0. As part of this implementation we have also moved the rules to the well-known YAML format, which also makes it easier to generate alerting rules based on common patterns in users’ environments.

Here’s how the new format looks:

groups:
- name: my-group-name
  interval: 30s   # defaults to global interval
  rules:
  - record: instance:errors:rate5m
    expr: rate(errors_total[5m])
  - record: instance:requests:rate5m
    expr: rate(requests_total[5m])
  - alert: HighErrors
    # Expressions remain PromQL as before and can be spread over
    # multiple lines via YAML’s multi-line strings.
    expr: |
      sum without(instance) (instance:errors:rate5m)
      / 
      sum without(instance) (instance:requests:rate5m)
    for: 5m
    labels:
      severity: critical
    annotations:
      description: "stuff's happening with {{ $.labels.service }}"      

The rules in each group are executed sequentially and you can have an evaluation interval per group.

As this change is breaking, we are going to release it with the 2.0 release and have added a command to promtool for the migration: promtool update rules <filenames> The converted files have the .yml suffix appended and the rule_files clause in your Prometheus configuration has to be adapted.

Help us moving towards the Prometheus 2.0 stable release by testing this new alpha version! You can report bugs on our issue tracker and provide general feedback via our community channels.

Interview with L’Atelier Animation

Continuing our series of interviews with users of Prometheus, Philippe Panaite and Barthelemy Stevens from L’Atelier Animation talk about how they switched their animation studio from a mix of Nagios, Graphite and InfluxDB to Prometheus.

Can you tell us about yourself and what L’Atelier Animation does?

L’Atelier Animation is a 3D animation studio based in the beautiful city of Montreal Canada. Our first feature film "Ballerina" (also known as "Leap") was released worldwide in 2017, US release is expected later this year.

We’re currently hard at work on an animated TV series and on our second feature film.   Our infrastructure consists of around 300 render blades, 150 workstations and twenty various servers. With the exception of a couple of Macs, everything runs on Linux (CentOS) and not a single Windows machine.  

 

What was your pre-Prometheus monitoring experience?

  At first we went with a mix of Nagios, Graphite, and InfluxDB. The initial setup was “ok” but nothing special and over complicated (too many moving parts).  

Why did you decide to look at Prometheus?

  When we switched all of our services to CentOS 7, we looked at new monitoring solutions and Prometheus came up for many reasons, but most importantly:

  • Node Exporter: With its customization capabilities, we can fetch any data from clients
  • SNMP support: Removes the need for a 3rd party SNMP service
  • Alerting system: ByeBye Nagios
  • Grafana support

How did you transition?

When we finished our first film we had a bit of a downtime so it was a perfect opportunity for our IT department to make big changes. We decided to flush our whole monitoring system as it was not as good as we wanted.  

One of the most important part is to monitor networking equipment so we started by configuring snmp_exporter to fetch data from one of our switches. The calls to NetSNMP that the exporter makes are different under CentOS so we had to re-compile some of the binaries, we did encounter small hiccups here and there but with the help of Brian Brazil from Robust Perception, we got everything sorted out quickly. Once we got snmp_exporter working, we were able to easily add new devices and fetch SNMP data. We now have our core network monitored in Grafana (including 13 switches, 10 VLANs).

Switch metrics from SNMP data

After that we configured node_exporter as we required analytics on workstations, render blades and servers. In our field, when a CPU is not at 100% it’s a problem, we want to use all the power we can so in the end temperature is more critical. Plus, we need as much uptime as possible so all our stations have email alerts setup via Prometheus’s Alertmanager so we’re aware when anything is down.

Dashboard for one workstation

Our specific needs require us to monitor custom data from clients, it’s made easy through the use of node_exporter’s textfile collector function. A cronjob outputs specific data from any given tool into a pre-formatted text file in a format readable by Prometheus.  

Since all the data is available through the HTTP protocol, we wrote a Python script to fetch data from Prometheus. We store it in a MySQL database accessed via a web application that creates a live floor map. This allows us to know with a simple mouse over which user is seated where with what type of hardware. We also created another page with user’s picture & departement information, it helps new employees know who’s their neighbour. The website is still an ongoing project so please don’t judge the look, we’re sysadmins after all not web designers :-)

Floormap with workstation detail

What improvements have you seen since switching?

It gave us an opportunity to change the way we monitor everything in the studio and inspired us to create a new custom floor map with all the data which has been initially fetched by Prometheus. The setup is a lot simpler with one service to rule them all.

What do you think the future holds for L’Atelier Animation and Prometheus?

We’re currently in the process of integrating software licenses usage with Prometheus. The information will give artists a good idea of whom is using what and where.

We will continue to customize and add new stuff to Prometheus by user demand and since we work with artists, we know there will be plenty :-) With SNMP and the node_exporter’s custom text file inputs, the possibilities are endless...

Interview with iAdvize

Continuing our series of interviews with users of Prometheus, Laurent COMMARIEU from iAdvize talks about how they replaced their legacy Nagios and Centreon monitoring with Prometheus.

Can you tell us about iAdvize does?

I am Laurent COMMARIEU, a system engineer at iAdvize. I work within the 60 person R&D department in a team of 5 system engineers. Our job is mainly to ensure that applications, services and the underlying system are up and running. We are working with developers to ensure the easiest path for their code to production, and provide the necessary feedback at every step. That’s where monitoring is important.

iAdvize is a full stack conversational commerce platform. We provide an easy way for a brand to centrally interact with their customers, no matter the communication channel (chat, call, video, Facebook Pages, Facebook Messenger, Twitter, Instagram, WhatsApp, SMS, etc...). Our customers work in ecommerce, banks, travel, fashion, etc. in 40 countries. We are an international company of 200 employees with offices in France, UK, Germany, Spain and Italy. We raised $16 Million in 2015.

What was your pre-Prometheus monitoring experience?

I joined iAdvize in February 2016. Previously I worked in companies specialized in network and application monitoring. We were working with opensource software like Nagios, Cacti, Centreon, Zabbix, OpenNMS, etc. and some non-free ones like HP NNM, IBM Netcool suite, BMC Patrol, etc.

iAdvize used to delegate monitoring to an external provider. They ensured 24/7 monitoring using Nagios and Centreon. This toolset was working fine with the legacy static architecture (barebone servers, no VMs, no containers). To complete this monitoring stack, we also use Pingdom.

With the moving our monolithic application towards a Microservices architecture (using Docker) and our will to move our current workload to an infrastructure cloud provider we needed to have more control and flexibility on monitoring. At the same time, iAdvize recruited 3 people, which grew the infrastructure team from 2 to 5. With the old system it took at least a few days or a week to add some new metrics into Centreon and had a real cost (time and money).

Why did you decide to look at Prometheus?

We knew Nagios and the like were not a good choice. Prometheus was the rising star at the time and we decided to PoC it. Sensu was also on the list at the beginning but Prometheus seemed more promising for our use cases.

We needed something able to integrate with Consul, our service discovery system. Our micro services already had a /health route; adding a /metrics endpoint was simple. For about every tool we used, an exporter was available (MySQL, Memcached, Redis, nginx, FPM, etc.).

On paper it looked good.

One of iAdvize's Grafana dashboards

How did you transition?

First of all, we had to convince the developers team (40 people) that Prometheus was the right tool for the job and that they had to add an exporter to their apps. So we did a little demo on RabbitMQ, we installed a RabbitMQ exporter and built a simple Grafana dashboard to display usage metrics to developers. A Python script was written to create some queue and publish/consume messages.

They were quite impressed to see queues and the messages appear in real time. Before that, developers didn't have access to any monitoring data. Centreon was restricted by our infrastructure provider. Today, Grafana is available to everyone at iAdvize, using the Google Auth integration to authenticate. There are 78 active accounts on it (from dev teams to the CEO).

After we started monitoring existing services with Consul and cAdvisor, we monitored the actual presence of the containers. They were monitored using Pingdom checks but it wasn't enough.

We developed a few custom exporters in Go to scrape some business metrics from our databases (MySQL and Redis).

Soon enough, we were able to replace all the legacy monitoring by Prometheus.

One of iAdvize's Grafana dashboards

What improvements have you seen since switching?

Business metrics became very popular and during sales periods everyone is connected to Grafana to see if we're gonna beat some record. We monitor the number of simultaneous conversations, routing errors, agents connected, the number of visitors loading the iAdvize tag, calls on our API gateway, etc.

We worked for a month to optimize our MySQL servers with analysis based on the Newrelic exporter and Percona dashboard for grafana. It was a real success, allowing us to discover inefficiencies and perform optimisations that cut database size by 45% and peak latency by 75%.

There are a lot to say. We know if a AMQP queue has no consumer or if it is Filling abnormally. We know when a container restarts.

The visibility is just awesome.

That was just for the legacy platform.

More and more micro services are going to be deployed in the cloud and Prometheus is used to monitor them. We are using Consul to register the services and Prometheus to discover the metrics routes. Everything works like a charm and we are able to build a Grafana dashboard with a lot of critical business, application and system metrics.

We are building a scalable architecture to deploy our services with Nomad. Nomad registers healthy services in Consul and with some tags relabeling we are able to filter those with a tag name "metrics=true". It offers to us a huge gain in time to deploy the monitoring. We have nothing to do ^^.

We also use the EC2 service discovery. It's really useful with auto-scaling groups. We scale and recycle instances and it's already monitored. No more waiting for our external infrastructure provider to notice what happens in production.

We use alertmanager to send some alerts by SMS or in to our Flowdock.

What do you think the future holds for iAdvize and Prometheus?

  • We are waiting for a simple way to add a long term scalable storage for our capacity planning.
  • We have a dream that one day, our auto-scaling will be triggered by Prometheus alerting. We want to build an autonomous system base on response time and business metrics.
  • I used to work with Netuitive, it had a great anomaly detection feature with automatic correlation. It would be great to have some in Prometheus.

Sneak Peak of Prometheus 2.0

In July 2016 Prometheus reached a big milestone with its 1.0 release. Since then, plenty of new features like new service discovery integrations and our experimental remote APIs have been added. We also realized that new developments in the infrastructure space, in particular Kubernetes, allowed monitored environments to become significantly more dynamic. Unsurprisingly, this also brings new challenges to Prometheus and we identified performance bottlenecks in its storage layer.

Over the past few months we have been designing and implementing a new storage concept that addresses those bottlenecks and shows considerable performance improvements overall. It also paves the way to add features such as hot backups.

The changes are so fundamental that it will trigger a new major release: Prometheus 2.0.
Important features and changes beyond the storage are planned before its stable release. However, today we are releasing an early alpha of Prometheus 2.0 to kick off the stabilization process of the new storage.

Release tarballs and Docker containers are now available. If you are interested in the new mechanics of the storage, make sure to read the deep-dive blog post looking under the hood.

This version does not work with old storage data and should not replace existing production deployments. To run it, the data directory must be empty and all existing storage flags except for -storage.local.retention have to be removed.

For example; before:

./prometheus -storage.local.retention=200h -storage.local.memory-chunks=1000000 -storage.local.max-chunks-to-persist=500000 -storage.local.chunk-encoding=2 -config.file=/etc/prometheus.yaml

after:

./prometheus -storage.local.retention=200h -config.file=/etc/prometheus.yaml

This is a very early version and crashes, data corruption, and bugs in general should be expected. Help us move towards a stable release by submitting them to our issue tracker.

The experimental remote storage APIs are disabled in this alpha release. Scraping targets exposing timestamps, such as federated Prometheus servers, does not yet work. The storage format is breaking and will break again between subsequent alpha releases. We plan to document an upgrade path from 1.0 to 2.0 once we are approaching a stable release.

Interview with Europace

Continuing our series of interviews with users of Prometheus, Tobias Gesellchen from Europace talks about how they discovered Prometheus.

Can you tell us about Europace does?

Europace AG develops and operates the web-based EUROPACE financial marketplace, which is Germany’s largest platform for mortgages, building finance products and personal loans. A fully integrated system links about 400 partners – banks, insurers and financial product distributors. Several thousand users execute some 35,000 transactions worth a total of up to €4 billion on EUROPACE every month. Our engineers regularly blog at http://tech.europace.de/ and @EuropaceTech.

What was your pre-Prometheus monitoring experience?

Nagios/Icinga are still in use for other projects, but with the growing number of services and higher demand for flexibility we looked for other solutions. Due to Nagios and Icinga being more centrally maintained, Prometheus matched our aim to have the full DevOps stack in our team and move specific responsibilities from our infrastructure team to the project members.

Why did you decide to look at Prometheus?

Through our activities in the Docker Berlin community we had been in contact with SoundCloud and Julius Volz, who gave us a good overview. The combination of flexible Docker containers with the highly flexible label-based concept convinced us give Prometheus a try. The Prometheus setup was easy enough, and the Alertmanager worked for our needs, so that we didn’t see any reason to try alternatives. Even our little pull requests to improve the integration in a Docker environment and with messaging tools had been merged very quickly. Over time, we added several exporters and Grafana to the stack. We never looked back or searched for alternatives.

Grafana dashboard for Docker Registry

How did you transition?

Our team introduced Prometheus in a new project, so the transition didn’t happen in our team. Other teams started by adding Prometheus side by side to existing solutions and then migrated the metrics collectors step by step. Custom exporters and other temporary services helped during the migration. Grafana existed already, so we didn’t have to consider another dashboard. Some projects still use both Icinga and Prometheus in parallel.

What improvements have you seen since switching?

We had issues using Icinga due to scalability - several teams maintaining a centrally managed solution didn’t work well. Using the Prometheus stack along with the Alertmanager decoupled our teams and projects. The Alertmanager is now able to be deployed in a high availability mode, which is a great improvement to the heart of our monitoring infrastructure.

What do you think the future holds for Europace and Prometheus?

Other teams in our company have gradually adopted Prometheus in their projects. We expect that more projects will introduce Prometheus along with the Alertmanager and slowly replace Icinga. With the inherent flexibility of Prometheus we expect that it will scale with our needs and that we won’t have issues adapting it to future requirements.

Interview with Weaveworks

Continuing our series of interviews with users of Prometheus, Tom Wilkie from Weaveworks talks about how they choose Prometheus and are now building on it.

Can you tell us about Weaveworks?

Weaveworks offers Weave Cloud, a service which "operationalizes" microservices through a combination of open source projects and software as a service.

Weave Cloud consists of:

You can try Weave Cloud free for 60 days. For the latest on our products check out our blog, Twitter, or Slack (invite).

What was your pre-Prometheus monitoring experience?

Weave Cloud was a clean-slate implementation, and as such there was no previous monitoring system. In previous lives the team had used the typical tools such as Munin and Nagios. Weave Cloud started life as a multitenant, hosted version of Scope. Scope includes basic monitoring for things like CPU and memory usage, so I guess you could say we used that. But we needed something to monitor Scope itself...

Why did you decide to look at Prometheus?

We've got a bunch of ex-Google SRE on staff, so there was plenty of experience with Borgmon, and an ex-SoundClouder with experience of Prometheus. We built the service on Kubernetes and were looking for something that would "fit" with its dynamically scheduled nature - so Prometheus was a no-brainer. We've even written a series of blog posts of which why Prometheus and Kubernetes work together so well is the first.

How did you transition?

When we started with Prometheus the Kubernetes service discovery was still just a PR and as such there were few docs. We ran a custom build for a while and kinda just muddled along, working it out for ourselves. Eventually we gave a talk at the London Prometheus meetup on our experience and published a series of blog posts.

We've tried pretty much every different option for running Prometheus. We started off building our own container images with embedded config, running them all together in a single Pod alongside Grafana and Alert Manager. We used ephemeral, in-Pod storage for time series data. We then broke this up into different Pods so we didn't have to restart Prometheus (and lose history) whenever we changed our dashboards. More recently we've moved to using upstream images and storing the config in a Kubernetes config map - which gets updated by our CI system whenever we change it. We use a small sidecar container in the Prometheus Pod to watch the config file and ping Prometheus when it changes. This means we don't have to restart Prometheus very often, can get away without doing anything fancy for storage, and don't lose history.

Still the problem of periodically losing Prometheus history haunted us, and the available solutions such as Kubernetes volumes or periodic S3 backups all had their downsides. Along with our fantastic experience using Prometheus to monitor the Scope service, this motivated us to build a cloud-native, distributed version of Prometheus - one which could be upgraded, shuffled around and survive host failures without losing history. And that’s how Weave Cortex was born.

What improvements have you seen since switching?

Ignoring Cortex for a second, we were particularly excited to see the introduction of the HA Alert Manager; although mainly because it was one of the first non-Weaveworks projects to use Weave Mesh, our gossip and coordination layer.

I was also particularly keen on the version two Kubernetes service discovery changes by Fabian - this solved an acute problem we were having with monitoring our Consul Pods, where we needed to scrape multiple ports on the same Pod.

And I'd be remiss if I didn't mention the remote write feature (something I worked on myself). With this, Prometheus forms a key component of Weave Cortex itself, scraping targets and sending samples to us.

What do you think the future holds for Weaveworks and Prometheus?

For me the immediate future is Weave Cortex, Weaveworks' Prometheus as a Service. We use it extensively internally, and are starting to achieve pretty good query performance out of it. It's running in production with real users right now, and shortly we'll be introducing support for alerting and achieve feature parity with upstream Prometheus. From there we'll enter a beta programme of stabilization before general availability in the middle of the year.

As part of Cortex, we've developed an intelligent Prometheus expression browser, with autocompletion for PromQL and Jupyter-esque notebooks. We're looking forward to getting this in front of more people and eventually open sourcing it.

I've also got a little side project called Loki, which brings Prometheus service discovery and scraping to OpenTracing, and makes distributed tracing easy and robust. I'll be giving a talk about this at KubeCon/CNCFCon Berlin at the end of March.

Interview with Canonical

Continuing our series of interviews with users of Prometheus, Canonical talks about how they are transitioning to Prometheus.

Can you tell us about yourself and what Canonical does?

Canonical is probably best known as the company that sponsors Ubuntu Linux. We also produce or contribute to a number of other open-source projects including MAAS, Juju, and OpenStack, and provide commercial support for these products. Ubuntu powers the majority of OpenStack deployments, with 55% of production clouds and 58% of large cloud deployments.

My group, BootStack, is our fully managed private cloud service. We build and operate OpenStack clouds for Canonical customers.

What was your pre-Prometheus monitoring experience?

We’d used a combination of Nagios, Graphite/statsd, and in-house Django apps. These did not offer us the level of flexibility and reporting that we need in both our internal and customer cloud environments.

Why did you decide to look at Prometheus?

We’d evaluated a few alternatives, including InfluxDB and extending our use of Graphite, but our first experiences with Prometheus proved it to have the combination of simplicity and power that we were looking for. We especially appreciate the convenience of labels, the simple HTTP protocol, and the out of box timeseries alerting. The potential with Prometheus to replace 2 different tools (alerting and trending) with one is particularly appealing.

Also, several of our staff have prior experience with Borgmon from their time at Google which greatly added to our interest!

How did you transition?

We are still in the process of transitioning, we expect this will take some time due to the number of custom checks we currently use in our existing systems that will need to be re-implemented in Prometheus. The most useful resource has been the prometheus.io site documentation.

It took us a while to choose an exporter. We originally went with collectd but ran into limitations with this. We’re working on writing an openstack-exporter now and were a bit surprised to find there is no good, working, example how to write exporter from scratch.

Some challenges we’ve run into are: No downsampling support, no long term storage solution (yet), and we were surprised by the default 2 week retention period. There's currently no tie-in with Juju, but we’re working on it!

What improvements have you seen since switching?

Once we got the hang of exporters, we found they were very easy to write and have given us very useful metrics. For example we are developing an openstack-exporter for our cloud environments. We’ve also seen very quick cross-team adoption from our DevOps and WebOps groups and developers. We don’t yet have alerting in place but expect to see a lot more to come once we get to this phase of the transition.

What do you think the future holds for Canonical and Prometheus?

We expect Prometheus to be a significant part of our monitoring and reporting infrastructure, providing the metrics gathering and storage for numerous current and future systems. We see it potentially replacing Nagios as for alerting.

Interview with JustWatch

Continuing our series of interviews with users of Prometheus, JustWatch talks about how they established their monitoring.

Can you tell us about yourself and what JustWatch does?

For consumers, JustWatch is a streaming search engine that helps to find out where to watch movies and TV shows legally online and in theaters. You can search movie content across all major streaming providers like Netflix, HBO, Amazon Video, iTunes, Google Play, and many others in 17 countries.

For our clients like movie studios or Video on Demand providers, we are an international movie marketing company that collects anonymized data about purchase behavior and movie taste of fans worldwide from our consumer apps. We help studios to advertise their content to the right audience and make digital video advertising a lot more efficient in minimizing waste coverage.

JustWatch logo

Since our launch in 2014 we went from zero to one of the largest 20k websites internationally without spending a single dollar on marketing - becoming the largest streaming search engine worldwide in under two years. Currently, with an engineering team of just 10, we build and operate a fully dockerized stack of about 50 micro- and macro-services, running mostly on Kubernetes.

What was your pre-Prometheus monitoring experience?

At prior companies many of us worked with most of the open-source monitoring products there are. We have quite some experience working with Nagios, Icinga, Zabbix, Monit, Munin, Graphite and a few other systems. At one company I helped build a distributed Nagios setup with Puppet. This setup was nice, since new services automatically showed up in the system, but taking instances out was still painful. As soon as you have some variance in your systems, the host and service based monitoring suites just don’t fit quite well. The label-based approach Prometheus took was something I always wanted to have, but didn’t find before.

Why did you decide to look at Prometheus?

At JustWatch the public Prometheus announcement hit exactly the right time. We mostly had blackbox monitoring for the first few months of the company - CloudWatch for some of the most important internal metrics, combined with a external services like Pingdom for detecting site-wide outages. Also, none of the classical host-based solutions satisfied us. In a world of containers and microservices, host-based tools like Icinga, Thruk or Zabbix felt antiquated and not ready for the job. When we started to investigate whitebox monitoring, some of us luckily attended the Golang Meetup where Julius and Björn announced Prometheus. We quickly set up a Prometheus server and started to instrument our Go services (we use almost only Go for the backend). It was amazing how easy that was - the design felt like being cloud- and service-oriented as a first principle and never got in the way.

How did you transition?

Transitioning wasn't that hard, as timing wise, we were lucky enough to go from no relevant monitoring directly to Prometheus.

The transition to Prometheus was mostly including the Go client into our apps and wrapping the HTTP handlers. We also wrote and deployed several exporters, including the node_exporter and several exporters for cloud provider APIs. In our experience monitoring and alerting is a project that is never finished, but the bulk of the work was done within a few weeks as a side project.

Since the deployment of Prometheus we tend to look into metrics whenever we miss something or when we are designing new services from scratch.

It took some time to fully grasp the elegance of PromQL and labels concept fully, but the effort really paid off.

What improvements have you seen since switching?

Prometheus enlightened us by making it incredibly easy to reap the benefits from whitebox monitoring and label-based canary deployments. The out-of-the-box metrics for many Golang aspects (HTTP Handler, Go Runtime) helped us to get to a return on investment very quickly - goroutine metrics alone saved the day multiple times. The only monitoring component we actually liked before - Grafana - feels like a natural fit for Prometheus and has allowed us to create some very helpful dashboards. We appreciated that Prometheus didn't try to reinvent the wheel but rather fit in perfectly with the best solution out there. Another huge improvement on predecessors was Prometheus's focus on actually getting the math right (percentiles, etc.). In other systems, we were never quite sure if the operations offered made sense. Especially percentiles are such a natural and necessary way of reasoning about microservice performance that it felt great that they get first class treatment.

Database Dashboard

The integrated service discovery makes it super easy to manage the scrape targets. For Kubernetes, everything just works out-of-the-box. For some other systems not running on Kubernetes yet, we use a Consul-based approach. All it takes to get an application monitored by Prometheus is to add the client, expose /metrics and set one simple annotation on the Container/Pod. This low coupling takes out a lot of friction between development and operations - a lot of services are built well orchestrated from the beginning, because it's simple and fun.

The combination of time-series and clever functions make for awesome alerting super-powers. Aggregations that run on the server and treating both time-series, combinations of them and even functions on those combinations as first-class citizens makes alerting a breeze - often times after the fact.

What do you think the future holds for JustWatch and Prometheus?

While we value very much that Prometheus doesn't focus on being shiny but on actually working and delivering value while being reasonably easy to deploy and operate - especially the Alertmanager leaves a lot to be desired yet. Just some simple improvements like simplified interactive alert building and editing in the frontend would go a long way in working with alerts being even simpler.

We are really looking forward to the ongoing improvements in the storage layer, including remote storage. We also hope for some of the approaches taken in Project Prism and Vulcan to be backported to core Prometheus. The most interesting topics for us right now are GCE Service Discovery, easier scaling, and much longer retention periods (even at the cost of colder storage and much longer query times for older events).

We are also looking forward to use Prometheus for more non-technical departments as well. We’d like to cover most of our KPIs with Prometheus to allow everyone to create beautiful dashboards, as well as alerts. We're currently even planning to abuse the awesome alert engine for a new, internal business project as well - stay tuned!

Interview with Compose

Continuing our series of interviews with users of Prometheus, Compose talks about their monitoring journey from Graphite and InfluxDB to Prometheus.

Can you tell us about yourself and what Compose does?

Compose delivers production-ready database clusters as a service to developers around the world. An app developer can come to us and in a few clicks have a multi-host, highly available, automatically backed up and secure database ready in minutes. Those database deployments then autoscale up as demand increases so a developer can spend their time on building their great apps, not on running their database.

We have tens of clusters of hosts across at least two regions in each of AWS, Google Cloud Platform and SoftLayer. Each cluster spans availability zones where supported and is home to around 1000 highly-available database deployments in their own private networks. More regions and providers are in the works.

What was your pre-Prometheus monitoring experience?

Before Prometheus, a number of different metrics systems were tried. The first system we tried was Graphite, which worked pretty well initially, but the sheer volume of different metrics we had to store, combined with the way Whisper files are stored and accessed on disk, quickly overloaded our systems. While we were aware that Graphite could be scaled horizontally relatively easily, it would have been an expensive cluster. InfluxDB looked more promising so we started trying out the early-ish versions of that and it seemed to work well for a good while. Goodbye Graphite.

The earlier versions of InfluxDB had some issues with data corruption occasionally. We semi-regularly had to purge all of our metrics. It wasn’t a devastating loss for us normally, but it was irritating. The continued promises of features that never materialised frankly wore on us.

Why did you decide to look at Prometheus?

It seemed to combine better efficiency with simpler operations than other options.

Pull-based metric gathering puzzled us at first, but we soon realised the benefits. Initially it seemed like it could be far too heavyweight to scale well in our environment where we often have several hundred containers with their own metrics on each host, but by combining it with Telegraf, we can arrange to have each host export metrics for all its containers (as well as its overall resource metrics) via a single Prometheus scrape target.

How did you transition?

We are a Chef shop so we spun up a largish instance with a big EBS volume and then reached right for a community chef cookbook for Prometheus.

With Prometheus up on a host, we wrote a small Ruby script that uses the Chef API to query for all our hosts, and write out a Prometheus target config file. We use this file with a file_sd_config to ensure all hosts are discovered and scraped as soon as they register with Chef. Thanks to Prometheus’ open ecosystem, we were able to use Telegraf out of the box with a simple config to export host-level metrics directly.

We were testing how far a single Prometheus would scale and waiting for it to fall over. It didn’t! In fact it handled the load of host-level metrics scraped every 15 seconds for around 450 hosts across our newer infrastructure with very little resource usage.

We have a lot of containers on each host so we were expecting to have to start to shard Prometheus once we added all memory usage metrics from those too, but Prometheus just kept on going without any drama and still without getting too close to saturating its resources. We currently monitor over 400,000 distinct metrics every 15 seconds for around 40,000 containers on 450 hosts with a single m4.xlarge prometheus instance with 1TB of storage. You can see our host dashboard for this host below. Disk IO on the 1TB gp2 SSD EBS volume will probably be the limiting factor eventually. Our initial guess is well over-provisioned for now, but we are growing fast in both metrics gathered and hosts/containers to monitor.

Prometheus Host Dashboard

At this point the Prometheus server we’d thrown up to test with was vastly more reliable than the InfluxDB cluster we had doing the same job before, so we did some basic work to make it less of a single-point-of-failure. We added another identical node scraping all the same targets, then added a simple failover scheme with keepalived + DNS updates. This was now more highly available than our previous system so we switched our customer-facing graphs to use Prometheus and tore down the old system.

Prometheus-powered memory metrics for PostgresSQL containers in our app

What improvements have you seen since switching?

Our previous monitoring setup was unreliable and difficult to manage. With Prometheus we have a system that’s working well for graphing lots of metrics, and we have team members suddenly excited about new ways to use it rather than wary of touching the metrics system we used before.

The cluster is simpler too, with just two identical nodes. As we grow, we know we’ll have to shard the work across more Prometheus hosts and have considered a few ways to do this.

What do you think the future holds for Compose and Prometheus?

Right now we have only replicated the metrics we already gathered in previous systems - basic memory usage for customer containers as well as host-level resource usage for our own operations. The next logical step is enabling the database teams to push metrics to the local Telegraf instance from inside the DB containers so we can record database-level stats too without increasing number of targets to scrape.

We also have several other systems that we want to get into Prometheus to get better visibility. We run our apps on Mesos and have integrated basic Docker container metrics already, which is better than previously, but we also want to have more of the infrastructure components in the Mesos cluster recording to the central Prometheus so we can have centralised dashboards showing all elements of supporting system health from load balancers right down to app metrics.

Eventually we will need to shard Prometheus. We already split customer deployments among many smaller clusters for a variety of reasons so the one logical option would be to move to a smaller Prometheus server (or a pair for redundancy) per cluster rather than a single global one.

For most reporting needs this is not a big issue as we usually don’t need hosts/containers from different clusters in the same dashboard, but we may keep a small global cluster with much longer retention and just a modest number of down-sampled and aggregated metrics from each cluster’s Prometheus using Recording Rules.

Interview with DigitalOcean

Next in our series of interviews with users of Prometheus, DigitalOcean talks about how they use Prometheus. Carlos Amedee also talked about the social aspects of the rollout at PromCon 2016.

Can you tell us about yourself and what DigitalOcean does?

My name is Ian Hansen and I work on the platform metrics team. DigitalOcean provides simple cloud computing. To date, we’ve created 20 million Droplets (SSD cloud servers) across 13 regions. We also recently released a new Block Storage product.

DigitalOcean logo

What was your pre-Prometheus monitoring experience?

Before Prometheus, we were running Graphite and OpenTSDB. Graphite was used for smaller-scale applications and OpenTSDB was used for collecting metrics from all of our physical servers via Collectd. Nagios would pull these databases to trigger alerts. We do still use Graphite but we no longer run OpenTSDB.

Why did you decide to look at Prometheus?

I was frustrated with OpenTSDB because I was responsible for keeping the cluster online, but found it difficult to guard against metric storms. Sometimes a team would launch a new (very chatty) service that would impact the total capacity of the cluster and hurt my SLAs.

We are able to blacklist/whitelist new metrics coming in to OpenTSDB, but didn’t have a great way to guard against chatty services except for organizational process (which was hard to change/enforce). Other teams were frustrated with the query language and the visualization tools available at the time. I was chatting with Julius Volz about push vs pull metric systems and was sold in wanting to try Prometheus when I saw that I would really be in control of my SLA when I get to determine what I’m pulling and how frequently. Plus, I really really liked the query language.

How did you transition?

We were gathering metrics via Collectd sending to OpenTSDB. Installing the Node Exporter in parallel with our already running Collectd setup allowed us to start experimenting with Prometheus. We also created a custom exporter to expose Droplet metrics. Soon, we had feature parity with our OpenTSDB service and started turning off Collectd and then turned off the OpenTSDB cluster.

People really liked Prometheus and the visualization tools that came with it. Suddenly, my small metrics team had a backlog that we couldn’t get to fast enough to make people happy, and instead of providing and maintaining Prometheus for people’s services, we looked at creating tooling to make it as easy as possible for other teams to run their own Prometheus servers and to also run the common exporters we use at the company.

Some teams have started using Alertmanager, but we still have a concept of pulling Prometheus from our existing monitoring tools.

What improvements have you seen since switching?

We’ve improved our insights on hypervisor machines. The data we could get out of Collectd and Node Exporter is about the same, but it’s much easier for our team of golang developers to create a new custom exporter that exposes data specific to the services we run on each hypervisor.

We’re exposing better application metrics. It’s easier to learn and teach how to create a Prometheus metric that can be aggregated correctly later. With Graphite it’s easy to create a metric that can’t be aggregated in a certain way later because the dot-separated-name wasn’t structured right.

Creating alerts is much quicker and simpler than what we had before, plus in a language that is familiar. This has empowered teams to create better alerting for the services they know and understand because they can iterate quickly.

What do you think the future holds for DigitalOcean and Prometheus?

We’re continuing to look at how to make collecting metrics as easy as possible for teams at DigitalOcean. Right now teams are running their own Prometheus servers for the things they care about, which allowed us to gain observability we otherwise wouldn’t have had as quickly. But, not every team should have to know how to run Prometheus. We’re looking at what we can do to make Prometheus as automatic as possible so that teams can just concentrate on what queries and alerts they want on their services and databases.

We also created Vulcan so that we have long-term data storage, while retaining the Prometheus Query Language that we have built tooling around and trained people how to use.

Interview with ShuttleCloud

Continuing our series of interviews with users of Prometheus, ShuttleCloud talks about how they began using Prometheus. Ignacio from ShuttleCloud also explained how Prometheus Is Good for Your Small Startup at PromCon 2016.

What does ShuttleCloud do?

ShuttleCloud is the world’s most scalable email and contacts data importing system. We help some of the leading email and address book providers, including Google and Comcast, increase user growth and engagement by automating the switching experience through data import.

By integrating our API into their offerings, our customers allow their users to easily migrate their email and contacts from one participating provider to another, reducing the friction users face when switching to a new provider. The 24/7 email providers supported include all major US internet service providers: Comcast, Time Warner Cable, AT&T, Verizon, and more.

By offering end users a simple path for migrating their emails (while keeping complete control over the import tool’s UI), our customers dramatically improve user activation and onboarding.

ShuttleCloud's integration with Gmail ShuttleCloud’s integration with Google’s Gmail Platform. Gmail has imported data for 3 million users with our API.

ShuttleCloud’s technology encrypts all the data required to process an import, in addition to following the most secure standards (SSL, oAuth) to ensure the confidentiality and integrity of API requests. Our technology allows us to guarantee our platform’s high availability, with up to 99.5% uptime assurances.

ShuttleCloud by Numbers

What was your pre-Prometheus monitoring experience?

In the beginning, a proper monitoring system for our infrastructure was not one of our main priorities. We didn’t have as many projects and instances as we currently have, so we worked with other simple systems to alert us if anything was not working properly and get it under control.

  • We had a set of automatic scripts to monitor most of the operational metrics for the machines. These were cron-based and executed, using Ansible from a centralized machine. The alerts were emails sent directly to the entire development team.
  • We trusted Pingdom for external blackbox monitoring and checking that all our frontends were up. They provided an easy interface and alerting system in case any of our external services were not reachable.

Fortunately, big customers arrived, and the SLAs started to be more demanding. Therefore, we needed something else to measure how we were performing and to ensure that we were complying with all SLAs. One of the features we required was to have accurate stats about our performance and business metrics (i.e., how many migrations finished correctly), so reporting was more on our minds than monitoring.

We developed the following system:

Initial Shuttlecloud System

  • The source of all necessary data is a status database in a CouchDB. There, each document represents one status of an operation. This information is processed by the Status Importer and stored in a relational manner in a MySQL database.

  • A component gathers data from that database, with the information aggregated and post-processed into several views.

    • One of the views is the email report, which we needed for reporting purposes. This is sent via email.
    • The other view pushes data to a dashboard, where it can be easily controlled. The dashboard service we used was external. We trusted Ducksboard, not only because the dashboards were easy to set up and looked beautiful, but also because they provided automatic alerts if a threshold was reached.

With all that in place, it didn’t take us long to realize that we would need a proper metrics, monitoring, and alerting system as the number of projects started to increase.

Some drawbacks of the systems we had at that time were:

  • No centralized monitoring system. Each metric type had a different one:
    • System metrics → Scripts run by Ansible.
    • Business metrics → Ducksboard and email reports.
    • Blackbox metrics → Pingdom.
  • No standard alerting system. Each metric type had different alerts (email, push notification, and so on).
  • Some business metrics had no alerts. These were reviewed manually.

Why did you decide to look at Prometheus?

We analyzed several monitoring and alerting systems. We were eager to get our hands dirty and check if the a solution would succeed or fail. The system we decided to put to the test was Prometheus, for the following reasons:

  • First of all, you don’t have to define a fixed metric system to start working with it; metrics can be added or changed in the future. This provides valuable flexibility when you don’t know all of the metrics you want to monitor yet.
  • If you know anything about Prometheus, you know that metrics can have labels that abstract us from the fact that different time series are considered. This, together with its query language, provided even more flexibility and a powerful tool. For example, we can have the same metric defined for different environments or projects and get a specific time series or aggregate certain metrics with the appropriate labels:
    • http_requests_total{job="my_super_app_1",environment="staging"} - the time series corresponding to the staging environment for the app "my_super_app_1".
    • http_requests_total{job="my_super_app_1"} - the time series for all environments for the app "my_super_app_1".
    • http_requests_total{environment="staging"} - the time series for all staging environments for all jobs.
  • Prometheus supports a DNS service for service discovery. We happened to already have an internal DNS service.
  • There is no need to install any external services (unlike Sensu, for example, which needs a data-storage service like Redis and a message bus like RabbitMQ). This might not be a deal breaker, but it definitely makes the test easier to perform, deploy, and maintain.
  • Prometheus is quite easy to install, as you only need to download an executable Go file. The Docker container also works well and it is easy to start.

How do you use Prometheus?

Initially we were only using some metrics provided out of the box by the node_exporter, including:

  • hard drive usage.
  • memory usage.
  • if an instance is up or down.

Our internal DNS service is integrated to be used for service discovery, so every new instance is automatically monitored.

Some of the metrics we used, which were not provided by the node_exporter by default, were exported using the node_exporter textfile collector feature. The first alerts we declared on the Prometheus Alertmanager were mainly related to the operational metrics mentioned above.

We later developed an operation exporter that allowed us to know the status of the system almost in real time. It exposed business metrics, namely the statuses of all operations, the number of incoming migrations, the number of finished migrations, and the number of errors. We could aggregate these on the Prometheus side and let it calculate different rates.

We decided to export and monitor the following metrics:

  • operation_requests_total
  • operation_statuses_total
  • operation_errors_total

Shuttlecloud Prometheus System

We have most of our services duplicated in two Google Cloud Platform availability zones. That includes the monitoring system. It’s straightforward to have more than one operation exporter in two or more different zones, as Prometheus can aggregate the data from all of them and make one metric (i.e., the maximum of all). We currently don’t have Prometheus or the Alertmanager in HA — only a metamonitoring instance — but we are working on it.

For external blackbox monitoring, we use the Prometheus Blackbox Exporter. Apart from checking if our external frontends are up, it is especially useful for having metrics for SSL certificates’ expiration dates. It even checks the whole chain of certificates. Kudos to Robust Perception for explaining it perfectly in their blogpost.

We set up some charts in Grafana for visual monitoring in some dashboards, and the integration with Prometheus was trivial. The query language used to define the charts is the same as in Prometheus, which simplified their creation a lot.

We also integrated Prometheus with Pagerduty and created a schedule of people on-call for the critical alerts. For those alerts that were not considered critical, we only sent an email.

How does Prometheus make things better for you?

We can't compare Prometheus with our previous solution because we didn’t have one, but we can talk about what features of Prometheus are highlights for us:

  • It has very few maintenance requirements.
  • It’s efficient: one machine can handle monitoring the whole cluster.
  • The community is friendly—both dev and users. Moreover, Brian’s blog is a very good resource.
  • It has no third-party requirements; it’s just the server and the exporters. (No RabbitMQ or Redis needs to be maintained.)
  • Deployment of Go applications is a breeze.

What do you think the future holds for ShuttleCloud and Prometheus?

We’re very happy with Prometheus, but new exporters are always welcome (Celery or Spark, for example).

One question that we face every time we add a new alarm is: how do we test that the alarm works as expected? It would be nice to have a way to inject fake metrics in order to raise an alarm, to test it.

PromCon 2016 - It's a wrap!

What happened

Last week, eighty Prometheus users and developers from around the world came together for two days in Berlin for the first-ever conference about the Prometheus monitoring system: PromCon 2016. The goal of this conference was to exchange knowledge, best practices, and experience gained using Prometheus. We also wanted to grow the community and help people build professional connections around service monitoring. Here are some impressions from the first morning:

Pull doesn't scale - or does it?

Let's talk about a particularly persistent myth. Whenever there is a discussion about monitoring systems and Prometheus's pull-based metrics collection approach comes up, someone inevitably chimes in about how a pull-based approach just “fundamentally doesn't scale”. The given reasons are often vague or only apply to systems that are fundamentally different from Prometheus. In fact, having worked with pull-based monitoring at the largest scales, this claim runs counter to our own operational experience.

We already have an FAQ entry about why Prometheus chooses pull over push, but it does not focus specifically on scaling aspects. Let's have a closer look at the usual misconceptions around this claim and analyze whether and how they would apply to Prometheus.

Prometheus is not Nagios

When people think of a monitoring system that actively pulls, they often think of Nagios. Nagios has a reputation of not scaling well, in part due to spawning subprocesses for active checks that can run arbitrary actions on the Nagios host in order to determine the health of a certain host or service. This sort of check architecture indeed does not scale well, as the central Nagios host quickly gets overwhelmed. As a result, people usually configure checks to only be executed every couple of minutes, or they run into more serious problems.

However, Prometheus takes a fundamentally different approach altogether. Instead of executing check scripts, it only collects time series data from a set of instrumented targets over the network. For each target, the Prometheus server simply fetches the current state of all metrics of that target over HTTP (in a highly parallel way, using goroutines) and has no other execution overhead that would be pull-related. This brings us to the next point:

It doesn't matter who initiates the connection

For scaling purposes, it doesn't matter who initiates the TCP connection over which metrics are then transferred. Either way you do it, the effort for establishing a connection is small compared to the metrics payload and other required work.

But a push-based approach could use UDP and avoid connection establishment altogether, you say! True, but the TCP/HTTP overhead in Prometheus is still negligible compared to the other work that the Prometheus server has to do to ingest data (especially persisting time series data on disk). To put some numbers behind this: a single big Prometheus server can easily store millions of time series, with a record of 800,000 incoming samples per second (as measured with real production metrics data at SoundCloud). Given a 10-seconds scrape interval and 700 time series per host, this allows you to monitor over 10,000 machines from a single Prometheus server. The scaling bottleneck here has never been related to pulling metrics, but usually to the speed at which the Prometheus server can ingest the data into memory and then sustainably persist and expire data on disk/SSD.

Also, although networks are pretty reliable these days, using a TCP-based pull approach makes sure that metrics data arrives reliably, or that the monitoring system at least knows immediately when the metrics transfer fails due to a broken network.

Prometheus is not an event-based system

Some monitoring systems are event-based. That is, they report each individual event (an HTTP request, an exception, you name it) to a central monitoring system immediately as it happens. This central system then either aggregates the events into metrics (StatsD is the prime example of this) or stores events individually for later processing (the ELK stack is an example of that). In such a system, pulling would be problematic indeed: the instrumented service would have to buffer events between pulls, and the pulls would have to happen incredibly frequently in order to simulate the same “liveness” of the push-based approach and not overwhelm event buffers.

However, again, Prometheus is not an event-based monitoring system. You do not send raw events to Prometheus, nor can it store them. Prometheus is in the business of collecting aggregated time series data. That means that it's only interested in regularly collecting the current state of a given set of metrics, not the underlying events that led to the generation of those metrics. For example, an instrumented service would not send a message about each HTTP request to Prometheus as it is handled, but would simply count up those requests in memory. This can happen hundreds of thousands of times per second without causing any monitoring traffic. Prometheus then simply asks the service instance every 15 or 30 seconds (or whatever you configure) about the current counter value and stores that value together with the scrape timestamp as a sample. Other metric types, such as gauges, histograms, and summaries, are handled similarly. The resulting monitoring traffic is low, and the pull-based approach also does not create problems in this case.

But now my monitoring needs to know about my service instances!

With a pull-based approach, your monitoring system needs to know which service instances exist and how to connect to them. Some people are worried about the extra configuration this requires on the part of the monitoring system and see this as an operational scalability problem.

We would argue that you cannot escape this configuration effort for serious monitoring setups in any case: if your monitoring system doesn't know what the world should look like and which monitored service instances should be there, how would it be able to tell when an instance just never reports in, is down due to an outage, or really is no longer meant to exist? This is only acceptable if you never care about the health of individual instances at all, like when you only run ephemeral workers where it is sufficient for a large-enough number of them to report in some result. Most environments are not exclusively like that.

If the monitoring system needs to know the desired state of the world anyway, then a push-based approach actually requires more configuration in total. Not only does your monitoring system need to know what service instances should exist, but your service instances now also need to know how to reach your monitoring system. A pull approach not only requires less configuration, it also makes your monitoring setup more flexible. With pull, you can just run a copy of production monitoring on your laptop to experiment with it. It also allows you just fetch metrics with some other tool or inspect metrics endpoints manually. To get high availability, pull allows you to just run two identically configured Prometheus servers in parallel. And lastly, if you have to move the endpoint under which your monitoring is reachable, a pull approach does not require you to reconfigure all of your metrics sources.

On a practical front, Prometheus makes it easy to configure the desired state of the world with its built-in support for a wide variety of service discovery mechanisms for cloud providers and container-scheduling systems: Consul, Marathon, Kubernetes, EC2, DNS-based SD, Azure, Zookeeper Serversets, and more. Prometheus also allows you to plug in your own custom mechanism if needed. In a microservice world or any multi-tiered architecture, it is also fundamentally an advantage if your monitoring system uses the same method to discover targets to monitor as your service instances use to discover their backends. This way you can be sure that you are monitoring the same targets that are serving production traffic and you have only one discovery mechanism to maintain.

Accidentally DDoS-ing your monitoring

Whether you pull or push, any time-series database will fall over if you send it more samples than it can handle. However, in our experience it's slightly more likely for a push-based approach to accidentally bring down your monitoring. If the control over what metrics get ingested from which instances is not centralized (in your monitoring system), then you run into the danger of experimental or rogue jobs suddenly pushing lots of garbage data into your production monitoring and bringing it down. There are still plenty of ways how this can happen with a pull-based approach (which only controls where to pull metrics from, but not the size and nature of the metrics payloads), but the risk is lower. More importantly, such incidents can be mitigated at a central point.

Real-world proof

Besides the fact that Prometheus is already being used to monitor very large setups in the real world (like using it to monitor millions of machines at DigitalOcean), there are other prominent examples of pull-based monitoring being used successfully in the largest possible environments. Prometheus was inspired by Google's Borgmon, which was (and partially still is) used within Google to monitor all its critical production services using a pull-based approach. Any scaling issues we encountered with Borgmon at Google were not due its pull approach either. If a pull-based approach scales to a global environment with many tens of datacenters and millions of machines, you can hardly say that pull doesn't scale.

But there are other problems with pull!

There are indeed setups that are hard to monitor with a pull-based approach. A prominent example is when you have many endpoints scattered around the world which are not directly reachable due to firewalls or complicated networking setups, and where it's infeasible to run a Prometheus server directly in each of the network segments. This is not quite the environment for which Prometheus was built, although workarounds are often possible (via the Pushgateway or restructuring your setup). In any case, these remaining concerns about pull-based monitoring are usually not scaling-related, but due to network operation difficulties around opening TCP connections.

All good then?

This article addresses the most common scalability concerns around a pull-based monitoring approach. With Prometheus and other pull-based systems being used successfully in very large environments and the pull aspect not posing a bottleneck in reality, the result should be clear: the “pull doesn't scale” argument is not a real concern. We hope that future debates will focus on aspects that matter more than this red herring.

Prometheus reaches 1.0

In January, we published a blog post on Prometheus’s first year of public existence, summarizing what has been an amazing journey for us, and hopefully an innovative and useful monitoring solution for you. Since then, Prometheus has also joined the Cloud Native Computing Foundation, where we are in good company, as the second charter project after Kubernetes.

Our recent work has focused on delivering a stable API and user interface, marked by version 1.0 of Prometheus. We’re thrilled to announce that we’ve reached this goal, and Prometheus 1.0 is available today.

What does 1.0 mean for you?

If you have been using Prometheus for a while, you may have noticed that the rate and impact of breaking changes significantly decreased over the past year. In the same spirit, reaching 1.0 means that subsequent 1.x releases will remain API stable. Upgrades won’t break programs built atop the Prometheus API, and updates won’t require storage re-initialization or deployment changes. Custom dashboards and alerts will remain intact across 1.x version updates as well. We’re confident Prometheus 1.0 is a solid monitoring solution. Now that the Prometheus server has reached a stable API state, other modules will follow it to their own stable version 1.0 releases over time.

Fine print

So what does API stability mean? Prometheus has a large surface area and some parts are certainly more mature than others. There are two simple categories, stable and unstable:

Stable as of v1.0 and throughout the 1.x series:

  • The query language and data model
  • Alerting and recording rules
  • The ingestion exposition formats
  • Configuration flag names
  • HTTP API (used by dashboards and UIs)
  • Configuration file format (minus the non-stable service discovery integrations, see below)
  • Alerting integration with Alertmanager 0.1+ for the foreseeable future
  • Console template syntax and semantics

Unstable and may change within 1.x:

  • The remote storage integrations (InfluxDB, OpenTSDB, Graphite) are still experimental and will at some point be removed in favor of a generic, more sophisticated API that allows storing samples in arbitrary storage systems.
  • Several service discovery integrations are new and need to keep up with fast evolving systems. Hence, integrations with Kubernetes, Marathon, Azure, and EC2 remain in beta status and are subject to change. However, changes will be clearly announced.
  • Exact flag meanings may change as necessary. However, changes will never cause the server to not start with previous flag configurations.
  • Go APIs of packages that are part of the server.
  • HTML generated by the web UI.
  • The metrics in the /metrics endpoint of Prometheus itself.
  • Exact on-disk format. Potential changes however, will be forward compatible and transparently handled by Prometheus.

So Prometheus is complete now?

Absolutely not. We have a long roadmap ahead of us, full of great features to implement. Prometheus will not stay in 1.x for years to come. The infrastructure space is evolving rapidly and we fully intend for Prometheus to evolve with it. This means that we will remain willing to question what we did in the past and are open to leave behind things that have lost relevance. There will be new major versions of Prometheus to facilitate future plans like persistent long-term storage, newer iterations of Alertmanager, internal storage improvements, and many things we don’t even know about yet.

Closing thoughts

We want to thank our fantastic community for field testing new versions, filing bug reports, contributing code, helping out other community members, and shaping Prometheus by participating in countless productive discussions. In the end, you are the ones who make Prometheus successful.

Thank you, and keep up the great work!

Prometheus to Join the Cloud Native Computing Foundation

Since the inception of Prometheus, we have been looking for a sustainable governance model for the project that is independent of any single company. Recently, we have been in discussions with the newly formed Cloud Native Computing Foundation (CNCF), which is backed by Google, CoreOS, Docker, Weaveworks, Mesosphere, and other leading infrastructure companies.

Today, we are excited to announce that the CNCF's Technical Oversight Committee voted unanimously to accept Prometheus as a second hosted project after Kubernetes! You can find more information about these plans in the official press release by the CNCF.

By joining the CNCF, we hope to establish a clear and sustainable project governance model, as well as benefit from the resources, infrastructure, and advice that the independent foundation provides to its members.

We think that the CNCF and Prometheus are an ideal thematic match, as both focus on bringing about a modern vision of the cloud.

In the following months, we will be working with the CNCF on finalizing the project governance structure. We will report back when there are more details to announce.

When (not) to use varbit chunks

The embedded time serie database (TSDB) of the Prometheus server organizes the raw sample data of each time series in chunks of constant 1024 bytes size. In addition to the raw sample data, a chunk contains some meta-data, which allows the selection of a different encoding for each chunk. The most fundamental distinction is the encoding version. You select the version for newly created chunks via the command line flag -storage.local.chunk-encoding-version. Up to now, there were only two supported versions: 0 for the original delta encoding, and 1 for the improved double-delta encoding. With release 0.18.0, we added version 2, which is another variety of double-delta encoding. We call it varbit encoding because it involves a variable bit-width per sample within the chunk. While version 1 is superior to version 0 in almost every aspect, there is a real trade-off between version 1 and 2. This blog post will help you to make that decision. Version 1 remains the default encoding, so if you want to try out version 2 after reading this article, you have to select it explicitly via the command line flag. There is no harm in switching back and forth, but note that existing chunks will not change their encoding version once they have been created. However, these chunks will gradually be phased out according to the configured retention time and will thus be replaced by chunks with the encoding specified in the command-line flag.

Interview with ShowMax

This is the second in a series of interviews with users of Prometheus, allowing them to share their experiences of evaluating and using Prometheus.

Can you tell us about yourself and what ShowMax does?

I’m Antonin Kral, and I’m leading research and architecture for ShowMax. Before that, I’ve held architectural and CTO roles for the past 12 years.

ShowMax is a subscription video on demand service that launched in South Africa in 2015. We’ve got an extensive content catalogue with more than 20,000 episodes of TV shows and movies. Our service is currently available in 65 countries worldwide. While better known rivals are skirmishing in America and Europe, ShowMax is battling a more difficult problem: how do you binge-watch in a barely connected village in sub-Saharan Africa? Already 35% of video around the world is streamed, but there are still so many places the revolution has left untouched.

ShowMax logo

We are managing about 50 services running mostly on private clusters built around CoreOS. They are primarily handling API requests from our clients (Android, iOS, AppleTV, JavaScript, Samsung TV, LG TV etc), while some of them are used internally. One of the biggest internal pipelines is video encoding which can occupy 400+ physical servers when handling large ingestion batches.

The majority of our back-end services are written in Ruby, Go or Python. We use EventMachine when writing apps in Ruby (Goliath on MRI, Puma on JRuby). Go is typically used in apps that require large throughput and don’t have so much business logic. We’re very happy with Falcon for services written in Python. Data is stored in PostgreSQL and ElasticSearch clusters. We use etcd and custom tooling for configuring Varnishes for routing requests.

Interview with Life360

This is the first in a series of interviews with users of Prometheus, allowing them to share their experiences of evaluating and using Prometheus. Our first interview is with Daniel from Life360.

Can you tell us about yourself and what Life360 does?

I’m Daniel Ben Yosef, a.k.a, dby, and I’m an Infrastructure Engineer for Life360, and before that, I’ve held systems engineering roles for the past 9 years.

Life360 creates technology that helps families stay connected, we’re the Family Network app for families. We’re quite busy handling these families - at peak we serve 700k requests per minute for 70 million registered families.

We manage around 20 services in production, mostly handling location requests from mobile clients (Android, iOS, and Windows Phone), spanning over 150+ instances at peak. Redundancy and high-availability are our goals and we strive to maintain 100% uptime whenever possible because families trust us to be available.

We hold user data in both our MySQL multi-master cluster and in our 12-node Cassandra ring which holds around 4TB of data at any given time. We have services written in Go, Python, PHP, as well as plans to introduce Java to our stack. We use Consul for service discovery, and of course our Prometheus setup is integrated with it.

Custom Alertmanager Templates

The Alertmanager handles alerts sent by Prometheus servers and sends notifications about them to different receivers based on their labels.

A receiver can be one of many different integrations such as PagerDuty, Slack, email, or a custom integration via the generic webhook interface (for example JIRA).

Templates

The messages sent to receivers are constructed via templates. Alertmanager comes with default templates but also allows defining custom ones.

In this blog post, we will walk through a simple customization of Slack notifications.

We use this simple Alertmanager configuration that sends all alerts to Slack:

global:
  slack_api_url: '<slack_webhook_url>'

route:
  receiver: 'slack-notifications'
  # All alerts in a notification have the same value for these labels.
  group_by: [alertname, datacenter, app]

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alerts'

By default, a Slack message sent by Alertmanager looks like this:

It shows us that there is one firing alert, followed by the label values of the alert grouping (alertname, datacenter, app) and further label values the alerts have in common (critical).

One Year of Open Prometheus Development

The beginning

A year ago today, we officially announced Prometheus to the wider world. This is a great opportunity for us to look back and share some of the wonderful things that have happened to the project since then. But first, let's start at the beginning.

Although we had already started Prometheus as an open-source project on GitHub in 2012, we didn't make noise about it at first. We wanted to give the project time to mature and be able to experiment without friction. Prometheus was gradually introduced for production monitoring at SoundCloud in 2013 and then saw more and more usage within the company, as well as some early adoption by our friends at Docker and Boexever in 2014. Over the years, Prometheus was growing more and more mature and although it was already solving people's monitoring problems, it was still unknown to the wider public.

Custom service discovery with etcd

In a previous post we introduced numerous new ways of doing service discovery in Prometheus. Since then a lot has happened. We improved the internal implementation and received fantastic contributions from our community, adding support for service discovery with Kubernetes and Marathon. They will become available with the release of version 0.16.

We also touched on the topic of custom service discovery.

Not every type of service discovery is generic enough to be directly included in Prometheus. Chances are your organisation has a proprietary system in place and you just have to make it work with Prometheus. This does not mean that you cannot enjoy the benefits of automatically discovering new monitoring targets.

In this post we will implement a small utility program that connects a custom service discovery approach based on etcd, the highly consistent distributed key-value store, to Prometheus.

Monitoring DreamHack - the World's Largest Digital Festival

Editor's note: This article is a guest post written by a Prometheus user.

If you are operating the network for 10,000's of demanding gamers, you need to really know what is going on inside your network. Oh, and everything needs to be built from scratch in just five days.

If you have never heard about DreamHack before, here is the pitch: Bring 20,000 people together and have the majority of them bring their own computer. Mix in professional gaming (eSports), programming contests, and live music concerts. The result is the world's largest festival dedicated solely to everything digital.

To make such an event possible, there needs to be a lot of infrastructure in place. Ordinary infrastructures of this size take months to build, but the crew at DreamHack builds everything from scratch in just five days. This of course includes stuff like configuring network switches, but also building the electricity distribution, setting up stores for food and drinks, and even building the actual tables.

The team that builds and operates everything related to the network is officially called the Network team, but we usually refer to ourselves as tech or dhtech. This post is going to focus on the work of dhtech and how we used Prometheus during DreamHack Summer 2015 to try to kick our monitoring up another notch.

Practical Anomaly Detection

In his Open Letter To Monitoring/Metrics/Alerting Companies, John Allspaw asserts that attempting "to detect anomalies perfectly, at the right time, is not possible".

I have seen several attempts by talented engineers to build systems to automatically detect and diagnose problems based on time series data. While it is certainly possible to get a demonstration working, the data always turned out to be too noisy to make this approach work for anything but the simplest of real-world systems.

All hope is not lost though. There are many common anomalies which you can detect and handle with custom-built rules. The Prometheus query language gives you the tools to discover these anomalies while avoiding false positives.

Advanced Service Discovery in Prometheus 0.14.0

This week we released Prometheus v0.14.0 — a version with many long-awaited additions and improvements.

On the user side, Prometheus now supports new service discovery mechanisms. In addition to DNS-SRV records, it now supports Consul out of the box, and a file-based interface allows you to connect your own discovery mechanisms. Over time, we plan to add other common service discovery mechanisms to Prometheus.

Aside from many smaller fixes and improvements, you can now also reload your configuration during runtime by sending a SIGHUP to the Prometheus process. For a full list of changes, check the changelog for this release.

In this blog post, we will take a closer look at the built-in service discovery mechanisms and provide some practical examples. As an additional resource, see Prometheus's configuration documentation.

Prometheus Monitoring Spreads through the Internet

It has been almost three months since we publicly announced Prometheus version 0.10.0, and we're now at version 0.13.1.

SoundCloud's announcement blog post remains the best overview of the key components of Prometheus, but there has been a lot of other online activity around Prometheus. This post will let you catch up on anything you missed.

In the future, we will use this blog to publish more articles and announcements to help you get the most out of Prometheus.