Interview with Canonical

Continuing our series of interviews with users of Prometheus, Canonical talks about how they are transitioning to Prometheus.

Can you tell us about yourself and what Canonical does?

Canonical is probably best known as the company that sponsors Ubuntu Linux. We also produce or contribute to a number of other open-source projects including MAAS, Juju, and OpenStack, and provide commercial support for these products. Ubuntu powers the majority of OpenStack deployments, with 55% of production clouds and 58% of large cloud deployments.

My group, BootStack, is our fully managed private cloud service. We build and operate OpenStack clouds for Canonical customers.

What was your pre-Prometheus monitoring experience?

We’d used a combination of Nagios, Graphite/statsd, and in-house Django apps. These did not offer us the level of flexibility and reporting that we need in both our internal and customer cloud environments.

Why did you decide to look at Prometheus?

We’d evaluated a few alternatives, including InfluxDB and extending our use of Graphite, but our first experiences with Prometheus proved it to have the combination of simplicity and power that we were looking for. We especially appreciate the convenience of labels, the simple HTTP protocol, and the out of box timeseries alerting. The potential with Prometheus to replace 2 different tools (alerting and trending) with one is particularly appealing.

Also, several of our staff have prior experience with Borgmon from their time at Google which greatly added to our interest!

How did you transition?

We are still in the process of transitioning, we expect this will take some time due to the number of custom checks we currently use in our existing systems that will need to be re-implemented in Prometheus. The most useful resource has been the prometheus.io site documentation.

It took us a while to choose an exporter. We originally went with collectd but ran into limitations with this. We’re working on writing an openstack-exporter now and were a bit surprised to find there is no good, working, example how to write exporter from scratch.

Some challenges we’ve run into are: No downsampling support, no long term storage solution (yet), and we were surprised by the default 2 week retention period. There's currently no tie-in with Juju, but we’re working on it!

What improvements have you seen since switching?

Once we got the hang of exporters, we found they were very easy to write and have given us very useful metrics. For example we are developing an openstack-exporter for our cloud environments. We’ve also seen very quick cross-team adoption from our DevOps and WebOps groups and developers. We don’t yet have alerting in place but expect to see a lot more to come once we get to this phase of the transition.

What do you think the future holds for Canonical and Prometheus?

We expect Prometheus to be a significant part of our monitoring and reporting infrastructure, providing the metrics gathering and storage for numerous current and future systems. We see it potentially replacing Nagios as for alerting.

Interview with JustWatch

Continuing our series of interviews with users of Prometheus, JustWatch talks about how they established their monitoring.

Can you tell us about yourself and what JustWatch does?

For consumers, JustWatch is a streaming search engine that helps to find out where to watch movies and TV shows legally online and in theaters. You can search movie content across all major streaming providers like Netflix, HBO, Amazon Video, iTunes, Google Play, and many others in 17 countries.

For our clients like movie studios or Video on Demand providers, we are an international movie marketing company that collects anonymized data about purchase behavior and movie taste of fans worldwide from our consumer apps. We help studios to advertise their content to the right audience and make digital video advertising a lot more efficient in minimizing waste coverage.

JustWatch logo

Since our launch in 2014 we went from zero to one of the largest 20k websites internationally without spending a single dollar on marketing - becoming the largest streaming search engine worldwide in under two years. Currently, with an engineering team of just 10, we build and operate a fully dockerized stack of about 50 micro- and macro-services, running mostly on Kubernetes.

What was your pre-Prometheus monitoring experience?

At prior companies many of us worked with most of the open-source monitoring products there are. We have quite some experience working with Nagios, Icinga, Zabbix, Monit, Munin, Graphite and a few other systems. At one company I helped build a distributed Nagios setup with Puppet. This setup was nice, since new services automatically showed up in the system, but taking instances out was still painful. As soon as you have some variance in your systems, the host and service based monitoring suites just don’t fit quite well. The label-based approach Prometheus took was something I always wanted to have, but didn’t find before.

Why did you decide to look at Prometheus?

At JustWatch the public Prometheus announcement hit exactly the right time. We mostly had blackbox monitoring for the first few months of the company - CloudWatch for some of the most important internal metrics, combined with a external services like Pingdom for detecting site-wide outages. Also, none of the classical host-based solutions satisfied us. In a world of containers and microservices, host-based tools like Icinga, Thruk or Zabbix felt antiquated and not ready for the job. When we started to investigate whitebox monitoring, some of us luckily attended the Golang Meetup where Julius and Björn announced Prometheus. We quickly set up a Prometheus server and started to instrument our Go services (we use almost only Go for the backend). It was amazing how easy that was - the design felt like being cloud- and service-oriented as a first principle and never got in the way.

How did you transition?

Transitioning wasn't that hard, as timing wise, we were lucky enough to go from no relevant monitoring directly to Prometheus.

The transition to Prometheus was mostly including the Go client into our apps and wrapping the HTTP handlers. We also wrote and deployed several exporters, including the node_exporter and several exporters for cloud provider APIs. In our experience monitoring and alerting is a project that is never finished, but the bulk of the work was done within a few weeks as a side project.

Since the deployment of Prometheus we tend to look into metrics whenever we miss something or when we are designing new services from scratch.

It took some time to fully grasp the elegance of PromQL and labels concept fully, but the effort really paid off.

What improvements have you seen since switching?

Prometheus enlightened us by making it incredibly easy to reap the benefits from whitebox monitoring and label-based canary deployments. The out-of-the-box metrics for many Golang aspects (HTTP Handler, Go Runtime) helped us to get to a return on investment very quickly - goroutine metrics alone saved the day multiple times. The only monitoring component we actually liked before - Grafana - feels like a natural fit for Prometheus and has allowed us to create some very helpful dashboards. We appreciated that Prometheus didn't try to reinvent the wheel but rather fit in perfectly with the best solution out there. Another huge improvement on predecessors was Prometheus's focus on actually getting the math right (percentiles, etc.). In other systems, we were never quite sure if the operations offered made sense. Especially percentiles are such a natural and necessary way of reasoning about microservice performance that it felt great that they get first class treatment.

Database Dashboard

The integrated service discovery makes it super easy to manage the scrape targets. For Kubernetes, everything just works out-of-the-box. For some other systems not running on Kubernetes yet, we use a Consul-based approach. All it takes to get an application monitored by Prometheus is to add the client, expose /metrics and set one simple annotation on the Container/Pod. This low coupling takes out a lot of friction between development and operations - a lot of services are built well orchestrated from the beginning, because it's simple and fun.

The combination of time-series and clever functions make for awesome alerting super-powers. Aggregations that run on the server and treating both time-series, combinations of them and even functions on those combinations as first-class citizens makes alerting a breeze - often times after the fact.

What do you think the future holds for JustWatch and Prometheus?

While we value very much that Prometheus doesn't focus on being shiny but on actually working and delivering value while being reasonably easy to deploy and operate - especially the Alertmanager leaves a lot to be desired yet. Just some simple improvements like simplified interactive alert building and editing in the frontend would go a long way in working with alerts being even simpler.

We are really looking forward to the ongoing improvements in the storage layer, including remote storage. We also hope for some of the approaches taken in Project Prism and Vulcan to be backported to core Prometheus. The most interesting topics for us right now are GCE Service Discovery, easier scaling, and much longer retention periods (even at the cost of colder storage and much longer query times for older events).

We are also looking forward to use Prometheus for more non-technical departments as well. We’d like to cover most of our KPIs with Prometheus to allow everyone to create beautiful dashboards, as well as alerts. We're currently even planning to abuse the awesome alert engine for a new, internal business project as well - stay tuned!

Interview with Compose

Continuing our series of interviews with users of Prometheus, Compose talks about their monitoring journey from Graphite and InfluxDB to Prometheus.

Can you tell us about yourself and what Compose does?

Compose delivers production-ready database clusters as a service to developers around the world. An app developer can come to us and in a few clicks have a multi-host, highly available, automatically backed up and secure database ready in minutes. Those database deployments then autoscale up as demand increases so a developer can spend their time on building their great apps, not on running their database.

We have tens of clusters of hosts across at least two regions in each of AWS, Google Cloud Platform and SoftLayer. Each cluster spans availability zones where supported and is home to around 1000 highly-available database deployments in their own private networks. More regions and providers are in the works.

What was your pre-Prometheus monitoring experience?

Before Prometheus, a number of different metrics systems were tried. The first system we tried was Graphite, which worked pretty well initially, but the sheer volume of different metrics we had to store, combined with the way Whisper files are stored and accessed on disk, quickly overloaded our systems. While we were aware that Graphite could be scaled horizontally relatively easily, it would have been an expensive cluster. InfluxDB looked more promising so we started trying out the early-ish versions of that and it seemed to work well for a good while. Goodbye Graphite.

The earlier versions of InfluxDB had some issues with data corruption occasionally. We semi-regularly had to purge all of our metrics. It wasn’t a devastating loss for us normally, but it was irritating. The continued promises of features that never materialised frankly wore on us.

Why did you decide to look at Prometheus?

It seemed to combine better efficiency with simpler operations than other options.

Pull-based metric gathering puzzled us at first, but we soon realised the benefits. Initially it seemed like it could be far too heavyweight to scale well in our environment where we often have several hundred containers with their own metrics on each host, but by combining it with Telegraf, we can arrange to have each host export metrics for all its containers (as well as its overall resource metrics) via a single Prometheus scrape target.

How did you transition?

We are a Chef shop so we spun up a largish instance with a big EBS volume and then reached right for a community chef cookbook for Prometheus.

With Prometheus up on a host, we wrote a small Ruby script that uses the Chef API to query for all our hosts, and write out a Prometheus target config file. We use this file with a file_sd_config to ensure all hosts are discovered and scraped as soon as they register with Chef. Thanks to Prometheus’ open ecosystem, we were able to use Telegraf out of the box with a simple config to export host-level metrics directly.

We were testing how far a single Prometheus would scale and waiting for it to fall over. It didn’t! In fact it handled the load of host-level metrics scraped every 15 seconds for around 450 hosts across our newer infrastructure with very little resource usage.

We have a lot of containers on each host so we were expecting to have to start to shard Prometheus once we added all memory usage metrics from those too, but Prometheus just kept on going without any drama and still without getting too close to saturating its resources. We currently monitor over 400,000 distinct metrics every 15 seconds for around 40,000 containers on 450 hosts with a single m4.xlarge prometheus instance with 1TB of storage. You can see our host dashboard for this host below. Disk IO on the 1TB gp2 SSD EBS volume will probably be the limiting factor eventually. Our initial guess is well over-provisioned for now, but we are growing fast in both metrics gathered and hosts/containers to monitor.

Prometheus Host Dashboard

At this point the Prometheus server we’d thrown up to test with was vastly more reliable than the InfluxDB cluster we had doing the same job before, so we did some basic work to make it less of a single-point-of-failure. We added another identical node scraping all the same targets, then added a simple failover scheme with keepalived + DNS updates. This was now more highly available than our previous system so we switched our customer-facing graphs to use Prometheus and tore down the old system.

Prometheus-powered memory metrics for PostgresSQL containers in our app

What improvements have you seen since switching?

Our previous monitoring setup was unreliable and difficult to manage. With Prometheus we have a system that’s working well for graphing lots of metrics, and we have team members suddenly excited about new ways to use it rather than wary of touching the metrics system we used before.

The cluster is simpler too, with just two identical nodes. As we grow, we know we’ll have to shard the work across more Prometheus hosts and have considered a few ways to do this.

What do you think the future holds for Compose and Prometheus?

Right now we have only replicated the metrics we already gathered in previous systems - basic memory usage for customer containers as well as host-level resource usage for our own operations. The next logical step is enabling the database teams to push metrics to the local Telegraf instance from inside the DB containers so we can record database-level stats too without increasing number of targets to scrape.

We also have several other systems that we want to get into Prometheus to get better visibility. We run our apps on Mesos and have integrated basic Docker container metrics already, which is better than previously, but we also want to have more of the infrastructure components in the Mesos cluster recording to the central Prometheus so we can have centralised dashboards showing all elements of supporting system health from load balancers right down to app metrics.

Eventually we will need to shard Prometheus. We already split customer deployments among many smaller clusters for a variety of reasons so the one logical option would be to move to a smaller Prometheus server (or a pair for redundancy) per cluster rather than a single global one.

For most reporting needs this is not a big issue as we usually don’t need hosts/containers from different clusters in the same dashboard, but we may keep a small global cluster with much longer retention and just a modest number of down-sampled and aggregated metrics from each cluster’s Prometheus using Recording Rules.

Interview with DigitalOcean

Next in our series of interviews with users of Prometheus, DigitalOcean talks about how they use Prometheus. Carlos Amedee also talked about the social aspects of the rollout at PromCon 2016.

Can you tell us about yourself and what DigitalOcean does?

My name is Ian Hansen and I work on the platform metrics team. DigitalOcean provides simple cloud computing. To date, we’ve created 20 million Droplets (SSD cloud servers) across 13 regions. We also recently released a new Block Storage product.

DigitalOcean logo

What was your pre-Prometheus monitoring experience?

Before Prometheus, we were running Graphite and OpenTSDB. Graphite was used for smaller-scale applications and OpenTSDB was used for collecting metrics from all of our physical servers via Collectd. Nagios would pull these databases to trigger alerts. We do still use Graphite but we no longer run OpenTSDB.

Why did you decide to look at Prometheus?

I was frustrated with OpenTSDB because I was responsible for keeping the cluster online, but found it difficult to guard against metric storms. Sometimes a team would launch a new (very chatty) service that would impact the total capacity of the cluster and hurt my SLAs.

We are able to blacklist/whitelist new metrics coming in to OpenTSDB, but didn’t have a great way to guard against chatty services except for organizational process (which was hard to change/enforce). Other teams were frustrated with the query language and the visualization tools available at the time. I was chatting with Julius Volz about push vs pull metric systems and was sold in wanting to try Prometheus when I saw that I would really be in control of my SLA when I get to determine what I’m pulling and how frequently. Plus, I really really liked the query language.

How did you transition?

We were gathering metrics via Collectd sending to OpenTSDB. Installing the Node Exporter in parallel with our already running Collectd setup allowed us to start experimenting with Prometheus. We also created a custom exporter to expose Droplet metrics. Soon, we had feature parity with our OpenTSDB service and started turning off Collectd and then turned off the OpenTSDB cluster.

People really liked Prometheus and the visualization tools that came with it. Suddenly, my small metrics team had a backlog that we couldn’t get to fast enough to make people happy, and instead of providing and maintaining Prometheus for people’s services, we looked at creating tooling to make it as easy as possible for other teams to run their own Prometheus servers and to also run the common exporters we use at the company.

Some teams have started using Alertmanager, but we still have a concept of pulling Prometheus from our existing monitoring tools.

What improvements have you seen since switching?

We’ve improved our insights on hypervisor machines. The data we could get out of Collectd and Node Exporter is about the same, but it’s much easier for our team of golang developers to create a new custom exporter that exposes data specific to the services we run on each hypervisor.

We’re exposing better application metrics. It’s easier to learn and teach how to create a Prometheus metric that can be aggregated correctly later. With Graphite it’s easy to create a metric that can’t be aggregated in a certain way later because the dot-separated-name wasn’t structured right.

Creating alerts is much quicker and simpler than what we had before, plus in a language that is familiar. This has empowered teams to create better alerting for the services they know and understand because they can iterate quickly.

What do you think the future holds for DigitalOcean and Prometheus?

We’re continuing to look at how to make collecting metrics as easy as possible for teams at DigitalOcean. Right now teams are running their own Prometheus servers for the things they care about, which allowed us to gain observability we otherwise wouldn’t have had as quickly. But, not every team should have to know how to run Prometheus. We’re looking at what we can do to make Prometheus as automatic as possible so that teams can just concentrate on what queries and alerts they want on their services and databases.

We also created Vulcan so that we have long-term data storage, while retaining the Prometheus Query Language that we have built tooling around and trained people how to use.

Interview with ShuttleCloud

Continuing our series of interviews with users of Prometheus, ShuttleCloud talks about how they began using Prometheus. Ignacio from ShuttleCloud also explained how Prometheus Is Good for Your Small Startup at PromCon 2016.

What does ShuttleCloud do?

ShuttleCloud is the world’s most scalable email and contacts data importing system. We help some of the leading email and address book providers, including Google and Comcast, increase user growth and engagement by automating the switching experience through data import.

By integrating our API into their offerings, our customers allow their users to easily migrate their email and contacts from one participating provider to another, reducing the friction users face when switching to a new provider. The 24/7 email providers supported include all major US internet service providers: Comcast, Time Warner Cable, AT&T, Verizon, and more.

By offering end users a simple path for migrating their emails (while keeping complete control over the import tool’s UI), our customers dramatically improve user activation and onboarding.

ShuttleCloud's integration with Gmail ShuttleCloud’s integration with Google’s Gmail Platform. Gmail has imported data for 3 million users with our API.

ShuttleCloud’s technology encrypts all the data required to process an import, in addition to following the most secure standards (SSL, oAuth) to ensure the confidentiality and integrity of API requests. Our technology allows us to guarantee our platform’s high availability, with up to 99.5% uptime assurances.

ShuttleCloud by Numbers

What was your pre-Prometheus monitoring experience?

In the beginning, a proper monitoring system for our infrastructure was not one of our main priorities. We didn’t have as many projects and instances as we currently have, so we worked with other simple systems to alert us if anything was not working properly and get it under control.

  • We had a set of automatic scripts to monitor most of the operational metrics for the machines. These were cron-based and executed, using Ansible from a centralized machine. The alerts were emails sent directly to the entire development team.
  • We trusted Pingdom for external blackbox monitoring and checking that all our frontends were up. They provided an easy interface and alerting system in case any of our external services were not reachable.

Fortunately, big customers arrived, and the SLAs started to be more demanding. Therefore, we needed something else to measure how we were performing and to ensure that we were complying with all SLAs. One of the features we required was to have accurate stats about our performance and business metrics (i.e., how many migrations finished correctly), so reporting was more on our minds than monitoring.

We developed the following system:

Initial Shuttlecloud System

  • The source of all necessary data is a status database in a CouchDB. There, each document represents one status of an operation. This information is processed by the Status Importer and stored in a relational manner in a MySQL database.

  • A component gathers data from that database, with the information aggregated and post-processed into several views.

    • One of the views is the email report, which we needed for reporting purposes. This is sent via email.
    • The other view pushes data to a dashboard, where it can be easily controlled. The dashboard service we used was external. We trusted Ducksboard, not only because the dashboards were easy to set up and looked beautiful, but also because they provided automatic alerts if a threshold was reached.

With all that in place, it didn’t take us long to realize that we would need a proper metrics, monitoring, and alerting system as the number of projects started to increase.

Some drawbacks of the systems we had at that time were:

  • No centralized monitoring system. Each metric type had a different one:
    • System metrics → Scripts run by Ansible.
    • Business metrics → Ducksboard and email reports.
    • Blackbox metrics → Pingdom.
  • No standard alerting system. Each metric type had different alerts (email, push notification, and so on).
  • Some business metrics had no alerts. These were reviewed manually.

Why did you decide to look at Prometheus?

We analyzed several monitoring and alerting systems. We were eager to get our hands dirty and check if the a solution would succeed or fail. The system we decided to put to the test was Prometheus, for the following reasons:

  • First of all, you don’t have to define a fixed metric system to start working with it; metrics can be added or changed in the future. This provides valuable flexibility when you don’t know all of the metrics you want to monitor yet.
  • If you know anything about Prometheus, you know that metrics can have labels that abstract us from the fact that different time series are considered. This, together with its query language, provided even more flexibility and a powerful tool. For example, we can have the same metric defined for different environments or projects and get a specific time series or aggregate certain metrics with the appropriate labels:
    • http_requests_total{job="my_super_app_1",environment="staging"} - the time series corresponding to the staging environment for the app "my_super_app_1".
    • http_requests_total{job="my_super_app_1"} - the time series for all environments for the app "my_super_app_1".
    • http_requests_total{environment="staging"} - the time series for all staging environments for all jobs.
  • Prometheus supports a DNS service for service discovery. We happened to already have an internal DNS service.
  • There is no need to install any external services (unlike Sensu, for example, which needs a data-storage service like Redis and a message bus like RabbitMQ). This might not be a deal breaker, but it definitely makes the test easier to perform, deploy, and maintain.
  • Prometheus is quite easy to install, as you only need to download an executable Go file. The Docker container also works well and it is easy to start.

How do you use Prometheus?

Initially we were only using some metrics provided out of the box by the node_exporter, including:

  • hard drive usage.
  • memory usage.
  • if an instance is up or down.

Our internal DNS service is integrated to be used for service discovery, so every new instance is automatically monitored.

Some of the metrics we used, which were not provided by the node_exporter by default, were exported using the node_exporter textfile collector feature. The first alerts we declared on the Prometheus Alertmanager were mainly related to the operational metrics mentioned above.

We later developed an operation exporter that allowed us to know the status of the system almost in real time. It exposed business metrics, namely the statuses of all operations, the number of incoming migrations, the number of finished migrations, and the number of errors. We could aggregate these on the Prometheus side and let it calculate different rates.

We decided to export and monitor the following metrics:

  • operation_requests_total
  • operation_statuses_total
  • operation_errors_total

Shuttlecloud Prometheus System

We have most of our services duplicated in two Google Cloud Platform availability zones. That includes the monitoring system. It’s straightforward to have more than one operation exporter in two or more different zones, as Prometheus can aggregate the data from all of them and make one metric (i.e., the maximum of all). We currently don’t have Prometheus or the Alertmanager in HA — only a metamonitoring instance — but we are working on it.

For external blackbox monitoring, we use the Prometheus Blackbox Exporter. Apart from checking if our external frontends are up, it is especially useful for having metrics for SSL certificates’ expiration dates. It even checks the whole chain of certificates. Kudos to Robust Perception for explaining it perfectly in their blogpost.

We set up some charts in Grafana for visual monitoring in some dashboards, and the integration with Prometheus was trivial. The query language used to define the charts is the same as in Prometheus, which simplified their creation a lot.

We also integrated Prometheus with Pagerduty and created a schedule of people on-call for the critical alerts. For those alerts that were not considered critical, we only sent an email.

How does Prometheus make things better for you?

We can't compare Prometheus with our previous solution because we didn’t have one, but we can talk about what features of Prometheus are highlights for us:

  • It has very few maintenance requirements.
  • It’s efficient: one machine can handle monitoring the whole cluster.
  • The community is friendly—both dev and users. Moreover, Brian’s blog is a very good resource.
  • It has no third-party requirements; it’s just the server and the exporters. (No RabbitMQ or Redis needs to be maintained.)
  • Deployment of Go applications is a breeze.

What do you think the future holds for ShuttleCloud and Prometheus?

We’re very happy with Prometheus, but new exporters are always welcome (Celery or Spark, for example).

One question that we face every time we add a new alarm is: how do we test that the alarm works as expected? It would be nice to have a way to inject fake metrics in order to raise an alarm, to test it.

PromCon 2016 - It's a wrap!

What happened

Last week, eighty Prometheus users and developers from around the world came together for two days in Berlin for the first-ever conference about the Prometheus monitoring system: PromCon 2016. The goal of this conference was to exchange knowledge, best practices, and experience gained using Prometheus. We also wanted to grow the community and help people build professional connections around service monitoring. Here are some impressions from the first morning:

Pull doesn't scale - or does it?

Let's talk about a particularly persistent myth. Whenever there is a discussion about monitoring systems and Prometheus's pull-based metrics collection approach comes up, someone inevitably chimes in about how a pull-based approach just “fundamentally doesn't scale”. The given reasons are often vague or only apply to systems that are fundamentally different from Prometheus. In fact, having worked with pull-based monitoring at the largest scales, this claim runs counter to our own operational experience.

We already have an FAQ entry about why Prometheus chooses pull over push, but it does not focus specifically on scaling aspects. Let's have a closer look at the usual misconceptions around this claim and analyze whether and how they would apply to Prometheus.

Prometheus is not Nagios

When people think of a monitoring system that actively pulls, they often think of Nagios. Nagios has a reputation of not scaling well, in part due to spawning subprocesses for active checks that can run arbitrary actions on the Nagios host in order to determine the health of a certain host or service. This sort of check architecture indeed does not scale well, as the central Nagios host quickly gets overwhelmed. As a result, people usually configure checks to only be executed every couple of minutes, or they run into more serious problems.

However, Prometheus takes a fundamentally different approach altogether. Instead of executing check scripts, it only collects time series data from a set of instrumented targets over the network. For each target, the Prometheus server simply fetches the current state of all metrics of that target over HTTP (in a highly parallel way, using goroutines) and has no other execution overhead that would be pull-related. This brings us to the next point:

It doesn't matter who initiates the connection

For scaling purposes, it doesn't matter who initiates the TCP connection over which metrics are then transferred. Either way you do it, the effort for establishing a connection is small compared to the metrics payload and other required work.

But a push-based approach could use UDP and avoid connection establishment altogether, you say! True, but the TCP/HTTP overhead in Prometheus is still negligible compared to the other work that the Prometheus server has to do to ingest data (especially persisting time series data on disk). To put some numbers behind this: a single big Prometheus server can easily store millions of time series, with a record of 800,000 incoming samples per second (as measured with real production metrics data at SoundCloud). Given a 10-seconds scrape interval and 700 time series per host, this allows you to monitor over 10,000 machines from a single Prometheus server. The scaling bottleneck here has never been related to pulling metrics, but usually to the speed at which the Prometheus server can ingest the data into memory and then sustainably persist and expire data on disk/SSD.

Also, although networks are pretty reliable these days, using a TCP-based pull approach makes sure that metrics data arrives reliably, or that the monitoring system at least knows immediately when the metrics transfer fails due to a broken network.

Prometheus is not an event-based system

Some monitoring systems are event-based. That is, they report each individual event (an HTTP request, an exception, you name it) to a central monitoring system immediately as it happens. This central system then either aggregates the events into metrics (StatsD is the prime example of this) or stores events individually for later processing (the ELK stack is an example of that). In such a system, pulling would be problematic indeed: the instrumented service would have to buffer events between pulls, and the pulls would have to happen incredibly frequently in order to simulate the same “liveness” of the push-based approach and not overwhelm event buffers.

However, again, Prometheus is not an event-based monitoring system. You do not send raw events to Prometheus, nor can it store them. Prometheus is in the business of collecting aggregated time series data. That means that it's only interested in regularly collecting the current state of a given set of metrics, not the underlying events that led to the generation of those metrics. For example, an instrumented service would not send a message about each HTTP request to Prometheus as it is handled, but would simply count up those requests in memory. This can happen hundreds of thousands of times per second without causing any monitoring traffic. Prometheus then simply asks the service instance every 15 or 30 seconds (or whatever you configure) about the current counter value and stores that value together with the scrape timestamp as a sample. Other metric types, such as gauges, histograms, and summaries, are handled similarly. The resulting monitoring traffic is low, and the pull-based approach also does not create problems in this case.

But now my monitoring needs to know about my service instances!

With a pull-based approach, your monitoring system needs to know which service instances exist and how to connect to them. Some people are worried about the extra configuration this requires on the part of the monitoring system and see this as an operational scalability problem.

We would argue that you cannot escape this configuration effort for serious monitoring setups in any case: if your monitoring system doesn't know what the world should look like and which monitored service instances should be there, how would it be able to tell when an instance just never reports in, is down due to an outage, or really is no longer meant to exist? This is only acceptable if you never care about the health of individual instances at all, like when you only run ephemeral workers where it is sufficient for a large-enough number of them to report in some result. Most environments are not exclusively like that.

If the monitoring system needs to know the desired state of the world anyway, then a push-based approach actually requires more configuration in total. Not only does your monitoring system need to know what service instances should exist, but your service instances now also need to know how to reach your monitoring system. A pull approach not only requires less configuration, it also makes your monitoring setup more flexible. With pull, you can just run a copy of production monitoring on your laptop to experiment with it. It also allows you just fetch metrics with some other tool or inspect metrics endpoints manually. To get high availability, pull allows you to just run two identically configured Prometheus servers in parallel. And lastly, if you have to move the endpoint under which your monitoring is reachable, a pull approach does not require you to reconfigure all of your metrics sources.

On a practical front, Prometheus makes it easy to configure the desired state of the world with its built-in support for a wide variety of service discovery mechanisms for cloud providers and container-scheduling systems: Consul, Marathon, Kubernetes, EC2, DNS-based SD, Azure, Zookeeper Serversets, and more. Prometheus also allows you to plug in your own custom mechanism if needed. In a microservice world or any multi-tiered architecture, it is also fundamentally an advantage if your monitoring system uses the same method to discover targets to monitor as your service instances use to discover their backends. This way you can be sure that you are monitoring the same targets that are serving production traffic and you have only one discovery mechanism to maintain.

Accidentally DDoS-ing your monitoring

Whether you pull or push, any time-series database will fall over if you send it more samples than it can handle. However, in our experience it's slightly more likely for a push-based approach to accidentally bring down your monitoring. If the control over what metrics get ingested from which instances is not centralized (in your monitoring system), then you run into the danger of experimental or rogue jobs suddenly pushing lots of garbage data into your production monitoring and bringing it down. There are still plenty of ways how this can happen with a pull-based approach (which only controls where to pull metrics from, but not the size and nature of the metrics payloads), but the risk is lower. More importantly, such incidents can be mitigated at a central point.

Real-world proof

Besides the fact that Prometheus is already being used to monitor very large setups in the real world (like using it to monitor millions of machines at DigitalOcean), there are other prominent examples of pull-based monitoring being used successfully in the largest possible environments. Prometheus was inspired by Google's Borgmon, which was (and partially still is) used within Google to monitor all its critical production services using a pull-based approach. Any scaling issues we encountered with Borgmon at Google were not due its pull approach either. If a pull-based approach scales to a global environment with many tens of datacenters and millions of machines, you can hardly say that pull doesn't scale.

But there are other problems with pull!

There are indeed setups that are hard to monitor with a pull-based approach. A prominent example is when you have many endpoints scattered around the world which are not directly reachable due to firewalls or complicated networking setups, and where it's infeasible to run a Prometheus server directly in each of the network segments. This is not quite the environment for which Prometheus was built, although workarounds are often possible (via the Pushgateway or restructuring your setup). In any case, these remaining concerns about pull-based monitoring are usually not scaling-related, but due to network operation difficulties around opening TCP connections.

All good then?

This article addresses the most common scalability concerns around a pull-based monitoring approach. With Prometheus and other pull-based systems being used successfully in very large environments and the pull aspect not posing a bottleneck in reality, the result should be clear: the “pull doesn't scale” argument is not a real concern. We hope that future debates will focus on aspects that matter more than this red herring.

Prometheus reaches 1.0

In January, we published a blog post on Prometheus’s first year of public existence, summarizing what has been an amazing journey for us, and hopefully an innovative and useful monitoring solution for you. Since then, Prometheus has also joined the Cloud Native Computing Foundation, where we are in good company, as the second charter project after Kubernetes.

Our recent work has focused on delivering a stable API and user interface, marked by version 1.0 of Prometheus. We’re thrilled to announce that we’ve reached this goal, and Prometheus 1.0 is available today.

What does 1.0 mean for you?

If you have been using Prometheus for a while, you may have noticed that the rate and impact of breaking changes significantly decreased over the past year. In the same spirit, reaching 1.0 means that subsequent 1.x releases will remain API stable. Upgrades won’t break programs built atop the Prometheus API, and updates won’t require storage re-initialization or deployment changes. Custom dashboards and alerts will remain intact across 1.x version updates as well. We’re confident Prometheus 1.0 is a solid monitoring solution. Now that the Prometheus server has reached a stable API state, other modules will follow it to their own stable version 1.0 releases over time.

Fine print

So what does API stability mean? Prometheus has a large surface area and some parts are certainly more mature than others. There are two simple categories, stable and unstable:

Stable as of v1.0 and throughout the 1.x series:

  • The query language and data model
  • Alerting and recording rules
  • The ingestion exposition formats
  • Configuration flag names
  • HTTP API (used by dashboards and UIs)
  • Configuration file format (minus the non-stable service discovery integrations, see below)
  • Alerting integration with Alertmanager 0.1+ for the foreseeable future
  • Console template syntax and semantics

Unstable and may change within 1.x:

  • The remote storage integrations (InfluxDB, OpenTSDB, Graphite) are still experimental and will at some point be removed in favor of a generic, more sophisticated API that allows storing samples in arbitrary storage systems.
  • Several service discovery integrations are new and need to keep up with fast evolving systems. Hence, integrations with Kubernetes, Marathon, Azure, and EC2 remain in beta status and are subject to change. However, changes will be clearly announced.
  • Exact flag meanings may change as necessary. However, changes will never cause the server to not start with previous flag configurations.
  • Go APIs of packages that are part of the server.
  • HTML generated by the web UI.
  • The metrics in the /metrics endpoint of Prometheus itself.
  • Exact on-disk format. Potential changes however, will be forward compatible and transparently handled by Prometheus.

So Prometheus is complete now?

Absolutely not. We have a long roadmap ahead of us, full of great features to implement. Prometheus will not stay in 1.x for years to come. The infrastructure space is evolving rapidly and we fully intend for Prometheus to evolve with it. This means that we will remain willing to question what we did in the past and are open to leave behind things that have lost relevance. There will be new major versions of Prometheus to facilitate future plans like persistent long-term storage, newer iterations of Alertmanager, internal storage improvements, and many things we don’t even know about yet.

Closing thoughts

We want to thank our fantastic community for field testing new versions, filing bug reports, contributing code, helping out other community members, and shaping Prometheus by participating in countless productive discussions. In the end, you are the ones who make Prometheus successful.

Thank you, and keep up the great work!

Prometheus to Join the Cloud Native Computing Foundation

Since the inception of Prometheus, we have been looking for a sustainable governance model for the project that is independent of any single company. Recently, we have been in discussions with the newly formed Cloud Native Computing Foundation (CNCF), which is backed by Google, CoreOS, Docker, Weaveworks, Mesosphere, and other leading infrastructure companies.

Today, we are excited to announce that the CNCF's Technical Oversight Committee voted unanimously to accept Prometheus as a second hosted project after Kubernetes! You can find more information about these plans in the official press release by the CNCF.

By joining the CNCF, we hope to establish a clear and sustainable project governance model, as well as benefit from the resources, infrastructure, and advice that the independent foundation provides to its members.

We think that the CNCF and Prometheus are an ideal thematic match, as both focus on bringing about a modern vision of the cloud.

In the following months, we will be working with the CNCF on finalizing the project governance structure. We will report back when there are more details to announce.

When (not) to use varbit chunks

The embedded time serie database (TSDB) of the Prometheus server organizes the raw sample data of each time series in chunks of constant 1024 bytes size. In addition to the raw sample data, a chunk contains some meta-data, which allows the selection of a different encoding for each chunk. The most fundamental distinction is the encoding version. You select the version for newly created chunks via the command line flag -storage.local.chunk-encoding-version. Up to now, there were only two supported versions: 0 for the original delta encoding, and 1 for the improved double-delta encoding. With release 0.18.0, we added version 2, which is another variety of double-delta encoding. We call it varbit encoding because it involves a variable bit-width per sample within the chunk. While version 1 is superior to version 0 in almost every aspect, there is a real trade-off between version 1 and 2. This blog post will help you to make that decision. Version 1 remains the default encoding, so if you want to try out version 2 after reading this article, you have to select it explicitly via the command line flag. There is no harm in switching back and forth, but note that existing chunks will not change their encoding version once they have been created. However, these chunks will gradually be phased out according to the configured retention time and will thus be replaced by chunks with the encoding specified in the command-line flag.

Interview with ShowMax

This is the second in a series of interviews with users of Prometheus, allowing them to share their experiences of evaluating and using Prometheus.

Can you tell us about yourself and what ShowMax does?

I’m Antonin Kral, and I’m leading research and architecture for ShowMax. Before that, I’ve held architectural and CTO roles for the past 12 years.

ShowMax is a subscription video on demand service that launched in South Africa in 2015. We’ve got an extensive content catalogue with more than 20,000 episodes of TV shows and movies. Our service is currently available in 65 countries worldwide. While better known rivals are skirmishing in America and Europe, ShowMax is battling a more difficult problem: how do you binge-watch in a barely connected village in sub-Saharan Africa? Already 35% of video around the world is streamed, but there are still so many places the revolution has left untouched.

ShowMax logo

We are managing about 50 services running mostly on private clusters built around CoreOS. They are primarily handling API requests from our clients (Android, iOS, AppleTV, JavaScript, Samsung TV, LG TV etc), while some of them are used internally. One of the biggest internal pipelines is video encoding which can occupy 400+ physical servers when handling large ingestion batches.

The majority of our back-end services are written in Ruby, Go or Python. We use EventMachine when writing apps in Ruby (Goliath on MRI, Puma on JRuby). Go is typically used in apps that require large throughput and don’t have so much business logic. We’re very happy with Falcon for services written in Python. Data is stored in PostgreSQL and ElasticSearch clusters. We use etcd and custom tooling for configuring Varnishes for routing requests.

Interview with Life360

This is the first in a series of interviews with users of Prometheus, allowing them to share their experiences of evaluating and using Prometheus. Our first interview is with Daniel from Life360.

Can you tell us about yourself and what Life360 does?

I’m Daniel Ben Yosef, a.k.a, dby, and I’m an Infrastructure Engineer for Life360, and before that, I’ve held systems engineering roles for the past 9 years.

Life360 creates technology that helps families stay connected, we’re the Family Network app for families. We’re quite busy handling these families - at peak we serve 700k requests per minute for 70 million registered families.

We manage around 20 services in production, mostly handling location requests from mobile clients (Android, iOS, and Windows Phone), spanning over 150+ instances at peak. Redundancy and high-availability are our goals and we strive to maintain 100% uptime whenever possible because families trust us to be available.

We hold user data in both our MySQL multi-master cluster and in our 12-node Cassandra ring which holds around 4TB of data at any given time. We have services written in Go, Python, PHP, as well as plans to introduce Java to our stack. We use Consul for service discovery, and of course our Prometheus setup is integrated with it.

Custom Alertmanager Templates

The Alertmanager handles alerts sent by Prometheus servers and sends notifications about them to different receivers based on their labels.

A receiver can be one of many different integrations such as PagerDuty, Slack, email, or a custom integration via the generic webhook interface (for example JIRA).

Templates

The messages sent to receivers are constructed via templates. Alertmanager comes with default templates but also allows defining custom ones.

In this blog post, we will walk through a simple customization of Slack notifications.

We use this simple Alertmanager configuration that sends all alerts to Slack:

global:
  slack_api_url: '<slack_webhook_url>'

route:
  receiver: 'slack-notifications'
  # All alerts in a notification have the same value for these labels.
  group_by: [alertname, datacenter, app]

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alerts'

By default, a Slack message sent by Alertmanager looks like this:

It shows us that there is one firing alert, followed by the label values of the alert grouping (alertname, datacenter, app) and further label values the alerts have in common (critical).

One Year of Open Prometheus Development

The beginning

A year ago today, we officially announced Prometheus to the wider world. This is a great opportunity for us to look back and share some of the wonderful things that have happened to the project since then. But first, let's start at the beginning.

Although we had already started Prometheus as an open-source project on GitHub in 2012, we didn't make noise about it at first. We wanted to give the project time to mature and be able to experiment without friction. Prometheus was gradually introduced for production monitoring at SoundCloud in 2013 and then saw more and more usage within the company, as well as some early adoption by our friends at Docker and Boexever in 2014. Over the years, Prometheus was growing more and more mature and although it was already solving people's monitoring problems, it was still unknown to the wider public.

Custom service discovery with etcd

In a previous post we introduced numerous new ways of doing service discovery in Prometheus. Since then a lot has happened. We improved the internal implementation and received fantastic contributions from our community, adding support for service discovery with Kubernetes and Marathon. They will become available with the release of version 0.16.

We also touched on the topic of custom service discovery.

Not every type of service discovery is generic enough to be directly included in Prometheus. Chances are your organisation has a proprietary system in place and you just have to make it work with Prometheus. This does not mean that you cannot enjoy the benefits of automatically discovering new monitoring targets.

In this post we will implement a small utility program that connects a custom service discovery approach based on etcd, the highly consistent distributed key-value store, to Prometheus.

Monitoring DreamHack - the World's Largest Digital Festival

Editor's note: This article is a guest post written by a Prometheus user.

If you are operating the network for 10,000's of demanding gamers, you need to really know what is going on inside your network. Oh, and everything needs to be built from scratch in just five days.

If you have never heard about DreamHack before, here is the pitch: Bring 20,000 people together and have the majority of them bring their own computer. Mix in professional gaming (eSports), programming contests, and live music concerts. The result is the world's largest festival dedicated solely to everything digital.

To make such an event possible, there needs to be a lot of infrastructure in place. Ordinary infrastructures of this size take months to build, but the crew at DreamHack builds everything from scratch in just five days. This of course includes stuff like configuring network switches, but also building the electricity distribution, setting up stores for food and drinks, and even building the actual tables.

The team that builds and operates everything related to the network is officially called the Network team, but we usually refer to ourselves as tech or dhtech. This post is going to focus on the work of dhtech and how we used Prometheus during DreamHack Summer 2015 to try to kick our monitoring up another notch.

Practical Anomaly Detection

In his Open Letter To Monitoring/Metrics/Alerting Companies, John Allspaw asserts that attempting "to detect anomalies perfectly, at the right time, is not possible".

I have seen several attempts by talented engineers to build systems to automatically detect and diagnose problems based on time series data. While it is certainly possible to get a demonstration working, the data always turned out to be too noisy to make this approach work for anything but the simplest of real-world systems.

All hope is not lost though. There are many common anomalies which you can detect and handle with custom-built rules. The Prometheus query language gives you the tools to discover these anomalies while avoiding false positives.

Advanced Service Discovery in Prometheus 0.14.0

This week we released Prometheus v0.14.0 — a version with many long-awaited additions and improvements.

On the user side, Prometheus now supports new service discovery mechanisms. In addition to DNS-SRV records, it now supports Consul out of the box, and a file-based interface allows you to connect your own discovery mechanisms. Over time, we plan to add other common service discovery mechanisms to Prometheus.

Aside from many smaller fixes and improvements, you can now also reload your configuration during runtime by sending a SIGHUP to the Prometheus process. For a full list of changes, check the changelog for this release.

In this blog post, we will take a closer look at the built-in service discovery mechanisms and provide some practical examples. As an additional resource, see Prometheus's configuration documentation.

Prometheus Monitoring Spreads through the Internet

It has been almost three months since we publicly announced Prometheus version 0.10.0, and we're now at version 0.13.1.

SoundCloud's announcement blog post remains the best overview of the key components of Prometheus, but there has been a lot of other online activity around Prometheus. This post will let you catch up on anything you missed.

In the future, we will use this blog to publish more articles and announcements to help you get the most out of Prometheus.