Frequently Asked Questions

General

What is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit with an active ecosystem. See the overview.

How does Prometheus compare against other monitoring systems?

See the comparison page.

What dependencies does Prometheus have?

The main Prometheus server runs standalone and has no external dependencies.

Can Prometheus be made highly available?

Yes, run identical Prometheus servers on two or more separate machines. Identical alerts will be deduplicated by the Alertmanager.

The Alertmanager cannot currently be made highly available, but this is a goal.

I was told Prometheus “doesn't scale”.

There are in fact various ways to scale and federate Prometheus. Read Scaling and Federating Prometheus on the Robust Perception blog to get started.

What language is Prometheus written in?

Most Prometheus components are written in Go. Some are also written in Java, Python, and Ruby.

How stable are Prometheus features, storage formats, and APIs?

Although Prometheus and many of its ecosystem components are already quite stable, we will still allow for occasional breaking changes until the Prometheus server reaches version 1.0.0. These breaking changes will be pointed out in release announcements for components that already have a proper release process (like the Prometheus server) or communicated clearly otherwise. After releasing version 1.0.0, breaking changes will be indicated by increments of the major version. See also the documentation for semantic versioning, which we are following.

Why do you pull rather than push?

Pulling over HTTP offers a number of advantages:

  • You can run your monitoring on your laptop when developing changes.
  • You can more easily tell if a target is down.
  • You can manually go to a target and inspect its health with a web browser.

Overall we believe that pulling is slightly better than pushing, but it should not be considered a major point when considering a monitoring system.

The Push vs Pull for Monitoring blog post by Brian Brazil goes into more detail.

For cases where you must push, we offer the Pushgateway.

How to feed logs into Prometheus?

Short answer: Don't! Use something like the ELK stack instead.

Longer answer: Prometheus is a system to collect and process metrics, not an event logging system. The Raintank blog post Logs and Metrics and Graphs, Oh My! provides more details about the differences between logs and metrics.

If you want to extract Prometheus metrics from application logs, Google's mtail might be helpful.

Who wrote Prometheus?

Prometheus was initially started privately by Matt T. Proud and Julius Volz. The majority of its development has been sponsored by SoundCloud.

Other companies making active contributions include Boxever and Docker. A full list can be found in the AUTHORS file in each repository.

What license is Prometheus released under?

Prometheus is released under the Apache 2.0 license.

What is the plural of Prometheus?

After extensive research it has been determined that the correct plural of 'Prometheus' is 'Prometheis'.

Can I reload Prometheus's configuration?

Yes, sending SIGHUP to the Prometheus process will reload and apply the configuration file. The different components attempt to handle failing changes gracefully.

Can I send alerts?

Yes, with the experimental Alertmanager.

Currently, the following external systems are supported:

Can I create dashboards?

Yes, we recomend Grafana for production usage. PromDash and Console templates also exist.

Can I change the timezone? Why is everything in UTC?

To avoid any kind of timezone confusion, especially when the so-called daylight saving time is involved, we decided to exclusively use Unix time internally and UTC for display purposes in all components of Prometheus. A carefully done timezone selection could be introduced into the UI. Contributions are welcome. See issue #500 for the current state of this effort.

Instrumentation

Which languages have instrumentation libraries?

There are a number of client libraries for instrumenting your services with Prometheus metrics. See the client libraries documentation for details.

If you are interested in contributing a client library for a new language, see the exposition formats.

Can I monitor machines?

Yes, the Node Exporter exposes an extensive set of machine-level metrics on Linux and other Unix systems such as CPU usage, memory, disk utilization, filesystem fullness and network bandwidth.

Can I monitor network devices?

Yes, the SNMP Exporter allows monitoring of devices that support SNMP.

Can I monitor batch jobs?

Yes, using the Pushgateway. See also the best practices for monitoring batch jobs.

What applications can Prometheus monitor out of the box?

See exporters for third-party systems.

Can I monitor JVM applications via JMX?

Yes, for applications that you cannot instrument directly with the Java client you can use the JMX Exporter either standalone or as a Java Agent.

What is the performance impact of instrumentation?

Performance across client libraries and languages may vary. For Java, benchmarks indicate that incrementing a counter/gauge with the Java client will take 12-17ns, depending on contention. This is negligible for all but the most latency-critical code.

Troubleshooting

My server takes a long time to start up and spams the log with copious information about crash recovery.

You are suffering from an unclean shutdown. Prometheus has to shut down cleanly after a SIGTERM, which might take a while for heavily used servers. If the server crashes or is killed hard (e.g. OOM kill by the kernel or your runlevel system got impatient while waiting for Prometheus to shutdown), a crash recovery has to be performed, which should take less than a minute under normal circumstances. See crash recovery for details.

I am using ZFS on Linux, and the unit test TestPersistLoadDropChunks fails. If I run Prometheus despite the failing test, the weirdest things happen.

You have run into a bug of ZFS on Linux. See issue #484 for details. Upgrading to ZFS on Linux v0.6.4 should fix the issue.

Implementation

Why are all sample values 64-bit floats? I want integers.

We restrained ourselves to 64-bit floats to simplify the design. The IEEE 754 double-precision binary floating-point format supports integer precision for values up to 253. Supporting native 64 bit integers would (only) help if you need integer precision above 253 but below 263. In principle, support for different sample value types (including some kind of big integer, supporting even more than 64 bit) could be implemented, but it is not a priority right now. Note that a counter, even if incremented one million times per second, will only run into precision issues after over 285 years.

Why does Prometheus use a custom storage backend rather than [some other storage method]? Isn't the "one file per time series" approach killing performance?

Initially, Prometheus ran completely on LevelDB, but to achieve better performance, we had to change the storage for bulk sample data. We evaluated many storage backends that were available at the time, without getting satisfactory results. So we implemented exactly the parts we needed, while keeping LevelDB for indexes and making heavy use of file system capabilities. Obviously, we could not evaluate every single storage backend out there, and storage backends have evolved meanwhile. However, the performance of the solution implemented now is satisfactory for most use-cases. Our most important requirements are an acceptable query speed for common queries and a sustainable ingestion rate of many thousands of samples per second. The latter depends on the compressibility of the sample data and on the number of time series the samples belong to, but to give you an idea, here are some results from benchmarks:

  • On an older 8-core machine with Intel Core i7 CPUs, 8GiB RAM, and two spinning disks (Samsung HD753LJ) in a RAID-1 setup, Prometheus sustained an ingestion rate of 34k samples per second, belonging to 170k time series, scraped from 600 targets.

  • On a modern server with 64GiB RAM, 32 CPU cores, and SSD, Prometheus sustained an ingestion rate of 525k samples per second, belonging to 1.4M time series, scraped from 1650 targets.

In both cases, there were no obvious bottlenecks. Various stages of the processing pipelines reached their limits more or less at the same ingestion rate.

Running out of inodes is highly unlikely in a usual set-up. There is a possible downside: If you want to delete Prometheus's storage directory, you will notice that some file systems are very slow when deleting files.

Why don't the Prometheus server components support TLS or authentication? Can I add those?

While TLS and authentication are frequently requested features, we have intentionally not implemented them in any of Prometheus's server-side components. There are so many different options and parameters for both (10+ options for TLS alone) that we have decided to focus on building the best monitoring system possible rather than supporting fully generic TLS and authentication solutions in every server component.

If you need TLS or authentication, we recommend putting a reverse proxy in front of Prometheus. See for example Adding Basic Auth to Prometheus with Nginx.

Note that this applies only to inbound connections. Prometheus does support scraping TLS- and auth-enabled targets, and other Prometheus components that create outbound connections have similar support.