High Availability
High Availability
Alertmanager supports configuration to create a cluster for high availability. This document describes how the HA mechanism works, its design goals, and operational considerations.
Design Goals
The Alertmanager HA implementation is designed around three core principles:
- Single pane view and management - Silences and alerts can be viewed and managed from any cluster member, providing a unified operational experience
- Survive cluster split-brain with "fail open" - During network partitions, Alertmanager prefers to send duplicate notifications rather than miss critical alerts
- At-least-once delivery - The system guarantees that notifications are delivered at least once, in line with the fail-open philosophy
These goals prioritize operational reliability and alert delivery over strict exactly-once semantics.
Architecture Overview
An Alertmanager cluster consists of multiple Alertmanager instances that communicate using a gossip protocol. Each instance:
- Receives alerts independently from Prometheus servers
- Participates in a peer-to-peer gossip mesh
- Replicates state (silences and notification log) to other cluster members
- Processes and sends notifications independently
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Prometheus 1 │ │ Prometheus 2 │ │ Prometheus N │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
│ alerts │ alerts │ alerts
│ │ │
▼ ▼ ▼
┌────────────────────────────────────────────┐
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ AM-1 │ │ AM-2 │ │ AM-3 │ │
│ │ (pos: 0) ├──┤ (pos: 1) ├──┤ (pos: 2) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ Gossip Protocol (Memberlist) │
└────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
Receivers Receivers Receivers
Gossip Protocol
Alertmanager uses Hashicorp's Memberlist library to implement gossip-based communication. The gossip protocol handles:
Membership Management
- Automatic peer discovery - Instances can be configured with a list of known peers and will automatically discover other cluster members
- Health checking - Regular probes detect failed members (default: every 1 second)
- Failure detection - Failed members are marked and can attempt to rejoin
State Replication
The gossip layer replicates three types of state:
- Silences - Create, update, and delete operations are broadcast to all peers
- Notification log - Records of which notifications were sent to prevent duplicates
- Membership changes - Join, leave, and failure events
State is eventually consistent - all cluster members will converge to the same state given sufficient time and network connectivity.
Gossip Settling
When an Alertmanager starts or rejoins the cluster, it waits for gossip to "settle" before processing notifications. This prevents sending notifications based on incomplete state.
The settling algorithm waits until:
- The number of peers remains stable for 3 consecutive checks (default interval: push-pull interval)
- Or a timeout occurs (configurable via context)
During this time, the instance already receives and stores alerts but defers notification processing.
Notification Pipeline in HA Mode
The notification pipeline operates differently in a clustered environment to ensure deduplication while maintaining at-least-once delivery:
┌────────────────────────────────────────────────┐
│ DISPATCHER STAGE │
├────────────────────────────────────────────────┤
│ 1. Find matching route(s) │
│ 2. Find/create aggregation group within route │
│ 3. Throttle by group wait or group interval │
└───────────────────┬────────────────────────────┘
│
▼
┌────────────────────────────────────────────────┐
│ NOTIFIER STAGE │
├────────────────────────────────────────────────┤
│ 1. Wait for HA gossip to settle │◄─── Ensures complete state
│ 2. Filter inhibited alerts │
│ 3. Filter non-time-active alerts │
│ 4. Filter time-muted alerts │
│ 5. Filter silenced alerts │◄─── Uses replicated silences
│ 6. Wait according to HA cluster peer index │◄─── Staggered notifications
│ 7. Dedupe by repeat interval/HA state │◄─── Uses notification log
│ 8. Notify & retry intermittent failures │
│ 9. Update notification log │◄─── Replicated to peers
└────────────────────────────────────────────────┘
HA-Specific Stages
1. Gossip Settling Wait
Before the first notification from a group, the instance waits for gossip to settle. This ensures:
- Silences are fully replicated
- The notification log contains recent send records from other instances
- The cluster membership is stable
Implementation: peer.WaitReady(ctx)
2. Peer Position-Based Wait
To prevent all cluster members from sending notifications simultaneously, each instance waits based on its position in the sorted peer list:
wait_time = peer_position × peer_timeout
For example, with 3 instances and a 15-second peer timeout:
- Instance
am-1(position 0): waits 0 seconds - Instance
am-2(position 1): waits 15 seconds - Instance
am-3(position 2): waits 30 seconds
This staggered timing allows:
- The first instance to send the notification
- Subsequent instances to see the notification log entry
- Deduplication to prevent duplicate sends
Implementation: clusterWait() in cmd/alertmanager/main.go:594
Position is determined by sorting all peer names alphabetically:
func (p *Peer) Position() int {
all := p.mlist.Members()
sort.Slice(all, func(i, j int) bool {
return all[i].Name < all[j].Name
})
// Find position of self in sorted list
}
3. Deduplication via Notification Log
The DedupStage queries the notification log to determine if a notification should be sent:
// Check notification log for recent sends
entry := nflog.Query(receiver, groupKey)
if entry.exists && !shouldNotify(entry, alerts, repeatInterval) {
// Skip: already notified recently
return nil
}
Deduplication checks:
- Firing alerts changed? If yes, notify
- Resolved alerts changed? If yes and
send_resolved: true, notify - Repeat interval elapsed? If yes, notify
- Otherwise: Skip notification (deduplicated)
The notification log is replicated via gossip, so all cluster members share the same send history.
Split-Brain Handling (Fail Open)
During a network partition, the cluster may split into multiple groups that cannot communicate. Alertmanager's "fail open" design ensures alerts are still delivered:
Scenario: Network Partition
Before partition:
┌────────┬────────┬────────┐
│ AM-1 │ AM-2 │ AM-3 │
└────────┴────────┴────────┘
Unified cluster
After partition:
┌────────┐ │ ┌────────┬────────┐
│ AM-1 │ │ │ AM-2 │ AM-3 │
└────────┘ │ └────────┴────────┘
Partition A │ Partition B
Behavior During Partition
In Partition A (AM-1 alone):
- AM-1 sees itself as position 0
- Waits 0 × timeout = 0 seconds
- Sends notifications (no dedup from AM-2/AM-3)
In Partition B (AM-2, AM-3):
- AM-2 is position 0, AM-3 is position 1
- AM-2 waits 0 seconds, sends notification
- AM-3 sees AM-2's notification log entry, deduplicates
Result: Duplicate notifications sent (one from Partition A, one from Partition B)
This is intentional - Alertmanager prefers duplicate notifications over missed alerts.
After Partition Heals
When the network partition heals:
- Gossip protocol detects all peers again
- Notification logs are merged (via CRDT-like merge with timestamp)
- Future notifications are deduplicated correctly across all instances
- Silences created in either partition are replicated to all peers
Silence Management in HA
Silences are first-class replicated state in the cluster.
Silence Creation and Updates
When a silence is created or updated on any instance:
- Local storage - Silence is stored in the local state map
- Broadcast - Silence is serialized (protobuf) and broadcast via gossip
- Merge on receive - Other instances receive and merge the silence:
// Merge logic: last-write-wins based on UpdatedAt timestamp if !exists || incoming.UpdatedAt > existing.UpdatedAt { accept_update() } - Indexing - The silence matcher cache is updated for fast alert matching
Silence Expiry
Silences have:
StartsAt,EndsAt- The active time rangeExpiresAt- When to garbage collect (EndsAt + retention period)UpdatedAt- For conflict resolution during merge
Each instance independently:
- Evaluates silence state (pending/active/expired) based on current time
- Garbage collects expired silences past their retention period
- The GC is local only (no gossip) since all instances converge to the same decision
Single Pane of Glass
Users can interact with any Alertmanager instance in the cluster:
- View silences - All instances have the same silence state (eventually consistent)
- Create/update silences - Changes made on any instance propagate to all peers
- Delete silences - Implemented as "expire immediately" + gossip
This provides a unified operational experience regardless of which instance you access.
Operational Considerations
Configuration
To configure a cluster, each Alertmanager instance needs:
# alertmanager.yml
global:
# ... other config ...
# No cluster config in YAML - use CLI flags
Command-line flags:
alertmanager \
--cluster.listen-address=0.0.0.0:9094 \
--cluster.peer=am-1.example.com:9094 \
--cluster.peer=am-2.example.com:9094 \
--cluster.peer=am-3.example.com:9094 \
--cluster.advertise-address=$(hostname):9094 \
--cluster.peer-timeout=15s \
--cluster.gossip-interval=200ms \
--cluster.pushpull-interval=60s
Key flags:
--cluster.listen-address- Bind address for cluster communication (default:0.0.0.0:9094)--cluster.peer- List of peer addresses (can be repeated)--cluster.advertise-address- Address advertised to peers (auto-detected if omitted)--cluster.peer-timeout- Wait time per peer position for deduplication (default:15s)--cluster.gossip-interval- How often to gossip (default:200ms)--cluster.pushpull-interval- Full state sync interval (default:60s)--cluster.probe-interval- Peer health check interval (default:1s)--cluster.settle-timeout- Max time to wait for gossip settling (default: context timeout)
Prometheus Configuration
Important: Configure Prometheus to send alerts to all Alertmanager instances, not via a load balancer.
# prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- am-1.example.com:9093
- am-2.example.com:9093
- am-3.example.com:9093
This ensures:
- Redundancy - If one Alertmanager is down, others still receive alerts
- Independent processing - Each instance independently evaluates routing, grouping, and deduplication
- No single point of failure - Load balancers introduce a single point of failure
Cluster Size Considerations
Since Alertmanager uses gossip without quorum or voting, any N instances tolerate up to N-1 failures - as long as one instance is alive, notifications will be sent.
However, cluster size involves tradeoffs:
Benefits of more instances:
- Greater resilience to simultaneous failures (hardware, network, datacenter outages)
- Continued operation even during maintenance windows
Costs of more instances:
- In case of partitions there will be an increase in duplicate notifications
- More gossip traffic
Typical deployments:
- 2-3 instances - Common for single-datacenter production deployments
- 4-5 instances - Multi-datacenter or highly critical environments
Note: Unlike consensus-based systems (etcd, Raft), odd vs. even cluster sizes make no difference - there is no voting or quorum.
Monitoring Cluster Health
Key metrics to monitor:
# Cluster size
alertmanager_cluster_members
# Peer health
alertmanager_cluster_peer_info
# Peer position (affects notification timing)
alertmanager_peer_position
# Failed peers
alertmanager_cluster_failed_peers
# State replication
alertmanager_nflog_gossip_messages_propagated_total
alertmanager_silences_gossip_messages_propagated_total
Security
By default, cluster communication is unencrypted. For production deployments, especially across WANs, use mutual TLS:
alertmanager \
--cluster.tls-config=/etc/alertmanager/cluster-tls.yml
See Secure Cluster Traffic for details.
Persistence
Each Alertmanager instance persists:
- Silences - Stored in a snapshot file (default:
data/silences) - Notification log - Stored in a snapshot file (default:
data/nflog)
On restart:
- Instance loads silences and notification log from disk
- Joins the cluster and gossips with peers
- Merges state received from peers (newer timestamps win)
- Begins processing notifications after gossip settling
Note: Alerts themselves are not persisted - Prometheus re-sends firing alerts regularly.
Common Pitfalls
-
Load balancing Prometheus → Alertmanager
- ❌ Don't use a load balancer
- ✅ Configure all instances in Prometheus
-
Not waiting for gossip to settle
- Can lead to missed silences or duplicate notifications on startup
- The
--cluster.settle-timeoutflag controls this
-
Network ACLs blocking cluster port
- Ensure port 9094 (or your
--cluster.listen-addressport) is open between all instances - Both TCP and UDP are used by default (TCP only if using TLS transport)
- Ensure port 9094 (or your
-
Unroutable advertise addresses
- If
--cluster.advertise-addressis not set, Alertmanager tries to auto-detect - For cloud/NAT environments, explicitly set a routable address
- If
-
Mismatched cluster configurations
- All instances should have the same
--cluster.peer-timeoutand gossip settings - Mismatches can cause unnecessary duplicates or missed notifications
- All instances should have the same
How It Works: End-to-End Example
Scenario: 3-instance cluster, new alert group
- Alert arrives at all 3 instances from Prometheus
- Dispatcher creates aggregation group, waits
group_wait(e.g., 30s) - After group_wait:
- Each instance prepares to notify
- Notifier stage:
- All instances wait for gossip to settle (if just started)
- AM-1 (position 0): waits 0s, checks notification log (empty), sends notification, logs to nflog
- AM-2 (position 1): waits 15s, checks notification log (sees AM-1's entry), skips notification
- AM-3 (position 2): waits 30s, checks notification log (sees AM-1's entry), skips notification
- Result: Exactly one notification sent (by AM-1)
Scenario: AM-1 fails
- Alert arrives at AM-2 and AM-3 only
- Dispatcher creates group, waits
group_wait - Notifier stage:
- AM-1 is not in cluster (failed probe)
- AM-2 is now position 0: waits 0s, sends notification
- AM-3 is now position 1: waits 15s, sees AM-2's entry, skips
- Result: Notification still sent (fail-open)
Scenario: Network partition during notification
- Alert arrives at all instances
- Network partition splits AM-1 from AM-2/AM-3
- In partition A (AM-1):
- Position 0, waits 0s, sends notification
- In partition B (AM-2, AM-3):
- AM-2 is position 0, waits 0s, sends notification
- AM-3 is position 1, waits 15s, deduplicates
- Result: Two notifications sent (one per partition) - fail-open behavior
Troubleshooting
Check cluster status
# View cluster members via API
curl http://am-1:9093/api/v2/status
# Check metrics
curl http://am-1:9093/metrics | grep cluster
Diagnose split-brain
If you suspect split-brain:
- Check
alertmanager_cluster_memberson each instance- Should match total cluster size
- Check
alertmanager_cluster_peer_info{state="alive"}- Should show all peers as alive
- Review network connectivity between instances
Debug duplicate notifications
Duplicate notifications can occur due to:
- Network partitions (expected, fail-open)
- Gossip not settled - Check
--cluster.settle-timeout - Clock skew - Ensure NTP is configured on all instances
- Notification log not replicating - Check gossip metrics
Enable debug logging:
alertmanager --log.level=debug
Look for:
"Waiting for gossip to settle...""gossip settled; proceeding"- Deduplication decisions in notification pipeline