Monitoring checklist
Nikolay takes us through a checklist of important things to monitor, while Michael tries to keep up.
Monitoring checklist (dashboard 1):
- TPS and (optional but also desired) QPS
- Latency (query duration) — at least average. Better: histogram, percentiles
- Connections (sessions) — stacked graph of session counts by state (first of all: active and idle-in-transaction; also interesting: idle, others) and how far the sum is from max_connection (+pool size for PgBouncer).
- Longest transactions (max transaction age or top-n transactions by age), excluding autovacuum activity
- Commits vs rollbacks — how many transactions are rolled back
- Transactions left till transaction ID wraparound
- Replication lags / bytes in replication slot / unused replication slots
- Count of WALs waiting to be archived (archiving lag)
- WAL generation rates
- Locks and deadlocks
- Basic query analysis graph (top-n by total_time or by mean_time?)
- Basic wait event analysis (a.k.a. “active session analysis” or “performance insights”)
And links to a few things we mentioned:
- Postgres monitoring review checklist (community document)
- pgstats.dev
- Improving Postgres Connection Scalability: Snapshots (blog post by Andres Freund)
- Transaction ID Wraparound in Postgres (blog post by David Cramer)
- Subtransactions Considered Harmful (blog post by Nikolay)
- datadoghq.com
- pgwatch2 (Postgres.ai Edition)
------------------------
What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov and @michristofides
If you would like to share this episode, here's a good link (and thank you!)
Postgres FM is brought to you by:
- Nikolay Samokhvalov, founder of Postgres.ai
- Michael Christofides, founder of pgMustard
With special thanks to:
- Jessie Draws for the amazing artwork