Monitoring checklist

Monitoring checklist

Nikolay takes us through a checklist of important things to monitor, while Michael tries to keep up.

Michael: Hello, and welcome to Postgres film,
a weekly show about all things Postgres.

I am Michael founder of PG mustard, and this
is my cohost Nicola, founder of Postgres AI.

Hey Nilay and what we are talking about today.

Nikolay: Hi, Michael.

Uh, Today we are going to talk monitoring.

And I think, uh, we are touching very, very, wide wide areas, uh, of Postgres.

And probably we will revisit each one of them
trying to go in some, uh, specific sub areas.

Right.

But every project every project should have.

Michael: Well, do you mean?

So let's check that first.

So we naturally for large enterprises, we, we definitely want monitoring.

We definitely want database specific things being

Nikolay: Sometimes couple, couple of tools.

So like sometimes I saw big projects using three
monitoring and all three tools were monitoring Postgres.

None of them were perfect to be enough alone.

Right?

So we needed to combine parts from three places.

It's it happens.

I think by, by the way, it's not a problem.

If you have a couple of tools used, if you know that overhead is low, right?

Michael: Yeah, absolutely

Nikolay: we are talking about Pogs monitoring.

This is huge topic.

And, my main statement is, there is no good product developed.

Michael: no perfect product maybe or no good product.

Nikolay: if even, okay.

Okay.

Good.

Is maybe, okay.

No great products.

I don't see great products because, , I understand
what should be present in any Postgres monitoring.

And it's, it's not my point of view.

We, a couple of years ago, we started some community work.

So we started to discuss various tools.

We, we had several, sessions in Russian community.

We had several sessions reviewing various tools, both open source
and commercial, and then we created called monitoring checklist, post

this monitoring checklist, collecting ideas what should be present in
monitoring and what should be, must have, or nice to have and so on.

And, uh, then some people, outside of rush community joined as well.

And, and eventually we, I think we, we've got very good
understanding what should be present like a hundred

percent, uh, if you want your monitoring be very useful.

And, I don't see any product, commercial or open source or maybe even
in house, sometimes in house products I see very good ones, but none

of them meet these criteria even 80%, they usually it's below 50.

Michael: So, yeah.

So we are talking about things that you consider probably almost a
must have for most projects that people care about the uptime of.

Making sure they have each of those covered
in, is it monitoring, monitoring and alerts?

How, like, do you count that as the same thing?

Nikolay: So well alerts, it's like a, a slightly different topic.

I'm talking about troubleshooting sessions when we open monitoring
and try to understand what's happening, what's wrong because

we have some complaints from users or from our coworkers.

And we.

Something wrong with POS happening, and we
need to understand what exactly is wrong.

And I see two big areas which must present in any monitoring.

First area is of course we should have physical
metrics like CPU, memory, disco network.

It should be covered.

And almost always it's it's covered.

And it's like, it's separate.

PSG specifics.

And here we can recall, uh, Alex Soki P starts do deaf, this
interactive picture, which shows you architecture of PSGs and

what system views and extensions are, or tools are responsible
for monitoring particular part of Postgres architecture.

It's it's a great tool.

We will provide link in description I'm sure.

So, this should be present in monitoring.

And this first part, I call dashboard number one.

The goal of dashboard number one is like you have like 30
or maximum 60 seconds and you need to perform very wide

and very shallow overview of all Pogo components, all
Pogs, uh, , areas and understand which area looks not well.

So we need to go deeper using other dashboards
probably, or maybe manually looking at Pogs.

So, and second component is, is query monitoring.

Query monitoring it's like not a king, it's
a queen of, of, uh, observability, right?

It should be present always.

It's it's this whole different topic.

It should be present in any monitoring.

And again, I don't see all like some, some systems.

Improving for example, data though currently is very well improving in
this area, particularly, and by the way, they do it for all databases.

Not only for PSGs.

I just see how well they are compared to two
years ago, but they have big mistakes as well.

Like not mistakes, but disadvantages like missing parts,
which don't allow me to work with it, but probably we

should, postpone a discussion about, query analysis.

It's like maybe different talk, but let's focus on
this first, like wide and shallow solo dashboard.

Like I want everything quickly help of each component.

Michael: So the use case is, my pager's just gone off or I've got a
notification or, or a user mentioned something that doesn't sound good.

And I want to look at something and see, where is the problem?

Like what, where do I even start

Nikolay: Right.

So either some people say there are errors related to Pogo
or some performance degradation and we want like to have 10

or like dozen of like 12 metrics present on single dashboard.

So we quickly understand are we okay compared to
like one hour ago or one day ago or one week ago.

Or we are not okay in particular areas.

So let's dive deeper.

So let's probably quickly have some overview of these metrics because I
have a list and I also, we can attach this list to our episode, right?

Michael: Yep.

So what's the top of your list?

Nikolay: At the top of the list, there's, two major things which
represent the behavior of any multi-user system, throughput and Latin.

Not only much user, but where network is also, involved.

So throughput in throughput, uh, if we talk about databases per particularly
can be measured in several ways, but two ways are most interesting to us.

It's TPS and QS transactions per second and queries per second.

And unfortunately posts doesn't allow to measure the second one really well.

TPS are present in PPG star database or PPG database.

I always mix it exact commit and exact rollback.

We observe these two numbers and some of how fast is increasing.

This is our TPS.

Michael: Or decreasing, right.

If it's, if there's some problems where it

Nikolay: It cannot decrease it's accumulative.

It's just a counter of committed transactions and rolled back transactions.

Michael: Sorry.

I thought you were talking about TPS still.

Nikolay: So right.

What can decrease if we divided by the duration of
like, we, we need two snapshots as always, right.

And we need to understand how many seconds between them.

So we divide and of course, this like speed can go up and down.

TPSs can go up and down, but Q PS, unfortunately, Postgres, system use don't.

And, uh, usual trick is to use statements, but it's not exact
number, unfortunately, because statements, as I remember do

max, uh, statements dot max parameter by default is 5,000.

if I'm not mistaken, it can be tuned to 10,000 or hundred.

It means that we don't register all queries.

So we probably see only the tip of the iceberg
and QPS can be wrong if we use this approach.

If we use calls metric from PJI segments.

When I have doubts, I usually usually like most of the cases we have
P bouncer, so I check P bouncer logs and it reports Q PS like fair

number of QPS, but this is like maybe already too deeper about it.

Michael: Yeah, let's focus on the things we can, uh, can measure in Postgres.

Nikolay: Right.

This, this is throughput, Latin seeds, query duration.

Again, usually it's measured from PC statements timing.

And we like why it's important because we, we need to understand,
for example, if QPS and TPS dropped significantly, maybe it's

just a problem with application Aeros on application cost.

Uh, like our load dropped.

Maybe it's a problem related to like, uh, CPU a hundred,
all CPUs are a hundred percent who have situation or

this situation on that's why we process fewer T TPS.

And latency of course, very important.

And I see in many cases we don't have it in monitoring.

These two metrics are.

It's it's our general health of our database should be always like we
have problem let's check CPS and latency, average duration of queries.

So next, , very, very, very, very important.

And I don't also see like more than 50 systems are
not good with, with this it's connection monitoring.

My ideal is just one single stacked graph of connections,
active item, transaction idle for, I don't remember.

So, but these three already tell them, tell everything.

Right.

So active.

We process something.

If active goes above number of CPU, it's already bed sign.

So if you have, uh, 12 course, and we see active 20, well
probably already too much work to do for this server.

And probably we have already huge degradation in performance.

It's not always so, but in many cases, so, adults and connections, adult
transaction connections also very important, especially before Pogs.

13 or 14.

were optimizations 14.

Right?

Right.

So I saw, sometimes people say, okay, we have
a couple of thousand more idol connections.

Don't worry.

They are idol.

But you know, overhead is big and I think
Andre had a very good couple of posts.

Also.

We should provide links.

Right.

explaining overhead

Michael: I think he did a lot of the work in 14.

I think it was.

Nikolay: optimizing.

It, it was related not to memory consumption as I
remember, but, uh, how work with snapshots is organized.

Usually people think a couple of thousand
connections give us a lot of memory overhead.

No, it's more about how work with snapshots organized.

But what I like about stack graphs is that
we can quickly understand the total as well.

Right?

And, uh, sometimes people split to multiple charts, multiple graphs.

It's also fine, but, uh, I like to see connection and overview in one place.

So, and, and of course I don't, I don't in transaction.

sessions or connections.

Connections sessions is the same, basically.

They are, very dangerous sometimes because if application started
transaction and went to talk to some external API, for example,

or to do some other work, it's, it can be very dangerous.

So we should also understand like, we should have
some threshold and not, not allow system to go.

Michael: So, yeah.

So at this point, I guess we're talking about long transactions.

Nikolay: No, well, sometimes these things correlate, but not always because
we can have, for example, at each moment we have 200 in transaction sessions.

Right.

But duration of each transaction is less than one.

It might happen.

Very brief transactions, but they do something
and keep connections, idle it also possible.

But next item is long transactions actually.

And, this is actually a big topic as well, because
usually long transactions have two dangers.

One danger is to lock something and block others.

It's one point, another point is, uh, disturb auto vacuum work.

But in the letter we should talk about not long
transactions, but about XME horizon, it should be different.

And I didn't see it.

Never like no money talking distinguishes these two things yet.

Of course I didn't see all existing systems of course, but over the
last couple of years, observing various systems, like dozens of them.

Michael: Should we talk about that then?

Straight away.

So the X Xmen horizon being to avoid, transaction ID wraparound, which is,
you know, there've been some famous blog posts that I can include on that.

Nikolay: Right, transaction idea around is one from horizon
being stuck in the, in very like past, but also accumulating

a lot of, deads and then it converts to blo also a problem.

Right.

Michael: See previous episode.

Nikolay: Right, right, exactly.

We discussed it already, but, again, like simple graph.

What is the maximum age of the longest action right now?

Excluding vacuum because actually regular vacuum it's by the way, it's tricky.

a regular vacuum doesn't hold extreme horizon.

It doesn't block anyone except we've analyzed it.

Hold XME horizon.

And if the vacuum is working to prevent
transaction idea around it can block others.

So in each area.

There are interesting, ifs, right?

If then, if then, so it's quite difficult to build ideal monitoring of course.

But anyway, like I want to have a graph.

do we understand the maximum edge of transaction?

Right.

usually people say, I don't know.

Michael: what kind of number are you worried about there?

Nikolay: right, right.

In TP systems.

And by the way, we forgot to mention that we,
again, discuss TP systems monitoring for them.

Right.

Because for analytical systems, things are different.

for OTP, I would like to have everything below, ideally , under one second.

again, it's related to human perception, which is
like we can distinguish 200 milliseconds roughly.

So one second feels slow, but of course, , sometimes we need
slightly longer transactions, a few seconds, but transaction one

minute, if it can block someone else, it's already a problem.

We should split.

To batches, but,

uh, we, we discussed it briefly last time.

If you create an index, uh, unfortunately you hold the extreme
horizon and of course, if your table is big, need to, to spend

like sometimes an hour or maybe even more to build an index.

And that's why partitioning is needed again, to have this time say.

Michael: So we are, about halfway through our,
nearly halfway through our, uh, dashboard.

Nikolay: Good.

So next thing is, um, , the ratio of error.

So it's can measure.

We, we already discussed this exact, , from
PTA database, exact commit exec roll bank.

How many transactions are committed?

How many transactions are canceled and, uh, we can have a ratio.

To understand the percentage of successful transactions.

And it's interesting sometimes, I see like system produces
like only 50%, or even fewer successful transactions.

But developers said this is an intentional.

this is by design and it triggers interesting questions.

What negative consequences can be if application uses roll back too often.

It's like separate

Michael: But the main reason you want this ratio
is to see if it changes from your baseline.

So if you've got normally a very high ratio, you just want to tell

Nikolay: Right.

99%.

And suddenly we have 70%, obviously it means we have a spike of errors.

So we need to go to logs probably, and understand which errors, or if you use,
extension called log errors, you can have counters, right inside database.

So you can select.

Aggregated by error type.

So next thing is, uh, transaction idea up around, uh,
separate metric, here people always like, okay, not always

in 90% of cases, people forget about not exact, , idea up.

they just don't check it.

Like it's more exotic, like transaction idea around is exotic,
but when it happens, it's a big disaster, but multi exec idea up

around it's even more, it's even like very, very rare, very unusual,

Michael: Is this the thing that you mentioned in
the GitLab, blog post about, sub transactions?

Nikolay: uh, briefly.

Yes briefly there is connection to it.

Yes, because, , if we use select for, share and sometimes select
for update, we might have, increasing, uh, multi, exact ideas and

risks of, uh, well, if we use foreign keys, we have not exact ideas.

That's it?

Michael: Of course.

So are we talking about two graphs here, then one
for like, what do we actually need for our dashboard?

Nikolay: It can be one, but just the idea is we, we should not
forget about this second ID and it also can be wrapped around.

It has the same risk.

So we should monitor both risks and, uh, alert, when it's already.

Above threshold and so on.

Okay.

Then, the health of replication, again, like
for this dashboard, we need only brief analysis.

Like, do we have at least one standby, which is legging physical or logical?

We, we should define our thresholds and.

That's it like if we see a spike or leg also a problem
you can get out of disc space event or something not good.

Then, a few things on the left.

By the way I realized we, in this list, we will provide this list.

As I've said, we don't have, vacuum behavior autobi behavior
and we don't have checkpoint or, and visual writer behavior.

We have it, uh, in our systems.

We have it on D.

Two, but I think we probably should move it to dashboard one as well,
because these two big components of Pogo, vacuuming and chick pointing.

Right.

So we should have it here.

Michael: before you move on this one, I see you also had unused
replication slots, which I thought was really interesting.

I've heard of some issues where, like if, if you've got a
static number of those, if, if it decreases, if, if something

ends up using one of those replication slots that you weren't
expecting, that's that could be sign of a problem as well.

Right?

Nikolay: The problem is that like, right.

So we don't have leg per se here, but,
uh, if slot is unused, it's accumulating.

so it's easy to be out of this space soon.

So, and use the, uh, slots are dangerous, definitely, but it can
be, everything can be here combined on one or couple of graphs.

So, but as I've said, we want to keep this, dashboard
quite small so we can quickly overview everything.

And then like, after application, I'd like
to talk about legs related to archiving.

Very often, people don't have it.

They don't monitor the leg.

Like how many walls are not yet archived?

Why is this important?

If you try to stop positives, we restart or just stop.

And it has a lot of archived.

you know, like when we shut down POG, there is so called implicit
shut down checkpoint, and the same way PSG tries to archive them mold.

So trans archive command of all painting walls and can take
our, if we accumulate a thousands of walls, it happens.

You wait

Michael: Well, it's not just if you try, if you want to upgrade
or something, it could also be if there's a crash, right.

That it's how, how long is it gonna take you to recover?

Nikolay: Well, recovery is different.

If it crashed, it'll recover quite soon.

It just, replay, redo, what's needed.

And that's it.

I'm talking about like, intentional restart or, or shut down,
for example, if, uh, Petroni wants to perform switch or, or

failover due to some reasons it tries to shut down the primary.

until some version not long ago, because I told KKI about it.

And it was a surprise because we had it on production in some system.

So Petro failed to perform fell over because a
lot of walls were, were, painting to be archived.

and, uh, co KKU can fix it, in, in the
recent versions of Petron upgrade, upgrade.

It's quite, it's quite new thing.

Uh, quite new bug fix.

Well, it's not bug fix.

It's like, uh, improvement, Petron.

Doesn't wait, and like tries to perform fell over.

There is a trade door because what what to do with these walls.

I don't remember details there because if we
don't archive those, this drops our, , Dr.

Strategy disaster recovery strategy, right.

We need to keep backups in good shape.

So this is interesting question as well, but, , anyway, monitoring should
have this number of painting walls, or also maybe the rate of our kit.

how, how, like how many walls we archive per second, for example,
it's very, very interesting metric as well, to understand like our.

Michael: You've got that next on the list.

Nikolay: Ah, okay.

Yeah.

Okay.

Well generation rates and, uh, yeah, as I recently learned, if
you have few walls per second, probably you won't be able to

use logical replication because wall center will, will be, will
set rate single CPU, , core and finally locks and deadlocks.

Uh, well maybe they should not go together because.

There's different things.

Right.

But, and locks, of course we like sometimes people draw both, um,
exclusive and Sherlocks there, there are many kinds of locks here.

We talk about, , like heavy locks, not
lightweight locks or latches and so heavy locks.

So we are locking rows and locking, tables.

And here, sometimes people try to draw everything, but
we should focus on, uh, maybe only on exclusive locks,

which can be reasonable for blocking others first reason.

Well, it's also like there are many , problems there.

This, this chart alone can be difficult to implement properly.

But, uh, anyway, we, we want to have something,
I, I just saw sometimes application engineers.

They say, oh, we had a spike of locks.

You look at it or you have had spike of share locks.

So it's okay because you had some deployment it's normal.

It's not a problem.

So.

this chart is also like interesting.

What, what should be there?

And that locks, of course, we want to understand how many deadlocks,
so many other things are missing here, but I'm, I'm not a big fan of

combining physical and this, this logical, like, for example, in data,
doc, if you check database monitoring, you will see CPU combined with TPS.

It's a mix of everything.

It's Not not right.

Host stats.

I would like to see separately from database stats.

What do you think.

Michael: Interesting.

Well, I think in the early days I've seen a lot of teams get really far
looking at application stats and then adding some database things into it.

They don't have some of these like real scaling issues, I guess.

Uh, I think a lot of people implement
monitoring after a big issue after a big outage.

They do they think, ah, maybe we do need it now.

so probably people put this in, in general, uh, a little bit too.

but yeah, I can see, I can see an argument for keeping it where the
application monitoring is, especially from the, the query performance

point of view, at least, especially from the performance monitoring.

Maybe not so much from the troubleshooting, we've had an outage perspective.

Nikolay: Right.

Well, well, again, we just didn't discuss here, uh, query
analysis of it's another half an hour, at least it's

like, it's very interesting and very deep topic, but.

I also like, forgot to mention that, uh, we
have, we took PG watch too, from cyber tech.

And, uh, I remember when we first, uh, reviewed
it, it didn't feel well like strange dashboards to us.

Like very, very, like, Like you explain, oh, why, why not add this?

Why not add that?

But the fact that it's open source, good components like Grafana and so on.

Like, it can be, it can use Pogs and timescale.

So we.

Vitality from it team implemented, the, initially
the set of dashboards following this philosophy.

So dashboard number one is for quick and very shallow troubleshooting,
understanding which areas need further analysis dashboard.

Number two.

More interesting things for DBA dashboard.

Number three is query analysis and it's available.

So it's open source available.

And, uh, recently another contributor
helped us to support, , timescale version.

So POG and timescale can be used as storage.

And I like to.

Like I encourage everyone to consider it as a second secondary
monitor probably already have something like data doc.

Right.

Which is like usually, or, or anything else which usually like

Michael: APM.

Nikolay: like organization already adopted some tool.

For everything, not for databases, not for Pogo.

And of course it's worth having it everywhere because it's
like unified solution, but, , it's missing a lot of things.

And I hope, uh, I convinced our listeners today that
there are many things that should be present there.

So pitch watch two can be added like a secondary monitor.

Very gentle without big observer effect.

and that can provide a lot of insights for database experts and actually
not, not only experts this TPS QPS plus latency, this, throughput plus

latency it's it's alone already, something very good, which many, systems

missing.

Michael: Yeah, absolutely.

So yeah, so I guess if you don't have some of these things,
, maybe look into how, how much work it would be to add them

and it, hopefully it would save you or make it a lot easier.

Next time you have a big issue, to spot what's going wrong.

Nikolay: Right.

And if you develop, uh, your tool in, in house or your developer
of monitoring this, uh, list, it's a product of community.

like several months of work we discussed, I discussed it
on Twitter either during our sessions in Russian community.

And so on, this list was like, I think it's very, very useful for like
checklist for everyone who wants to develop great progress monitoring.

So please use it as a reference.

What should be added.

Michael: Awesome.

Nikolay: I think it's, I think it will be useful, right.

Probably one day we will discuss query monitoring as

Michael: Yep.

Nikolay: I have something to say about it too.

Michael: L definitely part two coming.

Brilliant.

Well, thanks everyone for joining us again.

Uh, we really appreciate it.

Let us know if there's anything else you want to
make sure we covered or anything you think we missed.

Uh, be very helpful.

And yeah, check out the show notes for, well, I I'll include links to all
the tools we mentioned or, various things as well, including this list.

Nikolay: Yeah.

And thank you again for feedback feedback.

Very important thing to have Twitter and other places.

And please subscribe like and share, share, share everywhere in, in your
social networks and groups where you discuss engineering and databases.

And so,

Michael: Wonderful.

Well, thank you, Nick.

Eli.

good to see uh, speak to you next

Nikolay: Thank you Michael, till next time.

Bye.

Bye.

Michael: Cheers.

Bye.

Some kind things our listeners have said