Postgres Emergency Room
Nikolay: Hello, hello, this is
Postgres.FM.
My name is Nikolay from Postgres.AI
and as usual, my co-host
is Michael from pgMustard.
Hi, Michael.
Michael: Hello, Nikolay.
Nikolay: And the topic I chose
is related to problems, acute
problems, unpredictable, which
sometimes happen to production
systems including Postgres.
In many cases, databases in the
center of storm and let's discuss
how we manage this and how to manage
it better.
So yeah, how to handle crisis situation
with production Postgres.
I called it Postgres urgent care
or emergency room.
I don't know like what's better
name here, but yeah, I guess
this is something I can share from
my past experience.
Yeah.
So let's discuss this.
Michael: Sounds good.
And good point about the database
often being in the center of
things.
I think when you see large, sometimes
I guess it is often the
large companies that we notice
on Twitter and things that post,
you know, people start tweeting
it that it's down.
I think recently there was a big
GitHub incident and that was,
I think, one of the first communications
was about it being database
related.
Nikolay: Database changes related,
not only database, but also
database changes, which when you
have a change with database,
it means increase the risk of incident
in most cases, actually.
But it was not about Postgres,
so let's exclude this case.
It's MySQL.
But I truly believe that the database,
as I usually say, is the
heart of any tech system because
it's very, very stateful, right,
because it's data.
And since it's stateful, it's really
hard to scale it and handle
performance and so on.
So indeed, since database is heart
of our systems, production
systems, it's often in the center
of storm.
Michael: Where would you like to
start with this?
Nikolay: I have a plan.
Let's have a two-step discussion.
First, we'll discuss the psychological
part of incident management
related to databases, and second,
purely technical.
How about this?
Michael: Yeah, I like it.
The psychological aspect probably
isn't talked about as much
as it could be.
I think people often in postmortems
focus on the technical things
that could be done differently
to avoid the issue reoccurring.
I mean, that obviously is the most
important thing, but I feel
like so many could also have learned
a lot from how they communicated
or how much they communicated.
Often I see almost no communication
from companies that are having
big outages, very poor levels of
updates.
And it wouldn't completely alleviate
the situation, of course.
If you're down, you're still down.
But I'd be surprised if companies
that communicate better don't
have much better outcomes than
ones that don't.
People looking to cancel services
afterwards, that kind of thing.
Nikolay: Yeah, so indeed, it may
be not well discussed, but I
think anyone who is dealing with
production at a larger scale definitely
knows that, first of all, many
people manage this kind of stress
not well and that's fine.
Maybe it's not your kind of thing,
right?
Some people manage stress very
well.
It's not me, by the way.
I manage stress moderately well.
I've learned how to do it and so on,
but in the center of a production
incident, I still find myself with
high emotions.
We should do something very quickly,
right?
It's hard.
Also, tunnel vision.
You see only some things and you
don't have time for anything
else.
It's very hard to relax and capture
the whole picture and so
on.
And that's why tooling is also
important here, right?
Tooling should be designed like
for people who are under stress
and some runbooks and so on.
But this is technical stuff.
We will talk about it slightly
later.
So I think there are some trainings
and so on.
I'm not a good source of advice here, but I know we should
look at people who deal with production
systems, SREs, and so
on, and there are many, many books
written about it, and handbooks,
runbooks, and so on.
So, yeah, there are some good practices
how to deal with such
stress and so on.
Michael: What are your favorites
of those?
Nikolay: I like to use materials
from Google, of course.
There is sre.google, this is the hostname.
There are 3 books there and quite
good content.
I also like the handbook from GitLab
for the production SRE team.
Michael: I seem to remember Netflix
having some good stuff as
well.
Nikolay: Yeah, yeah.
Well, we are in the area like it's
not only about databases,
of course, right?
It's the SRE area basically and there
are many good materials for
this.
But specifically for databases,
I think in general, if you feel
Emotional, maybe you should tell
this to colleagues or maybe
someone else should be acting more.
You need to understand yourself
basically, right?
If you feel… I sometimes feel myself…
I remember we had a Black
Friday, very boring because we
were very well prepared.
We had like a lot of stuff, like
it's a large company, e-commerce,
and we had very good preparation
and war room organized and we
are prepared for incidents and
whole Black Friday was so boring.
So when we finally found some incident
at 9 p.m., I was happy.
Finally, we have some material.
And recently, I was helping to
some customers, and I also, I
remember exactly this state.
I'm finally something interesting,
you know?
This is great state to be in.
Instead of being stressed and everything
is on your shoulders,
you don't know what to do or maybe
you know, but what if it won't
work?
You have some good materials to
check, like some monitoring systems
to check, but so stressed, right?
Because you have fears of failure,
right?
What if you won't be able to bring
database up again, back up
during many hours, and then everything
is disaster.
But if you in the state of like,
this is interesting, this is
like finally we have some work
to do in production, let's just...
I have a high level of curiosity.
Maybe it's a new case.
This comes with experience.
You saw many cases, but you are
looking for new kind of cases
because it's already too boring
to deal with the same kind of
corruption again or I don't know,
like some database down again.
You saw it many times.
And you are hunting for new types
of cases and curiosity helps.
So these are very 2 opposite states,
I would say.
And I was in both in my life.
So yeah.
Michael: For you, does
it depend a bit on the severity
though because for me even if it
was 9 p.m.
And I'd been hoping for some interesting
case to come up if it
was super serious and the whole
like everything was down I wouldn't
be happy that we finally got
Nikolay: Well honestly I didn't
have cases when like for example
life of people depends on this.
I can assume this might happen
with some systems, but I was in
cases when cost of downtime was
super high.
And now I'm not scared already,
you know?
I already had it, right?
So I know how it feels and so on.
I'm not super scared if it's only
about money.
But life-threatening downtime,
honestly, I didn't have it.
And I think if it happened, I would
be very concerned, right?
Maybe this realization of this
is only about money, and the worst
thing that can happen, somebody
will lose money, and you will
lose a job.
It's not the worst case actually.
Just relax.
But life-threatening, it's another
story.
I'm very curious if someone who
is listening to us has some system
where the state of Postgres can
influence the health or life of people.
This is interesting.
Michael: Yeah, you do hear about
healthcare use cases, but also
military use cases.
Nikolay: Right, right.
Yeah, it might happen.
Might happen.
I didn't have it.
No, me neither.
So, that's why I'm not super scared.
It's like, okay, it's not a big
deal.
I know it's a big deal.
We will be professionally helping,
right?
But let's just do what we can and
that's it.
And that's why I'm curious.
Is it a new case, finally?
Okay, let's work on it.
But again, back to my point, you
need to understand yourself.
This is very important.
If you know you react not well,
even to small problems, it's
better maybe to be an assistant.
Michael: Ah, yeah.
I was doing some thinking before
the episode on what kinds of
emergencies there could be.
And a couple that I don't know
if you're 100% thinking of that
would be really scary for me would
be like security-style incidents,
either external or internal.
Yeah, exactly.
Nikolay: Like hackers acting right
now.
Michael: That would be scary in
a different way, potentially.
Or exciting in a different
way.
Nikolay: This is, I think this is CEO level
already.
So definitely if something like
that happens, it's not only like
there's a technical aspect here,
but it's also a very high level
organizational aspect of it, how
to handle this situation properly.
Right?
Michael: So how - Oh, I was still
talking about psychologically
though.
Nikolay: Psychologically, but this
like decisions, like how to
handle it, it's already CEO level.
It happens.
Recently, we received from our
insurance, I think we received
like a regular routine notice that,
you know, our database is slow
again.
And it just happens.
Like, you know, maybe you're, we
don't know if your record is
also stolen, maybe, no.
And a couple of days later, I found
on GitHub a very good project.
Some guy created a database of
all SSNs of all Americans and
just published it on GitHub.
The fact is that it's only 1,000,000,000,
like how many digits this
number has.
So he just published all numbers
up to 1,000,000,000,
Michael: okay.
Nikolay: But some people on Twitter
started thinking, oh, I found
my SSN as well.
It was like a snowball joke.
Some people started, okay, I'm
going to remove my SSN, created
pull request.
It's funny.
So back to this, If you know yourself,
it's good.
It helps you understand your stress
level.
On another note, it's funny that
we aim to monitor a database and
production systems well, like with
second-level precision sometimes,
but we don't monitor ourselves.
Like cortisol level, right?
It would be great to understand,
but we don't have it.
This bothers me a lot, monitoring
of human bodies.
I don't understand my own state
except how do I feel.
So it will be good
to see
heart rate for example,
right?
Michael: Yeah.
Rings and watches
that monitor heart rate which
is probably quite, like correlates
probably quite well with stress
level.
Nikolay: Yeah.
Yeah, But let's maybe slowly shift
to technical stuff.
So of course if you know yourself
helps.
If you… I wanted to share 1 story.
I very long ago, 15 years ago or
so, I had a great team, a great
startup.
I was CTO, I think.
Maybe no, I was CEO actually.
But yeah, combining 2 these roles
and I had a Postgres production
system and great Postgres experts
in my team.
And I remember 1 guy was a great
Postgres expert and I made
a mistake.
It was my mistake.
I was leaving to a trip for a few
days, and I said, you will
be responsible for production,
especially Postgres state, because
he was the best Postgres expert
in my team, right?
But it was an obvious mistake because
an incident happened and he
couldn't handle it properly and
he was completely like… he lost
his shit, sorry for French.
Right?
Michael: So you mean technically he would
have been best placed to handle
it in the team.
Nikolay: A technical expert is not
necessarily good in terms of
incident management, right?
And this is my mistake, I didn't
recognize it.
And this led to the end of our
cooperation, unfortunately.
So sometimes good technical experts
should be an assistant, right?
Not feel the pressure on their shoulders,
right?
This is super important to understand.
So my advice is, you know, like
try to understand yourself and
whether you should be responsible for incident
management or just assisting
technically, right?
Michael: Yeah, know yourself, but
also know your team and know
who in your team could, like who,
yeah, who you can call on for
different things.
Yeah.
Nikolay: Yeah.
Now, let's move to the technical
stuff.
What is helpful?
Very helpful, first of all, and
many small companies like we deal with. Our
main focus right now is companies
who are growing, startups, usually
lacking database expertise and
many such companies come to us
for help and almost none of them
have good incident management
in place.
It's not only about Postgres, right?
We always suggest thinking about
at least a simple process because
they say, we had an incident last
week.
My question is, show us the incident
notes.
Are they logged anyhow, like with
timestamps?
In most cases, they don't have
anything but just words.
They have words.
Okay, we saw the database was slow,
then it was unresponsive, blah,
blah, blah.
What you must have for an incident
is a sequence, like you have
we must have a document with artifacts,
like the first known event
happened, some logs, screenshots
from monitoring better with
links so we can revisit it.
But screenshots matter a lot because
sometimes monitoring has
a small retention window and the investigation
might be long, especially
if you involve external consultants
like us, right?
So there should be some template
and a plan for documenting incidents.
And when you have it, it also helps
with stress because you know
what to do.
You need to identify the first abnormal
event, document it, things
before it, things after it, like
some form it should take.
And everything you notice also
documented, important things highlighted.
It can be a Google Doc or something
with discussion around it.
It's good when it's possible to
discuss some things so people
can ask questions, clarify, add
some additional knowledge and
so on.
It can be anything actually, right?
But it's important to have, to
be prepared to document it.
Michael: Yeah, I've seen a lot
of people start with like a, like
whatever app you use for chat normally
in the team, or some people
have like a different app for instance
specifically.
But if you're using Slack, for
example, start a new channel for
the instant, all instant related
stuff goes in there.
Screenshots, logs,
Nikolay: chat, everything.
Michael: And then people turn it
into a doc later sometimes.
But I could see an argument for
starting with the doc.
But normally people are panicking
at the beginning, so chat makes
sense.
Nikolay: Yeah, chat is more convenient
for many people.
It's what you use every day, so
chat is good.
It's important to have long-term
storage for this document, converted
to document.
And I can say like most of startups
which grew to terabyte or
a couple of terabytes in terms
of database size, most of them
don't have proper incident management
workflow developed.
They must have it.
It's time already.
So yeah, I definitely encourage.
Even if you have like couple of
technical teams and technical
experts in your team, still it's
super important to have incident
management workflow developed.
So yeah, detailed, step-by-step,
so we understand what's happening.
And You agree on format of this
document in advance.
You can use some other companies
as example.
Again, SRE.Google and GitLab Handbook
for particular this area
are useful.
GitLab, for example, particularly
has example.
For instance, management.
Many other companies also share
their templates and description
how to document it properly.
Super important.
And also, Of course, sometimes
you feel, okay, I'm documenting,
documenting, but who will be actually
solving the problem, right?
So it's good if you have a few
folks who can help each other
and some of them is responsible
for documenting, another is trying
to find a quick solution.
And also document is important
to have because then in bigger
companies, we have a procedure
called root cause analysis, RCA,
right?
To learn from mistakes and fix
them and prevent them for future,
right?
That's why it's also important to document.
But then this helps and this is
I think fundamental number 1
technical thing you need to do.
Oh, it's an organizational thing,
sorry.
But it includes some technical
aspects.
For example, which monitoring we
use when an incident happens,
right?
Where do we start?
This dashboard or that dashboard,
right?
What technical things we must document
there?
For example, of course, we care
about CPU level and disk I/O,
basics, right?
Hosts us.
If the database seems to be slow or
unresponsive, we must document
these things.
We had discussions about monitoring
dashboard number 1, we propose
like these things.
Dashboard number 1 in our pgwatch2
Postgres.AI edition is designed
for shallow but very wide analysis,
very quick, like up to 1
minute analysis of various components
of Postgres and the various
properties at a very high level,
like 30,000 feet level of workload
to understand which directions
to investigate further, right?
Michael: Yeah, where's
the issue?
Nikolay: Right. Yeah, this is
very good to prepare in advance.
I know if something happens, how
I will act, where I will start,
right?
Yeah, so this is important.
And you will document, you'll know
how to start.
This is about monitoring and observability
and logs and so on.
Next, there are several particular
cases I can quickly share,
which are important to be prepared
for.
For example, of course, if you
already know that the database has,
for example, a transaction ID wraparound.
Michael: You can see straight away
that there's an error in the
log or something.
Nikolay: Yeah.
So we have cases very well documented
from Sentry, MailChimp,
somebody else, And also we have
very, very good work from Google,
GCP, Hannu Krossing.
He was at PostgresTV presenting
his talk about how to handle transaction
ID wraparound without single-user
mode.
He thinks single-user mode is not
the right way to do it, but
this is a traditional approach,
single-user mode, and a very
long time for processing of the
database, for recovering the
state of the database.
So this is like you just, you can
document if it happens sometime
someday, but I haven't seen it
for so long because monitoring
has it, alerts have it, like,
and so on.
Michael: And also recent versions
have improvements in this area.
I remember, I think Peter Geoghegan
did some good work around this.
probably others too.
Nikolay: Yeah, I just started
from a very scary thing.
The scariest thing, and also the scariest,
is like loss of backups and
you cannot perform disaster recovery,
right?
Also like very low-risk these days.
Michael: Yeah, I guess these days,
sometimes major issues
are things like the whole of US East 1 is down for like, this
hasn't really happened for a while,
but like a cloud regional
outage.
I feel like that could still take
down a company's data.
If you're using a managed service
or the cloud at all, you're
at risk of that.
Obviously there, you can have plans
in place to mitigate that.
Nikolay: Even if it's self-managed,
not many people have multi-region
setup.
Michael: Exactly.
Nikolay: It's very hard, actually.
Michael: So if you don't have off-site
backups, you're sat there
thinking, "We just have to wait".
Nikolay: Yeah, it's a complex thing
to have multi-region purely
and well-tested productions, like,
failover-tested very well,
and so on.
Yeah, it's a big topic actually.
So backups and transaction ID
are probably 2 nightmares of
any Postgres DBA, right?
Michael: Are they the scariest
to you?
I think corruption is pretty scary.
Nikolay: Well, it's a good and interesting
topic.
Corruption, we had an episode about
corruption as well, right?
But this is good to put to preparation
of incidents.
If corruption happens, what we
will do?
Some steps.
And first step is, according to
wiki.postgresql.org, copy the database,
right?
Because you will try to fix, maybe
you will break it more, right?
So copy.
This is the first step to do.
And knowing this helps because
this kind of thing you can know
in advance.
By the way, transaction ID wraparound
you can practice as well.
There is a recipe I wrote on how to
simulate it, right?
So you can have it in a lower environment
and then good luck dealing
with it.
Or you can clone your database
and simulate it there.
So corruption is a very broad topic,
many types of corruption,
but some kinds can also be simulated.
There are tools for it.
So it's good to know it.
But in cases I saw, in most cases,
it was quite like there was
some path to escape.
In some cases, escape was we just
restore from backups losing
some data and for that project
was like acceptable.
In some cases it was, okay, we
just noticed that only pg_statistic
is corrupted.
So running analyze fixes it.
But long term we see the database
is on NFS and this is no, no,
no.
Don't use NFS for PGDATA, right?
It's quite, Like in most cases
I saw corruption, it was something
silly actually.
But corruption happens also like
due to bugs, due to various
stuff or mistake planning some
major change like switching to
new operating system, GDPc.
Fortunately over the last few years
such corruption happened
in non-production, so we fully
prevented it.
Michael: Well, the reason
I find it scary is more that
we could have been returning bad
results.
Nikolay: Like it's just silent
corruption.
Michael: Yeah.
It's that's the fear to me is more
how far back does this go?
Anyway, but it's a different kind
of emergency.
Nikolay: Yeah, we had the corruption
due to index and GDPc change
in production with one company last
year.
And it was our oversight.
But fortunately, it happened only
on standby nodes, which were
not used.
So it was pure matter of luck that
this production corruption
happened.
Michael: And no failover.
Nikolay: Yeah, yeah.
Other clusters used standby nodes.
This cluster didn't use it.
And we just saw some errors in...
It was during upgrade with logical
replication.
We saw errors in logs and quickly
reacted and then realized,
these standby nodes are not used.
Let's pray that failover won't
happen soon.
Of course, it's like just imagine
like...
So we quickly mitigated this completely.
Nobody noticed.
But if it happens, yeah, the question
is how, like what's the
propagation here?
But there's also like tooling and
knowing, like learning from
other people's mistakes helps, of
course, as usual.
And knowing tools like amcheck
should be very routine tool being
used often, right?
amcheck, to check B-tree indexes.
Hopefully it will finally support
other types of indexes soon.
I think it's still a work in progress,
right?
Michael: Yeah.
I saw some work going on.
Nikolay: Gin and GiST.
Yeah.
What else?
For example, if the database is shutting
down too slowly, it takes
a lot of time or starting up takes
a lot of time.
Not once I saw many people being
nervous, not understanding that
it's normal, not understanding
how to check the progress, what
to expect.
And it was like when you perform
checkpoint tuning, we also had
an episode about it, and increase
checkpoint timeout and max
WAL size, which you should do
on loaded systems.
In this case, like restart or just
stopping the database or starting
the database might take many,
many, many minutes.
And if it's self-managed, I saw
people kill -9,
SIGKILL, right?
Sending to Postgres because they
are nervous, not understanding,
"Oh, Postgres is not starting. What
to do?"
And I think, I think now in fresh
versions, there are some log
messages telling that we are in
recovery mode and showing some
progress, right?
I thought about it.
Michael: I think it's very recent.
I think, I can't remember if it...
Nikolay: It should
be so.
I mean, it should be so.
It should be very straightforward.
A DBA should see the progress and
have an understanding of when it
will finish.
For older versions, at least definitely
older than 16, it's unclear.
Usually, you need to, if it's self-managed,
you just run ps to
see what process reports in its
title or top, right?
And you see LSN there, then you
see your pg_control data to
understand the point of consistency,
and then you understand
how many, if you have 2 LSNs and
go to another Postgres, you
can calculate the difference and the difference
is in bytes.
So you understand how many bytes,
megabytes, gigabytes left,
and then you can already monitor
like every minute or every second
and understand the progress and
have ETA, expected time of arrival,
right?
And this helps.
And I think it's a good idea to
learn how to do it.
In older versions, in newer versions,
I have a how-to about it,
actually.
What if the Postgres startup and stop
time takes, like, it's long.
What to do about it?
And yeah, it should be just learned,
right?
And if you're prepared, it will reduce
stress.
And yeah, we had a lot of such
cases working on DBLab.
Sometimes, like, a clone is not created.
Why?
But it's because WAL size is
huge and so on.
It's just recovering, so you just
need to wait a little bit more.
But then we improved it.
So yeah, this might happen.
This is a very common situation.
Long restart time.
Michael: Yeah, I'll definitely
share that episode in the show
notes as well so people can find
it if they weren't listening
back then.
Nikolay: What else?
Somebody deleted data and you
Michael: need to recover.
We have other episodes like out
of disk.
Like there's other kinds of emergencies.
1 we haven't covered I don't think
in much detail was the big,
like out of integers.
Yeah, in short.
Nikolay: Oh, out of integers is
a big disaster.
Michael: Yeah.
Yeah.
But I guess that's quite common
like in terms of other common
issues people come to you with
is that is that up there or what
the what tends
Nikolay: I had I have I may be
I'm biased here because I have
a feeling it's a very well-known
problem and people already mitigate
it or are mitigating not requiring
a lot of expertise.
Our Postgres checkup tool has a
report for it, like saying how
much of capacity of int4 regular
integer primary key left for
a particular table.
For me, it's like straightforward
already, right?
And I don't see a big deal.
But if it happens, of course, it's
like partial, at least partial
downtime because you cannot insert
new rows in this table.
And it
can be scary.
Michael: That's true of so many
of these issues though, right?
Like once you're monitoring for
them, once you know about them
and you've got alerts far enough
out, they become not emergencies.
Nikolay: Right.
But I'd like to mention also like
common problem, like database
is slow or database is unresponsive
for what to do.
Like very general, like where to
start?
What do you think?
Michael: Well, I think that's the
monitoring thing, isn't it?
Like that's to go to the monitoring,
that number 1 dashboard
you talked about.
Nikolay: Yeah.
Michael: I think that's the workout
where is the problem.
It needs to be the first point,
doesn't it?
Nikolay: Yeah, I agree.
And the first thing I would start understanding,
I think we can talk
about methodologies here like starting
from USE, use, right,
and others, like there are many
of them.
But question, like do you see the
utilization, saturation, errors from Brendan
Gregg, like basics from Netflix,
right?
It's a very, very trivial approach,
I would say.
But yeah, here, first question,
if the database is slow and unresponsive,
first question, are we really putting
more workload on it?
Very simple question, but sometimes
hard to answer.
Because often we find out that
many more clients are connected, some
background job started like bombarding
the database with new queries,
retrying a lot of connections.
Michael: Ah, like a cascading effect.
Nikolay: As well.
Yeah, yeah, yeah.
Michael: Is it because of elephants, or
I actually don't know that term,
but it's like a...
Nikolay: So the question is, is more
load coming from, like,
externally to the database.
And this can, of course, be a reason
why it's slow.
And if it's not tuned well to handle
spikes of load, for example,
you keep max_connections high,
ignoring advice from Postgres
experts that let's keep it sane.
Recently I saw – I'm sharing without
names so I can share, right?
– 12,000 max_connections.
This is for me, I think, a record.
A new client showed it and they
explained.
I see it's a trend.
Recently when I say you need to
decrease max_connections, I
also say most likely you will not
do it right now because most
people tend not to do it.
They all have reasons why max_connections
should be very high.
And of course, since Postgres,
I think, 14, things have improved
in terms of handling idle connections.
But when an incident happens, these
idle connections become active,
and we have almost 0 chances for
statements to be finished because
the server is overwhelmed with load.
Whereas if you have a sane number of
max_connections, I would say,
take your vCPUs number, multiply
it by some relatively low multiplier,
like less than 10.
That should be max_connections
for all OLTP workloads.
Then you have pgBouncer or something.
So if you have this and enormous
load is coming, additional load
will be receiving an "out of connections"
error.
And existing transactions or something.
Yeah, and who has chances to finish
current statements, current
queries processing and new ones and
so on.
So it's much better than you try
to please everyone, right?
And cannot do it at all, including
like with your old clients.
you know.
Michael: It also makes some diagnosis
easier, right?
If the database is still responding
to anything, it's easier
to diagnose issues than if it's
not responding at all.
Nikolay: Exactly.
Michael: It's kind of just moving
the problem, but it's definitely
an improvement.
Yeah.
But yeah, it's a good point.
Like it could just be overwhelmed,
but it could be, there are
like a million other reasons.
Nikolay: Of course.
But the first question I would
say, are we receiving more load?
So the reason is already outside
of Postgres.
Well, technically I just explained
additional factor, high or
max_connections, it's partially
problem is inside Postgres, but
the main reason, root cause is
outside.
Like we're just receiving much
more than usually.
Right.
This is number 1 thing to check.
If like we don't have time to discuss
full recipe for troubleshooting
of such cases.
Michael: We've got an episode,
I think, for that.
Nikolay: Maybe, yeah, I already
keep forgetting.
Michael: Probably actually just
monitoring,
Nikolay: yeah.
Yeah, but maybe we should have,
like, you know, like, how to
troubleshoot slow database, step
by step.
So, to save time, second advice,
I would say, just check wait
event analysis.
Second thing.
If you have a lot of active sessions,
maybe actually Sometimes
databases slow without a lot of
active sessions.
It's interesting.
But it's also, if you understand
number of active sessions, it's
very important.
But next thing, understand the
state of what are they doing,
right?
So are they doing a lot of IO or
there is a contention related
to lock manager, for example, or
sub-transactions or anything
like that.
So, wait event analysis is super
important.
And we discuss right now how to
improve dashboard number 1 and...
No, no, not dashboard.
Dashboard number 3, which is query
analysis in pgwatch Postgres.AI
edition.
And I'm almost convinced to put
wait event query analysis to
the top.
Previously, I was thinking we should
have total time from pg_stat_statements and average time, total
time maybe should be higher
and we have like a long discussion
inside the team about what should
be higher.
But now I'm almost convinced actually
wait event analysis should
be on very top because it gives
you a very quick understanding
just from 1 chart you quickly understand
the number of active sessions
and distribution in terms of what
they are doing.
In any analysis when you have some
number, the next step is to segment
analysis, right, properly.
So to divide this number into some
segments.
And I think wait event is a very
good direction for segmentation,
how to say.
Michael: Yeah, it's like it splits
it into fewer things, so therefore
it's easier to spot if there's
like a majority.
Whereas with query analysis, you
could have a real long tail,
like your, even the most commonly
executed query might only be
1% of your workload.
Well, yeah, it might be 50% but
it might be 1% whereas more likely.
Nikolay: Yeah, and timing in pg_stat_statements, it hides details
for it might be actual work database
is doing and that's why
it's spending time, for example,
sequential scans due to lack
of indexes or something like that,
or content, or it might be
waiting for a lock to be acquired.
So it also spending time and you
quickly see.
So very good book as usual, books
from Brendan Gregg.
There is in troubleshooting, I
remember also his talks, two-part
talk about tooling for Linux and
so on, and he mentioned that
if he needed to choose just 1 Linux
tool, like you can use only
1 tool and biggest outcome in terms
of troubleshooting.
What is it?
Do you remember?
No?
Michael: No.
Nikolay: It's iostat.
Oh, why?
It gives you disk I/O and also it
reports CPU as well, segmented
by like user system I/O wait.
So it's super good, like you see
Disk I/O and CPU just from 1
tool.
Similar here, we see the number of
active sessions and also we
see wait events segmentation.
It's a very good chart to have
for troubleshooting.
Michael: It feels to me like an
interesting trade-off, like whether
you're looking at monitoring more
often or not even necessarily
more often, but do you optimize
for people in an incident or
do you optimize for people doing
general performance work?
And I think optimizing for the
incident people make some sense,
even though it's less often, hopefully.
Nikolay: Yeah, they
have less time.
Michael: Less time,
but also heightened emotions
and not thinking straight like
we started.
So maybe that's a
Nikolay: Path should be shorter.
Yeah, yeah, I agree.
Right.
Yeah.
So there are many other things
that can happen with a database,
of course, right?
But if you know some common things,
it helps a lot.
Yeah.
And tooling should be prepared
and, yeah, observability is important.
Michael: Yeah, 1 last question.
I think there's some arguments
for trying to reduce incidents
down to like nearly 0, like trying
to put everything in place
so that you never have any incidents,
you know, high availability,
everything to try and minimize
the risk.
And then I think as a team, you
can get out of practice dealing
with incidents if you're good at
that kind of thing.
But then when one does happen, it
can really throw you.
Some teams like to deal with super
minor incidents and treat
those as incidents, almost like
as practice.
Do you have any opinions or feelings
around that kind of thing?
Nikolay: Yeah, good point.
So we actually didn't discuss many
things, for example, how to
categorize incidences like priority
1, priority 2, and so on.
Because when a client comes, it
happened a couple of times over
the last month, like a client comes
and shows me some graphs
with spikes of active sessions
exceeding the CPU count significantly,
I already say, oh, you are having
at least like, you know, P3
incident or maybe P2.
Maybe it's not user-facing, people
haven't noticed it, but it's
an incident already.
It requires investigation and they
like, the database is slow,
but this is already, you need some
reaction and mitigation for
it.
So it requires maybe understanding
and expertise and classification
rules, which require PostgreSQL
understanding, right?
Because sometimes I have a hard
time convincing people that if
you have, I don't know, like 64
cores, but accession count jumped
to 200, 300, It's already not normal.
They say, well, it worked.
Michael: No one complained?
Nikolay: Yeah, yeah, yeah.
Well, it worked.
And part of the problem in Postgres,
we don't have good metric
for average latency, for example,
for query processing, because
database job to be like, we want
query processing to be not producing
errors and be fast.
Fast, we have definition of fast
for OLTP case.
I have an article about it.
Definitely, it's not 1 second,
it should be below.
It should be below 100 milliseconds.
In most cases, it should be below
10 milliseconds because 1 HTTP
request consists of multiple SQL,
usually, in many cases.
And people, human perception is
200 milliseconds, so we have
some threshold already, so let's
keep latency low.
But funny thing, Postgres doesn't
have latency exposed, average
latency.
It doesn't.
So, the pg_stat_database doesn't
have it.
Nothing has it.
Only pg_stat_statements.
But it's not precise.
Michael: It's not in core.
Nikolay: It's not in core.
It's not precise.
There is max, 5,000.
In some cases, workload is complex
and there is constant eviction
of records from pg_stat_statements
and appearance of new ones.
So the latency measured from pg_stat_statements,
This is what most
monitoring systems do, including
dashboard number 1 we discussed
earlier, from pgwatch2 Postgres.AI
edition.
But it feels not fully reliable,
right?
But this is important because this
is how we can say, okay, really
slow, how much?
We had sub-millisecond latency,
now we have 5 millisecond latency.
Okay, indeed, there's proof of
it.
I like that PgBouncer reports it.
I was
going to
ask, yeah.
It logs and then starts and reports
it.
This is great.
This is what we should have, honestly,
in Postgres as well.
But, yeah, I actually don't remember
discussions about it.
There should be some discussions.
So, this is our maybe main characteristics
of performance.
I wish, of course, we had percentiles,
not only average.
Many people monitor it from client
side.
Datadog has APM and there's ability
to monitor it from client
side, but this is not purely database
latency because it includes
Round trip, RTTs, round trip times,
network, right?
And it should be excluded if we
talk about database for, to understand
behavior of database, right?
So yeah, pg_stat_statements, this
is how we understand latency.
And yeah, if it's slow, it's slow.
And then we need to, again, apply
segmentation and top-down analysis
and find what exactly is slow.
Everything or just some of it,
right?
So it's, it's, P2P3 incidents.
I think for smaller companies,
it's hard in terms of database.
It's possible, but it's too much
work.
Maybe.
Michael: Well, but I also think,
I think there could be an argument
from like a make incidents a bit
more normal in your team and
less stressful.
So when you do have a stressful
1, or like when you do have a
big 1 that's a bigger deal.
Nikolay: I see your point.
Unless your team is overwhelmed
with P1 incidents, which
I also had in my team actually,
and I saw it like we have every
day we have database down.
Unless that, it's a good idea if
you don't have database incidents
to say, okay, let's look for P2,
P3 incidents and start processing
them routinely so we build a muscle
for incident management.
It's a great advice.
Michael: Cool.
Nikolay: Indeed, Indeed.
Yeah.
Yeah, maybe that's it.
Let's let's wrap it up.
Michael: Sounds good.
Thanks so much, Nikolay.
Catch you next week.