
What to log
Michael: Hello and welcome to Postgres.FM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard, and this is Nik, founder
of Postgres.AI.
Hello Nik, how's it going?
Nikolay: Hello Michael, everything's all right.
Let's discuss logs.
Everyone loves to deal with logs, right?
Michael: What do you mean?
Nikolay: Well, I mean, if you have a huge server, a lot of logs,
so good to deal with them.
Michael: Well, I guess we're blessed in Postgres with a few ways
that we don't always have to check logs, right?
There are some system views that we can look at various information.
Nikolay: For example corruption, right?
There's no such view.
Michael: Yeah, yeah there was a but there's quite a few things
that there's no such view.
Well there's quite a lot of information that's put in the logs
and isn't available elsewhere.
So it is important.
And when I say is put in the logs, actually quite often it isn't
even put in the logs unless you change some settings.
So I thought it was, you had this good idea to discuss all of
the things that most people should change in terms of what to
log so that we can then check the logs for issues and when we're
doing performance analysis and that kind of thing.
Nikolay: Yeah, So I think system views, ideally we should for
everything we should have both system views and detailed logging.
Michael: Oh both?
Nikolay: Yeah, because like think about special statements, we
have normalized queries only, it's not possible to have there
every query with all parameters because cardinality will be insane.
Michael: But equally, I think it's quite often not possible to
have every query logged to the logs, right?
Like there's also limitations there.
Nikolay: Exactly.
So pros and cons of both approaches.
Cumulative metrics, they are good.
You can visualize, you can see spikes, everything, but they lack
details.
And same for example, temporary files, right?
We can see at high level, we can see from the pg_stat_database
that, okay, we generate a lot of temporary files.
Queries generate a lot of temporary files, some queries.
We can go down to pg_stat_statements and see which query groups
or normalized queries produce those temporary files.
It's already super helpful.
But obviously, the same normalized query, if you take 1 set of
parameters, it can produce gigabytes of temporary files.
Another set of parameters, it can produce 0 temporary files.
Depending on parameters, it can be very different.
So we need examples, and this is exactly when we need to look
at the logs.
But at the same time, if we log everything, observer effect is
not a joke.
And I must confess, several big clients suffered from observer
effect when I decided to log too much.
And it caused downtime and it was not pleasant experience.
So we always need to think about observer effect.
Those who say log everything, log all queries, they just don't
deal with really heavy loaded systems, unfortunately.
Yeah.
Just try to do it in heavy loaded systems, which has like 100,
000 TPS.
And that's it.
You will put down everything very quickly, you will learn it's
not the way to go.
So we need to limit the amount of logging, but we cannot avoid
logging because we need examples.
So my approach is to combine both things.
Michael: Yeah, it's a balance, isn't it?
We need to record enough information to be able to diagnose issues
when they occur, but not so much that there's excessive overhead.
So it's a balancing act.
Nikolay: Right.
And you need to think, like it's better to think in advance what
to log, because if some incident happened, you somehow mitigated
it.
You want to do a root cause analysis, but logs lack this information.
So you can only guess and you can guess quite well if you have
a lot of experience, but still it's better to have logging
tuned properly in advance.
And I see Postgres in the latest few years, like improving defaults,
but still defaults are not enough usually.
We tune some defaults for our clients always.
I mean, we tune some log related settings to have more logging.
Michael: Yeah, in my research for this I found 3 different parameters
that were all either introduced or had the default changed from
off to some version of on all in Postgres 15.
So I was kind of expecting a gradual improvement in this over
time but it seems almost, I don't know if it's a coincidence
or not, but 1 major version where quite a few things are improved
and I haven't seen a whole lot of movement in other areas unfortunately.
Nikolay: Yeah you're probably right, Defaults changed in 15,
you mean log_checkpoints and autovacuum related to 10 minutes,
which is enormous still, like it's huge value.
And what was the third?
Michael: And there's a new log startup progress interval.
Nikolay: But it's new, it's not a default change.
But decision was made to not to
introduce it being turned off
by default decision, which is a
great decision.
I agree.
And it's quite like 10 seconds,
right?
So every 10.
Yeah, let's discuss everything,
everything we have, right?
Let's let's, Let's start from something,
move to different topic.
But I agree, so 15 was revolutionary
in terms of bringing good
defaults in logging.
Michael: In some areas, yeah.
Nikolay: Yeah, And also worth mentioning
Postgres log, it's always
a single log.
So you can imagine if we had ability
to separate streams of logging,
for example, slow queries could
go in 1 place, maybe to faster
disks, or maybe to...
There is a like this idea, very
old idea, that some logging should
go not to local disk but straight
to some storage over network
UDP and that's it like send it
So we could log a lot of more
stuff.
Although when you ask guys who
support, I don't know, like Elastic
or something, which is receiving
a lot of logging from different
systems, they also would like to
limit, you know?
Or Splunk, for example.
Splunk is an old commercial enterprise
system for logs.
They can deal with huge volumes,
but always still cautious about
how much you are going to send
to them.
Michael: Yeah.
It's a good point, though.
Other databases do have...
I'm pretty sure MySQL has a separate
slow query log, because
I've seen people familiar with
MySQL asking about where's Postgres'
slow query log?
Nikolay: Yeah, it's not a big change,
it could be like, I think
it could be, I mean it's easy to
implement, but I imagine how
much of Infrastructure changes
needed to be done in all the systems
if slow query log would be separated
What do you think
Michael: what what I'm missing
something though?
What's the benefit of having it
separated out?
Nikolay: You can put it to different
disks, faster disks for
example, analysis wise also it's
easier
Michael: Because the volume is
likely to be so much higher than
the other
Nikolay: Yeah, maybe Some rotation
policies could be different.
And analysis wise, the nature of
like, I see a bunch of logs,
which are related to query optimization.
And I also see a bunch of log messages,
which are related to
like startup, autovacuum, checkpointing,
like more like not
about query processing, but everything
else, right?
And separation of them is, to me,
it sounds reasonable.
So I Imagine it would be possible
if like, what do you think?
Imagine there is a new setting
which says everything related
to query processing, including
log_min_duration_statement based,
Everything that can contain a query
text, actually.
Everything.
Like, for example, slow queries
exceeding log_min_duration_statement.
Queries generated temporary files.
Queries which are slow because
they are waiting for other queries
and logged after deadlock_timeout
1 second by default, right?
Imagine all of them go in 1 place,
everything else stays in the
old place.
But it's optional.
You can decide to choose
Michael: it or not.
Yeah, I haven't thought long and
hard about it, but I do quite
like the simplicity of everything
being in 1 place.
But it can also be helpful, right?
I understand how if you're not
worried about slow queries and
you're looking for other things,
they can be really noisy and
get in the way.
And in a heavy-loading environment,
it would just be impossible
without tools to then get anything
useful out of that.
But like an autovacuum could be
the cause of some slow, like
some issue that you're facing with
query performance.
Or checkpoint could be the cause.
And maybe in a less heavily loaded
environment where you're reviewing
it manually seeing that in in order
might help you debugging
like what's actually happening
here
Nikolay: yeah makes sense makes
sense and also some errors Should
we like, what will happen with
queries which are erroring out?
Yeah.
But, but still like, it could be
a choice, right?
Because people who are interested
in errors, corruption related
errors, I hope we will discuss
them as well.
And maybe a checkpointer behavior
and so on, and less interested
in query processing itself.
They often say, you know, like
your huge query is being logged
just because they are exceeding
1 second, they make our log really
hard to read.
Yeah.
Well,
Michael: hard to read manually,
right?
With tools you can filter to the
bits.
Nikolay: Or with DuckDB you can
have CSV and talk to it through
DuckDB queries, which I think is
a great idea.
Quite a new and great idea.
Michael: But I like this almost
as a way of splitting our conversation.
You mentioned the performance related
stuff and everything else.
Where do you want to start with
1 of those?
Nikolay: Yeah, we already talked
about log_checkpoints.
Let's maybe talk about this.
So in 15 it was changed to on and that's it, that's great.
I mean, it's easy.
Like I think everyone needs it.
And finally in 15, how long ago it was?
3 years ago.
Michael: Yeah, nearly.
Nikolay: It was realized that everyone should have it, everyone.
Michael: So our advice to people would be, if you're on a version
older than that, turn it on.
Nikolay: 100% so.
Michael: If it's not already.
Nikolay: Yeah, and another thing happened in 2015, it's related
to autovacuum logging.
So log_autovacuum_min_duration was always, I think before that
it was minus 1.
So it means the vacuum is not logged at all.
And it was changed to 10 minutes.
In my opinion, it's not enough.
10 minutes is way too long.
It should be 10 seconds.
In some places we set it to 0 to see everything.
Of course, I can imagine, especially if you have many workers,
you also tuned the autovacuum_naptime to very low value so
they sleep less between work.
And this, or workers sleep less, right?
In this case, I imagine amount of logging from a vacuum workers
might be substantial.
But at least like 10 seconds or a few seconds threshold, it's
good.
Because if you have a lot of tables, 10 minutes, you will miss
a lot of details, interesting details.
Michael: And it's not that frequent.
The thing if we go back to that, you know, recording enough information
to diagnose issues without excessive overhead, the overhead of
logging all of your autovacuums, unless you have, as you say,
really, like, yeah, very low nap time and maybe lots and lots
of tape, like, maybe you've got thousands of tapes, thousands
of schemas, like if they've multiplied out, I think I heard somebody
that was talking about having millions of objects in their Database.
Nikolay: All right, do you remember if you have, for example,
naptime 1 second, You have 10 workers, for example, and you
have longer running Transaction holding extreme horizon, so the
vacuum cannot Delete some dead Tuples.
How often those workers will visit this Table, which needs processing?
Maybe they will do it in loop, right?
Michael: Yeah, good point.
I don't know.
Nikolay: And if Table is very small, then quickly process it,
sleep 1 second, again process it, sleep 1 second, so it will
be a lot of spam, right?
Unless there is additional logic
in autovacuum, I'm not 100%
remembering.
It would be good to double check.
Michael: Worth checking, but also
to once per second, when we're
talking about logging, when we're
talking about potentially,
it's also still not that bad.
Nikolay: But 10 workers, Thousands
tables.
I can imagine something like this.
But also, nobody...
Like, default nap time is 1 minute.
And default number of workers is
3.
Yeah.
So I...
Anyway, like, I definitely...
We always recommend to our clients
to see more of the vacuum
work in logs.
It's really helpful, especially
during some incidents, after
some incidents, we have this useful
information to understand
what happened in terms of data
that cannot be deleted.
It also reports xmin horizon, by
the way.
It reports it like, it reports
the delta between current xmin
and xmin horizon, so it's super
useful.
Yeah.
Michael: Nice, so just to be clear,
anything, if you're on older
than 15, make sure you have this
set to something so that you're
actually logging autovacuum at
all.
And even if you're on 15 or above,
consider lowering it substantially
from the default.
Nikolay: Yeah, 1 more thing here
is that if we don't have activity
at all, in this case, it will be
producing some log entries which
are probably not super useful.
Michael: If we don't have any autovacuum
activity or any any
kind of updates, delete.
Nikolay: Yeah, or maybe SELECT
only during some long time.
Michael: But that was the argument
for not having checkpoint
logging on, if there's a silent
database.
Nikolay: There was a battle about
it, yeah.
Michael: Yeah, but we're talking
about very, very few log entries
in the grand scheme of things still?
Nikolay: I just mean if you, for
example, if you build some service,
if you have lots of databases or
you build a platform for other
people.
Michael: Sure, sure.
Nikolay: And you can imagine like
many of them will have databases
which are not super active, but
logging has costs.
If you have, say you have like
10,000 databases and 9,000 of
them are not super active and you
will be paying for this disk
space for logs, right?
This will make you think differently
and try to save.
But this is exactly like, this
is different situation compared
to, okay, I have 1 database, it's
hot startup, AI startup, it's
growing.
It's a single like monolith or
few monolith or shard, it doesn't
matter.
And we need to understand our workload
really well.
In this case, we need to log everything,
right?
Michael: Yeah, I get what you mean.
But going back to the, let's say
the platform that has 10, 000
of which 9, 000 are quiet, in that
case why not rotate the, like,
just delete logs more often?
Nikolay: Good point, So yeah, if
you, for example, understand
that you're going to log something
that's probably in not loaded
system is not super useful, but
you know the maximum amount of
that already, how many lines it
will be, and you know it will
be rotated so there is just constant
price for each node that
you're going to accept.
And logs also can be compressed
really well.
Yeah I agree, I agree with this
point as well.
I just wanted us not to forget
about different perspectives people
might have.
Michael: Yeah, great.
Nikolay: 1 big pet versus huge
cattle you need to take care of.
So yeah.
Good.
What about autovacuum?
That's it, right?
So we need to know details, especially
about larger tables, especially
when autovacuum is blocked.
And if you're, for example, a backend
developer, I think it's
interesting for you as well, because
it may sound autovacuum
is just some, okay, it's a garbage
collector in Postgres, right?
But if you think, like, if you
have chances to deal with performance
of system you develop, garbage
collection part is very important
because every time you deleted
or updated something or had INSERT
with ROLLBACK, you produce dead
tuples.
And basically, if you deleted something,
COMMIT, return to your
application code something, but
it's only half of work done.
autovacuum will do the remaining
half of work, so to speak.
And you definitely want to know
when it happened, in which volumes.
At some point you will need probably
to process a lot of rows,
like delete millions of rows.
We had an episode about this.
And people keep asking, I saw comments
yesterday, comments like,
okay, deletes, what about updates?
Well, similar, similar updates
also, like they produce new tuple,
they mark old tuple as dead, but
the process of what exactly
will clean it.
And we need to understand its behavior.
And if we want to move really fast,
We need to take care of cleaning,
otherwise we will see a lot of
huge gaps in storage.
Space won't be reclaimed, it won't
be reused very quickly, and
this is called bloat, right?
So we don't want this.
So we need to understand garbage
collection and Postgres, which
is called autovacuum.
Yeah, this is like, that's why
you want autovacuum logging,
even if it's, even if you are not
DBA, but just backend developer.
This was my point.
Michael: Yeah.
And, And it's just like you mentioned
at the beginning, there
are system views where we can look
at when was the last time
autovacuum ran on each table,
which is useful.
But you can't see the couple of
times before that, or...
Nikolay: You don't see details.
How many...
Yes, exactly.
Was it in the horizon a problem,
actually?
Yeah.
I think, by the way, is there some
work in progress to extend
it?
Michael: I don't know.
Nikolay: Because it's worth extending
I think.
pg_stat_all_tables or pg_stat_user_tables.
It's useful, but it's just a counter
and last timestamp.
Not enough.
2 counters, 2 last timestamps.
Or manual and automatic vacuuming.
Okay, good.
In the same release, just to close
with Postgres 15, it was this
log startup process interval.
I don't think it's for backend
engineers.
It's more for DBAs because if they
tune checkpointer behavior
and distance between checkpoints
is huge, significant.
In this case, startup can, might
take long time and it's great
that now we have this setting and
as you said Default is quite
good.
So we can not to think about it
at all Yeah, what's next?
Michael: What about log_statement?
We talked about that in a recent
episode as well.
The default there is none, right?
Nikolay: Yeah, exactly.
And I like to recommend people
to set DDL, to put DDL there just
to trace all schema changes.
When it was done, who did it.
Of course, actor may overwrite
it.
Unless, yeah, so, but it's good
to have some default there.
And I think there's a there's an
ability actually to forbid overriding,
right, some settings.
Michael: So I think there's 2,
there's almost 2 categories here,
right?
You're talking about almost the
security element of it.
But I think even if we, even if
we assume it's general case,
it's still extremely useful to
see when DDL changes were happening
in the context of everything else.
Nikolay: The timestamp, exact,
exact query.
For example, it may save you time
in future if you, for example,
start dealing with logical replication,
doing some experiments,
and it's blocked by DDL.
But you can trace when exactly
it happened and which DDL it was,
which is great.
Michael: Or even look back historically
over the past 30 days
and be like, which days do people
run DDL on which ones don't
they you know that you could plan
an upgrade around when it's
unlikely to have DDL happening.
Nikolay: It's super helpful in
various incident analysis when
you have deployment and something
went wrong.
We discussed it many times.
DDL may be a reason if it cooked
not properly, it may be a reason
for downtime or some issues.
And if you have log_statement equal
DDL, you can trace it.
Yes.
Yeah.
At least after the incident happened.
So, definitely DDL is a good idea.
Michael: And the cost is very low,
right?
How much DDL are you running on
your system?
Nikolay: It's not thousand per
second, definitely.
Michael: No.
I hope.
You
Nikolay: know, I see systems which
have DDL as a part of their
normal workload coming from users.
It happens.
Well, for example, if it's a multi-tenant
system, which creates
tables for each customer and if
they extend functionality, they
also create something.
Also temporary tables, which I
don't like at all.
Like I avoid temporary tables at
all costs, but people use them
and temporary table creation is
also DDL.
Michael: Yeah, good point.
Nikolay: So I can imagine that
there are cases when it can happen,
like we have a storm of DDL coming
to logs, but for me it's kind
of edge case rather than normal.
Michael: So, great.
Yeah.
On the security side, I wondered
about connections and disconnections
are both off by default.
Nikolay: I don't think it's what
this this is sometimes I see
people turn it on.
It depends on the particular system.
In many cases, it's off.
It's all by off by default.
And this information is usually
most interesting.
It looks most interesting to DBAs,
DBREs, not Backend developers.
Let's move on closer to workloads
and query optimization maybe.
Michael: Yeah, and I think we did
have an episode either about
pgaudit or about, did we do it
specifically about pgaudit or
about security or something like
that?
So I'll link up that episode because
I think that covers probably
the more interesting parts of the
DBA side of things.
Nikolay: Sounds good.
Yeah, maybe Let's talk about log_lock_waits before we focus
on workload analysis.
So it's also part of workload analysis,
just specific part related
to logs.
And we also discussed it recently,
I remember.
And this is definitely what we
recommend turning on because it's
off by default.
Yeah, because it's helpful.
Also during incidents, that's helpful.
If, if you're developing your app,
didn't think about multi-session
concurrency control properly and
blocking situations, you might
have incidents, spike of active
sessions, or maybe even downtime
caused by heavy lock contention.
And having logs related to this
is really helpful.
Really helpful, because without
it, you can only diagnose something
in runtime.
Okay, you see there's a spike of
exclusive locks.
So, With log_lock_waits, you will
see that some session tried
to run some query, but it was blocked
by another session.
And it was waiting, our query,
was waiting longer than
deadlock_timeout, which is by default 1
second.
So if you, for example, try to
update 1 row, but there is no
transaction which is not committed
yet, it may be already updated
this row, but it's not committed,
and we need to wait until commit
or Rollback for the transaction.
And our update lasts, like, waiting
period lasts 1 second.
In some systems it may be longer
or less longer if deadlock_timeout
was adjusted.
And this gets logged and you see,
okay, our session process ID
with executing query this was waiting
for long and it was waiting
for those process IDs.
Unfortunately we don't see details
about those sessions.
This is disadvantage.
But it's already half of the coin
visible.
Michael: Yeah well imagine trying
to debug without this, it just
seems to me, reproducing ordering
for locking issues just seems
to me to be so difficult without
this kind of information.
Nikolay: Yeah, also it's helpful
if you have deadlock issues,
and you try to understand your
system better, this kind of logging
is helpful.
I cannot say it will answer all
your questions, unfortunately.
I had cases when deadlock troubleshooting
was extremely challenging,
and we had logging, we had everything,
but it was still challenging.
Sometimes It's really challenging.
You need to go to application code
understand what exactly It's
doing but it's already something
and it's better to have it always
Michael: Yeah, and this to me seems
like a no-brainer.
The trickier 1, balance-wise, seems
to be the time, whether to
reduce it from 1 second or increase
it.
And that seems quite loaded because
it's doing 2 things, right?
It's determining at what time lock
waits get logged, but also
at what time deadlocks get picked.
Nikolay: Deadlock detection happens.
Michael: Yeah, so it's doing 2
things.
Nikolay: I can, I understand the
decision from engineering perspective?
It was like, where should we log
it?
Oh, this is where we're checking
it.
It's good time to log it exactly
at that time.
So I can imagine this decision,
why it happened.
But I agree with you, there is
some lack of flexibility a little
bit.
But also it's protecting us from
huge volumes.
Like let's talk a little bit about
observer effect.
Here I don't see it actually.
So at each session, like there
is this like 1 second delay, basically
by default, 1 second delay before
it gets logged.
If we have a lot of sessions okay
how many?
100?
100 active sessions.
Michael: Max connections.
Nikolay: Yeah we might have 100
of these messages per second.
Well it's significant volume But
I think on this server with
100 active sessions and I hope
with many more cores, I think
disks are already quite powerful
so they can handle 100 entries
per second for this type of logging.
But it might contribute in some
cases, I think.
Again, in some edge or corner cases,
it might contribute a lot.
So it's worth understanding.
But personally, I've never seen
the case when we decided to turn
it off.
And I think it should be on by
default, but it's off by default.
So advice from us, turn it on.
Michael: Yeah, good 1.
Nikolay: What's next?
Michael: Should we move on to performance
things?
Nikolay: Yeah, maybe temporary
files first.
Michael: Yeah, good 1.
Kind of bridge the gap a little
bit.
Nikolay: Yeah and by default it's
disabled minus 1.
Michael: Yes so this is when in
fact actually I looked into it
this this is logging when the temporary
file gets deleted interestingly
But what's happening is when a
query is using some temporary
storage, for example, if a sort
spills to disk or doesn't have
enough working memory available
to do a hash join or any operation
that uses temporary data.
So I had this under performance
but I guess it is also just general
system-wide tuning as well around
config parameters like work_mem
and hash_mem multiplier more recently
as well.
Nikolay: Yeah, yeah.
Yeah, I can imagine many queries
will start producing a lot of
logs if they need temporary files.
It's so multi-sided question to
me, because if you produce, so
if you try to save on log, logging
of temporary file generation
cases, deletion cases, as you said,
you already have issues with
dealing with disk.
The query processing itself was
dealing with disk.
It's better to capture that info
and try to optimize it as soon
as possible.
But will I put 0 there?
0 means let's log all the cases
even if it's just 8 kilobytes.
Maybe no, but what I would do,
what we usually recommend, I'm
sharing 3 pieces of advice today.
We say minus 1 is definitely not
good.
And of course, in some cases, if
you don't see temporary file
generated right now, looking at
pg_stat_activity or pg_stat_statements,
because they also have temporary
file written and read a couple
of columns, so you can see it there
as well.
If you don't see a big problem
right now, maybe okay, maybe not,
let's not log it.
But why not to be prepared for
future, right?
Michael: Yeah, I would say the
opposite.
I'd say if you're not seeing them
right now, you could even afford
to set it to 0 because you're just
not going to get much volume.
Nikolay: What if tomorrow memory
is not enough and we start logging
very tiny amount, tiny numbers
and it will be a lot of cases?
Michael: Well, depends if you've
if you've tuned work_mem, even
if work_mem is the default 4MB,
it would only be logging most
likely queries that are using more
than 4MB of memory and sorts
and things, so how many of those
per second are you realistically
running?
It seems unlikely to me that it's
going to be super high volume.
Nikolay: First of all, work_mem
should be increased.
Yes.
Right.
But if you keep it very low and
suddenly you, you changed like
data volumes become bigger and
bigger and for like for hash operations
ordering we need more and more
at some point we slightly exceed
4 gigabytes 4 megabytes and many
of processes do it for example
boom you you need to generate temporary
files and you you like
log_temp_files equals 0 add
some text on it additional text,
right?
My approach is different.
I If we don't see it right now,
but we grow This is scary moment
for me.
I would prefer seeing very heavily
loaded systems already having
some temporary files.
And we understand them.
And we go from quite significant
value, like 10 megabytes, And
we go down in stages.
Restart is not needed, by the way,
for all settings we discussed
today.
So far, restart is not needed,
which is perfect for going in
iterations gradually, right?
So we gradually descend to 0, but
maybe not to 0, maybe 200 kilobytes
or something.
Michael: But what's the difference?
Like it's on.
Are you going to get temporary
files that are 100 kilobytes?
Nikolay: Well, if we put 100 kilobytes
to log_temp_files, we
will see only the cases which exceeded.
Michael: Yeah, but what I mean
is if work_mem is 4MB and hash_mem_multiplier
is 2 by default, we're only realistically
going to have temp
files that are more roughly, very
roughly 4MB to 8MB.
Nikolay: I understood you.
Well, obviously we think about
it differently and I'm not sure
who is right.
You think when memory is not enough,
everything goes to temporary
file.
I think when memory is not enough,
only Delta goes to temporary
file, which is a
Michael: good, Good point.
I think it's actually different
in 2 different cases.
So, for example, like sorts, I
think it's all or nothing.
But I think, for example, with
bitmap scans that have non exact
Nikolay: Michael, we have homework
to do.
Yeah.
Michael: It's a good point, though.
You're right.
There probably are some operations.
I think including bit scans
Nikolay: In this Interesting question.
Yeah, I Talk to some hackers.
It might sound very simple actually.
Michael: I'm pretty sure with sorts
it's all or nothing.
But it's a good point that going
down in stages makes a load
of sense.
And I was just trying to say I
feel like you can skip it if you've
currently got it.
Nikolay: SERGEY VASILYEVICH Yeah.
Why go down or nothing?
I understand this logic, yeah.
Michael: But for me, low is almost as good as 0.
Like, there's no difference realistically.
It's 100 kilobytes, great.
If that makes things much safer in extremely heavily loaded systems
with lots of potential max connections, great, low is still good.
Nikolay: Yeah.
But again, this is a useful parameter because if you have cases
when, we discussed it in the beginning already, when you have
cases you can see some normalized query but you need examples
to understand how to reproduce it and then decide what to do
with it.
Raise work_mem globally, locally, this will be a question.
Michael: 1 user, for like an analytics user, that kind of thing.
Nikolay: Right, right, right, yeah.
Yeah, so okay, good.
And memory management might be quite challenging, like it's sometimes
not trivial.
Good, So let's talk about the main part of query analysis.
It's pg_stat_statements is not enough, right?
So you need to either have quite low log_min_duration_statement,
at least 1 second, preferably lower, like half a second, maybe
100 milliseconds, to capture more cases of slow queries and then
optimize them.
Right?
Michael: Yeah, absolutely.
And I like your point about them being examples.
So it's not necessarily that the average of that query is going
to be running in 100 milliseconds.
We want to catch the outliers.
Like, what are the parameters for that query that mean it does
tend to run over 100 milliseconds?
So we get those examples.
It's perfect.
Nikolay: And
Michael: again, 1 that we can lower in stages?
Nikolay: Yeah, definitely.
So I will start if never, like if it's big unknown, like from
5 seconds and go down.
Doesn't make any sense to start with really large value because
I assume we talk about OLTP systems where, for example, 10 seconds
is already an absolutely slow query.
I usually see 1 or half a second as a value being used these days.
And if you already performed a lot of optimization efforts and
you want to improve even further, you can go down.
But maybe, maybe it's worth it instead of this.
Oh, before we go there, there's actually an idea you can go down
to 0 but switch to sampling.
There is statement-level sampling and transaction-level sampling,
which is supported since a few
years ago.
Michael: Yeah, you can have 2 levels,
can't you?
You can set log_min_duration_statement
to, let's say, 100 milliseconds,
and that will log everything above
100.
Then you can have like a second
threshold that is log_min_duration_sample
that's like, you could say
10 milliseconds and then you
separately set a log_statement_sample_rate
to, you could say
like 0.1 and that would do 10%
or 0.01, it would do 1%.
And that would log, let's say,
1% of the statements that were
between 10 and 100 milliseconds.
So, yeah, it would be a way of,
like, capturing some examples.
Nikolay: Yeah, like 1% of everything,
and everything, like, 100%
if it's above the old log_min_duration_statement threshold.
And this was added in Postgres
13, I think it was adjusted later.
14, so we can say it's already
almost all versions.
Michael: All supported versions,
yeah.
Nikolay: All supported versions,
yeah.
Actually, 13 is the oldest already,
wow.
Yeah, time flies.
Good, good.
So we talked about this, but there's
also idea instead of logging
queries, let's log them with plans.
And just jump straight to using
auto_explain and just not to
use what we just discussed.
Log_min_duration_statement and
log_min_duration_sample, blah, blah,
blah.
So just auto_explain brings plans,
which is great.
And...
Michael: Yeah, and it also does
log the statement, so you do
still get the query.
But you could, in theory, have
a third setting, you could have
a third value, so you might want
to log the queries above 100
milliseconds, and you have a separate
threshold for logging auto_explain
plans at a different threshold
if you wanted to.
Nikolay: It was you who surprised
me a lot a couple of years
ago when you said auto_explain
support sampling for ages.
Yeah.
I haven't noticed somehow.
Yeah, but it's great.
You can log every plan, execution
with plan.
You log actual execution with plan,
but only 1% samples of it,
right?
Michael: Yeah, so this is slightly
different though, I think,
because with the sampling on the
log_min_duration_statement,
I get the impression it's well,
I'm not sure maybe maybe this
is a question I get impression
it's tracking them all but only
logging a percentage of them.
So you're then saving yourself
from an overhead around logging.
Because
Nikolay: it's...
SLAVOJ Yes, calculate metrics and
then log to disk.
Michael: SIMON Exactly.
So there's the observer effect
of observing, of tracking timings,
and then there's the second observer
effect of actually logging.
Whereas with auto_explain, the
sampling, you don't even measure
a certain proportion of the, so
you, it is, it's subtly different,
I think.
Nikolay: Yeah.
What would you recommend?
Like we can have so many options
here and worth saying we can,
it's quite flexible.
We tune it all.
And then you think, okay, this
is a big complex system.
We have so much logging.
We log everything.
Not everything, sometimes same,
sometimes not.
But then you think, actually, you
know what?
I still need to do such statements
because I want reliable, higher
level metrics to understand.
Because I identify 10 queries I
need to work on, but where to
start?
What is the most influential in
various terms?
Like I need to produce a Statement
Stone to realize that.
And I connect on by query ID and
query ID in auto_explain is supported
only since version 16, I think.
So this was a big missing piece
before 16.
But once you're on 16 or 17 or
maybe already 18 beta 1, I don't
know, then you connect by query
ID and you see, okay, based on
log analysis, I see these plans
need my work, need attention
or these queries, But I'm choosing
this one because I see it's
like 25% of all time spent based
on pg_stat_statements.
And this is a great way.
So this brings us to the idea that
having logs is not enough.
Michael: Yeah, well, they're kind
of serving slightly different
purposes, aren't they?
Because if a query is taking 25%
of total CPU, like if it's just
taking 25% of our resources, It
would probably be relatively
easy to reproduce that.
If we grab the query, grab an example
that's, you know, relatively
slow, and run EXPLAIN ANALYZE
with all the parameters on that,
we'll probably get a slow query
plan.
But where auto-explain comes into
its own is if queries are only
every, like, if they're slow sometimes
but not always.
If there's very specific.
Exactly.
It's cases like that that might
be difficult to diagnose.
Maybe they happen at certain times
or with, you know, in cases
that you're struggling to put your
finger on.
Getting the plan at that moment
is extremely helpful.
And it's not always possible.
Nikolay: Do you think auto_explain
should be in a different log?
Michael: I mean I still...
Nikolay: Let me...
I maybe already mentioned this
recipe in the past, I used it.
So you have a tmpfs disk, RAM
disk.
Yeah.
Like gigabyte, we have memory,
so 1 gigabyte.
And log goes there.
And you quickly rotate it to ensure
it's not filled up.
And send it somewhere to storage.
And it would be great to have separate
log for such very bloaty
things like auto_explain, right?
Michael: Well, and that's the big
difference between auto_explain
and just logging the query.
Because even queries can get long,
but query plans can be ridiculously
big.
Like the largest one I think I've
seen was getting close to 100
megabytes of
Nikolay: text.
Just one.
Michael: Just one query plan.
Okay, a lot of that was like, there
was a lot of like whitespace
and a lot of repeated
Nikolay: identifiers.
It should be compressed, immediately
compressed and sent over
network.
But compressed by whom?
Logging collector or?
Yeah.
That's an interesting question.
Michael: So it is the big downside
in terms of volume.
Nikolay: So we see so much power
we have right now for observability
and query processing, query optimization,
but still it feels
like in heavily loaded systems,
it's hard to have, like, it's
not super hard, but there are opportunities
to improve.
100% so.
Michael: Yeah.
And if you've got a bit of, if
you've not got issues right now
it's great to set a bunch of these
and you can afford to set
things like auto_explain up if
you've got overhead if you've
got headroom you can set these
things up in advance and then
change them if that if you start
to get into you know the 90%
of CPU being used you can look
into things you can do to reduce
those.
But I really like a quote, I think
somebody was at an Oracle
conference and somebody asked,
what's all the overhead of all
of the performance analysis Oracle's
doing.
Oh yeah, of
Nikolay: course, Stanel Potter.
Michael: Ah, nice.
Nikolay: Yeah, he said something like, I can pay like 10% or
something overhead, but save like 90% thanks to my optimization
based on those information.
Michael: Yes, I think it's along those lines.
I think if you have this information.
Nikolay: It's very exposition, you know, like, if you have this,
but you're not going to use it, you're losing, right, additionally.
But if you are definitely like, you are going to allocate your
resources to optimize, ready to pay this penalty like or tax,
but because I know I will optimize and I will be on a better
spot after it.
Michael: Yeah, exactly, it's not a liability, it's an asset at
that point.
Nikolay: And those 10% they will shrink, thanks to my optimization.
Compared to initial latencies, they may become smaller, right?
Michael: Maybe, if they're a percentage, I've got a feeling some
of them are fixed.
Nikolay: But if we are not sure that we will be able to optimize,
I can understand that it can sound risky to pay this additional
price without a reliable outcome.
Who knows how successful we will be.
But again, all of these things, except putting auto_explain to
shared_preload_libraries, you can share them on fly.
So it's so great.
You can temporarily lower all thresholds during a week or 2 while
you're optimizing, for example, and then get back to your normal
and do it like every quarter, for example, like routine optimization.
So cycles of optimization.
Also if it's non-production or something, you can be more aggressive
in going down with thresholds and having more, to see more, yeah.
Michael: It feels like 1 last thing on auto_explain.
The thing in the docs and quite a few experts will recommend
is turning off the timing parameter.
So you collect the EXPLAIN ANALYZE information but without per
node operation, like per node timings.
It does reduce the overhead, of course.
But it also reduces the, like, fidelity of those plans and it
is quite useful.
Nikolay: Well, yeah, I don't agree with you.
For me, for me, most valuable is structure and buffers, as usual.
Michael: Of course, I get that they're more useful, But I think
quite often in a lot of environments collecting the timing information
is also worth it.
Nikolay: Well, yeah.
You had a post long ago about this, but we also had a whole episode
about auto_explain.
Do you remember?
Michael: Yeah.
I'll link up the relevant episodes.
It's a
Nikolay: big topic.
AutoExpand has many settings to tune, so it's a big topic.
Good, I think maybe that's it, right?
Michael: Yes, definitely.
Thanks so much, Nikolay.
Catch you next week.
Nikolay: Yeah, Bye-bye.