Postgres FM | Transcript: What to log

June 6, 2025 • 49 Minutes

What to log

Michael: Hello and welcome to Postgres.FM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard, and this is Nik, founder

of Postgres.AI.

Hello Nik, how's it going?

Nikolay: Hello Michael, everything's all right.

Let's discuss logs.

Everyone loves to deal with logs, right?

Michael: What do you mean?

Nikolay: Well, I mean, if you have a huge server, a lot of logs,

so good to deal with them.

Michael: Well, I guess we're blessed in Postgres with a few ways

that we don't always have to check logs, right?

There are some system views that we can look at various information.

Nikolay: For example corruption, right?

There's no such view.

Michael: Yeah, yeah there was a but there's quite a few things

that there's no such view.

Well there's quite a lot of information that's put in the logs

and isn't available elsewhere.

So it is important.

And when I say is put in the logs, actually quite often it isn't

even put in the logs unless you change some settings.

So I thought it was, you had this good idea to discuss all of

the things that most people should change in terms of what to

log so that we can then check the logs for issues and when we're

doing performance analysis and that kind of thing.

Nikolay: Yeah, So I think system views, ideally we should for

everything we should have both system views and detailed logging.

Michael: Oh both?

Nikolay: Yeah, because like think about special statements, we

have normalized queries only, it's not possible to have there

every query with all parameters because cardinality will be insane.

Michael: But equally, I think it's quite often not possible to

have every query logged to the logs, right?

Like there's also limitations there.

Nikolay: Exactly.

So pros and cons of both approaches.

Cumulative metrics, they are good.

You can visualize, you can see spikes, everything, but they lack

details.

And same for example, temporary files, right?

We can see at high level, we can see from the pg_stat_database

that, okay, we generate a lot of temporary files.

Queries generate a lot of temporary files, some queries.

We can go down to pg_stat_statements and see which query groups

or normalized queries produce those temporary files.

It's already super helpful.

But obviously, the same normalized query, if you take 1 set of

parameters, it can produce gigabytes of temporary files.

Another set of parameters, it can produce 0 temporary files.

Depending on parameters, it can be very different.

So we need examples, and this is exactly when we need to look

at the logs.

But at the same time, if we log everything, observer effect is

not a joke.

And I must confess, several big clients suffered from observer

effect when I decided to log too much.

And it caused downtime and it was not pleasant experience.

So we always need to think about observer effect.

Those who say log everything, log all queries, they just don't

deal with really heavy loaded systems, unfortunately.

Yeah.

Just try to do it in heavy loaded systems, which has like 100,

000 TPS.

And that's it.

You will put down everything very quickly, you will learn it's

not the way to go.

So we need to limit the amount of logging, but we cannot avoid

logging because we need examples.

So my approach is to combine both things.

Michael: Yeah, it's a balance, isn't it?

We need to record enough information to be able to diagnose issues

when they occur, but not so much that there's excessive overhead.

So it's a balancing act.

Nikolay: Right.

And you need to think, like it's better to think in advance what

to log, because if some incident happened, you somehow mitigated

it.

You want to do a root cause analysis, but logs lack this information.

So you can only guess and you can guess quite well if you have

a lot of experience, but still it's better to have logging

tuned properly in advance.

And I see Postgres in the latest few years, like improving defaults,

but still defaults are not enough usually.

We tune some defaults for our clients always.

I mean, we tune some log related settings to have more logging.

Michael: Yeah, in my research for this I found 3 different parameters

that were all either introduced or had the default changed from

off to some version of on all in Postgres 15.

So I was kind of expecting a gradual improvement in this over

time but it seems almost, I don't know if it's a coincidence

or not, but 1 major version where quite a few things are improved

and I haven't seen a whole lot of movement in other areas unfortunately.

Nikolay: Yeah you're probably right, Defaults changed in 15,

you mean log_checkpoints and autovacuum related to 10 minutes,

which is enormous still, like it's huge value.

And what was the third?

Michael: And there's a new log startup progress interval.

Nikolay: But it's new, it's not a default change.

But decision was made to not to
introduce it being turned off

by default decision, which is a
great decision.

I agree.

And it's quite like 10 seconds,
right?

So every 10.

Yeah, let's discuss everything,
everything we have, right?

Let's let's, Let's start from something,
move to different topic.

But I agree, so 15 was revolutionary
in terms of bringing good

defaults in logging.

Michael: In some areas, yeah.

Nikolay: Yeah, And also worth mentioning
Postgres log, it's always

a single log.

So you can imagine if we had ability
to separate streams of logging,

for example, slow queries could
go in 1 place, maybe to faster

disks, or maybe to...

There is a like this idea, very
old idea, that some logging should

go not to local disk but straight
to some storage over network

UDP and that's it like send it
So we could log a lot of more

stuff.

Although when you ask guys who
support, I don't know, like Elastic

or something, which is receiving
a lot of logging from different

systems, they also would like to
limit, you know?

Or Splunk, for example.

Splunk is an old commercial enterprise
system for logs.

They can deal with huge volumes,
but always still cautious about

how much you are going to send
to them.

Michael: Yeah.

It's a good point, though.

Other databases do have...

I'm pretty sure MySQL has a separate
slow query log, because

I've seen people familiar with
MySQL asking about where's Postgres'

slow query log?

Nikolay: Yeah, it's not a big change,
it could be like, I think

it could be, I mean it's easy to
implement, but I imagine how

much of Infrastructure changes
needed to be done in all the systems

if slow query log would be separated
What do you think

Michael: what what I'm missing
something though?

What's the benefit of having it
separated out?

Nikolay: You can put it to different
disks, faster disks for

example, analysis wise also it's
easier

Michael: Because the volume is
likely to be so much higher than

the other

Nikolay: Yeah, maybe Some rotation
policies could be different.

And analysis wise, the nature of
like, I see a bunch of logs,

which are related to query optimization.

And I also see a bunch of log messages,
which are related to

like startup, autovacuum, checkpointing,
like more like not

about query processing, but everything
else, right?

And separation of them is, to me,
it sounds reasonable.

So I Imagine it would be possible
if like, what do you think?

Imagine there is a new setting
which says everything related

to query processing, including
log_min_duration_statement based,

Everything that can contain a query
text, actually.

Everything.

Like, for example, slow queries
exceeding log_min_duration_statement.

Queries generated temporary files.

Queries which are slow because
they are waiting for other queries

and logged after deadlock_timeout
1 second by default, right?

Imagine all of them go in 1 place,
everything else stays in the

old place.

But it's optional.

You can decide to choose

Michael: it or not.

Yeah, I haven't thought long and
hard about it, but I do quite

like the simplicity of everything
being in 1 place.

But it can also be helpful, right?

I understand how if you're not
worried about slow queries and

you're looking for other things,
they can be really noisy and

get in the way.

And in a heavy-loading environment,
it would just be impossible

without tools to then get anything
useful out of that.

But like an autovacuum could be
the cause of some slow, like

some issue that you're facing with
query performance.

Or checkpoint could be the cause.

And maybe in a less heavily loaded
environment where you're reviewing

it manually seeing that in in order
might help you debugging

like what's actually happening
here

Nikolay: yeah makes sense makes
sense and also some errors Should

we like, what will happen with
queries which are erroring out?

Yeah.

But, but still like, it could be
a choice, right?

Because people who are interested
in errors, corruption related

errors, I hope we will discuss
them as well.

And maybe a checkpointer behavior
and so on, and less interested

in query processing itself.

They often say, you know, like
your huge query is being logged

just because they are exceeding
1 second, they make our log really

hard to read.

Yeah.

Well,

Michael: hard to read manually,
right?

With tools you can filter to the
bits.

Nikolay: Or with DuckDB you can
have CSV and talk to it through

DuckDB queries, which I think is
a great idea.

Quite a new and great idea.

Michael: But I like this almost
as a way of splitting our conversation.

You mentioned the performance related
stuff and everything else.

Where do you want to start with
1 of those?

Nikolay: Yeah, we already talked
about log_checkpoints.

Let's maybe talk about this.

So in 15 it was changed to on and that's it, that's great.

I mean, it's easy.

Like I think everyone needs it.

And finally in 15, how long ago it was?

3 years ago.

Michael: Yeah, nearly.

Nikolay: It was realized that everyone should have it, everyone.

Michael: So our advice to people would be, if you're on a version

older than that, turn it on.

Nikolay: 100% so.

Michael: If it's not already.

Nikolay: Yeah, and another thing happened in 2015, it's related

to autovacuum logging.

So log_autovacuum_min_duration was always, I think before that

it was minus 1.

So it means the vacuum is not logged at all.

And it was changed to 10 minutes.

In my opinion, it's not enough.

10 minutes is way too long.

It should be 10 seconds.

In some places we set it to 0 to see everything.

Of course, I can imagine, especially if you have many workers,

you also tuned the autovacuum_naptime to very low value so

they sleep less between work.

And this, or workers sleep less, right?

In this case, I imagine amount of logging from a vacuum workers

might be substantial.

But at least like 10 seconds or a few seconds threshold, it's

good.

Because if you have a lot of tables, 10 minutes, you will miss

a lot of details, interesting details.

Michael: And it's not that frequent.

The thing if we go back to that, you know, recording enough information

to diagnose issues without excessive overhead, the overhead of

logging all of your autovacuums, unless you have, as you say,

really, like, yeah, very low nap time and maybe lots and lots

of tape, like, maybe you've got thousands of tapes, thousands

of schemas, like if they've multiplied out, I think I heard somebody

that was talking about having millions of objects in their Database.

Nikolay: All right, do you remember if you have, for example,

naptime 1 second, You have 10 workers, for example, and you

have longer running Transaction holding extreme horizon, so the

vacuum cannot Delete some dead Tuples.

How often those workers will visit this Table, which needs processing?

Maybe they will do it in loop, right?

Michael: Yeah, good point.

I don't know.

Nikolay: And if Table is very small, then quickly process it,

sleep 1 second, again process it, sleep 1 second, so it will

be a lot of spam, right?

Unless there is additional logic
in autovacuum, I'm not 100%

remembering.

It would be good to double check.

Michael: Worth checking, but also
to once per second, when we're

talking about logging, when we're
talking about potentially,

it's also still not that bad.

Nikolay: But 10 workers, Thousands
tables.

I can imagine something like this.

But also, nobody...

Like, default nap time is 1 minute.

And default number of workers is
3.

Yeah.

So I...

Anyway, like, I definitely...

We always recommend to our clients
to see more of the vacuum

work in logs.

It's really helpful, especially
during some incidents, after

some incidents, we have this useful
information to understand

what happened in terms of data
that cannot be deleted.

It also reports xmin horizon, by
the way.

It reports it like, it reports
the delta between current xmin

and xmin horizon, so it's super
useful.

Yeah.

Michael: Nice, so just to be clear,
anything, if you're on older

than 15, make sure you have this
set to something so that you're

actually logging autovacuum at
all.

And even if you're on 15 or above,
consider lowering it substantially

from the default.

Nikolay: Yeah, 1 more thing here
is that if we don't have activity

at all, in this case, it will be
producing some log entries which

are probably not super useful.

Michael: If we don't have any autovacuum
activity or any any

kind of updates, delete.

Nikolay: Yeah, or maybe SELECT
only during some long time.

Michael: But that was the argument
for not having checkpoint

logging on, if there's a silent
database.

Nikolay: There was a battle about
it, yeah.

Michael: Yeah, but we're talking
about very, very few log entries

in the grand scheme of things still?

Nikolay: I just mean if you, for
example, if you build some service,

if you have lots of databases or
you build a platform for other

people.

Michael: Sure, sure.

Nikolay: And you can imagine like
many of them will have databases

which are not super active, but
logging has costs.

If you have, say you have like
10,000 databases and 9,000 of

them are not super active and you
will be paying for this disk

space for logs, right?

This will make you think differently
and try to save.

But this is exactly like, this
is different situation compared

to, okay, I have 1 database, it's
hot startup, AI startup, it's

growing.

It's a single like monolith or
few monolith or shard, it doesn't

matter.

And we need to understand our workload
really well.

In this case, we need to log everything,
right?

Michael: Yeah, I get what you mean.

But going back to the, let's say
the platform that has 10, 000

of which 9, 000 are quiet, in that
case why not rotate the, like,

just delete logs more often?

Nikolay: Good point, So yeah, if
you, for example, understand

that you're going to log something
that's probably in not loaded

system is not super useful, but
you know the maximum amount of

that already, how many lines it
will be, and you know it will

be rotated so there is just constant
price for each node that

you're going to accept.

And logs also can be compressed
really well.

Yeah I agree, I agree with this
point as well.

I just wanted us not to forget
about different perspectives people

might have.

Michael: Yeah, great.

Nikolay: 1 big pet versus huge
cattle you need to take care of.

So yeah.

Good.

What about autovacuum?

That's it, right?

So we need to know details, especially
about larger tables, especially

when autovacuum is blocked.

And if you're, for example, a backend
developer, I think it's

interesting for you as well, because
it may sound autovacuum

is just some, okay, it's a garbage
collector in Postgres, right?

But if you think, like, if you
have chances to deal with performance

of system you develop, garbage
collection part is very important

because every time you deleted
or updated something or had INSERT

with ROLLBACK, you produce dead
tuples.

And basically, if you deleted something,
COMMIT, return to your

application code something, but
it's only half of work done.

autovacuum will do the remaining
half of work, so to speak.

And you definitely want to know
when it happened, in which volumes.

At some point you will need probably
to process a lot of rows,

like delete millions of rows.

We had an episode about this.

And people keep asking, I saw comments
yesterday, comments like,

okay, deletes, what about updates?

Well, similar, similar updates
also, like they produce new tuple,

they mark old tuple as dead, but
the process of what exactly

will clean it.

And we need to understand its behavior.

And if we want to move really fast,
We need to take care of cleaning,

otherwise we will see a lot of
huge gaps in storage.

Space won't be reclaimed, it won't
be reused very quickly, and

this is called bloat, right?

So we don't want this.

So we need to understand garbage
collection and Postgres, which

is called autovacuum.

Yeah, this is like, that's why
you want autovacuum logging,

even if it's, even if you are not
DBA, but just backend developer.

This was my point.

Michael: Yeah.

And, And it's just like you mentioned
at the beginning, there

are system views where we can look
at when was the last time

autovacuum ran on each table,
which is useful.

But you can't see the couple of
times before that, or...

Nikolay: You don't see details.

How many...

Yes, exactly.

Was it in the horizon a problem,
actually?

Yeah.

I think, by the way, is there some
work in progress to extend

it?

Michael: I don't know.

Nikolay: Because it's worth extending
I think.

pg_stat_all_tables or pg_stat_user_tables.

It's useful, but it's just a counter
and last timestamp.

Not enough.

2 counters, 2 last timestamps.

Or manual and automatic vacuuming.

Okay, good.

In the same release, just to close
with Postgres 15, it was this

log startup process interval.

I don't think it's for backend
engineers.

It's more for DBAs because if they
tune checkpointer behavior

and distance between checkpoints
is huge, significant.

In this case, startup can, might
take long time and it's great

that now we have this setting and
as you said Default is quite

good.

So we can not to think about it
at all Yeah, what's next?

Michael: What about log_statement?

We talked about that in a recent
episode as well.

The default there is none, right?

Nikolay: Yeah, exactly.

And I like to recommend people
to set DDL, to put DDL there just

to trace all schema changes.

When it was done, who did it.

Of course, actor may overwrite
it.

Unless, yeah, so, but it's good
to have some default there.

And I think there's a there's an
ability actually to forbid overriding,

right, some settings.

Michael: So I think there's 2,
there's almost 2 categories here,

right?

You're talking about almost the
security element of it.

But I think even if we, even if
we assume it's general case,

it's still extremely useful to
see when DDL changes were happening

in the context of everything else.

Nikolay: The timestamp, exact,
exact query.

For example, it may save you time
in future if you, for example,

start dealing with logical replication,
doing some experiments,

and it's blocked by DDL.

But you can trace when exactly
it happened and which DDL it was,

which is great.

Michael: Or even look back historically
over the past 30 days

and be like, which days do people
run DDL on which ones don't

they you know that you could plan
an upgrade around when it's

unlikely to have DDL happening.

Nikolay: It's super helpful in
various incident analysis when

you have deployment and something
went wrong.

We discussed it many times.

DDL may be a reason if it cooked
not properly, it may be a reason

for downtime or some issues.

And if you have log_statement equal
DDL, you can trace it.

Yes.

Yeah.

At least after the incident happened.

So, definitely DDL is a good idea.

Michael: And the cost is very low,
right?

How much DDL are you running on
your system?

Nikolay: It's not thousand per
second, definitely.

Michael: No.

I hope.

You

Nikolay: know, I see systems which
have DDL as a part of their

normal workload coming from users.

It happens.

Well, for example, if it's a multi-tenant
system, which creates

tables for each customer and if
they extend functionality, they

also create something.

Also temporary tables, which I
don't like at all.

Like I avoid temporary tables at
all costs, but people use them

and temporary table creation is
also DDL.

Michael: Yeah, good point.

Nikolay: So I can imagine that
there are cases when it can happen,

like we have a storm of DDL coming
to logs, but for me it's kind

of edge case rather than normal.

Michael: So, great.

Yeah.

On the security side, I wondered
about connections and disconnections

are both off by default.

Nikolay: I don't think it's what
this this is sometimes I see

people turn it on.

It depends on the particular system.

In many cases, it's off.

It's all by off by default.

And this information is usually
most interesting.

It looks most interesting to DBAs,
DBREs, not Backend developers.

Let's move on closer to workloads
and query optimization maybe.

Michael: Yeah, and I think we did
have an episode either about

pgaudit or about, did we do it
specifically about pgaudit or

about security or something like
that?

So I'll link up that episode because
I think that covers probably

the more interesting parts of the
DBA side of things.

Nikolay: Sounds good.

Yeah, maybe Let's talk about log_lock_waits before we focus

on workload analysis.

So it's also part of workload analysis,
just specific part related

to logs.

And we also discussed it recently,
I remember.

And this is definitely what we
recommend turning on because it's

off by default.

Yeah, because it's helpful.

Also during incidents, that's helpful.

If, if you're developing your app,
didn't think about multi-session

concurrency control properly and
blocking situations, you might

have incidents, spike of active
sessions, or maybe even downtime

caused by heavy lock contention.

And having logs related to this
is really helpful.

Really helpful, because without
it, you can only diagnose something

in runtime.

Okay, you see there's a spike of
exclusive locks.

So, With log_lock_waits, you will
see that some session tried

to run some query, but it was blocked
by another session.

And it was waiting, our query,
was waiting longer than

deadlock_timeout, which is by default 1
second.

So if you, for example, try to
update 1 row, but there is no

transaction which is not committed
yet, it may be already updated

this row, but it's not committed,
and we need to wait until commit

or Rollback for the transaction.

And our update lasts, like, waiting
period lasts 1 second.

In some systems it may be longer
or less longer if deadlock_timeout

was adjusted.

And this gets logged and you see,
okay, our session process ID

with executing query this was waiting
for long and it was waiting

for those process IDs.

Unfortunately we don't see details
about those sessions.

This is disadvantage.

But it's already half of the coin
visible.

Michael: Yeah well imagine trying
to debug without this, it just

seems to me, reproducing ordering
for locking issues just seems

to me to be so difficult without
this kind of information.

Nikolay: Yeah, also it's helpful
if you have deadlock issues,

and you try to understand your
system better, this kind of logging

is helpful.

I cannot say it will answer all
your questions, unfortunately.

I had cases when deadlock troubleshooting
was extremely challenging,

and we had logging, we had everything,
but it was still challenging.

Sometimes It's really challenging.

You need to go to application code
understand what exactly It's

doing but it's already something
and it's better to have it always

Michael: Yeah, and this to me seems
like a no-brainer.

The trickier 1, balance-wise, seems
to be the time, whether to

reduce it from 1 second or increase
it.

And that seems quite loaded because
it's doing 2 things, right?

It's determining at what time lock
waits get logged, but also

at what time deadlocks get picked.

Nikolay: Deadlock detection happens.

Michael: Yeah, so it's doing 2
things.

Nikolay: I can, I understand the
decision from engineering perspective?

It was like, where should we log
it?

Oh, this is where we're checking
it.

It's good time to log it exactly
at that time.

So I can imagine this decision,
why it happened.

But I agree with you, there is
some lack of flexibility a little

bit.

But also it's protecting us from
huge volumes.

Like let's talk a little bit about
observer effect.

Here I don't see it actually.

So at each session, like there
is this like 1 second delay, basically

by default, 1 second delay before
it gets logged.

If we have a lot of sessions okay
how many?

100?

100 active sessions.

Michael: Max connections.

Nikolay: Yeah we might have 100
of these messages per second.

Well it's significant volume But
I think on this server with

100 active sessions and I hope
with many more cores, I think

disks are already quite powerful
so they can handle 100 entries

per second for this type of logging.

But it might contribute in some
cases, I think.

Again, in some edge or corner cases,
it might contribute a lot.

So it's worth understanding.

But personally, I've never seen
the case when we decided to turn

it off.

And I think it should be on by
default, but it's off by default.

So advice from us, turn it on.

Michael: Yeah, good 1.

Nikolay: What's next?

Michael: Should we move on to performance
things?

Nikolay: Yeah, maybe temporary
files first.

Michael: Yeah, good 1.

Kind of bridge the gap a little
bit.

Nikolay: Yeah and by default it's
disabled minus 1.

Michael: Yes so this is when in
fact actually I looked into it

this this is logging when the temporary
file gets deleted interestingly

But what's happening is when a
query is using some temporary

storage, for example, if a sort
spills to disk or doesn't have

enough working memory available
to do a hash join or any operation

that uses temporary data.

So I had this under performance
but I guess it is also just general

system-wide tuning as well around
config parameters like work_mem

and hash_mem multiplier more recently
as well.

Nikolay: Yeah, yeah.

Yeah, I can imagine many queries
will start producing a lot of

logs if they need temporary files.

It's so multi-sided question to
me, because if you produce, so

if you try to save on log, logging
of temporary file generation

cases, deletion cases, as you said,
you already have issues with

dealing with disk.

The query processing itself was
dealing with disk.

It's better to capture that info
and try to optimize it as soon

as possible.

But will I put 0 there?

0 means let's log all the cases
even if it's just 8 kilobytes.

Maybe no, but what I would do,
what we usually recommend, I'm

sharing 3 pieces of advice today.

We say minus 1 is definitely not
good.

And of course, in some cases, if
you don't see temporary file

generated right now, looking at
pg_stat_activity or pg_stat_statements,

because they also have temporary
file written and read a couple

of columns, so you can see it there
as well.

If you don't see a big problem
right now, maybe okay, maybe not,

let's not log it.

But why not to be prepared for
future, right?

Michael: Yeah, I would say the
opposite.

I'd say if you're not seeing them
right now, you could even afford

to set it to 0 because you're just
not going to get much volume.

Nikolay: What if tomorrow memory
is not enough and we start logging

very tiny amount, tiny numbers
and it will be a lot of cases?

Michael: Well, depends if you've
if you've tuned work_mem, even

if work_mem is the default 4MB,
it would only be logging most

likely queries that are using more
than 4MB of memory and sorts

and things, so how many of those
per second are you realistically

running?

It seems unlikely to me that it's
going to be super high volume.

Nikolay: First of all, work_mem
should be increased.

Yes.

Right.

But if you keep it very low and
suddenly you, you changed like

data volumes become bigger and
bigger and for like for hash operations

ordering we need more and more
at some point we slightly exceed

4 gigabytes 4 megabytes and many
of processes do it for example

boom you you need to generate temporary
files and you you like

log_temp_files equals 0 add
some text on it additional text,

right?

My approach is different.

I If we don't see it right now,
but we grow This is scary moment

for me.

I would prefer seeing very heavily
loaded systems already having

some temporary files.

And we understand them.

And we go from quite significant
value, like 10 megabytes, And

we go down in stages.

Restart is not needed, by the way,
for all settings we discussed

today.

So far, restart is not needed,
which is perfect for going in

iterations gradually, right?

So we gradually descend to 0, but
maybe not to 0, maybe 200 kilobytes

or something.

Michael: But what's the difference?

Like it's on.

Are you going to get temporary
files that are 100 kilobytes?

Nikolay: Well, if we put 100 kilobytes
to log_temp_files, we

will see only the cases which exceeded.

Michael: Yeah, but what I mean
is if work_mem is 4MB and hash_mem_multiplier

is 2 by default, we're only realistically
going to have temp

files that are more roughly, very
roughly 4MB to 8MB.

Nikolay: I understood you.

Well, obviously we think about
it differently and I'm not sure

who is right.

You think when memory is not enough,
everything goes to temporary

file.

I think when memory is not enough,
only Delta goes to temporary

file, which is a

Michael: good, Good point.

I think it's actually different
in 2 different cases.

So, for example, like sorts, I
think it's all or nothing.

But I think, for example, with
bitmap scans that have non exact

Nikolay: Michael, we have homework
to do.

Yeah.

Michael: It's a good point, though.

You're right.

There probably are some operations.

I think including bit scans

Nikolay: In this Interesting question.

Yeah, I Talk to some hackers.

It might sound very simple actually.

Michael: I'm pretty sure with sorts
it's all or nothing.

But it's a good point that going
down in stages makes a load

of sense.

And I was just trying to say I
feel like you can skip it if you've

currently got it.

Nikolay: SERGEY VASILYEVICH Yeah.

Why go down or nothing?

I understand this logic, yeah.

Michael: But for me, low is almost as good as 0.

Like, there's no difference realistically.

It's 100 kilobytes, great.

If that makes things much safer in extremely heavily loaded systems

with lots of potential max connections, great, low is still good.

Nikolay: Yeah.

But again, this is a useful parameter because if you have cases

when, we discussed it in the beginning already, when you have

cases you can see some normalized query but you need examples

to understand how to reproduce it and then decide what to do

with it.

Raise work_mem globally, locally, this will be a question.

Michael: 1 user, for like an analytics user, that kind of thing.

Nikolay: Right, right, right, yeah.

Yeah, so okay, good.

And memory management might be quite challenging, like it's sometimes

not trivial.

Good, So let's talk about the main part of query analysis.

It's pg_stat_statements is not enough, right?

So you need to either have quite low log_min_duration_statement,

at least 1 second, preferably lower, like half a second, maybe

100 milliseconds, to capture more cases of slow queries and then

optimize them.

Right?

Michael: Yeah, absolutely.

And I like your point about them being examples.

So it's not necessarily that the average of that query is going

to be running in 100 milliseconds.

We want to catch the outliers.

Like, what are the parameters for that query that mean it does

tend to run over 100 milliseconds?

So we get those examples.

It's perfect.

Nikolay: And

Michael: again, 1 that we can lower in stages?

Nikolay: Yeah, definitely.

So I will start if never, like if it's big unknown, like from

5 seconds and go down.

Doesn't make any sense to start with really large value because

I assume we talk about OLTP systems where, for example, 10 seconds

is already an absolutely slow query.

I usually see 1 or half a second as a value being used these days.

And if you already performed a lot of optimization efforts and

you want to improve even further, you can go down.

But maybe, maybe it's worth it instead of this.

Oh, before we go there, there's actually an idea you can go down

to 0 but switch to sampling.

There is statement-level sampling and transaction-level sampling,

which is supported since a few
years ago.

Michael: Yeah, you can have 2 levels,
can't you?

You can set log_min_duration_statement
to, let's say, 100 milliseconds,

and that will log everything above
100.

Then you can have like a second
threshold that is log_min_duration_sample

that's like, you could say
10 milliseconds and then you

separately set a log_statement_sample_rate
to, you could say

like 0.1 and that would do 10%
or 0.01, it would do 1%.

And that would log, let's say,
1% of the statements that were

between 10 and 100 milliseconds.

So, yeah, it would be a way of,
like, capturing some examples.

Nikolay: Yeah, like 1% of everything,
and everything, like, 100%

if it's above the old log_min_duration_statement threshold.

And this was added in Postgres
13, I think it was adjusted later.

14, so we can say it's already
almost all versions.

Michael: All supported versions,
yeah.

Nikolay: All supported versions,
yeah.

Actually, 13 is the oldest already,
wow.

Yeah, time flies.

Good, good.

So we talked about this, but there's
also idea instead of logging

queries, let's log them with plans.

And just jump straight to using
auto_explain and just not to

use what we just discussed.

Log_min_duration_statement and
log_min_duration_sample, blah, blah,

blah.

So just auto_explain brings plans,
which is great.

And...

Michael: Yeah, and it also does
log the statement, so you do

still get the query.

But you could, in theory, have
a third setting, you could have

a third value, so you might want
to log the queries above 100

milliseconds, and you have a separate
threshold for logging auto_explain

plans at a different threshold
if you wanted to.

Nikolay: It was you who surprised
me a lot a couple of years

ago when you said auto_explain
support sampling for ages.

Yeah.

I haven't noticed somehow.

Yeah, but it's great.

You can log every plan, execution
with plan.

You log actual execution with plan,
but only 1% samples of it,

right?

Michael: Yeah, so this is slightly
different though, I think,

because with the sampling on the
log_min_duration_statement,

I get the impression it's well,
I'm not sure maybe maybe this

is a question I get impression
it's tracking them all but only

logging a percentage of them.

So you're then saving yourself
from an overhead around logging.

Because

Nikolay: it's...

SLAVOJ Yes, calculate metrics and
then log to disk.

Michael: SIMON Exactly.

So there's the observer effect
of observing, of tracking timings,

and then there's the second observer
effect of actually logging.

Whereas with auto_explain, the
sampling, you don't even measure

a certain proportion of the, so
you, it is, it's subtly different,

I think.

Nikolay: Yeah.

What would you recommend?

Like we can have so many options
here and worth saying we can,

it's quite flexible.

We tune it all.

And then you think, okay, this
is a big complex system.

We have so much logging.

We log everything.

Not everything, sometimes same,
sometimes not.

But then you think, actually, you
know what?

I still need to do such statements
because I want reliable, higher

level metrics to understand.

Because I identify 10 queries I
need to work on, but where to

start?

What is the most influential in
various terms?

Like I need to produce a Statement
Stone to realize that.

And I connect on by query ID and
query ID in auto_explain is supported

only since version 16, I think.

So this was a big missing piece
before 16.

But once you're on 16 or 17 or
maybe already 18 beta 1, I don't

know, then you connect by query
ID and you see, okay, based on

log analysis, I see these plans
need my work, need attention

or these queries, But I'm choosing
this one because I see it's

like 25% of all time spent based
on pg_stat_statements.

And this is a great way.

So this brings us to the idea that
having logs is not enough.

Michael: Yeah, well, they're kind
of serving slightly different

purposes, aren't they?

Because if a query is taking 25%
of total CPU, like if it's just

taking 25% of our resources, It
would probably be relatively

easy to reproduce that.

If we grab the query, grab an example
that's, you know, relatively

slow, and run EXPLAIN ANALYZE
with all the parameters on that,

we'll probably get a slow query
plan.

But where auto-explain comes into
its own is if queries are only

every, like, if they're slow sometimes
but not always.

If there's very specific.

Exactly.

It's cases like that that might
be difficult to diagnose.

Maybe they happen at certain times
or with, you know, in cases

that you're struggling to put your
finger on.

Getting the plan at that moment
is extremely helpful.

And it's not always possible.

Nikolay: Do you think auto_explain
should be in a different log?

Michael: I mean I still...

Nikolay: Let me...

I maybe already mentioned this
recipe in the past, I used it.

So you have a tmpfs disk, RAM
disk.

Yeah.

Like gigabyte, we have memory,
so 1 gigabyte.

And log goes there.

And you quickly rotate it to ensure
it's not filled up.

And send it somewhere to storage.

And it would be great to have separate
log for such very bloaty

things like auto_explain, right?

Michael: Well, and that's the big
difference between auto_explain

and just logging the query.

Because even queries can get long,
but query plans can be ridiculously

big.

Like the largest one I think I've
seen was getting close to 100

megabytes of

Nikolay: text.

Just one.

Michael: Just one query plan.

Okay, a lot of that was like, there
was a lot of like whitespace

and a lot of repeated

Nikolay: identifiers.

It should be compressed, immediately
compressed and sent over

network.

But compressed by whom?

Logging collector or?

Yeah.

That's an interesting question.

Michael: So it is the big downside
in terms of volume.

Nikolay: So we see so much power
we have right now for observability

and query processing, query optimization,
but still it feels

like in heavily loaded systems,
it's hard to have, like, it's

not super hard, but there are opportunities
to improve.

100% so.

Michael: Yeah.

And if you've got a bit of, if
you've not got issues right now

it's great to set a bunch of these
and you can afford to set

things like auto_explain up if
you've got overhead if you've

got headroom you can set these
things up in advance and then

change them if that if you start
to get into you know the 90%

of CPU being used you can look
into things you can do to reduce

those.

But I really like a quote, I think
somebody was at an Oracle

conference and somebody asked,
what's all the overhead of all

of the performance analysis Oracle's
doing.

Oh yeah, of

Nikolay: course, Stanel Potter.

Michael: Ah, nice.

Nikolay: Yeah, he said something like, I can pay like 10% or

something overhead, but save like 90% thanks to my optimization

based on those information.

Michael: Yes, I think it's along those lines.

I think if you have this information.

Nikolay: It's very exposition, you know, like, if you have this,

but you're not going to use it, you're losing, right, additionally.

But if you are definitely like, you are going to allocate your

resources to optimize, ready to pay this penalty like or tax,

but because I know I will optimize and I will be on a better

spot after it.

Michael: Yeah, exactly, it's not a liability, it's an asset at

that point.

Nikolay: And those 10% they will shrink, thanks to my optimization.

Compared to initial latencies, they may become smaller, right?

Michael: Maybe, if they're a percentage, I've got a feeling some

of them are fixed.

Nikolay: But if we are not sure that we will be able to optimize,

I can understand that it can sound risky to pay this additional

price without a reliable outcome.

Who knows how successful we will be.

But again, all of these things, except putting auto_explain to

shared_preload_libraries, you can share them on fly.

So it's so great.

You can temporarily lower all thresholds during a week or 2 while

you're optimizing, for example, and then get back to your normal

and do it like every quarter, for example, like routine optimization.

So cycles of optimization.

Also if it's non-production or something, you can be more aggressive

in going down with thresholds and having more, to see more, yeah.

Michael: It feels like 1 last thing on auto_explain.

The thing in the docs and quite a few experts will recommend

is turning off the timing parameter.

So you collect the EXPLAIN ANALYZE information but without per

node operation, like per node timings.

It does reduce the overhead, of course.

But it also reduces the, like, fidelity of those plans and it

is quite useful.

Nikolay: Well, yeah, I don't agree with you.

For me, for me, most valuable is structure and buffers, as usual.

Michael: Of course, I get that they're more useful, But I think

quite often in a lot of environments collecting the timing information

is also worth it.

Nikolay: Well, yeah.

You had a post long ago about this, but we also had a whole episode

about auto_explain.

Do you remember?

Michael: Yeah.

I'll link up the relevant episodes.

It's a

Nikolay: big topic.

AutoExpand has many settings to tune, so it's a big topic.

Good, I think maybe that's it, right?

Michael: Yes, definitely.

Thanks so much, Nikolay.

Catch you next week.

Nikolay: Yeah, Bye-bye.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

What to log

Creators and Guests

Some kind things our listeners have said