Advanced EXPLAIN
Nikolay: Hello, hello, this is
Postgres.FM.
My name is Nikolay, Postgres.AI,
and with me as usual my co-host
Michael, pgMustard.
Hi Michael, how are you doing?
Michael: Hello Nikolay, I'm good,
how are you?
Nikolay: Very good, very good.
So it was my turn to choose the
topic.
And I chose the topic naturally.
I thought we didn't discuss EXPLAIN.
Never, ever we discussed EXPLAIN,
right?
We didn't do it.
But then I googled and found out
that you discussed it without
me.
So I thought, it doesn't count.
Let's discuss once again, but maybe
going deeper in some places.
Michael: Yeah, I covered the basics and
I think it's a good suggestion
to cover more advanced topics.
I think EXPLAIN is one of those topics
that you can just keep getting
more advanced.
And there's definitely a level
at which I'll get way out of my
depth.
But yeah, there's definitely room
for a second episode where
we go into more advanced topics
for sure.
Nikolay: Right.
And just to those who don't know,
you spend most of your professional
time dealing with plans because
pgMustard is helping people analyze
and improve queries, particular
queries, looking at EXPLAIN plans,
right?
Michael: Yeah, exactly.
It's a visualization tool for EXPLAIN
and then also tries to
provide performance advice based
on that.
So we get normally people's more
complex query plans.
So I guess this is an area that
I've got more practical experience
than most.
Obviously, everyone knows when
you're building something, people
that build healthcare software
aren't always intimately familiar
with everything about healthcare.
There's certain things you get
really familiar with, but you
also end up having to deal with
things that are not to do with
that at all.
But yeah, I have over the past
few years come across probably
most variations of how performance
problems can be shown via
EXPLAIN plans.
I have some blind spots for sure
but yeah it's been a big thing
for me but also for you right like
this is a big part of your
work's been performance consulting
and
Nikolay: yeah well It's
Michael: hard to do that without
EXPLAIN.
Nikolay: For me, EXPLAIN is not
the number 1 tool usually, except
cases when a customer asks us to
help with a specific query,
which happens as well.
But usually we deal with analysis,
like top-down analysis of
whole workload.
We start with pg_stat_statements
and wait event analysis, performance
insights, and ideas or something.
Then we go down and when we identify
first aggregated query or
normalized query in official pg_stat_statements
terminology.
Only then we find examples of bad
behaving query.
Then we discuss how to improve,
where to, we need a playground,
a proper playground, and maybe
we will want to discuss how to
make it so EXPLAIN plans match what
we see in production.
And only then we go and check the
actual plan and try to improve
it and so on.
Sometimes we have auto_explain,
right?
auto_explain is a great tool.
Michael: Yeah, we did an episode on that
too.
Nikolay: Yeah, yeah, I remember.
And yeah, if you have it, it's
great.
But it's not always so.
It's not enabled by default.
It requires some effort to be enabled.
Although it's present, it's available
everywhere.
So I wish I saw it more often.
Michael: Did you see I did a blog
post on this this year on which
cloud providers provide access
to EXPLAIN?
I don't
Nikolay: remember honestly, I need
to check it.
Michael: I'll share it, it's not
the most interesting read, but
it's like a good reference and
I find it useful for if somebody
mentions a cloud provider to me.
Nikolay: So everyone has it right?
Michael: Yes, pretty much.
Not quite everyone but pretty much
everyone.
And Most of them provide access
to most parameters, so EXPLAIN
and auto_explain have a bunch of
parameters and I found it hard
to check which ones offered which
parameters.
And I've got good news as well,
actually.
I got a nice direct message from
somebody who works at AWS and
the parameters they didn't offer,
there's like 2 or 3 that they
didn't offer, they've submitted
a pull request for.
So I'm looking forward to the next
release of the RDS products,
which I think will include those.
Nikolay: That's good.
So yeah, of course, we also deal
with explain, but it's just
1 of the things and we probably
don't dive too deep as you do.
So where do you want to start
Michael: this discussion?
It was your suggestion, what do you want to start?
Nikolay: Oh, but you're an expert, right?
Michael: Okay, alright, maybe.
I'm not sure, I still feel really uncomfortable with the word
expert.
Nikolay: I'm feeling very uncomfortable with the word analyze.
It should be execute, not analyze, right?
In the EXPLAIN command.
Michael: I mean, naming...
I've been thinking about naming for a while with this because
part of our tool, we give people something to copy and paste
so they can easily copy it and put it in front of their query
if they're using a like PSQL or an editor or something.
And it's getting longer and longer.
The more parameters that get added to EXPLAIN the more like and
and the more information the better for when you're using a tool.
If you don't have to worry about the output getting long and
your visualization tool should be pointing people at the right
place anyway, the amount of data is like the more the better
really.
The more information you get, the more likely it is you can point
out the problem.
However, that it's become like EXPLAIN and then in parenthesis,
ANALYZE, BUFFERS, VERBOSE, SETTINGS, FORMAT, JSON, and there's
more.
There's like, wow, now there's serialize and memory, but most
of these are off by default so you have to specify all of them
to get all of the information.
So ANALYZE is 1 of them but the reason I was thinking about naming
was I wondered if we could I was wondering about suggesting something
like EXPLAIN all But the problem with that is there's so many
pieces of advice out there in the wild about when you run EXPLAIN,
ANALYZE, be careful because your query will be executed.
If your query's data modifying, it will modify the data.
If your query's long, it will take that amount of time to run.
And all of that advice would become not obsolete, but if we change
the name, all of that advice becomes difficult to like, you have
to go back and change all of the advice, or you have to add,
if you want to EXPLAIN, ANALYZE, or EXPLAIN all.
Nikolay: For 5 years, the old word could be supported,
but it's so confusing to explain people that ANALYZE is actually
executing the query and it's not the same ANALYZE as ANALYZE
table name, which also affects probably plan, but you need to
understand this.
If you run ANALYZE table name, probably after that you will see
different plan in EXPLAIN and EXPLAIN ANALYZE, but it's not because
these 2 words are the same words.
They are very different words.
Remember we discussed Postgres 17, it was confusion resolution.
I wish this confusion 1 day would be resolved as well.
Michael: I agree in principle, I just think in practice it's
gonna be painful.
Nikolay: Why painful?
The word execute is not used yet.
First we introduced synonym and then in 5 years we ditched the
old…
Michael: It's used in prepared statements, right?
Like prepare, execute.
Nikolay: Yeah, but well, execute means execute, and analyze and
explain also means execute.
Okay, by the way, Let me confess about something related to prepared
statements.
A few weeks ago, we had an episode about planning time, which
also related to today's topic because planning time we check
using explain mostly.
Of course we can check it in pg_stat_statements but again not everyone
enabled track planning in pg_stat_statements so usually we just
check in auto_explain or regular manually in ad hoc way we check
using explain.
Not auto_explain, sadly.
It cannot show planning time?
Wow.
Okay.
Good to know.
Michael: But in pg_stat_statements, it's if track planning, yeah.
Nikolay: Yeah, if track planning.
So If we, for example, deal with partitioning, with high number
of partitions, planning time increases as we discussed.
But I just blog posted about it and right after that we had a
discussion on Twitter and it turned out that my experiment was
wrong and my ideas were not fully wrong, but not purely accurate,
because planning time is affected only in the very first time
when queries is executed in the session.
If you execute it once again, we already have relation metadata
cache, rel cache, and overhead becomes almost 0.
At least for simple queries.
Michael: Yes, simple queries that can prune down to a single
partition or like a low number of partitions.
And it was David Rowley, who
Nikolay: pointed out… Yes, of course, If there's no pruning,
then it's a different story.
Of course, with the number of partitions growing, more work needs
to be done at planning time.
But yeah, I'm talking about partition pruning happening at planning
time.
The idea is only the very first execution suffers.
Michael: It's so, I mean I'm really impressed David noticed that
so quickly.
Yeah.
I mean he knows his stuff around
this topic, I'm pretty sure
he was involved in the fixes, I
think it was in Postgres 12 to
improve that exact metric.
So yeah, kudos to him for pointing
that out so politely as well.
I think he said it was misleading,
not wrong.
And also a really good reminder,
do you remember when we had
Melanie on to talk about benchmarking,
how difficult benchmarking
was to make sure you're measuring
the thing you think you're
measuring.
It's just a good reminder that
these things are so difficult.
But by publishing them, by publishing
your findings, and by the
Postgres community being so open
to correcting us when we're
wrong, we've all learned something.
I think
Nikolay: that's good.
Exactly.
It was very good.
And I cannot agree it's wrong because
there was no mistake there.
There was a problem that the experiment
didn't do next step,
which of course I should do.
Actually misleading is a good term
here.
I extended blog post right away.
It turned out if you use just EXPLAIN,
no execution, no ANALYZE
work.
If you just use EXPLAIN, you see
you add partitions.
We checked from 0, 100, 200, and
up to 1,000.
You see we're using EXPLAIN, SELECT,
primary key lookup, and
partition key basically lookup.
You see first of all, you see index
scan, only 1 partition.
Partition pruning is working at
planning time.
But planning time grows linearly
with number of partitions and
number of planning buffers also
grows linearly for this kind
of query and it was range partitioning.
We didn't check others.
But once you run EXPLAIN once again
in the same session, only 3
buffer hits always, regardless
of number of partitions.
It's also good relief, right?
So you think, oh, it's great.
I'm going just to relax and I can
have 10,000 partitions for
simple queries like this.
Of course, complex queries will
be different, I'm quite sure.
But if partition pruning works
at planning time, I'm happy, right?
It just tells me I need to take
care of proper connection poolers.
I need to have them always.
The problem with our discussion
and my ideas a few weeks ago
when I said, oh, if you have high
overhead of planning time,
which obviously it's only for first
execution planning.
In this case, just use prepared
statements.
But this is definitely maybe not
only misleading, but also wrong,
because as I actually learned in
detail, if you use prepare statements,
you see that partition pruning
shifts to execution.
So the plan, a generic plan, includes
all partitions, right?
Because we cannot cache the plan,
which is dealing only with
single partition in this case,
but what if we need to execute
the prepared statements with different
parameter?
We need different partition probably.
The cached plan includes all partitions
and you run explain for
prepared statements for a specific
parameter and you see that
execution time, you see subplans
removed.
This is interesting.
Which means that overhead shifted
to execution.
It's not good, right?
But again, overhead, which overhead?
If we have real cache here, all
good.
So it means that for high number
of partitions, not prepared
statements matter.
They can help, but they don't help
fully because of this shift
of partition pruning.
But what actually helps a lot is
poolers.
We have poolers in all heavily
loaded projects.
And if you are preparing to have
huge tables, like we always
say, if you exceed 100 gigabytes,
you need partitioning.
But it also means you do need proper
connection pooling and avoid
the initial overhead of planning
basically.
With high number of partitions,
planning will be not good.
So back to explain.
Michael: Yeah, question back to
explain, kind of linking them.
When you did the second explain,
extracting the buffers and things,
did you use format JSON for that
and parse the output?
Or what did you do?
Nikolay: It was text regular.
Michael: We've just parsed it.
Nikolay: Settings, yeah.
And it was done by our AI, so it
doesn't use JSON at this point.
Just we don't want very bloated
format.
And yeah, and actually, yeah, we
needed to fix how it works because
right now it works through psycopg,
so Python.
If you run a couple of explains
in one shot, it received only one
result.
We fixed that, so now it sees both
results.
The second result is always, in
our case, three buffer hits.
Planning time remains way below
one millisecond, I think maybe
10 microseconds or so, super tiny,
almost zero.
Bottom line here is that usually
we say, if you run
EXPLAIN (ANALYZE, BUFFERS), of course.
We had a couple of episodes about
it.
If you run it, always run it a
couple of times, maybe a few times
because of caches.
If you want to check timing, if
you're too obsessed by timing,
not my case.
I'm obsessed by buffers, not timing.
You need to run it multiple times
to check the second execution
because first time probably everything
is not cached, data is
not in the buffer pool, and not
in page cache, so run it twice.
That's a good piece of advice,
right?
But what we just discussed means
also that even for planning,
run it twice in the same session,
right?
Because maybe you observe a lot
of work, if partitioning is involved,
especially if a lot of work related
to lack of real cache.
Michael: I've seen this.
The other weird thing I've seen,
I don't know if you know the
answer to, is you can do a simple
ID lookup, like primary key
lookup and you will get planning
time, but no planning buffers
reported, even if you include the
buffers thing.
What's that?
How's that?
It's a bug I think.
Yeah.
Nikolay: Inside my team, like last
week, since we worked on this
post, a couple of other… Many people
actually were involved,
but this particular problem was
very annoying.
Very annoying.
So, like, why buffers are not reported
in planning?
Okay, now they are reported.
And we cannot understand what it
depends on.
So, Once we understand, probably
it will be a bug report already,
right?
Because it's something wrong is
there, because we requested.
Michael: It's weird, but I mean
the text format is hurting you
there because by default in the
text format, if things are 0,
they don't get reported, it's like
roughly a general rule.
Nikolay: You think so?
Michael: Whereas in JSON format,
you'd still get buffers reported,
but they would have zeros.
It's not much better.
Nikolay: In our case, it should
be reported because it always
was something like some number
of hits.
But yeah.
Michael: So that's a different
bug, right?
Like all I meant is if planning
buffers
Nikolay: are 0,
Michael: I don't understand how
planning buffers can be 0 if
it's like it was actually doing
work, if it's actually, you know,
picking which index to scan.
So I found that…
Nikolay: – In the case of prepared,
cached plan, if plan is cached,
there is still some planning time,
but I suspect this planning
time occurs during execution because
of these subplans removed
and partition pruning happening.
I'm not sure, by the way.
This is some nuance.
Michael: I think maybe like
parsing might count as planning
or like I think there's other steps.
Nikolay: Well, I'm looking at
it right now.
I'm looking at the plan which checks
execution of the forced
generic plan.
Plan cache mode set to forced generic
plan and we check execution
for prepared statement.
It's cached.
The plan cached.
We see no planning-related buffers
reported at all.
It's missing.
But planning time is like 38 microseconds.
And partition pruning happened
at execution time.
So I wonder what are those 38 microseconds.
If plan is cached, maybe just additional
checks, I don't know.
It's interesting, right?
Michael: You know there's multiple
stages, like, you know, I
can't remember all of them, but
1 of them, for example, is parsing.
1 of them is like rewriting, so
like you might rewrite the query.
I think some of those get bucketed
at the moment as planned.
So
Nikolay: probably you're right,
there are no buffer hits involved
at all and that's why it's not
reported at all.
Michael: But I've seen this, I've
seen 0 reported when it's like
not a plan cache, like if it's
a primary lookup that you've
run.
Nikolay: So where there is some
work for sure, right?
Michael: Yeah, or like in my head
it would make sense that there's
work.
Anyway, it's an interesting quirk.
Nikolay: And even if it's not a
bug, the fact that you raise
this topic and we discuss this
topic inside my team, it's already
something, right, which is not
clear.
And yeah, If someone who watches
us who is more experienced than
us, please tell us.
This is super interesting piece
of information.
So what I wanted to say, You need
to check the execution of explain
for planning for the second time
inside the same session just
to exclude the fact that you observed
overhead from lack of real
cache.
This is lesson I learned last week.
This is correction to ideas I expressed
a few weeks ago in the
planning time episode.
Michael: And there's some good
practices that help mitigate
this kind of issue anyway.
Normally people when they're benchmarking
or testing things will
run something 5 times and maybe
take the best of the 5 or you
know running things multiple times I don't know if you've ever
seen some sports do like diving for example they give people
scores out of 10 and then they cross out the top scoring judge
and the bottom scoring judge.
And yeah, actually,
Nikolay: we did that.
We did it, but we like, due to some reasons, it was not like
We plan to improve this and this kind of analysis should be done
exactly that way.
You run multiple times, you check which data points are way off
and exclude them and then you take average.
That makes sense totally.
But in this case, just how we did it, It was a single session
experiment and we ran all EXPLAINs.
So bottom point, if you run EXPLAIN 5 times and every time it's
a new connection, the problem persists.
You need to run it multiple times inside the same connection.
And then I would not agree with excluding the first 1 due to
lack of real cache, because it's also an interesting fact that
With growing number of PARTITIONs, the very first execution,
it can reach dozens of milliseconds and even hundreds of milliseconds.
It's a lot.
It means connection poolers are needed and they need to be configured
properly if you have thousands of PARTITIONs.
Otherwise, it will be very often new connection established,
could execute the first time, huge number of PARTITIONs, and
you have penalty of dozens of milliseconds for heavily loaded
projects.
It's a lot, maybe 100 milliseconds.
It's like huge overhead.
This fact also interesting, right?
We learned 2 facts.
Overhead is huge, linearly growing for this particular simple,
very trivial case.
There is basically no overhead for subsequent queries.
Wow.
This is like a sequence of surprises for me in this work.
I like it actually.
Let's maybe switch to another topic.
Maybe you have some ideas what to discuss next?
Michael: Yeah, well when I think of advanced EXPLAIN, I'm thinking
more like advanced query optimisation.
Maybe you're familiar and got good at reading EXPLAIN plans when
it's a relatively simple query, maybe some normal scan types,
some normal JOIN algorithms.
You're somewhat familiar with it, you can work your way around
that.
But I've spent several years looking at this and I still from
time to time see operation types that I've never seen before.
And things get complex like in just in terms of how many operations
there are or things that complicate
things like CTEs and in it
plans and kind of things happening,
which order things happen
in, things happening at the same
time.
I think we had Haki Benita on relatively
recently talking about
how if you have multiple CTEs,
those happen at the same time
and independently of each other.
So like, that's confusing.
And like, how those appear and
explain is interesting.
But because it's shown as a tree
and they have to kind of work
out different display mechanisms
for how, you know, for showing
that that's not necessarily misleading
because it's hard to say
how you should visualize that.
But that's what I'm thinking in
terms of like advanced topics
around explain.
How do you interpret those?
And also, what can you ignore?
If you want to do this, normally
your aim is, can I speed this
up?
And as things get complex, you
often want to simplify.
Like, what can I do to discard
a bunch of this information?
Like, which bits of this are already
fast that I can ignore?
How do I, maybe I can run a simpler
query that's only the part
of the plan that's problematic.
That's what I'm thinking in terms
of more complex topics.
Nikolay: Right, I agree.
Reading complex plans.
This is it.
Yeah, I agree.
Here actually what helps?
Experience helps definitely.
I usually like buffers for me as
a must.
Buffers is a must.
First, I start reading execution
time, planning time, but I read
quickly scan buffers and they are
like cumulative, like they
are accumulated to the root.
Root includes everything.
So you can see and go from root
to leaves basically.
It's like upside down tree, right?
So we top down analysis again.
Okay, we understand the number
of buffers, understand how many
megabytes or even gigabytes sometimes
it is if you multiply by
the block size, which is in most
cases 8 kibibytes.
Then we go down and see where exactly
this huge volume came from.
This is my default type of work
with plans, including execution,
of course.
Because if you don't have execution,
you only have cost, and
buffers are reported only for planning,
as we just discussed.
In this case, usually it helps.
In most cases it helps.
And I wish I also was able to see
lock-related information, right?
Mixed plane, plane.
Michael: T.
Yeah.
Like weights.
Yeah.
It's difficult to see how they
would do that, but...
Nikolay: Not necessarily weights.
Michael: What do you mean by locks?
Nikolay: If we...
Yeah, yeah, actually maybe you're
right.
Yeah, what's happening under the
hood slightly, but it's already
super complicated output, right?
So...
Michael: Yeah, and there's macro
versus micro issues here.
If you're on a test environment
where you're the only person
there, it's much less likely that
you're going to be incorrect.
Nikolay: When I said locks,
I was meaning heavy locks.
So at planning time, all indexes
are locked, And all tables and
all relations are locked.
And that's a lot of overhead.
I wish I saw it.
It's interesting information.
Accessory lock for all of them.
So, and if it's update, okay, what
kind of lock I have here?
Michael: Oh, interesting.
Yeah, that's not what I was thinking
at all.
Nikolay: Yeah, you thought about
lightweight logs.
Michael: Well, no, I was...
Oh, even a different...
Or weight events.
No, I was thinking, like, if you...
Easiest experiment to show this
is start 1 transaction and maybe
do some DDL or like yeah drop a
column or something yeah anything
that's doing well add a huge like
a really heavy lock on that
table, then open a second transaction,
just do a simple SELECT
query on that table, go back to
the first transaction and commit,
go back to the second 1 and see
that your query was just waiting
that entire time.
If you run that query with EXPLAIN
in that
Nikolay: second session...
If you commit it, it's not waiting
Michael: anymore.
But the execution time of that
EXPLAIN will be huge.
Nikolay: Oh yeah, yeah, yeah, yeah,
yeah, yeah, yeah, you're
right.
And this execution time can be
explained if we saw basically
a pie chart or something, or a
distribution of wait
Michael: elements.
There will be no node, there'll
be no operation in that EXPLAIN
plan that shows that it was waiting
on an exclusive lot.
It will just show that 1 of the
operations was extremely slow.
Nikolay: Simple wait event
analysis for this particular
session would help.
Michael: Yes, but it doesn't
come up that often to me.
Maybe it's 1 of the blind spots
because I don't see those, but
I don't see plans that have these
really unexpected bottlenecks.
Nikolay: I would
like to be able to, not by default,
but I would like to be able to
see these wait event analysis and
heavy locks as well because I want
to understand this query,
which locks it's going to acquire
when I go production.
When execution finished, what locks
were acquired?
I think interesting extensions,
but it's already so huge.
Michael: And as I mentioned, we're
getting more and more, I think
this is part of the reason people
are adding more parameters
to explain, so write-ahead logging
information.
Nikolay: It should not be by default,
it's too much.
Michael: Well, I actually have
a proposal here.
I haven't suggested it on any of
the mailing lists yet, but I
would like your opinion.
I thought about adding them by
default when you include the parameter
verbose.
So if you do explain, analyze verbose,
give me everything.
Nikolay: You just don't want to
update your website
Michael: snippet.
Yes, yeah, I don't want to update
that snippet for sure, but
also I want people to only have
to remember to do that and they'll
get whatever Postgres can give
them.
Verbose means really long.
It means everything.
It kind of means like, I'm okay
with loads of information, give
me it all.
So it's already kind of in the
meaning.
Anyway, I haven't proposed it yet,
but that would also come with
buffers, so I thought you might
be happy.
Nikolay: Okay.
And also, if you talk about what's
missing and what you cannot
get from EXPLAIN, I also like at
some point, maybe some extension
of this could be, imagine if we
see operations at disk level
as well.
And we understand not only hits
and reads for the buffer pool
but also hits and reads for page
cache for example and maybe
CPU analysis as well for this particular
execution if it is explained
and analyzed.
Michael: Right?
That's a good point.
It's not exactly what you're talking
about but loads of people
don't realize because it's not
an EXPLAIN parameter that you
can, with track_io_timing on, you
can start to get some insights
into the disk level performance
and that's improving in each,
Like there's recent, even in 17
we got some improvements there.
Like buffers are split into shared, local and temp.
IOTiming wasn't initially and is now split into all 3.
Nikolay: Yeah, that's great.
But I'm talking about like basically pg_stat_kcache, but for
EXPLAIN.
Michael: I understand, but the timings are a good start, right?
Nikolay: I agree.
Yeah.
And actually, I think it's possible to have all of this right
now already, but you need to create, like, connect many pieces.
And for example, it could be a tool which runs EXPLAIN, but also
looks at wait event analysis for this pid in pg_stat_activity
and locks as well.
And what else?
It collects everything and big report for a particular query.
And also physical.
Physical, we can, if it's, we have physical access, we can do
it from proc.
Knowing PID, this I think we had prototype for our old bot.
So it's like you just, counters, counters from proc.
Michael: You can get, because if you're the query identifier,
which is the normalized 1 that you can match to like pg_stat_statements,
I'm imagining.
This also, yeah.
Yeah, you could start to like tie these things, all of these
things together.
Nikolay: Yeah, a lot of stuff is possible.
Who's doing this, by the way?
Maybe someone is doing this.
Michael: I don't know the monitoring tools well enough to know
who's doing some of this micro level, but if they are, let us
know.
Like if you're using 1 or you if you're building 1 that does
that would be cool to hear.
To be honest though, it goes back to some of the conversations
we've had about they're like macro to micro zoom, like once you've
done your macro analysis and you've identified a few problematic
query, like the main consumer, if you've got a few bottlenecks,
EXPLAIN is a really good tool for doing that micro, like once
you've got some sensible set of parameters that you know will
be slow, it's not that hard to reproduce normally.
The difficult thing is then going through a 200 line explain
plan.
You asked what can help.
I think we've put a lot of effort into helping with that stage
and a lot of the other tool providers have as well.
1 thing we do that I think is really helpful is if a sub plan
is fast like as a proportion of your query, we collapse it by
default.
We hide like we don't hide it.
We just say there's like there's
20 nodes here.
Probably you don't want to look
here, which means that like a
300 line plan can go down to like
20.
Nikolay: That's cool.
Michael: But that like those kinds
of things I think can really
help initially.
Like not it's not good for a hacker.
It's not good for a Postgres hacker.
It's like trying to work out the
cost model for a new feature.
But for somebody just trying to
speed up a single query, you
don't need to look at the 200 partitions
that all took 0, or
that were never executed, or that
took 0 milliseconds.
Nikolay: Yeah.
And you mentioned query ID and
for example, like I think it was
last week we were working with
1 of the customers, we saw that
auto_explain was used and it's great.
And There was a task to analyze
a lot, and we still plan it.
I'm very thankful to you that we
have integration in our AI right
now, and your system helps our
system.
I plan to perform some massive
analysis at larger scale to be
able to do it.
But the interesting thing is that
if you want to analyze a lot
of queries from auto_explain, you
do need some...
You basically need to compute...
It's called compute_query_id parameter,
right?
You need to turn it on because
by default it's auto and for auto_explain
it will not work.
Michael: It does if you have
the boson.
Nikolay: Okay, in auto_explain.
Okay, in our case we didn't have
the boson, I think.
Michael: But only as of 16, I think.
Nikolay: It's better to turn it
on, actually.
Michael: Maybe, yeah.
Nikolay: The question of overhead,
and let's, by the way, talk
about overhead in a second.
But Imagine you have a lot of query
cases, various queries from
the log.
Presence of query ID first gives
you understanding how many different
normalized or aggregated queries
you are dealing with.
Second, you can then join this
information with pg_stat_statements
and go to macro level back and
understand, okay, this query is
not very influential because it's
a tiny part of or maybe vice
versa, it's influential although
we see only a few occurrences
in our slow log because we have
threshold by time.
auto_explain mean duration, query
mean duration.
We usually set it to 1 second,
for example.
If we have only a few cases, but
we see in the workload, oh,
this is actually 50% of all total
exact time, oh, we do need
to think about it properly, and
influence of this optimization
will be high.
Let's do it.
This helps understand when you
have a lot of queries to deal
with, where to start.
You want to start with most influential
steps.
So query ID is a good thing.
I'm glad we have it everywhere
in EXPLAIN, in the logs, in
pg_stat_statements, you can connect all
the dots.
And pg_stat_activity as well, right?
So wait event analysis is possible
just joining by query ID as
well, if you run many, many times,
right?
So, yeah, great.
What about overhead?
I think it's also an advanced topic.
I know you blog posted about it
and OnGres did.
You talked about auto_explain,
right?
But Explain itself also has overhead.
Michael: Well, Explain can have
both, can under-report and can
over-report the timing.
Nikolay: Django, I'm not a fan
of Django.
Michael: Well, but bear with me.
I think people focus a lot...
There's different things.
The OnGres thing was about the
queries being slower, like you
actually cause some overhead by
observing and that's true when
you're running EXPLAIN ANALYZE
with time because timing is on
with Analyze by default But it's
also true if you're running
auto_explain, like there's also
some overhead to measuring and
to logging, like if you log a lot
of plans that's some overhead
as well.
So there's those overheads but
there's also the possibility that
it under reports which I
Nikolay: think
Michael: throws some.
So this is 1 of the reasons they
added the parameter serialize.
If you are doing a lot of work
after the query is executed but
in order to send the data to the
client.
Nobody really cares about the server-side
time.
The server-side time is important,
that's what we get from EXPLAIN
ANALYZE, but what they normally
care about is the user experience,
like the full client to server
to client time.
Nikolay: And- I cannot agree here,
but okay, go on.
Michael: Okay, well, what I mean
is, If you're getting a report
for my dashboard is slow, there's
no point only speeding up the
server side time.
If you're sending huge amounts
of data, and part of that work
is, so serialize is part of it,
It's getting the data ready to
transmit.
And then there's the network overhead
as well, like of actually
sending all that data.
So, yeah, I guess you're right.
Nikolay: It's not network cost.
It's not like transmission cost.
It's a cost of, not cost, cost
may be not a good word, it's overloaded
here, right?
It's time, right?
Or cost, or it's cost, it's cost
Michael: actually, right?
Serialized is reported as time
and amount of data.
Nikolay: What about cost?
Okay.
Michael: I actually don't think
it has cost to relate.
I think it's 1 of those things
that gets reported in the summary
section.
So, just-in-time compilation and
planning.
Nikolay: It's about civilization,
not transmitting.
Michael: Yes, It's a step before
getting the data ready to transmit.
Nikolay: It's still everything
on the server.
Michael: For example, uncompressing
or decompressing TOASTed data.
Nikolay: And it's Postgres 17 feature,
right?
So it just goes deeper to understand
what's happening on server
side, like execution and what part
of it is associated with civilization,
attributed to civilization, right?
Michael: Well, and not just
that.
Before, if you ran EXPLAIN ANALYZE,
SELECT some data that had
been TOASTed from this table, you
would get a really fast execution
plan because it wouldn't do the
serialization.
It would say this executes really
quickly.
So it wasn't just that you couldn't
see the time, it was like
a query that might take 100 milliseconds
when you run it from
psql, if you put EXPLAIN ANALYZE
in front of it, it could run
in 1 millisecond or 10 milliseconds
much much much faster.
So because it didn't have to do
that expensive step.
So this is 1 of those cases that
used to be and not just used
to be like by default this will
still be, this is still the behavior
in version 17.
If you want to explain, analyze
some basics.
Nikolay: SELECT star from blah
blah blah and before 17 we didn't
see this part at all.
Michael: Even in 17 you don't see
it unless you also explicitly
ask EXPLAIN ANALYZE serial.
That's what I've been saying.
Nikolay: And what about verbose?
Verbose includes it, right?
Michael: No it doesn't.
But this is my proposal that verbose
should include all of these
additional parameters.
Nikolay: It should, of course, I agree.
Well, okay.
It is as it is right now.
Michael: And I have to actually say, EXPLAIN is awesome.
EXPLAIN in Postgres in particular is the best I've seen of any
Database that I've worked with.
Other data, like MySQL for example, is catching up.
They got EXPLAIN ANALYZE in version 8, but the amount of detail
we can already get is way, like, it's beyond the other open source
Databases I've seen.
So it is good, it's just always can be better, right?
Nikolay: Yeah, yeah, yeah.
If somebody is watching us thinking, oh, a couple of haters,
right?
Haters of EXPLAIN.
It's just a lot of stuff, yeah, yeah.
But a lot of stuff is helping all the time, it's true.
Do you want to discuss overhead from buffers option?
Michael: I couldn't.
When I did some really basic, really basic experiment where
I tried to, where I did very, very small queries, but I couldn't
get any additional overhead.
I couldn't, there wasn't like a measurable extra overhead of
requesting buffers.
I think you said you did an experiment once that did show it,
but I don't remember.
Nikolay: Yeah.
On timing.
It cannot, like buffers are like, they are basically invariant.
Of course, there can be hits or reads or they're like moving
back and forth.
But if everything is cached, it's very stable.
Like you're reading the same data, like same query, just SELECT
multiple times, you will see the same number of buffers all the
time.
That's one of the reasons I like them.
But if you include buffers, I did see in the past, it was some
time ago, that timing is increasing.
Because a lot of buffers need to be reported and counters inside
need to be presented.
If it's a complex query, in many places this overhead adds up
and the timing without buffers option is very different compared
to with buffers.
Michael: I've seen single queries where that's true and I've
seen I think Lukas Fittl from pganalyze has done some examples
of this.
I think I've only seen it in queries that have nested loops with
lots of loops.
That's the example everybody turns to.
For example,
Nikolay: join with nested loop.
Michael: But the reason I think
it's slightly overhyped is when
you're optimising, you don't always
care.
If the EXPLAIN analyzes 4 seconds,
or maybe let's pick 400
milliseconds, and your real execution
time is 300 milliseconds,
if you can speed the 400 milliseconds
up to 50 milliseconds,
the 300 might go down to 40, and
directionally you're still looking
for the same things.
And by asking for buffers information,
you're getting more information,
you're more likely to be able to
optimize your query.
So I don't care as much as most
that it's slightly different.
Nikolay: Maybe I only care that including
buffers don't change
buffers.
That's it.
That's it.
So For me, when I start working
with buffers, I'm less concerned
about timing.
Focus on buffers, optimize, optimize,
optimize, and then go back
to result and see, oh, timing has
improved drastically because
we reduced number of buffer hits
and reads.
That's it.
This is a very, very common path
for me.
Just reduce because you like sequential
scan, lots of buffer
hits and reads.
We are at an index, okay,
Michael: couple of, 3, 4, 5 buffer
hits, that's it.
Time to look at timing.
Oh, of course, expect it a thousand
times faster, right?
I like, I like, you know, I like
buffers a lot, but for me they'll
always be secondary because just
Because if you look at timings
first and last, that's the main
aim.
The main aim is to get things faster.
And yes, in maybe 99% of cases,
it's going to be I/O.
That's the limiting factor.
But in some queries, there'll be
a sort in memory.
There'll be something that doesn't
report extra buffers in EXPLAIN.
There'll be just-in-time compilation,
or there'll be some limit,
like maybe trigger timing.
Nikolay: Just-in-time compilation
is off.
Michael: But all I'm saying or
planning time, like actually planning
time, you now do get buffers, but
trigger times, just in time
compilation, there's like a few
other examples, sorts in memory,
hashes in memory, like if some
of these other bottlenecks don't
report buffers, I would just, why
not look at timings and then
buffers?
Also, I almost always see query
plans with both these days.
So it's kind of a
Nikolay: care.
That's great that buffers are present.
At some point I started saying
if there are no buffers I'm not
dealing with this query.
This is what I say to customers,
go and give me with buffers
because we are missing super important.
Okay, if you put it in second place,
this place should be always
present.
Yeah.
I'm sure
Michael: that's what it's like
before.
I think that's probably enough.
I've got a talk that I gave a few
years ago called explain beyond
the basics which I would have thought
we had covered some of
the bits of but actually I took
a completely different path that
I went down some of the less common
types of issue that you can
spot with explain which we haven't
really talked about so I can
include that as like a if anybody
wants to know a bit more of
the specifics, it's a bit nicer,
like that format is good because
you can actually see examples of
EXPLAIN plans, so it's probably
a bit better video wise.
And yet the reason I see buffers
more often now is everybody
suggests adding it, You, but also
some of the other tools as
well, like depesz added it to the
default thing that he suggests
people get.
So there's loads of people.
Nikolay: That's news to me.
I remember it was missing and it
was a big disadvantage.
Michael: He added it.
It's great.
Nikolay: That's good.
Yeah.
Michael: Anything else you wanted to add?
Nikolay: I think it's enough.
Michael: Me too.
Nikolay: Thank you.
See you later.
Bye.