Postgres FM | Transcript: Custom vs generic plan

May 10, 2024 • 29 Minutes

Custom vs generic plan

Nikolay: Hello, hello, this is
PostgresFM, episode number 96.

My name is Nikolay, founder of
Postgres.AI, and as usual, my

co-host is Michael from pgMustard.

Hi, Michael.

Michael: Hello, Nikolay.

Nikolay: So pgMustard, as I mentioned
last time, is focusing on

explain plans and helping optimize
Postgres queries.

And the topic I chose this time
is very, very close to that topic,

right?

I mean, it's a subtopic.

I know many people discuss a lot
of very in-depth materials around

this topic, but let's just focus
on basics just for awareness.

This topic is how plan caching
works and generic versus custom

plans.

Because at a very high level, we
don't know about them at all.

We know that EXPLAIN shows us a
plan.

We know there is plan caching.

We know pgBouncer since recently
started supporting prepared

statements, so you can avoid planning
time, because sometimes

it's very costly.

For example, if you have a lot
of partitions, a lot of indexes,

log manager contention can happen.

So it's good, especially for queries
which have very high frequency,

for example, a thousand times per
second, it's good to just perform

planning once per session and then
rely on the cached plan.

But it's not really once per session,
right?

There are more nuances in this
area.

So let's uncover them and discuss.

Michael: Yeah, nice.

At a kind of high level, I see
people that are using ORMs and

frameworks using prepared statements,
sometimes without realizing

it, or sometimes they know but
they don't really know what it

means.

Do you see a lot of people, like
is that the same in your experience?

A lot of people using this without
realizing, or are you seeing

quite a lot of conscious choice,
like conscious use of prepared

statements, even for people that
aren't using it via an ORM or

similar?

Nikolay: Well, you know my point
of view on this world.

On one hand, I think people are quite
smart in general,

Michael: but

Nikolay: they are smart only when
they are very motivated.

And we know a lot of engineers
don't distinguish between binary tree

and B-tree at all, because they
are not motivated enough to understand

the differences.

Similarly here, a lot of people,
in my opinion, don't understand

that one normalized query in pg_stat_statements
might have different

plans depending on parameters.

This simple truth is something
that should be understood, but

it can be understood only when
you are motivated, when you have

some incident, for example, and
you start off thinking, oh, okay,

this query can behave differently
depending on parameters.

So this is like basic knowledge.

And honestly, I didn't realize
it for quite long in my career

as well.

I thought, okay, this is so, but
why is it important?

Well, it's very important because
looking at pg_stat_statements

and understanding heavy queries,
it's not easy to jump to optimizing

these queries, especially when
we talk about mixed situation,

when it's also quite high frequency
query and also it has issues

with not perfect indexes chosen,
not perfect plan, like not only

just index only scan, that's it.

So it means that we need to understand,
okay, we depend on parameters,

but which parameters should we
choose for optimization?

Michael: When you say for optimization,
do you mean like we get

the PG stat statements and it has
these parameterized queries,

which ones do we put in in order
to run, explain, analyze buffers?

Nikolay: Right, right, right.

This is unfortunately a big gap
between macro and micro levels

of query analysis, which still
exists.

Of course, with the addition of
query ID to logs and to pg_stat_activity,

we have this bridge between pg_stat_statements and individual

queries, which we can find in logs
or pg_stat_activity.

But still, I remember 1 lesson
I learned in practice when I spent

1 hour or so optimizing some query.

It was some parameter which was
true or false.

And I was trying to optimize for
value which was present in 90%

of rows, majority of rows.

And it was, of course, difficult
because selectivity is low,

cardinality is high, right?

But then I understood all this
hour spent was wasted because

it's never solved.

Never, ever, application runs this
value, uses this value in

query.

So The idea is that we need to
understand real cases and also

not super rare cases or non-existing
at all cases, but we need

to understand the cases which hurt
user experience, for example,

and focus on them.

In this case, generic plan is something
really weird in this

picture I just painted, right?

So what do you think?

Like generic and custom and maybe
let's talk about how planner

uses them and then discuss what
should we do about it at all.

So this 5 times rule, what do you
think?

Michael: Yeah, I think that catches
people out sometimes.

So the way prepared statements
works, at least my understanding,

is that you can prepare a query
in advance.

And that will do, like, the parse
stage.

And it will do, like, several stages
up to but not including

the execution of that query, then
you can ask it to execute with

parameters.

Now, that then, on the subsequent
executions, you don't have

to reparse, like do the set, like
do several of those.

Well, planning at first does have
to be done again, right?

So it's the parsing and the rewriting
stages.

Yes, on a per session basis, we
don't have to do several of those

steps.

But there is also this weird kind
of initial phase.

So if you've done prepare, if you've
prepared the statement and

then you execute it, the first
5 times, Postgres will plan each

of those 5 executions separately
based on the parameters that

are given.

So if those are in the case you
gave if you used true was it

that was 90% of time You've used
true every single time for those

5 times It's gonna plan it the
same way each time and on the

sixth execution it will then look
at the cost of planning.

And if that cost is less than a
generic, or if it's more than

a generic plan that it's calculated,
or I think there's some

subtlety around exactly how precise
that is, then it will choose

its generic plan thereafter.

It will flip to not planning each
new execution.

So sixth time, seventh time, eighth
time, ninth time, forever

onwards, during that session, it
won't replan.

So now you're not only saving on
the parsing, on the rewriting

stages, you're also saving on the
planning stage, which you mentioned

already.

But that's not true if, for example,
you did 3 or 4 executions

with true and then a couple of
executions with false in your

example, then you're going to get
a different kind of average

cost for those 5 executions.

Then that will get compared to
the generic plan and then a different

decision will get made.

So depending on those first 5,
So those first 5 do actually determine

quite a lot what happens thereafter,
which is interesting.

But yeah, the 1 time I've seen
that the 5 become important is...

This can be really confusing for
folks.

If you're trying to debug something
and it's fast for a while,

and then suddenly it goes slow,
This is 1 of those telltale signs

once you've been around a while.

If you notice that it's fast for
5 executions then slow on the

sixth, you should look into this
issue.

Custom plan versus generic plan.

It's kind of a telltale sign, like
a smoking gun, or like a,

you know, it's a telltale sign
that this could be what you're

hitting.

But aside from that, I haven't
really thought too much about

it.

I don't think it's configurable,
that 5 number.

I think it's hard-coded.

Nikolay: Yeah, it's, by the way,
why 5?

Why 5?

Michael: Yeah, good question.

Nikolay: Some decision, right?

That's it.

But like, what is the generic plan?

Like generic plan is just we don't
take into account parameters

at all, right?

It means that we don't take into
account this distribution of

values?

Or we take…

Michael: Yeah, good question.

Did you look into how it's calculated
in the first place?

Nikolay: No, but it's an interesting
question.

Michael: I could be wrong, but
I would have guessed that it would

be based on some statistics at
the time.

If you were implementing this feature,
you would think if you've

got a data distribution there,
maybe you'd factor that in.

But it's difficult.

Because if it's skewed, if it's
like completely uniform, it's

quite easy.

Like if we're talking about primary
key lookups, it's primary

key lookup query and every single
one's unique and every single

one's only going to return 1 row.

It's only scanning 1 table, that
kind of thing.

The generic plan is really easy
because every single parameter

you could provide to that is going
to be planned the same way,

even in the custom plan mode.

So generic plan in that case is
quite easy.

But in the case where it's a skewed
distribution, some values

have more entries than others,
it becomes tricky, I think, to

know how you would plan that in
a generic way.

So I don't actually know.

Nikolay: For me, a generic plan
is a good thing that we don't

have a lot of time to spend finding
particular examples in logs

or in pg_stat_activity.

In pg_stat_activity, we have a
different problem, the track activity

size.

So this parameter, which by default
is 1024, means that really

long queries are truncated and
we don't see the full query.

Parameters usually are in the end,
right?

Select all columns, our favorite,
from table and so many column

names, so I don't see parameter.

Something equals truncated.

I saw it not once.

So when we don't want to spend
time, we want something really

fast, this trick prepare statement,
we prepare statement and

then we use just nulls as all values.

This gives us a generic plan, but
also we can say set plan cache

mode to force generic plan, which
exists as we just checked before

we started recording this episode.

Surprisingly, the plan cache mode
configuration setting, it exists

since many, many years ago.

So in all current versions, it's
supported, right?

Michael: Yes, since version 12.

Nikolay: Exactly.

So 12 is the oldest currently supported
version.

A few months left until the end
of it.

Michael: Yeah.

Nikolay: Yeah, yeah.

So this trick allows us to, looking
at the pg_stat_statements,

just take some normalized query,
which has this $1, $2, doesn't

have parameters at all, they are
removed.

And just to use this prepared statements
with null streak and

get some generic plan and think
already how bad the situation

in general, not in particular cases,
but overall.

But this trick, as you absolutely
correctly mentioned before

this episode recording, doesn't
work if we have, for example,

a primary key there.

Because primary key equals null
or is null won't work properly

because there is no null constraint,
so the planner knows that

nothing will be returned.

Doesn't show us the proper plan
at all.

Michael: Yeah, if you don't specify,
like, if you don't set plan

cache mode to force generic plan,
then you'll get a custom plan

and the planner will shortcut it
because there's a really, yeah.

So, but if you do force generic
plan on version 12 onwards, then

you will get the generic plan you're
hoping for.

Nikolay: Well, that's great.

I didn't realize that.

So yeah, good, good.

But if we are lucky enough to have
already Postgres 16, all these

tricks are not needed because we
can just say, explain generic

plan.

There is a new option in explain
command, and this is very handy.

We just ask a generic plan and
explain.

Don't care.

And interesting that we can even
use $1, $2, and this query text

right from pg_stat_statements.

This is super convenient.

We can say, explain generic plan
and that text.

We can even probably automate this,
right?

And see, for example, if we have
sequential scans in generic

plans and put warnings somewhere
and so on.

But of course, this won't work
if we want to execute and we want

to explain analyze buffers.

This won't work.

I mean, in this case, we need to
replace $1, $2 with some values.

Michael: Yeah, and it makes sense,
right?

We can't do the execution unless
we specify something, otherwise

what are you executing?

So it makes sense.

And kudos to Lawrence from CyberTech
who added this and has blogged

about it.

So I'll share the blog post about
that.

Nikolay: I remember actually this
wave from Lawrence and also

Lucas Fittal, who generalized blogging
about generic plans and

also getting excited.

Also, Frank Pascho, I think, joined
at some point.

I actually don't know who started
this, but I remember waves

of, wave of thinking, multiple
blog authors, blog posts about

this, generic plans a few years
ago and it was great.

Michael: I don't think it's super
surprising that this has come

out of a consultancy, like a Postgres
consultancy, and that Lucas

is excited about it.

So I think it's 1 of those things
that really helps people help

others.

Because if you really know what
you're doing, like these people

all do, you can now set up Postgres
in a way where you can find

parameter values in the vast majority
of cases, or you can reconfigure

things so that it does log the
query ID, or you can get your

slowest queries logged with parameters.

So there are these ways of doing
it now, but if people haven't

set that up in advance and you're
helping them with a production

issue, this is a really helpful
tool to give you a clue as to

what's going on.

Now explain will always, there
will only ever be a clue, not

because you don't have the execution
system, you don't know for

sure why it's slow.

But those clues are often enough
for experienced people to get

a theory, get a hypothesis as to
what's going wrong, and then

give them something to be getting
on with instead of being able

to say, I have no idea why this
is like, instead of, it's kind

of like extra information to make
the guess a little bit more

educated rather than an actual
answer.

Nikolay: Yeah, and we can have
a holistic approach now, like checking

everything in an automated fashion,
not guessing and so on.

It's great.

So, yeah, what else worth mentioning
in this area?

Maybe partitioning, which we like,
it's funny that you found

this commit in Postgres 15, which
improved the situation.

My general impression, if you use
a lot of partitioning, you

should perform major upgrades promptly,
not lagging at all, because

every year a lot of things are
improved.

So this is a quite in-depth thing
that was improved in Postgres

15, saying that now, like partition
pruning, so exclusion for

irrelevant partitions based on,
like if user asks to return some

data in a query, Postgres can remove
partitions which are for

sure irrelevant from consideration.

For example, we have partitioning
by time, range by time and

by date, and then the user for
sure wants fresh data, definitely

didn't need to look at very old
partitions, partitions with very

old data.

But this might happen both at planning
time and execution time,

which can be a big surprise.

And you want this to happen at
planning time, basically, better.

Because this will mean that we
can rely later on cached plan,

or if we use prepared statements,
we can rely more on cached plans.

And then we have a very good situation
when partition pruning

was already done at planning time,
we just execute that set.

But it's tricky when we talk about
generic plan, because generic

plan doesn't take into account
the values.

So something was improved in Postgres
15 in this area, right?

Michael: Yeah, We tried to look
into it before the call, didn't

we?

And couldn't...

Nikolay: Yeah, I couldn't reproduce
the problem.

Michael: Yeah, so we tried on an
old version before these.

So it would be interesting to hear
from people that have reproduced

this.

I didn't see a blog post about
it, I saw it referenced a couple

of times.

Nikolay: Yeah, Lukas Fittel in
this blog post, the generalized

blog post, mentioned that before
we had a problem with generic

plans, that partition pruning didn't
work in planning time.

It worked only at execution time.

I couldn't reproduce it, but if
it's so it means that, for example,

if you have a lot of partitions
with a lot of indexes and at

execution time you involve all
of them.

In general, if you have a lot of
partitions, just checking them,

not returning everything, it's
a big overhead, huge overhead.

But at least at execution time,
Postgres doesn't put access share

locks on all indexes, because at
planning time it does.

And if at planning time we don't
have partition pruning working,

it's a nightmare.

This case

Michael: of manager spikes.

Well, I think that would happen
at X.

Yeah.

Nikolay: So,

Michael: a question for you.

Because we do now have this plan
cache mode, so plan cache mode

got added in 12, that's 1 of the
things we can use to force generic

plan, force custom plan while we're
debugging, while we're trying

to, pre-version 16 releases, before
we had this explained generic

plan, we could use that for debugging.

But I don't think I've ever seen
anybody change.

So, Plan cache mode got added.

You have 3 options, Auto, Force
Custom Plan and Force Generic

Plan.

Auto is the default, so Auto acts
as it did as...

Nikolay: As you described.

Michael: Yes, so the first 5 executions...

Nikolay: times custom, then decision
is made.

Yeah.

What choice?

Well, I

Michael: was going to ask because
I think I got the impression

that some of these ORMs, like ActiveRecord,
for example, uses

prepared statements.

I got the impression the main reason
they do so is to protect

against SQL injection attacks,
not for performance reasons.

And that might be not quite accurate.

But there is this other benefit
that seems to be mentioned quite

a lot around prepared statements.

Now, I wondered if you'd ever seen
anybody change, like if you

only were, if you only were using
prepared statements for.

SQL injection protection, and you
didn't, you weren't as bothered

about the performance benefits
in some cases, or pitfalls in

other cases, you might want to
allow Postgres to replan every

single time.

I can imagine some people considering
turning force custom plan

on.

I can imagine like the upside of
doing so.

I can't see quite the same upside
for doing forced generic plan

because all you're doing is saving
those 5.

Well, I think you'd only save those
5 executions and then you

flip to the generic plan anyway.

Maybe there's another case for
that.

But yeah, have you seen anybody
changing that setting in any

cases?

Nikolay: I see everyone doesn't
know about it, almost everyone.

It's very deep.

It's not super, super...

It feels like some kind of fine
tuning.

I see people have fears, for example,
to rely on prepared statements

globally, because this idea, what
if global plan will work not

well?

Michael: Right?

And...

So generic plan is bad for a specific
parameter...

Nikolay: If generic plan is chosen,
it's cached, and then it's

bad for some cases and we have
performance degradation for particular

parameters.

But also I think this idea to general,
to use prepared statements,

I don't think these days it already
matters as protection from...

I think the main reason is planning
time is usually fast, but

not always fast, as we know.

And the idea that during planning
time Postgres locks all indexes,

it's a super big surprise.

Many people, including myself,
had not long ago.

And we had benchmarks showing this
small overhead adding with

each index for select, not for
update, for select.

And at some point when total number
of relations, both tables

and indexes, reaches 16, which
is also a hard-coded constant.

If a query is very frequent, 1000
or 2000 per second, primary kilo

caps are usually so.

We have a log manager spike, and
partitioning increases the probability

of it.

And that's why getting rid of planning
is good.

And when you get rid of planning,
it's worth understanding this

behavior of how caching works and
generic versus custom.

But just changing this globally,
I'm not sure this is a popular

decision.

But of course, my experience is
very limited.

Michael: Let us know if you have
changed it.

It'd be interesting to hear from
you.

Nikolay: Yeah, it's an interesting
topic anyway.

But in general, I think it's worth
spending efforts to get rid

of planning for high-frequency queries,
especially for partitioning

tables.

Michael: I have one more related
question.

So now pgBouncer supports prepared
statements.

Nikolay: I

Michael: saw they have a max prepared
state.

Well, so it's off by default, which
I didn't realize.

They have a max prepared statements
parameter.

Do you have any opinions on tuning
that?

Or like I saw that they said a
default of 100 might make sense.

Like, might be a sensible start.

Nikolay: Yeah, some high value
is good.

I think it's like, depends on the
situation, of course, right?

I mean, but in general, it's not
an issue to just increase it.

Yeah.

Cool.

But again, I see the tendency to
be afraid of turning prepare

statements globally in already
existing heavily loaded projects.

If you start from scratch, it's
a very good decision to make

to start using prepared statements
from the very beginning.

And then you just grow and see
the problems, fix them as usual,

like you would do anyway.

And then there's confidence level
of using already there, but

switching for a big project from
0 to 100% coverage, it's a very

hard decision to make.

Michael: That's hard as in scary
or?

Nikolay: Just people are not sure
and that's the problem.

And there is no good path.

Maybe there is a good path to it.

But in a couple of big companies
I observed recently, people

decided not to proceed with this
after a lot of consideration.

And it's kind of an interesting
topic.

Instead, It feels safer to start
using prepare statements only

for specific queries.

Michael: That's a slow way in.

Maybe for new features.

Nikolay: Not new features.

Not at all.

Again, the problem is usually this
log manager overhead.

When we have 1,000, 2,000 QPS,
we see very simple queries, select

star from or select something from
table When we have 1000-2000

QPS, we see a very simple query,
select something from the table

where ID equals something, primary
queue lookup.

And if we have planning time every
time, select itself is fast,

planning itself is also fast.

But if you have a lot of backends
competing, access share lock

is not a problem at all.

But lock manager has overhead because
of fast path is false.

And this is what multiple companies
experienced recently.

And Jeremy Schneider said, local
managers strikes back or bites

back or something like that.

Last year was a local manager year
in terms of performance issues

people experienced.

Michael: So you're saying, yeah,
so I'll share the episode where

we discussed that in more detail.

And so you're saying they choose
to turn prepared statements

on only for those queries, because
it reduces the issue there,

but not globally.

Okay, makes sense.

Nikolay: Just to remove these hotspots.

Yeah.

Only for these primary lookups.

Michael: I mean, we're talking,
So if you're talking about these

super fast queries, even though
both execution and planning time

are fast, it could be that planning
time is more than execution

time.

Like it's quite common.

Nikolay: It happens.

Michael: So even if it wasn't for
the issues you're describing,

you could maybe increase throughput.

Or like there's a lot of overhead
to be gained by not having

to plan those queries.

Nikolay: Yeah, shave off planning
time is a good idea in many

cases in general.

It's like, for example, partitioning,
right?

If partitioning is there, planning
time can be high if you have

a lot.

It can depend on the number of
partitions.

This is the problem as well.

In this case, of course, getting
rid of it is good.

Nice.

Cool.

Anything else you want

Michael: to add to this 1?

Nikolay: No.

It's an interesting thing to keep
in mind when dealing with plan

optimization.

So use explain-generic-plan in
Postgres 16 and in future Postgres

versions.

Michael: Yeah, if you can't, as
a starting point, or if you can't

get

Nikolay: parameters.

Michael: Lazy mode.

Lazy mode.

Nikolay: If you don't want to go
to logs to find examples, or

maybe a log min duration statement
was not configured somehow.

Michael: Exactly.

If it wasn't configured for the
time where you're having issues,

or if you can't turn it on easily,
like if you've got a lot of

process at your company before
changing some of these parameters,

you know, so there might be reasons
why you can't get real values.

But of course, I think The reason
I don't know as much as maybe

I should on this is, if you can
get the real values, performance

optimization work becomes so much
easier.

Nikolay: Of course.

Michael: There's less guesswork.

Nikolay: Yeah.

This generic plan should be something
like a partial solution

in terms of analysis.

It's a partial analysis.

It always should be with remark,
you know, this is not, maybe

not it.

Maybe but maybe not.

Michael: Yeah.

Like a stepping stone on the way
to the solution, but like it's

not a solution in itself.

Nikolay: Right.

On the other hand, it might give
very quick insights what to

do and how to speed up.

So yeah,

Michael: good point.

Yeah, good point.

Yeah.

Nice one, Nikolay.

Thank you so much.

Nikolay: Thank you.

Have a good day.

Bye.

Michael: You too.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

Custom vs generic plan

Creators and Guests

Some kind things our listeners have said