Plan flips
Michael: Hello and welcome to PostgresFM,
a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard,
and I'm joined as always
by Nik, founder of Postgres.AI.
Hey Nik!
Nikolay: Hi Michael, how are you?
Michael: I am good, how are you?
Nikolay: Very good.
Fantastic.
Oh
Michael: great.
What are we talking about this
week?
Nikolay: Not about AI.
Michael: Yes, for a few.
We have a detox.
Nikolay: Everyone is talking about
AI, so we are not going to
talk about AI.
Michael: Great, so what are we
talking about?
Nikolay: Let's talk about plan
flips and it's quite a complex
topic because I feel Postgres lacks
a lot of stuff and I'm not
sure what it needs to have in future
to be better.
And there is a lot of criticism,
including the recent one.
A recent wave of criticism against
Postgres in this area.
After some incidents from one company,
the company name is Clerk,
right?
Yeah.
Clerk.
Clerk.
How do you pronounce it?
Michael: I say Clerk.
Nikolay: Clerk.
Okay.
That's interesting.
Michael: So it's E,
Nikolay: but you say I.
I don't know.
Yeah.
Okay.
Yeah.
And there was an outage or series
of outages.
I don't know.
Michael: Yeah.
Nikolay: Two weeks ago.
And I'm as usual kudos to all companies
who share experience.
This is great.
They shared experience.
They had a blog post with timeline
and root cause analysis to
some extent and then I saw on Twitter,
I saw discussions which
looked like Postgres is not good
enough, let's choose better systems
because look what you might have
with Postgres.
Michael: Yeah, look what can happen
and...
Nikolay: Yeah, might happen, can
happen.
Michael: Exactly.
Something quite innocuous.
So it was, they've said the plan
flip was the, maybe not the
root cause, but the thing that
took them down and due to insufficient
statistics but yeah they're an
infrastructure provider right
so they they do authentication
and if auth is down that means
all of their customers are down
yeah exactly so this is a huge
issue for them and for all of their
customers.
They are very wary of things that
can change unexpectedly, understandably.
But equally, lots of services that
need extreme high availability
run on Postgres, so how can they
do it?
I think that would be a good discussion.
Nikolay: So in my opinion, their
root cause analysis is halfway
done.
And naming plan flip A root cause
is not all right.
Why?
Why 5 why's?
You know this approach.
Michael: Yeah, I think to a...
Nikolay: Yeah, basic approach for
root cause analysis is ask
why until you cannot answer at
all.
So their analysis was saying like,
the problem happened because
Postgres triggered Analyze, obviously,
automatic Analyze by autovacuum
process, which caused plan
flip.
So they did only 2 steps in this
5-y approach.
They said okay what happened?
Why we are down?
Because statistics changed.
Why is this?
Because plan flip.
Why?
Because it was triggered by autovacuum.
2 of 3 steps.
But you can go deeper, right?
Yeah.
You can go deeper.
I think there was 1
Michael: more, I think they did
increase statistics target for
example to make it less likely
to happen again but there still
are more steps for sure
Nikolay: yeah they increased
default_statistics_target?
Michael: Maybe not default but
for the table I think.
Or for the column, I forget which.
Nikolay: Yeah, it's either Column
or global.
Anyway, that's interesting because
in my opinion there are different
plan flips, different kinds of
plan flips.
And I'm still trying to figure
out for myself, but if you have
plan flip when suddenly like it
flipped after recalculation of
statistics, that's interesting because that might happen.
This is actually a difficult type, this is a hard type of plan
flip.
But the question is why it happened only now, not 1 hour ago,
for example.
Michael: And what's to stop it happening again?
Nikolay: Yeah, yeah.
And this is interesting, right?
Michael: So I have some thoughts on this specific case but I
think it might actually be useful for us to go back to basics
a little bit and define what do we mean by a plan flip and what
the normal causes of them and I think this is a little bit of
a special case that should be interesting and might be completely
avoidable I think, but a lot of cases aren't avoidable and I
think that's an interesting discussion.
But yeah, in terms of plan flips, I wanted to ask you, do you
define them only when they're bad because we have a cost-based
optimizer, right?
The reason we can flip query plans is because the planner has
choices and it makes those choices based on estimated costs of
doing different join algorithms or different scan types.
But flipping is part of its design, right?
The same query running on different data as the data changes,
perhaps should use a different execution plan.
It would be faster to use a different execution plan.
So flipping is designed, it's deliberate, but the idea is it
flips to a better plan when the data suits that.
Nikolay: So sudden plan change is normal.
Michael: Yes.
So, yes.
When people say plan flip, it feels like they're only ever
Nikolay: talking about that.
Michael: Yes, exactly.
Nikolay: Yeah, because this is when it changes to wrong state.
Because the planner sometimes have mistakes.
Michael: Yes, or misestimations.
Nikolay: For example, if you have random page cost 4, which is
default, I saw many situations when we have 2 candidate plans
which have very similar cost under these circumstances.
So for this data and what we have in pg_statistic, and for these
settings, including random page cost 4, which is default, I saw
the cases when there are 2 plans.
1 is obviously wrong.
For example, sequential scan for huge table.
And another is obviously right, which is index scan.
Like I'm remembering some Simple case, like simple in terms of
it's obvious to a human or even AI that this plan for this table
shouldn't be chosen because it's sequential scan on whole table,
but since for example Postgres thinks that random I/O is so expensive
suddenly chooses sequential scan plan instead of index And you
don't notice until some point when
these costs, they are very
close to each other.
And then suddenly when a little
bit more data comes to the table
and ANALYZE happens.
Sequential bad 1 wins and that's
it.
And this we usually solve by adjusting
random_page_cost.
Telling the planner that please
don't choose sequential plans.
I know we had, like, Tomas Vondra,
Tomas Vondra, has great
proof that maybe random_page_cost
4 is not bad.
Maybe even it should be bigger.
But I have brain split here because
I see what his reasoning
is makes sense theoretically, but
what I saw in production proves
opposite.
I saw these plan flips happening
when just like suddenly, and
it was like some ticking bomb it
was like yeah it was sitting
there sitting and then boom like
suddenly the planner chooses
the sequential scan and we are
down because the time's out right
Michael: yeah I have a theory about
this I've been thinking a
little bit about it and I think
it's like, Tomasz's work is very
good in theory because he cares
about the optimal time to flip,
he cares about how much data do
we need to be reading before
sequential scan is faster?
But in practice, especially with
large tables, the flipping point
is not that important.
The main point is we almost never
want to request that much data.
Often, especially in OLTP, we want
very small SELECTs.
We're only selecting a time, 1
record or only a very few number
of records.
So the flipping point, we want
to put so far out of sight that
even if our row estimate is really
bad, because often it's a
combination of 2 things right it's
not just the the cost of doing
random access it's also how many
rows do we expect to be reading
and if that's out by a lot and
the cost is high those multiply
together and suddenly Postgres
is thinking you know what I might
as well scan the whole table here
because I'm expecting to read
so much.
But if we can, if we decrease the
random_page_cost, we get away
with the bad row estimate.
So I think you can fix either and
get towards it.
But if you do both, then you massively
decrease your chance
Nikolay: of
Michael: this kind of issue.
Nikolay: I hear you and this in
my mind connects very well to
the topic of OLTP versus analytical
versus hybrid because there
are good advancements in the area
of analytical queries processing
for Postgres in various ways.
And there is also, I'm strongly
believe that there is a very
simple case called pure OLTP.
Pure OLTP means that We need to
serve users who are humans mostly,
maybe, and they have human perception.
Our very first episode was about
this.
200 milliseconds, this is when
they recognize something is already
happening.
And If it's significantly more
than 200 milliseconds, it feels
slow, which means we need to have
our queries below 20 milliseconds
because 1 web page consists of
multiple queries, usually roughly
10, say 10, like some guess, right?
It means we need to be below 20
milliseconds, in this case, by
no means I want any sequential
scan be applied.
Never, like, I don't want it, like,
just forget about it.
I want to turn it off completely.
I
Michael: would encounter that Sequential
scans on tiny tables
are fine, right?
If it's only got 100 rows in it,
but also an index scan on that
table would be fine as well because
it's still only got 100 rows.
So the times where sequential scan
is best don't matter anymore
because an index scan, whilst being
less efficient, or even a
bitmap scan would be absolutely
fine, even if it's twice as slow,
or even 5 times as slow because
it's such a tiny table.
Nikolay: But we also have different
other, there is also bitmap
scan, index and heap scan, So it
also, it's like in the middle
and how the planner thinks about
the cost.
It's interesting.
So I think how we tune some service
for OLTP, and this is obviously
OLTP.
People didn't want this service
to,
Michael: of course,
Nikolay: Yeah.
We definitely want very strict
timeouts.
I think 15 milliseconds, it's old.
We should go down to 10 seconds.
So statement timeout should be
10 seconds.
Even transaction timeout should
be 10 seconds for radical.
Let's introduce concept of radical
OLTP.
Yes, you have some workloads that
needs to be analytical and
some people can wait on or some
background jobs.
Yes, the create index taking a
couple of hours.
It's also a different story, but
if it's pure OLTP mobile app,
which many people depend on, put
transaction timeout and statement
timeout, very low value, be very
strict, and just cut those queries
which start running longer.
In this case, if you, for example,
have a bunch of them happening,
there's a chance that on powerful
machine, you will survive.
If you don't have it, it will be
a snowball effect of many sessions
being stuck and running in sequential
scan or something like
a not optimal and it will be consuming
resources.
Michael: The other thing it might
do is it might more quickly
point you in the right direction
as to what's causing the issues
that you start...
Nikolay: Yeah, how to diagnose.
Let's maybe, I think I'm pushing
us to some like problems and
resolutions already but let's go
back.
You raised a very good point about
what plan flip is so in my
head, it's also like this is negative
term plan flip is Sudden
plan change in the wrong direction
when like basically the planner
Chooses a plan we don't want like
sequential scan or maybe bitmap
scan which is timing out for example.
Michael: Yeah I think that it might
even be worse than that I
think it might even be like it's
noticeably worse.
Nikolay: If
Michael: it flips and it's 1.1
times slower I think most people
don't really care.
But if it's double or triple or
it's like it takes you from being
90% CPU to 100% CPU It crosses
that threshold.
Maybe again.
It's the Tomas episode we did
on performance cliffs It's a cliff
for you then.
Nikolay: Yeah in my opinion performance
cliff term is different
and it's macro Yeah, while plan
shape is micro term.
It's like it's just 1 session and
we see suddenly execution of
this query became much worse.
It was 10 milliseconds, suddenly
became like 1 minute.
Boom.
And obvious example is like we
had an index scan, but suddenly
the planner...
And usually it happens in complex
queries, right?
So joins, subqueries, CTE, multi-stage,
like it's something complex.
And then boom.
While performance cliffs for me,
like it's macro level, like
some lightweight lock contention
happens and didn't happen and
then suddenly it happens.
There is no official terminology
here.
So I'm just reading some blog posts
and trying to adjust my own
mind to them mostly, right?
This is how I feel.
So plan flip is negative term,
right?
So we don't want it to happen.
And we started with difficult example.
This company obviously experienced
this difficult example.
And my question to them is like,
what random page cost you had?
Did you have statement timeout,
item transaction timeout, maybe
not related in this case, and transaction
timeout?
So safeguards, safeguards which
would protect from a huge wave
of sessions which experience this
bad plan to be executed and
also what were your autovacuum
settings.
I think
Michael: that's relevant.
Nikolay: Before we go to an increased
number of buckets with
statistics target, these are macro
settings we need to keep in
mind.
Michael: I had another thought.
This for me felt like a state of skew issue.
And so they mentioned the column in question was 99.9996% NULL
values.
And this query was looking for the not null values which of which
there are still 17,000 so we're talking about billions of records
but only 17,000 were not NULL.
In cases like that, where you've got such a data skew, you could
be looking at, like for those columns you could be looking at
increased statistics or often you want a partial index on a column
like that right because you don't care about indexing the billions
of NULL values you only care about indexing the not null values
and once you've got the perfect index on that I know it could
be a more complex query where you want different columns but
if you've got those columns in the index at where the value is
not null then actually the planners almost certainly going to
choose that because it's such the perfect index for that query
and you don't have to worry about misestimations it's never going
to pick a sex scan in that case.
Nikolay: So we definitely lack information here.
Yeah.
We didn't see the plans.
We don't know their settings.
We even don't know the version.
Right.
So we can only guess.
And again, there are difficult situations with plan flips.
I saw quite difficult ones and where obviously like everything
was normal.
But what I'm trying to say, you can get a plan flip because you
didn't do a good job configuring Postgres well enough and putting
these safeguards for OLTP workloads.
And this is what we do, number 1, with all the clients who are
doing consulting and which also the clients who install our monitoring
and checkup service.
And this is like number 1 thing we need to be, we recognize we
are OLTP.
Priority is to have the super fast queries and we need to prepare
basics, right?
Unfortunately, they cannot protect us in 100% of cases.
But if you have your autovacuum not tuned, for example, you
have this relatively new setting.
I remember Darafei raised this topic.
We didn't have autovacuum insert, autovacuum vacuum insert
scale factor.
In autovacuum analyzing, there are 2, right?
So which triggers on insert?
And by default, it's 20%.
And also we have all the regular vacuum scale factor, which triggers
on updates and deletes, also 20%.
For analyze, it's 10%, 0.10, which
means 10% of your table needs
to be changed.
If it's a huge table, it's a lot
of data.
Again, this data skew is a good
point.
You had 1 distribution, but suddenly
all the new records, they
are very different.
And you don't have properly tuned
autovacuum, you accumulate
that data a lot.
And then suddenly after 10 or 20%
of data changed or inserted,
autovacuum finally ran analyze.
Suddenly we have a flick.
If it happened earlier, maybe it
also would be bad.
It's actually a good point.
Maybe still we need the statistics
target, but we, like in my
opinion, it's better to have more
frequent process running on
large tables.
So statistics will represent actual
data better, more often,
right?
To allocate more resources, to
have more workers.
This is what we do all the time
with larger services.
And unfortunately, we see, like
I would say 2 out of 3 clients
coming, most of our clients are
startups, which like have say
database exceeding 100 gigabytes
up to 10 or 20 terabytes.
This is like sweet spot for us
where they experienced some problems
already and 2 out of 3 roughly
have poorly tuned autovacuum.
Poorly tuned.
Michael: Only 2 out of 3 is quite
surprising to me.
Nikolay: Yeah this is my rough.
I can check statistics you
Michael: need to update your statistics
Nikolay: my internal yeah I need
to ask a guy to maintain statistics
Michael: anyway but I had a question
though yeah default_statistics_target
is like 100 and I quite
often see people increasing it
to 1000 when an issue like this
happens or at least for like
specific columns other than increased
analyze time which is like
important but not that important
for yeah and it doesn't increase
that much and also it's not that
important for OLTP systems right
like it's a background thing it
can be like it's I'm wondering...
Nikolay: I'm not really sure during
upgrades though.
Michael: Only some types of upgrades,
right?
Because if you've got such an important
infrastructure OLTP system,
you're probably not going to be
doing pg_upgrade.
Like, you're probably not going
to be doing those types of upgrades
that care about analyze.
You're probably going to be doing
logical failover.
Nikolay: You still need to analyze.
Still not.
You create this logical replica,
you need to analyze this replica.
Michael: Not while you're, you
don't have to be down during that.
That could be done while the other
system's serving queries.
Nikolay: You are right.
You are right.
Michael: And the statistics will
have been getting updated while
it's being populated because vacuum
will be running and analyze
will be running.
So it's not in this.
So I was wondering for really important
OLTP systems, why not
just boost the default_statistics_target
everywhere and just
pay that extra cost?
Nikolay: Did we raise this question
with Tomas Vondra or no?
Michael: I don't think so.
Nikolay: Maybe we raised it when
he was on PostgresTV YouTube
channel.
I remember the opinion that 100
buckets is a lot.
Right?
This topic connects to partitioning
for me.
Because when I see a table with
10 billion records, like a table
like multi-terabyte in size, It's,
oh, this is like already quite
late.
We should have it partitioned.
In this case, we would have many
tables and at least we would
have localized problem.
Only for this partition we have
a problem.
Michael: And we would have, let's
say it had 30 partitions, we
would have 30 times more data in
the samples
Nikolay: because it...
Different statistics, yeah.
Yeah.
Yeah.
So I don't know.
I see people raise it.
I like, okay, does it affect or
analyze?
Yeah, it does.
How much we can afford it?
Okay.
Set it to thousand globally.
Okay.
How to prove it?
It helps or not.
We need holistic analysis of plans.
So we need, This is actually, we
tried to do many years ago before
like current state of art of like
things with branching and so
on.
Actually I disagree.
I've
Michael: just realized I disagree.
I don't think this is about on
average improving things.
I don't think this is a case of
on average I want a slightly
better execution plan.
I think this is a case of I never
want to completely flip to
an awful plan.
So it's about risk mitigating,
it's like mitigating the outliers
not about improving.
I would actually accept slightly
worse on average latencies.
Nikolay: I don't talk about average.
I don't talk about average.
I'm talking about complete analysis.
When you have, for example, thousands
of plans based on production,
you have a clone of production
and using branching you can create
replicas and compare plans before
and after the change.
This is what if framework which
we have with Database Lab.
And you can compare plans.
There's a question how to properly
compare plans at the ground
scale.
We talked about this before.
Our approach is just compare structure
and cost and the buffers
and also timing.
But timing, it's tricky because
it's volatile and so on.
If cost changes seriously, this
is already a signal to investigate
what happened.
If structure changes significantly,
also a signal.
Structure, by the way, maybe like
you don't need the whole structure,
you probably just need to access
methods like you had 3 index
scans, suddenly you have 2 index
scans and 1 heap scan, sequential
scan, right?
So it's bad, right?
So what happened?
Anyway, this holistic approach
is great.
So you can have a huge amount of
plans, and then you have the
ability to do the change and compare
before and after.
Again, I haven't visited the topic
of default_statistics_target
particularly in this context for
quite a while, but it would
be great to do it.
And again, I have a bunch of unfinished
thoughts.
This topic, plan flips, is unfinished
in my head.
I'm still thinking what's better.
So once again, you don't need average
or something.
You want, it's a classic optimization
problem.
You want to optimize something
and you don't want everything
else to degrade significantly.
You say, I'm okay if something
degrades like maximum 10% on some
metric There's a question which
metric to choose also, but I
want this to be improved.
This is 1 thing.
So we have framework We don't have
all the answers yet Second
thing is how to understand we are
on the edge of flip.
Or cliff if you like, doesn't matter.
How to understand.
Yeah, I think,
Michael: I actually don't know
if this is important as much as
how do we stick with what we've
got.
I think
Nikolay: I want to predict the
flip.
Michael: Yeah, but I think it's
really difficult to predict bad
plan flips without also predicting
good ones.
There would be so many false positives.
If you think about how you would
do it.
Nikolay: I forgot to mention that
in our case, We execute on
real database, full size, full
clone, using database branching
provisioned very quickly.
And we see not only planner, but
executor in action, like realistic
testing, right?
Full-fledged realistic testing.
We execute hundreds or thousands
of queries, samples properly
collected.
So it's not we have 100 samples
and 90 of them is the same query
ID.
We take query ID, we collect multiple
samples for each query
ID, and then we execute before
and after the change.
We developed this framework specifically
to predict plan flips
for major upgrades.
Michael: Yeah, which is, it's slightly
different, right?
I think
Nikolay: it's different.
Michael: I think it's different.
I think why it's different is often
people talk about plan flips
in the context of I didn't change
anything and this thing just
happened yeah whereas with a major
version upgrade you know in
advance you're doing
Nikolay: it you
Michael: can look it yeah you you
have this reference point if
we change you
Nikolay: have new angel big
Michael: yeah yeah So I think psychologically
at least it's different.
I understand that in practice there's
a lot of the same harnesses.
Nikolay: So that's why I'm saying
there are easy plan flips and
there are difficult plan flips.
Easy is when you did some action,
you upgraded Postgres.
You jumped between 13 to 18, 5
versions for example.
There are new nodes and plans,
right?
Never existed before.
So definitely
Michael: new settings, new configuration
parameter defaults.
Nikolay: In my opinion, plan flip
analysis should be a part of
major Postgres upgrade.
And we have it for our clients.
Yeah, we have it for our clients.
It still requires some, like It's
semi-automated right now.
I hope at some point we will fully
automate it, and it should
And it shines when you have Database
branching properly implemented.
But there are difficult plan flips.
So to explain how to properly categorize
it, let's a little bit
stop and think about what the planner, what does it
take into account when choosing plans?
It's only 3 things.
It's version, which defines the
version of the planner, basically,
which engine we are using.
And second, it's settings, not
all settings.
My current vision is like the planner
settings plus work, plus
work_mem.
Why I'm saying plus work_mem?
Because the planner settings, there
is a category.
If you check pg_settings, 1 of
the columns will be category,
right?
And there is a category called
planner settings, but work_mem
is beyond.
So plus work_mem.
And finally, the number 3 is pg_statistic
content.
And that's it.
It doesn't care about hardware
or actual data.
Michael: Yeah, true.
It's not actually the data volume,
it's how much data it thinks
it has, which is the statistics.
Nikolay: Yeah and if the statistics
is lagging, it might live
in different reality.
That's why I say I would prefer,
it might still be wrong, but
I would prefer it be wrong sooner
and be closer to reality, right?
Yeah.
So when we change engine, it's
obvious we know when it happens.
And before that, we can perform
this plan flip analysis holistically
according to our methodology.
This is great.
We can predict this plan flip and
prepare a resolution, solution
for it in advance.
This is what we do with clients.
Second, if we change settings,
basically it's the same.
We also know when we do it.
For example, okay, we're going
to raise default_statistics_target
globally.
This will help to this plan.
Question, how will other plans
behave?
We don't want them to degrade significantly,
at least now.
Same approach can happen, it can
be applied.
The third thing is pg_statistic
content is changing.
Again I would prefer to be changing
more often, like continuously
changing so more often an autoanalyze
should happen more often
to reflect the current reality.
For this we need autovacuum tuning.
And Then how to predict?
Michael: Also, you missed something
which is we could gather
more statistics, like it's a sample
and we could increase that.
Nikolay: Yeah, this is mitigation
activity already.
I agree that moving from 100 to
1,000 sometimes might make sense.
And honestly, like, I don't have
strong opinion here.
I have clients with 100, with 1000.
In the past, I had clients who
thought that going down to 10
is a good idea.
I don't have them anymore somehow.
Somehow right now, all the thoughts
are to direction of raising
this, not to reducing this.
Yeah, but I remember we had such
discussions.
Maybe we should raise it.
Counterintuitive ideas, sometimes
we know, and we don't have
it anymore.
Anyway, 2 first things we usually
control.
We don't have self-driving Postgres
yet, so we control it.
And we can perform plan flip analysis
if we have such tool.
But...
Michael: Wait, there's 1 more thing.
It's not available everywhere,
but another thing that does affect
this is, for example, if you're
using hints with pg_hint_plan.
Nikolay: Okay, hints.
Michael: I think that's a big topic,
though, in production.
Nikolay: That's a great topic.
Let's consider the exclusion because
it's not a major approach
right now.
It's not, it's like very rare.
Is that true?
I hope so.
Michael: I think in other databases,
this is actually often 1
of the main.
Yeah, and
Nikolay: you just try to open another
like gate or something
and I agree let's open it but let's
just finish with this second
plan while I think about second
plan.
Again, if you change major version,
you control it, you can predict
plan flip.
If you change settings, the same
thing.
You control it, you can predict.
If changing pg_statistic content
due to data changes leads to
some plan flip.
This is hard to predict.
How to predict it?
When I think about it, I want to
be able to check the second
plan.
Or maybe first 5 plans which were
like almost winning because
in this case I can say okay I see
the winning plan current plan
has cost say 500 but also there
is a second plan which has cost
490 and it's quite close.
If it's close, say, below 20%,
if difference is below 20% I will
start thinking might it flip soon
because this is why I want
second plan.
Michael: I think this is incredibly
tricky though.
Maybe it's simple, but for example,
parameters matter a lot.
Like per query, different parameters.
If you get, for example...
Nikolay: Oh yes, parameters matter.
Yeah, I agree.
Michael: Yeah, a huge amount.
If you look at the solutions that
SQL Server and Oracle have,
bear with me because I think SQL
Server and Oracle are a bit
further ahead of Postgres.
On this, maybe DB2 sometimes has
some features in this kind of
area.
But the solutions they've ended
up implementing are around forcing
the existing.
Yeah, so plan pinning.
Nikolay: Aurora has it.
Postgres Aurora has it.
Michael: Interesting.
So yeah, these are the solutions
that other databases have ended
up implementing rather than detection
of, rather than trying
to predict a plan flip, what they
try and do is pin it or freeze
it and then maybe require approval
when it does flip.
So instead of trying to predict
it, they prevent it and then
have you have it as like an opt-in
thing.
Maybe this change for the better,
But maybe you want someone
to approve it first.
Nikolay: So there are several things
here.
You touched several things.
First, depending on parameters.
I agree with you.
But in a system which is quite
big, we have usually like Pareto
principle is working basically.
Like A lot of workload is defined
by a limited number of specific
plans.
And this is how we...
In our plan flip analysis, we have
methodology to identify not
only normalized queries from pg_stat_statements, but also particular
examples, particular parameters.
So we would reproduce exact queries
with exact execution, so
plans, which contribute to workload
the most.
For example, contribute in terms
of calls or in terms of total
execution and planned time.
So we want to cover like 80% of
workload in several directions,
in several metrics.
And this is the magic, like how
we collect those samples and
prepare our testing query set for
testing, so it would represent
the majority of workload.
Once we found a way to do it, there
is some tail.
Those queries are less important
because if you just think about
database time in terms of execution
and planning, if some query
defines only minority of that time,
even if plan flips, it wouldn't
affect the whole cluster so much
right yeah, that's make sense
right
Michael: It does and actually reminds
me of a different idea
Nikolay: Maybe I'm wrong you know
what it might be like okay
now It's only like 1% of the whole
time and London due to plan
flip boom authority half of time
Michael: It reminds me though of
something that I think could
help here, it's not in Postgres,
actually maybe there's a couple
of examples in Postgres core at
the moment, but the idea that
the plan is chosen and then is
stuck to no matter what during
execution, it could be, I think
Tom has actually brought this
up at the end of our call on performance
cliffs, it could be
adaptive at runtime.
If it realizes it's doing, it's
chosen a bad plan, instead of
sticking to it till
Nikolay: the end.
We should invite Andrey Lepikhov.
Andrey Lepikhov, who I think, yeah,
who can talk about this adaptive
query optimization a lot.
I consider him an expert in this
area.
Michael: Great.
But do you see what I mean?
It's able to adapt during execution.
Maybe we avoid the issue, or at
least it's way less bad when
it happens.
Nikolay: Yeah.
And also, like, to finish my thought,
maybe I said maybe this
approach is wrong.
Maybe if some plan which contributed
only like 1% to whole database
time, now suddenly half.
Because of this, I think my gut
told me that we need multiple
directions.
1 of the directions when we collect
this working set, we also
order by calls.
So it should be quite...
So for a query which had the plan
flip to influence the whole
cluster so much that it causes
downtime, it should be quite frequent
enough, I think.
If it's super infrequent, It should
be fine.
If it's not super frequent, it
means that it won't occupy multiple
backends in parallel.
Michael: Unless it gets really
bad, yeah.
Nikolay: Unless you forgot to tune
the statement timeout or transaction
timeout.
True.
Yeah, we have multiple vectors and I think this is a good approach,
good enough approach, maybe it can be improved.
So I'm very curious about this area.
And I'm coming from practice.
Michael: Yeah, I do think there's still complex cases where you
might have clusters of even the same query in pg_stat_statements
could have 4 different plans that are actually all good.
But let's say you're a multi-tenant system, maybe small customers
look like, or tiny customers have this plan, huge customers that
are typical have this plan, huge customers that have this weird
distribution, or maybe like healthcare customers look different
for some other concern.
Yeah.
Nikolay: Yeah.
Yeah.
And in analysis, again, I said, like, we collect multiple plans
per query ID.
Why multiple?
Because of this.
But so we like say top 5 for each query ID, we collect multiple
them.
But all of them are like, we take from this top end by each metric.
So we try to capture like the concept of how influential this
query is and multiple plans might appear there.
It would be great to have plan ID, right, and shift aggregation
logic there, but we have query ideas remain.
I
Michael: think pganalyze, Luke, I think did some work on a
plan ID.
Nikolay: Yeah, yeah, there's a, I saw it and we, yeah, I commented
on LinkedIn and it's great work.
Absolutely needed, I think.
Yeah, it's great work.
So then, like, if we have this, the question, 2 questions, like,
first, like, how to understand that some plan candidate is approaching
and suddenly can flip soon.
Because data is in large tables, data usually is not changing
too much suddenly.
I agree with you.
Michael: It's like gradual.
Nikolay: Yeah.
Yeah, it's quite gradual.
So the planner behavior involving large tables, it's dangerous
when we have 2 plans.
This is my hypothesis, right?
So again, for me, a lot of open questions still.
It might be on the edge.
And I want to feel this on the edge for each query idea or for
each plan in advance, slightly in advance to be able to predict.
I think someone recently had some work in this area to show Plan
candidates.
I might be mistaken, but I saw something somewhere It didn't
dive deep and the second thing is in my opinion hints and plan
freeze are 2 great features I can I put them to enterprise But
yeah basket like this is for me?
It's enterprise thing because it's interesting doing hints doing
in plant phase for larger systems, probably we should have both.
But I also understand the reasoning that it makes system complex
and it requires a lot of management.
There is a price you pay.
If you put something, you start to be responsible for it.
You need to maintain it.
And if you throw some plan, the same thing, you, you're responsible
now, right?
Basically, do we have capacity to manage it?
Because maybe later it helps less and maybe then it's not helping
but opposite.
That's the problem.
Michael: Absolutely.
But I do think it's interesting that these other databases, given
the choice between the 2, went in that direction.
I'm not saying it's a good idea, but they've clearly thought
about it and there's some merit to it.
And it's been working for many years, of course, with issues,
of course, with unintended side effects, but it is a solution
I think worth considering.
So yeah, this is good.
I thought it was actually really interesting.
Nikolay: A lot of open questions.
Of course I wish, I saw you send me this Postgres 19 improvements
in terms of plan management and hidden plans mentioned there.
I wish Aurora team open sourced this.
It's not Aurora specifically.
It's called APG plan, MGMT extension.
APG plan management.
Yeah.
So it's great thing.
And I wish it existed in the open source ecosystem.
I think eventually it should exist in open source ecosystem.
Because when I would use it, When suddenly we have plan flip,
we know what plan we would prefer, and let's just pin it right
now before we apply permanent solution, just mitigate as fast
as possible without any deployments.
Michael: But let's say we're Clark, and we know we've got 15
to 20 really important queries, they're on the hot path for every
single auth request, they're really frequent, they're thousands
of times a second, we could just, we could deliberately pin all
of those or put hints into all of those
Nikolay: to make
Michael: them really unlike...
Pardon me?
Nikolay: So this is different.
So what I'm saying is temporarily
pin before we apply permanent
solution.
What you say, permanent pin, which
means there might be some
cost you will pay to maintain it.
Michael: Maybe yes, but my point
is at least we then can guarantee
it won't go bad.
Like it maybe it will gradually
degrade and we can look at fixing
if it's going to degrade it.
I think it's more likely to degrade
gradually instead of this
all of a sudden, which I think
we can then fix Monday to Friday
9 till 5 instead of when there's
a huge database outage.
Nikolay: And again, just partition
your tables.
I know it costs some effort, but
with AI it's easier.
Michael: And gather statistics
on the parent Partition.
Nikolay: Yeah, which is another
topic.
It's fixed.
Laurenz Albe from CYBERTEC fixed
it.
I think it will be in 19.
Michael: Nice.
So automatic gathering of statistics.
Nikolay: So it's a headache during
upgrades especially.
I think it's solved now.
I mean now it means
Michael: in the future.
Nikolay: In the future.
Yeah.
Great talk.
Nice 1.
Yeah.
Enjoyed.
Thank you.
Michael: Likewise.
Thanks so much and catch
you next time.