Michael: Hello and welcome to PostgresFM,
a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard,
and I'm joined as always

by Nik, founder of Postgres.AI.

Hey Nik!

Nikolay: Hi Michael, how are you?

Michael: I am good, how are you?

Nikolay: Very good.

Fantastic.

Oh

Michael: great.

What are we talking about this
week?

Nikolay: Not about AI.

Michael: Yes, for a few.

We have a detox.

Nikolay: Everyone is talking about
AI, so we are not going to

talk about AI.

Michael: Great, so what are we
talking about?

Nikolay: Let's talk about plan
flips and it's quite a complex

topic because I feel Postgres lacks
a lot of stuff and I'm not

sure what it needs to have in future
to be better.

And there is a lot of criticism,
including the recent one.

A recent wave of criticism against
Postgres in this area.

After some incidents from one company,
the company name is Clerk,

right?

Yeah.

Clerk.

Clerk.

How do you pronounce it?

Michael: I say Clerk.

Nikolay: Clerk.

Okay.

That's interesting.

Michael: So it's E,

Nikolay: but you say I.

I don't know.

Yeah.

Okay.

Yeah.

And there was an outage or series
of outages.

I don't know.

Michael: Yeah.

Nikolay: Two weeks ago.

And I'm as usual kudos to all companies
who share experience.

This is great.

They shared experience.

They had a blog post with timeline
and root cause analysis to

some extent and then I saw on Twitter,
I saw discussions which

looked like Postgres is not good
enough, let's choose better systems

because look what you might have
with Postgres.

Michael: Yeah, look what can happen
and...

Nikolay: Yeah, might happen, can
happen.

Michael: Exactly.

Something quite innocuous.

So it was, they've said the plan
flip was the, maybe not the

root cause, but the thing that
took them down and due to insufficient

statistics but yeah they're an
infrastructure provider right

so they they do authentication
and if auth is down that means

all of their customers are down
yeah exactly so this is a huge

issue for them and for all of their
customers.

They are very wary of things that
can change unexpectedly, understandably.

But equally, lots of services that
need extreme high availability

run on Postgres, so how can they
do it?

I think that would be a good discussion.

Nikolay: So in my opinion, their
root cause analysis is halfway

done.

And naming plan flip A root cause
is not all right.

Why?

Why 5 why's?

You know this approach.

Michael: Yeah, I think to a...

Nikolay: Yeah, basic approach for
root cause analysis is ask

why until you cannot answer at
all.

So their analysis was saying like,
the problem happened because

Postgres triggered Analyze, obviously,
automatic Analyze by autovacuum

process, which caused plan
flip.

So they did only 2 steps in this
5-y approach.

They said okay what happened?

Why we are down?

Because statistics changed.

Why is this?

Because plan flip.

Why?

Because it was triggered by autovacuum.

2 of 3 steps.

But you can go deeper, right?

Yeah.

You can go deeper.

I think there was 1

Michael: more, I think they did
increase statistics target for

example to make it less likely
to happen again but there still

are more steps for sure

Nikolay: yeah they increased
default_statistics_target?

Michael: Maybe not default but
for the table I think.

Or for the column, I forget which.

Nikolay: Yeah, it's either Column
or global.

Anyway, that's interesting because
in my opinion there are different

plan flips, different kinds of
plan flips.

And I'm still trying to figure
out for myself, but if you have

plan flip when suddenly like it
flipped after recalculation of

statistics, that's interesting because that might happen.

This is actually a difficult type, this is a hard type of plan

flip.

But the question is why it happened only now, not 1 hour ago,

for example.

Michael: And what's to stop it happening again?

Nikolay: Yeah, yeah.

And this is interesting, right?

Michael: So I have some thoughts on this specific case but I

think it might actually be useful for us to go back to basics

a little bit and define what do we mean by a plan flip and what

the normal causes of them and I think this is a little bit of

a special case that should be interesting and might be completely

avoidable I think, but a lot of cases aren't avoidable and I

think that's an interesting discussion.

But yeah, in terms of plan flips, I wanted to ask you, do you

define them only when they're bad because we have a cost-based

optimizer, right?

The reason we can flip query plans is because the planner has

choices and it makes those choices based on estimated costs of

doing different join algorithms or different scan types.

But flipping is part of its design, right?

The same query running on different data as the data changes,

perhaps should use a different execution plan.

It would be faster to use a different execution plan.

So flipping is designed, it's deliberate, but the idea is it

flips to a better plan when the data suits that.

Nikolay: So sudden plan change is normal.

Michael: Yes.

So, yes.

When people say plan flip, it feels like they're only ever

Nikolay: talking about that.

Michael: Yes, exactly.

Nikolay: Yeah, because this is when it changes to wrong state.

Because the planner sometimes have mistakes.

Michael: Yes, or misestimations.

Nikolay: For example, if you have random page cost 4, which is

default, I saw many situations when we have 2 candidate plans

which have very similar cost under these circumstances.

So for this data and what we have in pg_statistic, and for these

settings, including random page cost 4, which is default, I saw

the cases when there are 2 plans.

1 is obviously wrong.

For example, sequential scan for huge table.

And another is obviously right, which is index scan.

Like I'm remembering some Simple case, like simple in terms of

it's obvious to a human or even AI that this plan for this table

shouldn't be chosen because it's sequential scan on whole table,

but since for example Postgres thinks that random I/O is so expensive

suddenly chooses sequential scan plan instead of index And you

don't notice until some point when
these costs, they are very

close to each other.

And then suddenly when a little
bit more data comes to the table

and ANALYZE happens.

Sequential bad 1 wins and that's
it.

And this we usually solve by adjusting
random_page_cost.

Telling the planner that please
don't choose sequential plans.

I know we had, like, Tomas Vondra,
Tomas Vondra, has great

proof that maybe random_page_cost
4 is not bad.

Maybe even it should be bigger.

But I have brain split here because
I see what his reasoning

is makes sense theoretically, but
what I saw in production proves

opposite.

I saw these plan flips happening
when just like suddenly, and

it was like some ticking bomb it
was like yeah it was sitting

there sitting and then boom like
suddenly the planner chooses

the sequential scan and we are
down because the time's out right

Michael: yeah I have a theory about
this I've been thinking a

little bit about it and I think
it's like, Tomasz's work is very

good in theory because he cares
about the optimal time to flip,

he cares about how much data do
we need to be reading before

sequential scan is faster?

But in practice, especially with
large tables, the flipping point

is not that important.

The main point is we almost never
want to request that much data.

Often, especially in OLTP, we want
very small SELECTs.

We're only selecting a time, 1
record or only a very few number

of records.

So the flipping point, we want
to put so far out of sight that

even if our row estimate is really
bad, because often it's a

combination of 2 things right it's
not just the the cost of doing

random access it's also how many
rows do we expect to be reading

and if that's out by a lot and
the cost is high those multiply

together and suddenly Postgres
is thinking you know what I might

as well scan the whole table here
because I'm expecting to read

so much.

But if we can, if we decrease the
random_page_cost, we get away

with the bad row estimate.

So I think you can fix either and
get towards it.

But if you do both, then you massively
decrease your chance

Nikolay: of

Michael: this kind of issue.

Nikolay: I hear you and this in
my mind connects very well to

the topic of OLTP versus analytical
versus hybrid because there

are good advancements in the area
of analytical queries processing

for Postgres in various ways.

And there is also, I'm strongly
believe that there is a very

simple case called pure OLTP.

Pure OLTP means that We need to
serve users who are humans mostly,

maybe, and they have human perception.

Our very first episode was about
this.

200 milliseconds, this is when
they recognize something is already

happening.

And If it's significantly more
than 200 milliseconds, it feels

slow, which means we need to have
our queries below 20 milliseconds

because 1 web page consists of
multiple queries, usually roughly

10, say 10, like some guess, right?

It means we need to be below 20
milliseconds, in this case, by

no means I want any sequential
scan be applied.

Never, like, I don't want it, like,
just forget about it.

I want to turn it off completely.

I

Michael: would encounter that Sequential
scans on tiny tables

are fine, right?

If it's only got 100 rows in it,
but also an index scan on that

table would be fine as well because
it's still only got 100 rows.

So the times where sequential scan
is best don't matter anymore

because an index scan, whilst being
less efficient, or even a

bitmap scan would be absolutely
fine, even if it's twice as slow,

or even 5 times as slow because
it's such a tiny table.

Nikolay: But we also have different
other, there is also bitmap

scan, index and heap scan, So it
also, it's like in the middle

and how the planner thinks about
the cost.

It's interesting.

So I think how we tune some service
for OLTP, and this is obviously

OLTP.

People didn't want this service
to,

Michael: of course,

Nikolay: Yeah.

We definitely want very strict
timeouts.

I think 15 milliseconds, it's old.

We should go down to 10 seconds.

So statement timeout should be
10 seconds.

Even transaction timeout should
be 10 seconds for radical.

Let's introduce concept of radical
OLTP.

Yes, you have some workloads that
needs to be analytical and

some people can wait on or some
background jobs.

Yes, the create index taking a
couple of hours.

It's also a different story, but
if it's pure OLTP mobile app,

which many people depend on, put
transaction timeout and statement

timeout, very low value, be very
strict, and just cut those queries

which start running longer.

In this case, if you, for example,
have a bunch of them happening,

there's a chance that on powerful
machine, you will survive.

If you don't have it, it will be
a snowball effect of many sessions

being stuck and running in sequential
scan or something like

a not optimal and it will be consuming
resources.

Michael: The other thing it might
do is it might more quickly

point you in the right direction
as to what's causing the issues

that you start...

Nikolay: Yeah, how to diagnose.

Let's maybe, I think I'm pushing
us to some like problems and

resolutions already but let's go
back.

You raised a very good point about
what plan flip is so in my

head, it's also like this is negative
term plan flip is Sudden

plan change in the wrong direction
when like basically the planner

Chooses a plan we don't want like
sequential scan or maybe bitmap

scan which is timing out for example.

Michael: Yeah I think that it might
even be worse than that I

think it might even be like it's
noticeably worse.

Nikolay: If

Michael: it flips and it's 1.1
times slower I think most people

don't really care.

But if it's double or triple or
it's like it takes you from being

90% CPU to 100% CPU It crosses
that threshold.

Maybe again.

It's the Tomas episode we did
on performance cliffs It's a cliff

for you then.

Nikolay: Yeah in my opinion performance
cliff term is different

and it's macro Yeah, while plan
shape is micro term.

It's like it's just 1 session and
we see suddenly execution of

this query became much worse.

It was 10 milliseconds, suddenly
became like 1 minute.

Boom.

And obvious example is like we
had an index scan, but suddenly

the planner...

And usually it happens in complex
queries, right?

So joins, subqueries, CTE, multi-stage,
like it's something complex.

And then boom.

While performance cliffs for me,
like it's macro level, like

some lightweight lock contention
happens and didn't happen and

then suddenly it happens.

There is no official terminology
here.

So I'm just reading some blog posts
and trying to adjust my own

mind to them mostly, right?

This is how I feel.

So plan flip is negative term,
right?

So we don't want it to happen.

And we started with difficult example.

This company obviously experienced
this difficult example.

And my question to them is like,
what random page cost you had?

Did you have statement timeout,
item transaction timeout, maybe

not related in this case, and transaction
timeout?

So safeguards, safeguards which
would protect from a huge wave

of sessions which experience this
bad plan to be executed and

also what were your autovacuum
settings.

I think

Michael: that's relevant.

Nikolay: Before we go to an increased
number of buckets with

statistics target, these are macro
settings we need to keep in

mind.

Michael: I had another thought.

This for me felt like a state of skew issue.

And so they mentioned the column in question was 99.9996% NULL

values.

And this query was looking for the not null values which of which

there are still 17,000 so we're talking about billions of records

but only 17,000 were not NULL.

In cases like that, where you've got such a data skew, you could

be looking at, like for those columns you could be looking at

increased statistics or often you want a partial index on a column

like that right because you don't care about indexing the billions

of NULL values you only care about indexing the not null values

and once you've got the perfect index on that I know it could

be a more complex query where you want different columns but

if you've got those columns in the index at where the value is

not null then actually the planners almost certainly going to

choose that because it's such the perfect index for that query

and you don't have to worry about misestimations it's never going

to pick a sex scan in that case.

Nikolay: So we definitely lack information here.

Yeah.

We didn't see the plans.

We don't know their settings.

We even don't know the version.

Right.

So we can only guess.

And again, there are difficult situations with plan flips.

I saw quite difficult ones and where obviously like everything

was normal.

But what I'm trying to say, you can get a plan flip because you

didn't do a good job configuring Postgres well enough and putting

these safeguards for OLTP workloads.

And this is what we do, number 1, with all the clients who are

doing consulting and which also the clients who install our monitoring

and checkup service.

And this is like number 1 thing we need to be, we recognize we

are OLTP.

Priority is to have the super fast queries and we need to prepare

basics, right?

Unfortunately, they cannot protect us in 100% of cases.

But if you have your autovacuum not tuned, for example, you

have this relatively new setting.

I remember Darafei raised this topic.

We didn't have autovacuum insert, autovacuum vacuum insert

scale factor.

In autovacuum analyzing, there are 2, right?

So which triggers on insert?

And by default, it's 20%.

And also we have all the regular vacuum scale factor, which triggers

on updates and deletes, also 20%.

For analyze, it's 10%, 0.10, which
means 10% of your table needs

to be changed.

If it's a huge table, it's a lot
of data.

Again, this data skew is a good
point.

You had 1 distribution, but suddenly
all the new records, they

are very different.

And you don't have properly tuned
autovacuum, you accumulate

that data a lot.

And then suddenly after 10 or 20%
of data changed or inserted,

autovacuum finally ran analyze.

Suddenly we have a flick.

If it happened earlier, maybe it
also would be bad.

It's actually a good point.

Maybe still we need the statistics
target, but we, like in my

opinion, it's better to have more
frequent process running on

large tables.

So statistics will represent actual
data better, more often,

right?

To allocate more resources, to
have more workers.

This is what we do all the time
with larger services.

And unfortunately, we see, like
I would say 2 out of 3 clients

coming, most of our clients are
startups, which like have say

database exceeding 100 gigabytes
up to 10 or 20 terabytes.

This is like sweet spot for us
where they experienced some problems

already and 2 out of 3 roughly
have poorly tuned autovacuum.

Poorly tuned.

Michael: Only 2 out of 3 is quite
surprising to me.

Nikolay: Yeah this is my rough.

I can check statistics you

Michael: need to update your statistics

Nikolay: my internal yeah I need
to ask a guy to maintain statistics

Michael: anyway but I had a question
though yeah default_statistics_target

is like 100 and I quite
often see people increasing it

to 1000 when an issue like this
happens or at least for like

specific columns other than increased
analyze time which is like

important but not that important
for yeah and it doesn't increase

that much and also it's not that
important for OLTP systems right

like it's a background thing it
can be like it's I'm wondering...

Nikolay: I'm not really sure during
upgrades though.

Michael: Only some types of upgrades,
right?

Because if you've got such an important
infrastructure OLTP system,

you're probably not going to be
doing pg_upgrade.

Like, you're probably not going
to be doing those types of upgrades

that care about analyze.

You're probably going to be doing
logical failover.

Nikolay: You still need to analyze.

Still not.

You create this logical replica,
you need to analyze this replica.

Michael: Not while you're, you
don't have to be down during that.

That could be done while the other
system's serving queries.

Nikolay: You are right.

You are right.

Michael: And the statistics will
have been getting updated while

it's being populated because vacuum
will be running and analyze

will be running.

So it's not in this.

So I was wondering for really important
OLTP systems, why not

just boost the default_statistics_target
everywhere and just

pay that extra cost?

Nikolay: Did we raise this question
with Tomas Vondra or no?

Michael: I don't think so.

Nikolay: Maybe we raised it when
he was on PostgresTV YouTube

channel.

I remember the opinion that 100
buckets is a lot.

Right?

This topic connects to partitioning
for me.

Because when I see a table with
10 billion records, like a table

like multi-terabyte in size, It's,
oh, this is like already quite

late.

We should have it partitioned.

In this case, we would have many
tables and at least we would

have localized problem.

Only for this partition we have
a problem.

Michael: And we would have, let's
say it had 30 partitions, we

would have 30 times more data in
the samples

Nikolay: because it...

Different statistics, yeah.

Yeah.

Yeah.

So I don't know.

I see people raise it.

I like, okay, does it affect or
analyze?

Yeah, it does.

How much we can afford it?

Okay.

Set it to thousand globally.

Okay.

How to prove it?

It helps or not.

We need holistic analysis of plans.

So we need, This is actually, we
tried to do many years ago before

like current state of art of like
things with branching and so

on.

Actually I disagree.

I've

Michael: just realized I disagree.

I don't think this is about on
average improving things.

I don't think this is a case of
on average I want a slightly

better execution plan.

I think this is a case of I never
want to completely flip to

an awful plan.

So it's about risk mitigating,
it's like mitigating the outliers

not about improving.

I would actually accept slightly
worse on average latencies.

Nikolay: I don't talk about average.

I don't talk about average.

I'm talking about complete analysis.

When you have, for example, thousands
of plans based on production,

you have a clone of production
and using branching you can create

replicas and compare plans before
and after the change.

This is what if framework which
we have with Database Lab.

And you can compare plans.

There's a question how to properly
compare plans at the ground

scale.

We talked about this before.

Our approach is just compare structure
and cost and the buffers

and also timing.

But timing, it's tricky because
it's volatile and so on.

If cost changes seriously, this
is already a signal to investigate

what happened.

If structure changes significantly,
also a signal.

Structure, by the way, maybe like
you don't need the whole structure,

you probably just need to access
methods like you had 3 index

scans, suddenly you have 2 index
scans and 1 heap scan, sequential

scan, right?

So it's bad, right?

So what happened?

Anyway, this holistic approach
is great.

So you can have a huge amount of
plans, and then you have the

ability to do the change and compare
before and after.

Again, I haven't visited the topic
of default_statistics_target

particularly in this context for
quite a while, but it would

be great to do it.

And again, I have a bunch of unfinished
thoughts.

This topic, plan flips, is unfinished
in my head.

I'm still thinking what's better.

So once again, you don't need average
or something.

You want, it's a classic optimization
problem.

You want to optimize something
and you don't want everything

else to degrade significantly.

You say, I'm okay if something
degrades like maximum 10% on some

metric There's a question which
metric to choose also, but I

want this to be improved.

This is 1 thing.

So we have framework We don't have
all the answers yet Second

thing is how to understand we are
on the edge of flip.

Or cliff if you like, doesn't matter.

How to understand.

Yeah, I think,

Michael: I actually don't know
if this is important as much as

how do we stick with what we've
got.

I think

Nikolay: I want to predict the
flip.

Michael: Yeah, but I think it's
really difficult to predict bad

plan flips without also predicting
good ones.

There would be so many false positives.

If you think about how you would
do it.

Nikolay: I forgot to mention that
in our case, We execute on

real database, full size, full
clone, using database branching

provisioned very quickly.

And we see not only planner, but
executor in action, like realistic

testing, right?

Full-fledged realistic testing.

We execute hundreds or thousands
of queries, samples properly

collected.

So it's not we have 100 samples
and 90 of them is the same query

ID.

We take query ID, we collect multiple
samples for each query

ID, and then we execute before
and after the change.

We developed this framework specifically
to predict plan flips

for major upgrades.

Michael: Yeah, which is, it's slightly
different, right?

I think

Nikolay: it's different.

Michael: I think it's different.

I think why it's different is often
people talk about plan flips

in the context of I didn't change
anything and this thing just

happened yeah whereas with a major
version upgrade you know in

advance you're doing

Nikolay: it you

Michael: can look it yeah you you
have this reference point if

we change you

Nikolay: have new angel big

Michael: yeah yeah So I think psychologically
at least it's different.

I understand that in practice there's
a lot of the same harnesses.

Nikolay: So that's why I'm saying
there are easy plan flips and

there are difficult plan flips.

Easy is when you did some action,
you upgraded Postgres.

You jumped between 13 to 18, 5
versions for example.

There are new nodes and plans,
right?

Never existed before.

So definitely

Michael: new settings, new configuration
parameter defaults.

Nikolay: In my opinion, plan flip
analysis should be a part of

major Postgres upgrade.

And we have it for our clients.

Yeah, we have it for our clients.

It still requires some, like It's
semi-automated right now.

I hope at some point we will fully
automate it, and it should

And it shines when you have Database
branching properly implemented.

But there are difficult plan flips.

So to explain how to properly categorize
it, let's a little bit

stop and think about what the planner, what does it

take into account when choosing plans?

It's only 3 things.

It's version, which defines the
version of the planner, basically,

which engine we are using.

And second, it's settings, not
all settings.

My current vision is like the planner
settings plus work, plus

work_mem.

Why I'm saying plus work_mem?

Because the planner settings, there
is a category.

If you check pg_settings, 1 of
the columns will be category,

right?

And there is a category called
planner settings, but work_mem

is beyond.

So plus work_mem.

And finally, the number 3 is pg_statistic
content.

And that's it.

It doesn't care about hardware
or actual data.

Michael: Yeah, true.

It's not actually the data volume,
it's how much data it thinks

it has, which is the statistics.

Nikolay: Yeah and if the statistics
is lagging, it might live

in different reality.

That's why I say I would prefer,
it might still be wrong, but

I would prefer it be wrong sooner
and be closer to reality, right?

Yeah.

So when we change engine, it's
obvious we know when it happens.

And before that, we can perform
this plan flip analysis holistically

according to our methodology.

This is great.

We can predict this plan flip and
prepare a resolution, solution

for it in advance.

This is what we do with clients.

Second, if we change settings,
basically it's the same.

We also know when we do it.

For example, okay, we're going
to raise default_statistics_target

globally.

This will help to this plan.

Question, how will other plans
behave?

We don't want them to degrade significantly,
at least now.

Same approach can happen, it can
be applied.

The third thing is pg_statistic
content is changing.

Again I would prefer to be changing
more often, like continuously

changing so more often an autoanalyze
should happen more often

to reflect the current reality.

For this we need autovacuum tuning.

And Then how to predict?

Michael: Also, you missed something
which is we could gather

more statistics, like it's a sample
and we could increase that.

Nikolay: Yeah, this is mitigation
activity already.

I agree that moving from 100 to
1,000 sometimes might make sense.

And honestly, like, I don't have
strong opinion here.

I have clients with 100, with 1000.

In the past, I had clients who
thought that going down to 10

is a good idea.

I don't have them anymore somehow.

Somehow right now, all the thoughts
are to direction of raising

this, not to reducing this.

Yeah, but I remember we had such
discussions.

Maybe we should raise it.

Counterintuitive ideas, sometimes
we know, and we don't have

it anymore.

Anyway, 2 first things we usually
control.

We don't have self-driving Postgres
yet, so we control it.

And we can perform plan flip analysis
if we have such tool.

But...

Michael: Wait, there's 1 more thing.

It's not available everywhere,
but another thing that does affect

this is, for example, if you're
using hints with pg_hint_plan.

Nikolay: Okay, hints.

Michael: I think that's a big topic,
though, in production.

Nikolay: That's a great topic.

Let's consider the exclusion because
it's not a major approach

right now.

It's not, it's like very rare.

Is that true?

I hope so.

Michael: I think in other databases,
this is actually often 1

of the main.

Yeah, and

Nikolay: you just try to open another
like gate or something

and I agree let's open it but let's
just finish with this second

plan while I think about second
plan.

Again, if you change major version,
you control it, you can predict

plan flip.

If you change settings, the same
thing.

You control it, you can predict.

If changing pg_statistic content
due to data changes leads to

some plan flip.

This is hard to predict.

How to predict it?

When I think about it, I want to
be able to check the second

plan.

Or maybe first 5 plans which were
like almost winning because

in this case I can say okay I see
the winning plan current plan

has cost say 500 but also there
is a second plan which has cost

490 and it's quite close.

If it's close, say, below 20%,
if difference is below 20% I will

start thinking might it flip soon
because this is why I want

second plan.

Michael: I think this is incredibly
tricky though.

Maybe it's simple, but for example,
parameters matter a lot.

Like per query, different parameters.

If you get, for example...

Nikolay: Oh yes, parameters matter.

Yeah, I agree.

Michael: Yeah, a huge amount.

If you look at the solutions that
SQL Server and Oracle have,

bear with me because I think SQL
Server and Oracle are a bit

further ahead of Postgres.

On this, maybe DB2 sometimes has
some features in this kind of

area.

But the solutions they've ended
up implementing are around forcing

the existing.

Yeah, so plan pinning.

Nikolay: Aurora has it.

Postgres Aurora has it.

Michael: Interesting.

So yeah, these are the solutions
that other databases have ended

up implementing rather than detection
of, rather than trying

to predict a plan flip, what they
try and do is pin it or freeze

it and then maybe require approval
when it does flip.

So instead of trying to predict
it, they prevent it and then

have you have it as like an opt-in
thing.

Maybe this change for the better,
But maybe you want someone

to approve it first.

Nikolay: So there are several things
here.

You touched several things.

First, depending on parameters.

I agree with you.

But in a system which is quite
big, we have usually like Pareto

principle is working basically.

Like A lot of workload is defined
by a limited number of specific

plans.

And this is how we...

In our plan flip analysis, we have
methodology to identify not

only normalized queries from pg_stat_statements, but also particular

examples, particular parameters.

So we would reproduce exact queries
with exact execution, so

plans, which contribute to workload
the most.

For example, contribute in terms
of calls or in terms of total

execution and planned time.

So we want to cover like 80% of
workload in several directions,

in several metrics.

And this is the magic, like how
we collect those samples and

prepare our testing query set for
testing, so it would represent

the majority of workload.

Once we found a way to do it, there
is some tail.

Those queries are less important
because if you just think about

database time in terms of execution
and planning, if some query

defines only minority of that time,
even if plan flips, it wouldn't

affect the whole cluster so much
right yeah, that's make sense

right

Michael: It does and actually reminds
me of a different idea

Nikolay: Maybe I'm wrong you know
what it might be like okay

now It's only like 1% of the whole
time and London due to plan

flip boom authority half of time

Michael: It reminds me though of
something that I think could

help here, it's not in Postgres,
actually maybe there's a couple

of examples in Postgres core at
the moment, but the idea that

the plan is chosen and then is
stuck to no matter what during

execution, it could be, I think
Tom has actually brought this

up at the end of our call on performance
cliffs, it could be

adaptive at runtime.

If it realizes it's doing, it's
chosen a bad plan, instead of

sticking to it till

Nikolay: the end.

We should invite Andrey Lepikhov.

Andrey Lepikhov, who I think, yeah,
who can talk about this adaptive

query optimization a lot.

I consider him an expert in this
area.

Michael: Great.

But do you see what I mean?

It's able to adapt during execution.

Maybe we avoid the issue, or at
least it's way less bad when

it happens.

Nikolay: Yeah.

And also, like, to finish my thought,
maybe I said maybe this

approach is wrong.

Maybe if some plan which contributed
only like 1% to whole database

time, now suddenly half.

Because of this, I think my gut
told me that we need multiple

directions.

1 of the directions when we collect
this working set, we also

order by calls.

So it should be quite...

So for a query which had the plan
flip to influence the whole

cluster so much that it causes
downtime, it should be quite frequent

enough, I think.

If it's super infrequent, It should
be fine.

If it's not super frequent, it
means that it won't occupy multiple

backends in parallel.

Michael: Unless it gets really
bad, yeah.

Nikolay: Unless you forgot to tune
the statement timeout or transaction

timeout.

True.

Yeah, we have multiple vectors and I think this is a good approach,

good enough approach, maybe it can be improved.

So I'm very curious about this area.

And I'm coming from practice.

Michael: Yeah, I do think there's still complex cases where you

might have clusters of even the same query in pg_stat_statements

could have 4 different plans that are actually all good.

But let's say you're a multi-tenant system, maybe small customers

look like, or tiny customers have this plan, huge customers that

are typical have this plan, huge customers that have this weird

distribution, or maybe like healthcare customers look different

for some other concern.

Yeah.

Nikolay: Yeah.

Yeah.

And in analysis, again, I said, like, we collect multiple plans

per query ID.

Why multiple?

Because of this.

But so we like say top 5 for each query ID, we collect multiple

them.

But all of them are like, we take from this top end by each metric.

So we try to capture like the concept of how influential this

query is and multiple plans might appear there.

It would be great to have plan ID, right, and shift aggregation

logic there, but we have query ideas remain.

I

Michael: think pganalyze, Luke, I think did some work on a

plan ID.

Nikolay: Yeah, yeah, there's a, I saw it and we, yeah, I commented

on LinkedIn and it's great work.

Absolutely needed, I think.

Yeah, it's great work.

So then, like, if we have this, the question, 2 questions, like,

first, like, how to understand that some plan candidate is approaching

and suddenly can flip soon.

Because data is in large tables, data usually is not changing

too much suddenly.

I agree with you.

Michael: It's like gradual.

Nikolay: Yeah.

Yeah, it's quite gradual.

So the planner behavior involving large tables, it's dangerous

when we have 2 plans.

This is my hypothesis, right?

So again, for me, a lot of open questions still.

It might be on the edge.

And I want to feel this on the edge for each query idea or for

each plan in advance, slightly in advance to be able to predict.

I think someone recently had some work in this area to show Plan

candidates.

I might be mistaken, but I saw something somewhere It didn't

dive deep and the second thing is in my opinion hints and plan

freeze are 2 great features I can I put them to enterprise But

yeah basket like this is for me?

It's enterprise thing because it's interesting doing hints doing

in plant phase for larger systems, probably we should have both.

But I also understand the reasoning that it makes system complex

and it requires a lot of management.

There is a price you pay.

If you put something, you start to be responsible for it.

You need to maintain it.

And if you throw some plan, the same thing, you, you're responsible

now, right?

Basically, do we have capacity to manage it?

Because maybe later it helps less and maybe then it's not helping

but opposite.

That's the problem.

Michael: Absolutely.

But I do think it's interesting that these other databases, given

the choice between the 2, went in that direction.

I'm not saying it's a good idea, but they've clearly thought

about it and there's some merit to it.

And it's been working for many years, of course, with issues,

of course, with unintended side effects, but it is a solution

I think worth considering.

So yeah, this is good.

I thought it was actually really interesting.

Nikolay: A lot of open questions.

Of course I wish, I saw you send me this Postgres 19 improvements

in terms of plan management and hidden plans mentioned there.

I wish Aurora team open sourced this.

It's not Aurora specifically.

It's called APG plan, MGMT extension.

APG plan management.

Yeah.

So it's great thing.

And I wish it existed in the open source ecosystem.

I think eventually it should exist in open source ecosystem.

Because when I would use it, When suddenly we have plan flip,

we know what plan we would prefer, and let's just pin it right

now before we apply permanent solution, just mitigate as fast

as possible without any deployments.

Michael: But let's say we're Clark, and we know we've got 15

to 20 really important queries, they're on the hot path for every

single auth request, they're really frequent, they're thousands

of times a second, we could just, we could deliberately pin all

of those or put hints into all of those

Nikolay: to make

Michael: them really unlike...

Pardon me?

Nikolay: So this is different.

So what I'm saying is temporarily
pin before we apply permanent

solution.

What you say, permanent pin, which
means there might be some

cost you will pay to maintain it.

Michael: Maybe yes, but my point
is at least we then can guarantee

it won't go bad.

Like it maybe it will gradually
degrade and we can look at fixing

if it's going to degrade it.

I think it's more likely to degrade
gradually instead of this

all of a sudden, which I think
we can then fix Monday to Friday

9 till 5 instead of when there's
a huge database outage.

Nikolay: And again, just partition
your tables.

I know it costs some effort, but
with AI it's easier.

Michael: And gather statistics
on the parent Partition.

Nikolay: Yeah, which is another
topic.

It's fixed.

Laurenz Albe from CYBERTEC fixed
it.

I think it will be in 19.

Michael: Nice.

So automatic gathering of statistics.

Nikolay: So it's a headache during
upgrades especially.

I think it's solved now.

I mean now it means

Michael: in the future.

Nikolay: In the future.

Yeah.

Great talk.

Nice 1.

Yeah.

Enjoyed.

Thank you.

Michael: Likewise.

Thanks so much and catch

you next time.

Some kind things our listeners have said