Michael: Hello and welcome to Postgres.FM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard, and as usual, I'm joined

by my co-host, Nikolay, founder of Postgres.AI.

Hey, Nikolay.

Nikolay: Hi, Michael.

Michael: Hello.

Today, we are excited to have a special guest, Tomas Vondra,

who is a Postgres major contributor and committer now working

at Microsoft on their Postgres team.

Welcome to the show, Tomas.

Tomas: Hello, everyone.

Michael: Right, so a few months ago, you gave an excellent talk

about where performance cliffs come from, and we have been excited

to talk to you about this.

So perhaps as a starting point could you let us know a little

bit like how you got interested in this as a topic or why you

chose to talk about it there.

Tomas: So I think I started thinking about performance glyphs

when I actually started working on Postgres or started using

Postgres.

Because performance lists are something people often run into

and have to investigate changes, sudden changes in behavior after

small changes in the query.

A single additional row in a query could kind of like flip the

behavior in a way that is completely like unexpected and unpredictable

and it's causing problems for like robustness and so on.

And I actually started with Postgres as a user or a developer

using Postgres.

And 1 of the things I was responsible for was investigating performance

problems.

So that's how I actually came to work on performance cliffs.

But after I started working on development of Postgres, you also

see the other side of performance cliffs, which is like why it

actually happens in practice, because the database needs to do

some decisions.

So it's also an engineering problem which is interesting for

me as a developer committer.

And I've been working on a planner some of the time and That's

where some of the performance was actually happen, right?

Because of decisions made either during planning or then during

the execution.

So I've been reading some papers and that's actually where I

learned that there even isn't like a name for this, which is

the performance cliff.

And there is a whole area of research for robustness, like making

robust decisions where the performance cliffs do not happen at

all or not as often.

So that's why I'm interested in this.

Michael: Nice, and yeah, I really like the term performance cliff.

It's so intuitive in that, from
the example you were talking

about kind of adding 1 additional
row.

So if we go back a few steps, if
we're adding 1 row incrementally

up until the cliff, we might get
a certain gradient, might even

be relatively horizontal and then
all of a sudden there's a massive

spike and that's like the cliff.

I'm probably thinking of it as
a spike but it maybe a cliff would

be a drop.

So yeah I really like that as a
phrase.

You mentioned kind of planner versus
execution time cliffs and

I guess we're talking about individual
execution more than kind

of system level cliffs, mostly
here.

From your experience as a user,
or maybe since you've, since

you've been looking at like trying
to prioritize which things

to work on, where do you see the
majority of cliffs coming from?

Like plan time, execution time,
some, some mix.

Tomas: So that's a good question.

I don't know because I'm inherently
biased, right?

The fact that a lot of performance
groups, for example, happen

during planning means that you
can't actually see a lot of cliffs

that happen later, right?

So I don't have a very good like
data to explain that or to say

that most of the cliffs happen
at some point.

But I would say that the sooner
the decision needs to be done

in the query planning and query
execution, The sooner you need

to do that, the less information
you actually have to do a good,

robust decision.

And the more likely it is that
it will be wrong, right?

Because for example, we do planning
based on statistics, and

those are inherently inaccurate,
right?

There will be a lot of details
that we remove from the statistics

to keep it small, and therefore,
those are the things that can

actually cause problems.

And then there is, of course, the
thing that even the cost model

itself is like extremely simplified,
right?

So I would say that the main problems
is that we are making a

lot of decisions early in the query
execution.

And because of that, we are kind
of like fixed in a certain execution

path, and that may not be the optimal
1, right?

So that's why I think the performance
glitches come from kind

of.

Nikolay: I wanted to ask, there
should be a challenge not only

about like planner versus executor,
but also their, both of them

behavior, like in a single session
when there is no like a multi-user

situation versus some various contention
situations when we have

many, many sessions working in
parallel and all of them have

some issue.

For example, unfamous lightweight
local log manager happens when

during planning planner needs to
log all the indexes for the

table and we exceed 16.

Fast-path becomes not available
for some relations.

And we can not see it when we just
checking the single query,

but we, like it becomes a performance
cliff immediately when

we have thousands of transactions
per second doing the same thing

and planning becomes problem, right?

I'm curious, how do you think this
should be understood, like

with what tools and earlier, not
when you already fell from that

cliff, right?

But before, this is the trickiest,
like how to diagnose cliffs

which are going to happen soon
with our system.

Tomas: So, I understand the example
that you've been describing,

but I think that's a slightly different
performance cliff and

not what I usually mean by that
term, because you are kind of

like talking about like resource
exhaustion, right?

You have a resource, which is like
a lock queue or like the buffer

for the fast-path locks, which was
increased or like made larger

in Postgres 18.

Nikolay: So we should have

Tomas: the problem.

Nikolay: And it's configurable
now, right?

I think it's maybe configurable.

Yes.

Tomas: It's ties to expected number
of transactions per connection,

so it is configurable in this way,
yes.

So it's not fixed size like before.

But I think what I mean by performance
cliffs and what I describe

in my talk is mostly performance
cliffs that are done because

of like a binary decision somewhere
in the processing, right?

And the locking bottleneck is not
that, right?

I mean, like, yes, you do have
the case where you can either

fit the fast path lock or not,
right?

And if not, you have to do something
much more expensive.

That's a problem, but it's not
the thing that I had in mind thinking

about performance.

Nikolay: Yeah, now I remember when
I was checking your talk,

this is exactly what I thought,
like we have different views

on this term.

And I grabbed this term from various
blog posts where people

talked about multixact, SLRU
buffer problem, like it was also

a cliff.

Sub-transactions, also calling
this performance cliff, like saying

sub-transactions are cursed, don't
use them at all, because there

is a cliff that is hidden there
and so on.

So this definition is very different
from yours.

I understand, yeah, yeah, I agree
with you.

Tomas: Yeah, I don't feel like
I'm the authority to define what

exactly that means.

I'm trying to more explain what
I've been describing, like the

cases that I've been discussing
in my talk.

Because also the reasons why that
happens are slightly different,

right?

For the performance, as I understand them, or as I discuss them,

it's more like an engineering question because you have multiple

algorithms that you can choose from.

Each of those algorithms is kind of optimal, best suited for

certain type of conditions or selectivity ranges or something

like that.

And therefore, if you make the decision based on the assumption

that you know the correct answer, right?

And I think the bottlenecks that you've been describing are more

about like say changes in the hardware available.

Because years ago when we initially sized the number of

fast-path locks, I think 16 was probably fine.

It was reasonable.

You probably didn't have that many tables or connections, and

it wasn't such a problem.

But nowadays we have many more CPUs, many more cores available

usually.

And we also have things like partitioning, which can completely

explode the number of things that you need to think.

So I think it's more like a consequence of changes in Postgres

that leads to hitting this bottleneck.

I'm not saying it's not a valid problem.

It definitely is.

We need to improve that.

But it's a slightly different thing.

Nikolay: Right, yeah, I agree, yeah.

Michael: The topic's already huge if we only limit it to single

session, single query performance cliffs.

So if we think about that, just those, you talked about both

reducing kind of the frequency of these, so making them less

likely to happen, but also reducing the severity of them.

So it kind of accepting they will happen, make them less of a

cliff or less of a drop when they do.

Can you talk us through some things from a user perspective that

we can do to reduce the severity and or to reduce the likelihood

of running into them?

Tomas: That's a good question.

I think it probably depends on each individual case of the performance

cliff.

And in some cases you probably can't do anything.

But it's just good to know that this is what's happening.

In the talk, which you can see on YouTube, and I'm actually invited

to do the talk at the San Francisco meetup in a week.

But the first example that I gave in that talk is about the number

of elements in the in condition, right?

Where column in and then array of elements.

And there is like a flip because we exactly switch from 1 algorithm

to the other, like from linear search to hash.

But that's hard-coded, right?

There's nothing the User can do
about that.

The one thing you could do is to
pick the right size of the element,

of the list, right?

But that's not a very practical
thing.

But I think in many cases just
knowing what's causing the problem

can be a very useful thing because
you can say like okay, this

is what's happening, I can't do
anything about that, I have quantified

the impact, and we have just to
live with that for now.

I can imagine patches doing something,
like improving that by

adjusting the point where exactly
that flips, by doing some better

estimation.

But at this point, people can't
do anything about that.

For some of the cases in planning,
you can do the usual thing,

like trying to convince the planner
to do the right decision,

right?

And there are different ways to
do that.

I don't know what's the best solution
to that.

Michael: Things I really liked
were, I think proactively people

can adjust some of the cost parameters
like we still have random

page cost as for as the default
even on most cloud providers,

so I feel like things like that
when we've mostly got SSDs these

days just feel like they're going
to be more likely to push plan

flips at the wrong times.

Some of these, if there are other
cost signals that are like

massively out as well, I'm not
aware of any, but that feels like

a foot gun or something that if
people change proactively or

reduce proactively, they'd be less
likely to encounter this kind

of thing.

I'm also aware you did a lot of
the work on extended statistics

and that feels like another tool
that, used well, could start

to teach the planner about certain
relationships and make cardinality

errors far less likely.

And again, maybe then less likely
to have plan flips at the worst,

not the worst times.

When we do have the plan flip,
it's more likely to be closer

to the optimal point.

So we're less likely to have a
severe cliff at least.

What do you think?

Tomas: I think you are absolutely
right that like tuning some

of the parameters for like random
page costs, for example, and

something like that is exactly
one of the things you can do to

convince the planner to make the
correct decisions more often,

right?

Because you are essentially moving
the flip point between the

costs closer to the actual flip
point, right, between the execution

costs, like measured in time, for
example.

So that's definitely true.

I think most of the time we actually
do the correct decision

like for regular queries, because
the selectivity where we would

have a problem is kind of like
in the middle, right, somewhere.

And usually queries do have like
very low or very high selectivity,

something.

So in between is not very common,
it can happen.

But I don't really have a perfect
solution to that.

Extended statistics definitely
can help, but it's still the very

early decision.

It's going to help with improving
that in some cases.

And I know you had a whole podcast
about that.

The problem with the extended statistics
is that it still is

like an extremely simplified representation,
approximation of

the data.

That's the whole point of having
statistics, to have a very compact

thing, very compact data set, which
you can use to do the planning

quickly.

Which means that it definitely
is missing some of the details.

And that can be fatal, right?

So I think what we need to focus
on in the future is kind of

like making the execution part
a bit more robust, right?

Like moving some of the decisions
to that instead of expecting

the decisions in planning to be
perfectly correct all the time,

to kind of like allow the execution
to actually recover from

mistakes.

Nikolay: Change the mind on the
fly, right?

Change the execution path.

Tomas: Kind of, right?

Or like, yes, there are different
approaches to that.

Some are kind of like difficult
to do correctly, like replanning,

for example, right?

Or you could have like a typical
problem for a planner, it's

like a sequence of nested loops,
right?

When you have underestimated the
join size or like a condition

selectivity, then you pick a nested
loop and then it explodes

and takes forever.

So there are approaches which kind
of blend 2 different paths

through the join, and either do
like a hash join or like a nested

loop, right?

But there are like practical issues,
what you can do limiting

the behavior of the optimizer,
right?

Because if you want to do this,
then for example, you cannot

emit any tuples from that join
until you know that that's the

correct approach, right?

Because then you wouldn't be actually
able to flip to the other,

and so on.

Nikolay: And I see Another challenge,
it's like user experience.

We all are very, we're scared about
the probability of plan flips

sometimes, right?

And it's dangerous, especially
when you do upgrade, major upgrade,

Plan flips are not uncommon.

And I know, for example, AWS Aurora,
RDS Aurora, has extension

which is not open to freeze plans
so no flips can be possible.

What you described, like if I see
some plan in explain, but then

during execution plan can be changed,
it makes like my head thinking,

like how can I like control it?

I'm not controlling, I just understand
what's happening, right,

Because explain will lie in this
case, right?

Or maybe you see something that
could help to see 2 plans and

explain, or 3 plans and we know
that.

Tomas: But we already can show
this kind of stuff in the plan,

right?

Where when you have a sub-plan,
we can either have a regular

plan or a hashed sub-plan, I think.

So we can show alternative plans.

I don't think that would be a problem.

Obviously, that wouldn't tell you
what exactly happened during

execution.

It would need some additional improvements.

But yes, it adds ambiguity to the
explained plan, right?

Because it would tell you, these
are the possibilities, right?

You don't know what's going to
happen for, and it could actually

differ between executions of the
same query even, right?

But that's the price for higher
robustness, right?

Nikolay: Right.

I'm just thinking observability
tools, many people want them

to be improved, like to track...

pg_stat_statements is not enough to
have plans, 1 normalized query

can have multiple plans.

In the past, there were attempts
to have pg_stat_plans or something.

There are also extensions which
allow you to see the plan in

for ongoing query, right?

I think if this way is like this
direction is going to be like

norm to alterate the plan on the
fly that observability tooling

needs improvement improvements
for sure because we need to understand

what's happening and auto_explain
as well right probably it should

report initial plan and then additional
plan or something like

this, alternations.

So I just feel the bigger need
to improve observability part

of Postgres as well.

Tomas: Yes.

I have nothing against observability.

Nikolay: Not a question, just feeling.

Tomas: And I think I haven't tried
like pg_stat_plans for a while,

but I think it still exists, it's
still available, right?

I think it's very useful.

I think there's also a patch to
actually add ability to watch

a plan as it's being executed,
right?

Which can actually help, that would
be a tremendous benefit when

investigating queries that never
complete, where you can't actually

get like explain analyze and so
on, or just watch the execution

of a query in another backend.

So there is a patch for that.

It's definitely not going to be in Postgres 18, but I think Rafael

actually is working on that.

I hope I didn't get the name wrong.

Yeah, but again, I don't think that's necessarily about the performance

only, right?

That's about monitoring in general.

Michael: This concept of robustness is really interesting.

I don't think I'd come across it as defined a term until reading

some of the papers that you link to at the end of your talk and

I found it fascinating.

I think it aligns well with my experience of what people really

care about in performance.

I think that, Correct me if I'm wrong, but I got the impression

it's really about trying to avoid really bad worst case situations

at the cost of potentially slightly worse average execution time,

for example, if we want to optimize based on time.

So this idea that on average, we might perform slightly worse,

but our worst case will be nowhere near as bad as the, so the

P99 would be way better or way less severe, but the P50 might

be a bit higher.

Is that about right?

Tomas: Yeah, I think that's exactly right.

You're exchanging the, you know, worse average performance for

having to do less investigations of performance issues.

And I think that's perfectly reasonable trade-off.

I mean, I do like working on micro-optimizations and micro-benchmarks

and everything.

That's great.

But in the end, someone needs to be using the database, right?

That's the point.

And if they do have, I don't know, until people had like 1 database

and they cared about that only, that was fine.

As soon as you have like a larger fleet of database servers that

are actually used by people or by applications or something,

you definitely will value robustness, like not having issues,

even if the peak performance is maybe 25% lower or something

like that.

Because it's much easier to spend a bit more on the hardware

than having to run a large team investigating and fixing issues

and firefighting all the time.

That just doesn't work at scale.

I think it's also what some of the Cloud hosting providers realize.

It's a bit against my nature because I definitely do want to

do the best performance possible.

I've been benchmarking a lot and doing performance-related patches

for a long time.

At some point, I think it's kind of pointless, right?

Because people will not be actually able to use that in production

anyway.

Michael: Yeah.

I mean, there are definitely...

It feels like there are cases and
there are people trying to

get the best of both worlds right
trying to get a faster than

we currently have solution that
is also more robust like we had

we had Peter Geoghegan on to talk about
the work he's been doing

on like in the in the direction
of skip scan, but that that work

seems to make some decisions post
planning execution as to whether

it's going to skip or continue
trying to skip or just run through

the index sequentially depending
on certain statistics it gathers

along the way.

That strikes me as work in this
direction that in the real world

doesn't seem to give up too much
on the optimal path either.

Tomas: So I think you are talking
about the multi-dimensional

access method that Peter was talking
about.

And I think it's actually a win-win
situation in this case, right,

because he actually can do better
decisions based on the a priori

information that he has at the
planning stage, right?

So he actually is not sacrificing
anything.

He actually, he's improving the
behavior in many cases.

So I think it's amazing what he's
doing with indexes.

And I've been grateful to actually
collaborate with him on the

index prefetching patch.

So, and his comments were very
useful.

And I think it's actually Peter
Geoghegan who pointed out like a

quote from Goetz Graefe, who is
one of the authors of the patches,

who said that choice is confusion.

By adding more different ways to
execute a query to the optimizer,

then you are just giving it more
chance to make a mistake, right?

So I think the robustness, like
the main parts of the features

that improve robustness needs to
be at the execution path, right?

Kind of like adjusting based on
the actual data, right?

Michael: Yeah, so we've got probably
more than this, but 3 main

times we're having to make, or
the planners having to make decisions

that are really impactful for plan
flips.

We've got scan type, so we've got
kind of sequential scan, index

scan, bitmap scan, index only scan,
perhaps some others if you've

got extensions installed.

And then we've got join algorithm.

And then we've also got join order.

I've read a bit about from based
on your talk, I've read a bit

about the first 2, but nothing
about the last 1 in terms of execution

time optimizations.

Is there anything around that that
you'd be looking into?

Tomas: So the first 2 parts that
you mentioned, that's like,

there are actually, you know, papers
proposing like more robust,

on average slower, but more robust
execution paths.

I think that would be interesting
to have some of those as extensions

in Postgres.

Because we do have the custom scan
interface and we can actually

inject a different path.

So that would be fine.

For the join order search, I'm
not sure.

I'm aware of some issues in that.

But I think for the typical queries,
the algorithm actually works

quite fine.

But there are some examples of
queries where that actually blows

up.

And I don't know if that's what
you've been pointing at, but

like StarJoin, for example, is
a great example, because we end

up exploring n factorial of different
plans.

And we can actually improve that.

I actually posted like a proof
of concept patch for improving

this case.

But there's a lot of work that
needs to be done on that.

But that could actually help tremendously
because I think in

another talk, I compared like performance
for different kinds

of workloads since like Postgres
8, I think.

And on like a star join with, I
mean, OLTP star join, which means

like you select 1 row from, or
a small number of rows from the

fact table, and then have like
20 dimensions, you know, joining

to.

So for that, we didn't actually
improve throughput at all since

Postgres 8.

Right?

Michael: Wow.

Tomas: Which is like crazy, like
compared to the order or 2 orders

of magnitude better throughput
for the other OLTP workloads.

I think that's a clear bottleneck
there.

And it's exactly because the join
order search is like not great.

We do have something we call genetic
optimization, but I think

almost no 1 actually uses that
because the default configuration

parameters actually force us to
do a different thing.

So we don't actually use the genetic
optimizer at all.

Nikolay: Well, when we have 20
tables joined, it's used, right?

Because there's threshold.

Tomas: No, no, no, no, no, no,
no.

Because first we split that at
8 into smaller join searches,

right?

And we do the regular algorithm
for that.

Nikolay: I see.

Tomas: So you would actually need
to change the defaults, and

only then would the genetic algorithm
be used.

Nikolay: I didn't know this.

Okay.

Tomas: Yeah.

Good.

I didn't know that either because
that's when I started doing

the benchmarks to compare the different
versions.

I was wondering, why is the genetic
optimizer not actually visible

anywhere right and it's because
of this so people don't use that.

Michael: When you enabled it, did it improve on later versions

versus version 8?

Tomas: I think it did.

Okay.

But I don't remember the details.

I haven't like included that in the charts at all, for example.

But that's a good question.

I don't know.

Michael: Thinking about like next steps for Postgres.

What would you be most excited to see people working on or what

are you looking to work on next that you're excited about?

Tomas: So I definitely would like to see someone working on the

robustness patches for either the SmoothScan, I do have a proof

of concept patch for that, or the G-Join, I think they call it.

That's a robust algorithm for joins.

I think that would be extremely useful and I'm 99% sure it could

be done as an extension.

So that makes it much easier to hack on without breaking everything

else.

And it also makes it immediately available for users, which I

think is, that's very useful.

And then I think there are actually some like smaller patches

where people could improve like individual performance clips,

right?

If I go back to the example that I used in the talk, I can imagine

people experimenting with choosing the flip, the number at which

we flip from a linear search to a hash search, right?

So I can imagine people doing some runtime measurements and like

doing that.

I think another like problem for robustness is adjusting time

compilation for us, right?

Which adds like a completely different level of robustness issues,

like unpredictable behavior.

Nikolay: We just switched

Tomas: it off.

Nikolay: We just switched it off.

Michael: Yeah, yeah.

But that's great.

Tomas: Yes, I think that's the default recommendation, but it's

also quite sad, right?

Because the just-in-time compilation can be really helpful for

some queries.

Michael: But it's a robustness decision, right?

Yes.

Like Nikolay is often, and I've seen people as well, choosing

potentially worse on average performance in certain queries for

not getting those outliers.

So it's exactly about robustness.

Tomas: Yes, I think that's a great example showcasing the decisions

people make, right?

And they almost always choose better
robustness if that's a production

workload, right?

I think that's fine, right?

And I think we could have like
a...

I don't know if you had a session
about just-in-time compilation,

but I think there's a whole set
of things people could do to

improve that.

It could be better planning or
it could be using a different

library for JIT.

Because a lot of that, a lot of
the problems that we have is

actually due to LLVM not being
a well-suited library for the

kind of JIT that we do for the
use case, right?

Michael: Yeah, interesting.

We haven't done an episode on it.

I think you're right, it would
be a good 1 to do.

Who would you recommend we talk
to about that?

Is it Andres or who is it?

Tomas: I think Andres is like the
1 person who knows about that

most because I think he's the 1
implementing that.

So you can also complain to him
and he's the right person.

But I think there are actually
other people who submitted some

patches for like improvements for
like

Nikolay: cool

Tomas: I think that was like 2
years ago there was a blog post

about someone actually implementing
like custom JIT and that

might be another good guest I think
I

Michael: completely forgot about
that that's a good idea and

the the other thing is I think
it's based solely around like

a limit of, a cost limit.

That feels very crude to me.

Tomas: Yes.

As I say, like the planning is
very, very simplistic.

It makes like essentially the decision
for the whole query at

once, right?

Michael: Yeah.

Tomas: I mean like the fact that
it's based on like cost limited,

that's roughly fine, but then crossing
the whole limit, crossing

the cost limit for the whole query,
can also enable just-in-time

compilation in parts of the query
where that's not really useful,

right?

So that's definitely 1 of the things
that we could improve.

I mean, like it's not a perfect
solution for all the cases, but

it would be, I think, better approach.

But I'm sure like there are reasons
why it was done like this

right so I don't want to like make
like a definitive statements

about like solutions.

Michael: Yeah would be an interesting
discussion and was there

anything you want anything we should
have asked about that we

didn't?

Tomas: No, I think we discussed
like a lot of different like

parts of the that I mentioned in the talk so I think that's fine.

Michael: It's a really excellent talk and we didn't we definitely

didn't discuss it all and it comes across really well with examples.

So I will include the link to the talk and the slides as well

for those links.

And I'll include a link to the San Fran talk.

So if anybody's got questions themselves, I guess you're attending

live so they can ask them at the meetup as well.

Tomas: Yeah, or they can just like send me an email or something.

I also do something I call office hours, where people can just

let me know what they would like to talk about.

And it could be like a badge they are considering or whatever.

And we can definitely like schedule call or something.

That's also what I do for the community.

Yes.

Michael: Yeah, that's awesome.

I did actually want to say thank you for everything you're doing,

because you're also doing some mentoring and some patch ideas

for new potential committers.

And I think that that kind of work is very, very valued and very,

very valuable.

So thank you.

Awesome.

Well, thanks so much for joining us, Tomas.

And yeah, I hope you have a great week.

Tomas: Bye.

Creators and Guests

Tomas Vondra
Guest
Tomas Vondra
Principle Software Engineer at Microsoft, PostgreSQL Major Contributor, and PostgreSQL Committer

Some kind things our listeners have said