
Performance cliffs
Michael: Hello and welcome to Postgres.FM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard, and as usual, I'm joined
by my co-host, Nikolay, founder of Postgres.AI.
Hey, Nikolay.
Nikolay: Hi, Michael.
Michael: Hello.
Today, we are excited to have a special guest, Tomas Vondra,
who is a Postgres major contributor and committer now working
at Microsoft on their Postgres team.
Welcome to the show, Tomas.
Tomas: Hello, everyone.
Michael: Right, so a few months ago, you gave an excellent talk
about where performance cliffs come from, and we have been excited
to talk to you about this.
So perhaps as a starting point could you let us know a little
bit like how you got interested in this as a topic or why you
chose to talk about it there.
Tomas: So I think I started thinking about performance glyphs
when I actually started working on Postgres or started using
Postgres.
Because performance lists are something people often run into
and have to investigate changes, sudden changes in behavior after
small changes in the query.
A single additional row in a query could kind of like flip the
behavior in a way that is completely like unexpected and unpredictable
and it's causing problems for like robustness and so on.
And I actually started with Postgres as a user or a developer
using Postgres.
And 1 of the things I was responsible for was investigating performance
problems.
So that's how I actually came to work on performance cliffs.
But after I started working on development of Postgres, you also
see the other side of performance cliffs, which is like why it
actually happens in practice, because the database needs to do
some decisions.
So it's also an engineering problem which is interesting for
me as a developer committer.
And I've been working on a planner some of the time and That's
where some of the performance was actually happen, right?
Because of decisions made either during planning or then during
the execution.
So I've been reading some papers and that's actually where I
learned that there even isn't like a name for this, which is
the performance cliff.
And there is a whole area of research for robustness, like making
robust decisions where the performance cliffs do not happen at
all or not as often.
So that's why I'm interested in this.
Michael: Nice, and yeah, I really like the term performance cliff.
It's so intuitive in that, from
the example you were talking
about kind of adding 1 additional
row.
So if we go back a few steps, if
we're adding 1 row incrementally
up until the cliff, we might get
a certain gradient, might even
be relatively horizontal and then
all of a sudden there's a massive
spike and that's like the cliff.
I'm probably thinking of it as
a spike but it maybe a cliff would
be a drop.
So yeah I really like that as a
phrase.
You mentioned kind of planner versus
execution time cliffs and
I guess we're talking about individual
execution more than kind
of system level cliffs, mostly
here.
From your experience as a user,
or maybe since you've, since
you've been looking at like trying
to prioritize which things
to work on, where do you see the
majority of cliffs coming from?
Like plan time, execution time,
some, some mix.
Tomas: So that's a good question.
I don't know because I'm inherently
biased, right?
The fact that a lot of performance
groups, for example, happen
during planning means that you
can't actually see a lot of cliffs
that happen later, right?
So I don't have a very good like
data to explain that or to say
that most of the cliffs happen
at some point.
But I would say that the sooner
the decision needs to be done
in the query planning and query
execution, The sooner you need
to do that, the less information
you actually have to do a good,
robust decision.
And the more likely it is that
it will be wrong, right?
Because for example, we do planning
based on statistics, and
those are inherently inaccurate,
right?
There will be a lot of details
that we remove from the statistics
to keep it small, and therefore,
those are the things that can
actually cause problems.
And then there is, of course, the
thing that even the cost model
itself is like extremely simplified,
right?
So I would say that the main problems
is that we are making a
lot of decisions early in the query
execution.
And because of that, we are kind
of like fixed in a certain execution
path, and that may not be the optimal
1, right?
So that's why I think the performance
glitches come from kind
of.
Nikolay: I wanted to ask, there
should be a challenge not only
about like planner versus executor,
but also their, both of them
behavior, like in a single session
when there is no like a multi-user
situation versus some various contention
situations when we have
many, many sessions working in
parallel and all of them have
some issue.
For example, unfamous lightweight
local log manager happens when
during planning planner needs to
log all the indexes for the
table and we exceed 16.
Fast-path becomes not available
for some relations.
And we can not see it when we just
checking the single query,
but we, like it becomes a performance
cliff immediately when
we have thousands of transactions
per second doing the same thing
and planning becomes problem, right?
I'm curious, how do you think this
should be understood, like
with what tools and earlier, not
when you already fell from that
cliff, right?
But before, this is the trickiest,
like how to diagnose cliffs
which are going to happen soon
with our system.
Tomas: So, I understand the example
that you've been describing,
but I think that's a slightly different
performance cliff and
not what I usually mean by that
term, because you are kind of
like talking about like resource
exhaustion, right?
You have a resource, which is like
a lock queue or like the buffer
for the fast-path locks, which was
increased or like made larger
in Postgres 18.
Nikolay: So we should have
Tomas: the problem.
Nikolay: And it's configurable
now, right?
I think it's maybe configurable.
Yes.
Tomas: It's ties to expected number
of transactions per connection,
so it is configurable in this way,
yes.
So it's not fixed size like before.
But I think what I mean by performance
cliffs and what I describe
in my talk is mostly performance
cliffs that are done because
of like a binary decision somewhere
in the processing, right?
And the locking bottleneck is not
that, right?
I mean, like, yes, you do have
the case where you can either
fit the fast path lock or not,
right?
And if not, you have to do something
much more expensive.
That's a problem, but it's not
the thing that I had in mind thinking
about performance.
Nikolay: Yeah, now I remember when
I was checking your talk,
this is exactly what I thought,
like we have different views
on this term.
And I grabbed this term from various
blog posts where people
talked about multixact, SLRU
buffer problem, like it was also
a cliff.
Sub-transactions, also calling
this performance cliff, like saying
sub-transactions are cursed, don't
use them at all, because there
is a cliff that is hidden there
and so on.
So this definition is very different
from yours.
I understand, yeah, yeah, I agree
with you.
Tomas: Yeah, I don't feel like
I'm the authority to define what
exactly that means.
I'm trying to more explain what
I've been describing, like the
cases that I've been discussing
in my talk.
Because also the reasons why that
happens are slightly different,
right?
For the performance, as I understand them, or as I discuss them,
it's more like an engineering question because you have multiple
algorithms that you can choose from.
Each of those algorithms is kind of optimal, best suited for
certain type of conditions or selectivity ranges or something
like that.
And therefore, if you make the decision based on the assumption
that you know the correct answer, right?
And I think the bottlenecks that you've been describing are more
about like say changes in the hardware available.
Because years ago when we initially sized the number of
fast-path locks, I think 16 was probably fine.
It was reasonable.
You probably didn't have that many tables or connections, and
it wasn't such a problem.
But nowadays we have many more CPUs, many more cores available
usually.
And we also have things like partitioning, which can completely
explode the number of things that you need to think.
So I think it's more like a consequence of changes in Postgres
that leads to hitting this bottleneck.
I'm not saying it's not a valid problem.
It definitely is.
We need to improve that.
But it's a slightly different thing.
Nikolay: Right, yeah, I agree, yeah.
Michael: The topic's already huge if we only limit it to single
session, single query performance cliffs.
So if we think about that, just those, you talked about both
reducing kind of the frequency of these, so making them less
likely to happen, but also reducing the severity of them.
So it kind of accepting they will happen, make them less of a
cliff or less of a drop when they do.
Can you talk us through some things from a user perspective that
we can do to reduce the severity and or to reduce the likelihood
of running into them?
Tomas: That's a good question.
I think it probably depends on each individual case of the performance
cliff.
And in some cases you probably can't do anything.
But it's just good to know that this is what's happening.
In the talk, which you can see on YouTube, and I'm actually invited
to do the talk at the San Francisco meetup in a week.
But the first example that I gave in that talk is about the number
of elements in the in condition, right?
Where column in and then array of elements.
And there is like a flip because we exactly switch from 1 algorithm
to the other, like from linear search to hash.
But that's hard-coded, right?
There's nothing the User can do
about that.
The one thing you could do is to
pick the right size of the element,
of the list, right?
But that's not a very practical
thing.
But I think in many cases just
knowing what's causing the problem
can be a very useful thing because
you can say like okay, this
is what's happening, I can't do
anything about that, I have quantified
the impact, and we have just to
live with that for now.
I can imagine patches doing something,
like improving that by
adjusting the point where exactly
that flips, by doing some better
estimation.
But at this point, people can't
do anything about that.
For some of the cases in planning,
you can do the usual thing,
like trying to convince the planner
to do the right decision,
right?
And there are different ways to
do that.
I don't know what's the best solution
to that.
Michael: Things I really liked
were, I think proactively people
can adjust some of the cost parameters
like we still have random
page cost as for as the default
even on most cloud providers,
so I feel like things like that
when we've mostly got SSDs these
days just feel like they're going
to be more likely to push plan
flips at the wrong times.
Some of these, if there are other
cost signals that are like
massively out as well, I'm not
aware of any, but that feels like
a foot gun or something that if
people change proactively or
reduce proactively, they'd be less
likely to encounter this kind
of thing.
I'm also aware you did a lot of
the work on extended statistics
and that feels like another tool
that, used well, could start
to teach the planner about certain
relationships and make cardinality
errors far less likely.
And again, maybe then less likely
to have plan flips at the worst,
not the worst times.
When we do have the plan flip,
it's more likely to be closer
to the optimal point.
So we're less likely to have a
severe cliff at least.
What do you think?
Tomas: I think you are absolutely
right that like tuning some
of the parameters for like random
page costs, for example, and
something like that is exactly
one of the things you can do to
convince the planner to make the
correct decisions more often,
right?
Because you are essentially moving
the flip point between the
costs closer to the actual flip
point, right, between the execution
costs, like measured in time, for
example.
So that's definitely true.
I think most of the time we actually
do the correct decision
like for regular queries, because
the selectivity where we would
have a problem is kind of like
in the middle, right, somewhere.
And usually queries do have like
very low or very high selectivity,
something.
So in between is not very common,
it can happen.
But I don't really have a perfect
solution to that.
Extended statistics definitely
can help, but it's still the very
early decision.
It's going to help with improving
that in some cases.
And I know you had a whole podcast
about that.
The problem with the extended statistics
is that it still is
like an extremely simplified representation,
approximation of
the data.
That's the whole point of having
statistics, to have a very compact
thing, very compact data set, which
you can use to do the planning
quickly.
Which means that it definitely
is missing some of the details.
And that can be fatal, right?
So I think what we need to focus
on in the future is kind of
like making the execution part
a bit more robust, right?
Like moving some of the decisions
to that instead of expecting
the decisions in planning to be
perfectly correct all the time,
to kind of like allow the execution
to actually recover from
mistakes.
Nikolay: Change the mind on the
fly, right?
Change the execution path.
Tomas: Kind of, right?
Or like, yes, there are different
approaches to that.
Some are kind of like difficult
to do correctly, like replanning,
for example, right?
Or you could have like a typical
problem for a planner, it's
like a sequence of nested loops,
right?
When you have underestimated the
join size or like a condition
selectivity, then you pick a nested
loop and then it explodes
and takes forever.
So there are approaches which kind
of blend 2 different paths
through the join, and either do
like a hash join or like a nested
loop, right?
But there are like practical issues,
what you can do limiting
the behavior of the optimizer,
right?
Because if you want to do this,
then for example, you cannot
emit any tuples from that join
until you know that that's the
correct approach, right?
Because then you wouldn't be actually
able to flip to the other,
and so on.
Nikolay: And I see Another challenge,
it's like user experience.
We all are very, we're scared about
the probability of plan flips
sometimes, right?
And it's dangerous, especially
when you do upgrade, major upgrade,
Plan flips are not uncommon.
And I know, for example, AWS Aurora,
RDS Aurora, has extension
which is not open to freeze plans
so no flips can be possible.
What you described, like if I see
some plan in explain, but then
during execution plan can be changed,
it makes like my head thinking,
like how can I like control it?
I'm not controlling, I just understand
what's happening, right,
Because explain will lie in this
case, right?
Or maybe you see something that
could help to see 2 plans and
explain, or 3 plans and we know
that.
Tomas: But we already can show
this kind of stuff in the plan,
right?
Where when you have a sub-plan,
we can either have a regular
plan or a hashed sub-plan, I think.
So we can show alternative plans.
I don't think that would be a problem.
Obviously, that wouldn't tell you
what exactly happened during
execution.
It would need some additional improvements.
But yes, it adds ambiguity to the
explained plan, right?
Because it would tell you, these
are the possibilities, right?
You don't know what's going to
happen for, and it could actually
differ between executions of the
same query even, right?
But that's the price for higher
robustness, right?
Nikolay: Right.
I'm just thinking observability
tools, many people want them
to be improved, like to track...
pg_stat_statements is not enough to
have plans, 1 normalized query
can have multiple plans.
In the past, there were attempts
to have pg_stat_plans or something.
There are also extensions which
allow you to see the plan in
for ongoing query, right?
I think if this way is like this
direction is going to be like
norm to alterate the plan on the
fly that observability tooling
needs improvement improvements
for sure because we need to understand
what's happening and auto_explain
as well right probably it should
report initial plan and then additional
plan or something like
this, alternations.
So I just feel the bigger need
to improve observability part
of Postgres as well.
Tomas: Yes.
I have nothing against observability.
Nikolay: Not a question, just feeling.
Tomas: And I think I haven't tried
like pg_stat_plans for a while,
but I think it still exists, it's
still available, right?
I think it's very useful.
I think there's also a patch to
actually add ability to watch
a plan as it's being executed,
right?
Which can actually help, that would
be a tremendous benefit when
investigating queries that never
complete, where you can't actually
get like explain analyze and so
on, or just watch the execution
of a query in another backend.
So there is a patch for that.
It's definitely not going to be in Postgres 18, but I think Rafael
actually is working on that.
I hope I didn't get the name wrong.
Yeah, but again, I don't think that's necessarily about the performance
only, right?
That's about monitoring in general.
Michael: This concept of robustness is really interesting.
I don't think I'd come across it as defined a term until reading
some of the papers that you link to at the end of your talk and
I found it fascinating.
I think it aligns well with my experience of what people really
care about in performance.
I think that, Correct me if I'm wrong, but I got the impression
it's really about trying to avoid really bad worst case situations
at the cost of potentially slightly worse average execution time,
for example, if we want to optimize based on time.
So this idea that on average, we might perform slightly worse,
but our worst case will be nowhere near as bad as the, so the
P99 would be way better or way less severe, but the P50 might
be a bit higher.
Is that about right?
Tomas: Yeah, I think that's exactly right.
You're exchanging the, you know, worse average performance for
having to do less investigations of performance issues.
And I think that's perfectly reasonable trade-off.
I mean, I do like working on micro-optimizations and micro-benchmarks
and everything.
That's great.
But in the end, someone needs to be using the database, right?
That's the point.
And if they do have, I don't know, until people had like 1 database
and they cared about that only, that was fine.
As soon as you have like a larger fleet of database servers that
are actually used by people or by applications or something,
you definitely will value robustness, like not having issues,
even if the peak performance is maybe 25% lower or something
like that.
Because it's much easier to spend a bit more on the hardware
than having to run a large team investigating and fixing issues
and firefighting all the time.
That just doesn't work at scale.
I think it's also what some of the Cloud hosting providers realize.
It's a bit against my nature because I definitely do want to
do the best performance possible.
I've been benchmarking a lot and doing performance-related patches
for a long time.
At some point, I think it's kind of pointless, right?
Because people will not be actually able to use that in production
anyway.
Michael: Yeah.
I mean, there are definitely...
It feels like there are cases and
there are people trying to
get the best of both worlds right
trying to get a faster than
we currently have solution that
is also more robust like we had
we had Peter Geoghegan on to talk about
the work he's been doing
on like in the in the direction
of skip scan, but that that work
seems to make some decisions post
planning execution as to whether
it's going to skip or continue
trying to skip or just run through
the index sequentially depending
on certain statistics it gathers
along the way.
That strikes me as work in this
direction that in the real world
doesn't seem to give up too much
on the optimal path either.
Tomas: So I think you are talking
about the multi-dimensional
access method that Peter was talking
about.
And I think it's actually a win-win
situation in this case, right,
because he actually can do better
decisions based on the a priori
information that he has at the
planning stage, right?
So he actually is not sacrificing
anything.
He actually, he's improving the
behavior in many cases.
So I think it's amazing what he's
doing with indexes.
And I've been grateful to actually
collaborate with him on the
index prefetching patch.
So, and his comments were very
useful.
And I think it's actually Peter
Geoghegan who pointed out like a
quote from Goetz Graefe, who is
one of the authors of the patches,
who said that choice is confusion.
By adding more different ways to
execute a query to the optimizer,
then you are just giving it more
chance to make a mistake, right?
So I think the robustness, like
the main parts of the features
that improve robustness needs to
be at the execution path, right?
Kind of like adjusting based on
the actual data, right?
Michael: Yeah, so we've got probably
more than this, but 3 main
times we're having to make, or
the planners having to make decisions
that are really impactful for plan
flips.
We've got scan type, so we've got
kind of sequential scan, index
scan, bitmap scan, index only scan,
perhaps some others if you've
got extensions installed.
And then we've got join algorithm.
And then we've also got join order.
I've read a bit about from based
on your talk, I've read a bit
about the first 2, but nothing
about the last 1 in terms of execution
time optimizations.
Is there anything around that that
you'd be looking into?
Tomas: So the first 2 parts that
you mentioned, that's like,
there are actually, you know, papers
proposing like more robust,
on average slower, but more robust
execution paths.
I think that would be interesting
to have some of those as extensions
in Postgres.
Because we do have the custom scan
interface and we can actually
inject a different path.
So that would be fine.
For the join order search, I'm
not sure.
I'm aware of some issues in that.
But I think for the typical queries,
the algorithm actually works
quite fine.
But there are some examples of
queries where that actually blows
up.
And I don't know if that's what
you've been pointing at, but
like StarJoin, for example, is
a great example, because we end
up exploring n factorial of different
plans.
And we can actually improve that.
I actually posted like a proof
of concept patch for improving
this case.
But there's a lot of work that
needs to be done on that.
But that could actually help tremendously
because I think in
another talk, I compared like performance
for different kinds
of workloads since like Postgres
8, I think.
And on like a star join with, I
mean, OLTP star join, which means
like you select 1 row from, or
a small number of rows from the
fact table, and then have like
20 dimensions, you know, joining
to.
So for that, we didn't actually
improve throughput at all since
Postgres 8.
Right?
Michael: Wow.
Tomas: Which is like crazy, like
compared to the order or 2 orders
of magnitude better throughput
for the other OLTP workloads.
I think that's a clear bottleneck
there.
And it's exactly because the join
order search is like not great.
We do have something we call genetic
optimization, but I think
almost no 1 actually uses that
because the default configuration
parameters actually force us to
do a different thing.
So we don't actually use the genetic
optimizer at all.
Nikolay: Well, when we have 20
tables joined, it's used, right?
Because there's threshold.
Tomas: No, no, no, no, no, no,
no.
Because first we split that at
8 into smaller join searches,
right?
And we do the regular algorithm
for that.
Nikolay: I see.
Tomas: So you would actually need
to change the defaults, and
only then would the genetic algorithm
be used.
Nikolay: I didn't know this.
Okay.
Tomas: Yeah.
Good.
I didn't know that either because
that's when I started doing
the benchmarks to compare the different
versions.
I was wondering, why is the genetic
optimizer not actually visible
anywhere right and it's because
of this so people don't use that.
Michael: When you enabled it, did it improve on later versions
versus version 8?
Tomas: I think it did.
Okay.
But I don't remember the details.
I haven't like included that in the charts at all, for example.
But that's a good question.
I don't know.
Michael: Thinking about like next steps for Postgres.
What would you be most excited to see people working on or what
are you looking to work on next that you're excited about?
Tomas: So I definitely would like to see someone working on the
robustness patches for either the SmoothScan, I do have a proof
of concept patch for that, or the G-Join, I think they call it.
That's a robust algorithm for joins.
I think that would be extremely useful and I'm 99% sure it could
be done as an extension.
So that makes it much easier to hack on without breaking everything
else.
And it also makes it immediately available for users, which I
think is, that's very useful.
And then I think there are actually some like smaller patches
where people could improve like individual performance clips,
right?
If I go back to the example that I used in the talk, I can imagine
people experimenting with choosing the flip, the number at which
we flip from a linear search to a hash search, right?
So I can imagine people doing some runtime measurements and like
doing that.
I think another like problem for robustness is adjusting time
compilation for us, right?
Which adds like a completely different level of robustness issues,
like unpredictable behavior.
Nikolay: We just switched
Tomas: it off.
Nikolay: We just switched it off.
Michael: Yeah, yeah.
But that's great.
Tomas: Yes, I think that's the default recommendation, but it's
also quite sad, right?
Because the just-in-time compilation can be really helpful for
some queries.
Michael: But it's a robustness decision, right?
Yes.
Like Nikolay is often, and I've seen people as well, choosing
potentially worse on average performance in certain queries for
not getting those outliers.
So it's exactly about robustness.
Tomas: Yes, I think that's a great example showcasing the decisions
people make, right?
And they almost always choose better
robustness if that's a production
workload, right?
I think that's fine, right?
And I think we could have like
a...
I don't know if you had a session
about just-in-time compilation,
but I think there's a whole set
of things people could do to
improve that.
It could be better planning or
it could be using a different
library for JIT.
Because a lot of that, a lot of
the problems that we have is
actually due to LLVM not being
a well-suited library for the
kind of JIT that we do for the
use case, right?
Michael: Yeah, interesting.
We haven't done an episode on it.
I think you're right, it would
be a good 1 to do.
Who would you recommend we talk
to about that?
Is it Andres or who is it?
Tomas: I think Andres is like the
1 person who knows about that
most because I think he's the 1
implementing that.
So you can also complain to him
and he's the right person.
But I think there are actually
other people who submitted some
patches for like improvements for
like
Nikolay: cool
Tomas: I think that was like 2
years ago there was a blog post
about someone actually implementing
like custom JIT and that
might be another good guest I think
I
Michael: completely forgot about
that that's a good idea and
the the other thing is I think
it's based solely around like
a limit of, a cost limit.
That feels very crude to me.
Tomas: Yes.
As I say, like the planning is
very, very simplistic.
It makes like essentially the decision
for the whole query at
once, right?
Michael: Yeah.
Tomas: I mean like the fact that
it's based on like cost limited,
that's roughly fine, but then crossing
the whole limit, crossing
the cost limit for the whole query,
can also enable just-in-time
compilation in parts of the query
where that's not really useful,
right?
So that's definitely 1 of the things
that we could improve.
I mean, like it's not a perfect
solution for all the cases, but
it would be, I think, better approach.
But I'm sure like there are reasons
why it was done like this
right so I don't want to like make
like a definitive statements
about like solutions.
Michael: Yeah would be an interesting
discussion and was there
anything you want anything we should
have asked about that we
didn't?
Tomas: No, I think we discussed
like a lot of different like
parts of the that I mentioned in the talk so I think that's fine.
Michael: It's a really excellent talk and we didn't we definitely
didn't discuss it all and it comes across really well with examples.
So I will include the link to the talk and the slides as well
for those links.
And I'll include a link to the San Fran talk.
So if anybody's got questions themselves, I guess you're attending
live so they can ask them at the meetup as well.
Tomas: Yeah, or they can just like send me an email or something.
I also do something I call office hours, where people can just
let me know what they would like to talk about.
And it could be like a badge they are considering or whatever.
And we can definitely like schedule call or something.
That's also what I do for the community.
Yes.
Michael: Yeah, that's awesome.
I did actually want to say thank you for everything you're doing,
because you're also doing some mentoring and some patch ideas
for new potential committers.
And I think that that kind of work is very, very valued and very,
very valuable.
So thank you.
Awesome.
Well, thanks so much for joining us, Tomas.
And yeah, I hope you have a great week.
Tomas: Bye.
Creators and Guests
