Overhead of pg_stat_statements and pg_stat_kcache
Michael: Hello and welcome to PostgresFM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard,
and this is my co-host,
Nikolay, founder of Postgres.ai.
Welcome back, Nikolay.
How are you getting on and what
would you like to talk about
today?
Nikolay: Hi, Michael.
First of all, thank you for keeping
the, how is it called, the
ball rolling, or how to say, long
time no see or long time no
hear.
Yeah.
Yeah.
Thank you for very interesting
interviews.
I liked listening to them while
I was slightly off, but maybe
it's time to return to our regular
format or have a mix, I don't
know what we will decide next,
but today we have a regular format,
right?
Michael: Yeah.
Nikolay: Yeah, and the topic I
brought this topic is overhead
of observability tools, the most
popular ones.
Well, 1 is the most popular one,
pg_stat_statements.
And usually people, like I remember
exactly when we didn't have
pg_stat_statements, we had only
log-based analysis.
We didn't have wait events, didn't
have pg_stat_statements like
15 years ago, right?
And we only had log-based analysis
and there was a tool written
in PHP called pgFouine, and then
another tool replaced it, written
in Perl called pgBadger.
And the idea was, like, people
always, like, first thing to figure
out that, oh, usually we only see
only a tip of the iceberg only
we see only like some queries which
are the slowest ones but
you cannot properly perform like
full holistic analysis if you
don't see the fast queries, because
fast queries might be consuming
even more resources than slow queries.
So what people did usually, they
experienced DBAs usually said,
okay, I'm going to switch off all
query logging for a few minutes
to collect everything and then
analyze.
In very many cases it worked well.
Yes, there is an observer effect
because we probably put our
Postgres down completely if we
log all queries, but not always.
Sometimes it depends.
So yeah, this is quite understandable.
But then pg_stat_statements were
created and some DBAs, I remember,
we were saying, oh, you know, like
we still need to study its
overhead.
And then somehow there was like,
there were some benchmarks,
I don't remember any benchmarks
I would say I trust them.
But I remember many experienced
folks started saying, oh, you
know, yes, there is overhead, but
it's below 10%.
And since this thing is measuring
everything, it's observing
the whole workload, it's worth keeping
it enabled.
Let's have it enabled.
I remember also there was some
number, like, people, many people
saying, you know, pg_stat_statements
overhead is 7%.
It's kind of strange, but kind
of 7%.
7%?
7%.
I don't remember any benchmarks
that proved it, but I remember
it was like the consensus was below
10%, so we are all good.
If we try to find a good benchmark
from these early days of
pg_stat_statements, it would be
great.
I don't think it exists.
But honestly, we as a community
should perform good benchmarks
there.
I don't remember that.
It doesn't mean they don't exist, of
course, but...
Michael: I did a little bit of
searching beforehand because I
was interested in if there was
any up-to-date ones as well.
I couldn't find anything great,
but there were a few trustworthy
sources.
One was PG Analyze, a commercial
monitoring tool that uses pg_stat_statements.
statements.
In their frequently asked questions,
they gave an estimate of
approximately 1% of CPU time, which
I thought was interesting
that they're telling their customers.
Anyway, as we're going to discuss,
this is all very dependent
on workload, right but at least
that's interesting and
I trust them enough to say that's
probably based on some tests
on their side even if I didn't
see the actual benchmarking, and
then the other things I've seen
are a series of, well, there's
a question on the DBA Stack Exchange
and somebody's quoting various
Postgres authors and a benchmarking
series.
One that was like a more realistic
workload that was about 0.5%
measured overhead and another that
was a more stress test that
measured it at about 10%.
So those are numbers, but that's
a wide range of different numbers.
And as we're going to discuss later,
you've done some work that
shows even that it can be very
different, which is cool.
Nikolay: I hope by the end of this
episode, people will start
having some ideas why this range
can be so wide.
But I can imagine like 0.5% or
10%.
In both cases, it can be so.
I mean, it depends on workload,
actually.
Very well depends on the workload.
And also not only about workload,
we will talk about it later,
right?
So, yeah, okay.
But when I say overhead, of course
I mean CPU overhead, first
of all, because we don't think
about disk I/O or disk usage, disk
space usage here, network, nothing
like that, only pure CPU overhead.
Michael: By the
Nikolay: way, when guys like RDS
and others, probably inheriting
this from Oracle, when they consider
wait event being null in
pg_stat_activity as CPU, marking
it as green, CPU green in performance
insights.
Do they mean other wait events,
all of them are not CPU?
I don't understand this.
Michael: I don't know.
Nikolay: Because many of, for example,
lightweight locks, spin
locks, for example, they are purely
CPU.
Michael: I know we have several
listeners from AWS, so maybe
they can let us know.
Nikolay: We have listeners, yeah.
And on the other hand, if you say
CPU instead of null, null means
unknown.
And according to PostgreSQL documentation
and source code, it means
no wait.
In reality, it means either really
no wait and maybe some kind
of CPU work, but also in many cases
it means a wait event which
is not yet created and code is
not covered with this.
For example, PostgreSQL 16, we recently
had some benchmarks and
we saw wait event which you see
in PostgreSQL 16 but you don't
see it in PostgreSQL 14 because it
was not yet created there.
So I'm thinking, okay, it means
in performance insights on RDS
and I think in many other systems
maybe in CloudSQL as well
or this PgBouncer ad hoc tool
written in Java, it also likes
to use green color and say this
is CPU, but it's actually not
CPU, it's null.
Because CPU doesn't exist in the
list of wait events in pg_stat_activity.
of Activity.
So if you mark it CPU for 14 because
it's what was null, but
then you start distinguishing it
for 16.
Something is not right here.
So I'm asking, is it really CPU?
So interesting question, but it's
slightly off topic.
Back to PgStatStatements.
Let's talk about benchmarks we
had.
So, the idea was we got some credits
from Google Cloud.
And it's great, this year we will
be performing, I mean, we,
I mean, Postgres.ai team, we will
be performing a lot of benchmarks.
I'm going to do a lot of post...
We already had a lot of benchmarks,
I think thousands of them
if you count each iteration, and
hundreds of them if you count
the whole benchmark consisting
of many iterations.
So we were curious how many TPS
we can squeeze from PostgreSQL
16 on various big machines, Intel,
AMD, and there was a good
article from 2016 from Alexander
Korotkov, when both PostgreSQL
and MySQL, I remember, teamed up.
It was interesting.
I think, I don't remember if somebody
from Percona was also working
on MySQL, the goal was to demonstrate
that both systems can show
million TPS.
And during that work, some contention
issues were improved, fixed
in Postgres.
So, Postgres reached a million
TPS on some strong machines.
And I like that benchmark because
it's simple, it's pgBench,
select only.
So, it's only selects, very simple
ones.
So we just repeated the same benchmark.
And as usual for such benchmarks,
it's actually stress testing,
because you are exploring this
edge.
It's not what you need to do for
your application, for example,
unless you're on purpose studying
it.
I hate this behavior being default
in pgBench.
It should not be so, because it's
provoking to perform stress
tests instead of regular load tests
and exploring normal situation,
for example, 25 or 50% of CPU load.
But in our case, we just want to
squeeze as much as we can.
So we did this, and I published
a blog post about our new AI
bot and it has details and links
to details how we did it.
So we took the newest Intel, fourth
generation Intel scalable
Sapphire Rapids it's called, right?
And also 4th generation AMD EPYC,
C3 and C3D instances on GCP.
One has, the biggest number is 176
vCPUs, and AMD has 360 vCPUs.
And both have more than a terabyte
of RAM.
So it's insane, but we use spots.
AWS has spots, GCP has spot instances, it's
cool.
You pay a couple of bucks for such
an experiment, honestly.
Like it's not...
I realized even if we didn't have
credits, I would probably pay
for myself, it's interesting to
explore these things.
Because we just provision temporary
machines for 1 or 2 hours
and spot means huge discount.
Yeah, so, this kind of experiment,
it's a classic experiment for
stress testing.
You first start with 1 connection,
then more and more connections,
and you control...
We chose Aleksandr Korotkov's approach.
In pgBench both -c and -j are the same.
So we start from 1, then we put
50.
With such huge machines, this step
is reasonable.
Jumping to 50 right away.
Then 100, then 150, 200, and so
on until, I think, 500.
Exceeding the number of vCPUs in
both cases.
So since we use PostgreSQL cluster,
which is a project, it's
open source project, It's a very
good thing, Ansible kind
of set of playbooks to provision
Patroni clusters with a lot
of things.
It's maintained by Vitaliy, who
is working in my team.
So it's a great tool and it has
a lot of things, including those
we over years discussed together
and he just added into PostgreSQL
cluster.
So we used it and because of that
we had a lot of observability
tools including those for query
analysis, **pg_stat_statements**, **pg_stat_kcache**,
much less popular, and **pg_wait_sampling**
for weight event analysis.
And then we saw a very strange
picture.
Until 50 we grow in terms of TPS,
kind of approaching a million
TPS, but then go down very quickly.
And on Intel we went down even
more, but it's a different story.
And thanks to **pg_wait_sampling**,
we saw that we have number 1
weight event is related to **pg_stat_kcache**.
So we immediately realized this
is an observed effect from this
**pg_stat_kcache**.
For those who don't know, **pg_stat_kcache**
is an additional extension
to **pg_stat_statements** extension which
provides physical level metrics,
user CPU, sys CPU, real disk IO
at physical level, context switches.
So it's very useful to understand
real CPU usage, unlike this
where to when equals null, which
I think is wrongly presented
in RDS performance insights.
This thing, trustworthy, we use
it many years, a few big companies
use it as well.
It's not super popular, **RDS** doesn't
have it, **CloudSQL** doesn't
have it, most managed Postgres
providers don't have it.
I know **Yandex Cloud** has it.
I think they might be the
Michael: only ones that have it.
Nikolay: Maybe, But also 1 of the
biggest e-commerce companies
we worked with also has it for
many years and they have very
critical systems running on Postgres,
so it's...
But in this case, overhead was
so huge, like, what's happening?
And I immediately published it
on Twitter and Vitaliy created
an issue in **pg_stat_kcache**, GitHub,
a repository, and then maintainers,
they created a fix in 4 hours,
I think.
So I think they expected something.
There was some additional lock
which was removed in this fix
with the idea this lock is something
old and it's not really
needed, but in our case it was
slowing everything down when contention
is high, when many, many sessions
fight...
Not fight, basically what's happening
with **pgbench** you have by
default only 4 queries, right?
If you say select only dash uppercase
S, It's only a single query,
only 1 select to the **pgbench_accounts**
table, that's it.
In this case, like hundreds of
sessions try to increment metrics
in a single record of **pg_stat_kcache**,
right?
And due to that lock, which was
removed in the latest version,
it was not good.
I mean, the more sessions we have,
the biggest contention, the
bigger overhead is and we see it
as we increase the number of
clients, number of connections
and jobs, the dash C, dash J in
pgBench parameters, TPS go down.
So we have more, we have room,
we have more vCPUs, right?
But we cannot use them properly
because all of them try to execute
basically the same normalized query.
Parameters don't matter here.
Michael: Yeah, I pulled the chart
up from your tweet and just
to give people an idea, at 50 clients
it's about 550,000 TPS.
At 100 clients it does go up to
about 700,000 TPS, but at 150
we're down below the 50 client
rate and it's only 400,000.
Then it settles at about 200 clients
and above, at about 300,000
TPS, which is less than we had
at 50.
So it's a super interesting curve
and cool to hear that you got
so quickly to the root cause and
that the team was able to fix
it so quickly.
Nikolay: Yeah, in a few hours.
Well, let's split the story here
into 2 paths.
First path is purely pg_stat_kcache
and the second path is without
it, because when we saw this overhead
we continued our discovery
of maximum TPS on modern Intel
and modern AMD without pg_stat_kcache,
of course.
Because we didn't expect it to
be fixed so quickly.
But when the fix was ready, we
asked the bot to repeat this benchmark.
It was interesting to convince
the bot that it's safe to download
fresh code from GitHub and compile
it.
It hallucinated, saying it's against
policy.
When I asked which policy, It said
PostgreSQL policy.
So it was funny, like I went checking,
do we have such policy?
I checked all our docs because
the bot does know our documents.
So probably like, but it was just
pure hallucination.
So then we convinced it, verified
and indeed we saw that the
fix indeed resolves the problem
and no more such overhead.
So it was good and next week the
new release of pg_stat_kcache was
issued.
But interesting question, like,
as I mentioned, I trust those
people, and also not just trust, I worked
with some of them, and I
touched production systems with
my hands, so I didn't see such
problems for years on production.
But here we saw obviously a very
big overhead.
Why so?
Why have we survived with this on production?
This is an interesting question.
Let's return to it after we
explore the second path.
Second path, okay, forget about
pg_stat_kcache, we only have pg_stat
statements, pg_wait_sampling, we
know the overhead of both is not
so big.
And then we just explore from 1
to 500 connections on both modern
Intel platform and modern AMD,
176 vCPUs on one and 360 vCPUs on
another.
And a huge surprise was Intel behaved...
Again, the same workload, pgBench -S,
so select only, single query,
very simple, sub-millisecond latency.
Huge surprise was Intel behaves
not well at all.
It behaves similar to what we had
with pg_stat_kcache enabled.
But higher, yes, higher, it reached
1 million I think or so,
maybe slightly below it, and then
went down.
While AMD, like this going down,
even before you reach a number
of vCPUs, increasing number of
connections and jobs in pgBench.
Bench.
It's not normal.
Some kind of problem, obviously.
While AMD also was not good, but
it didn't go down.
It went down slightly, right?
It demonstrated almost a plateau.
Very different behavior.
And we started studying what's
happening here, and obviously
the problem was pg_stat_statements
in this case.
And we saw it, I think, from wait
events by pg_wait_sampling, but
we also collected flame graphs
and we obtained 2 very different
pictures for these platforms.
Everything is the same, Ubuntu
22.04, Postgres 16 latest version,
everything is like, some kind of
tuning applied, kind of default
tuning we usually apply to new
clusters under our control.
Nothing special, nothing fancy,
but very different behavior.
And on flame graphs we could see
that in case of Intel, these
few repits, we see that PgStatStatements_Nextval
had a huge S_log function call consumed a lot
of time.
And in case of AMD this call is
much, like, smaller.
So in case of Intel it was like
75% of whole time spent by CPU
in FlameGraph.
Like, Why?
What's happening?
Something not normal at all.
And I talked to a few guys, talked
to Andrey Borodin and Alexander
Korotkov, and interestingly, they
both mentioned the idea that
probably pg_stat_statements needs
sampling here.
Right?
Sampling.
So, it...
Michael: Yeah, so it's an interesting
idea.
We don't currently have a parameter.
Yeah, we don't...
Like for a lot of the other things
we have, like for logging,
for example, because there's such
overhead to logging, we have
parameters for that kind of thing.
We can sample to only measure 1
in 10 or 1 in 100.
Yeah, yeah.
Well, I'm thinking actually, I
know we've discussed this before,
but from log_min_duration_statement,
yeah, I think since more recently,
but for auto_explain from a long
time ago, for example, but the
the 75% makes sense as well just
to give people an idea of the
exact numbers I've pulled up that
chart as well On Intel with
pg_stat_statements.
It does get to very nearly a million
tps without pg_stat_kcache
and then drops a little bit less
rapidly but still it drops down
to 500,000 150 and then down to
about 300,000 by the time you
get into the hundreds of clients
whereas AMD with pg_stat_statements
continues to rise at 100 clients
to about 1.2 million, gets above
1.5 million at about 150 clients,
then seems saturated and it's
mostly a plateau, slight decline
as you get more.
So that's about 5 times more by
the time you get to, you know,
300,000 versus 1.5 million.
Starts to make sense that that's
like 75% overhead, I guess.
Nikolay: Yeah, so in all cases
when you reach some usual number
of vCPU, you go down.
But normally you go down slowly,
right?
But the picture demonstrated on
Intel, like very acute, like
acutely going down, right?
Going down very fast.
And this is not normal.
And of course, we, to confirm that
PgSentinel Attachments involved,
and also to get numbers we wanted,
we wanted big numbers, right?
When I posted it on Twitter, of
course, people started liking
and reposting.
Like, with AMD, I think we've got,
without Pages of Atonement,
just removing it, we've got 2.5
million TPS, right?
On AMD.
Michael: Nearly, yeah.
Nikolay: Almost, almost, right.
About 2 million TPS.
My tweet was 2 million TPS on PostgreSQL
16.
Right.
Well, yeah, It was funny to see
the reaction from CEO of PlanetScale.
Like, it's not real workload.
Well, no, it's not real workload.
It's select-only.
It's purely synthetic workload.
We're just exploring some edge
of what system can provide.
Of course, it's not realistic,
it's some select, that's it.
Well, maybe there are some systems
which need mostly this kind
of workload.
And this is an interesting question,
because if they have such
workload, they suffer from bigger
problems from pg_stat_statements settings.
I think this is the main idea of
today's episode.
Look at your workload and understand
the nature of it.
But obviously, CEO of PlanetScale
is not interested in single-node
performance, because their main
thing is sharding.
So single-node should not provide
millions of TPS, right?
It's not normal, right?
Michael: It doesn't help the marketing.
Nikolay: Right, right.
Everyone realizes it's select only,
everything is cached in memory,
but still, 2.5 million TPS, wow.
On a machine you can just rent easily.
Well, it's an expensive machine, I
think.
If it's not spot, if it's normal,
without any discounts, it's
above $10,000 per month.
So it's an expensive machine.
Yeah.
Of course.
But it's possible.
And this is just Postgres with
minimal tuning, right?
It's good.
Like 2 million TPS, whoo-hoo.
Of course, I'm slightly sad comparing
to 2016, how many years
have already passed?
Like 8?
3,000,000?
Maybe we can do more, we can squeeze
maybe more, but it's a separate
question.
We will probably think about it.
But returning to pg_stat_statements,
what's happening?
Again, a lot of sessions, they
compete trying to update the same
record in the pg_stat_statements, just single
record, select, query, that's
it.
If it was different queries, it
would be okay.
So, I even started calling this
workload pathological.
But then I say, okay, what about
all SaaS systems and social
media?
Do they have something, some query
which is executed, which should
be fast and it's executed in many
cases?
And the answer is yes.
Usually, if people work with your
system, you need to select...
It can be cached, of course, but
I saw it many times, some primary
key lookup to tables like users
or posts or projects or blogs
or something.
And you see most of sessions, I
mean not only database sessions,
but for example, like session in
terms of HTTP communication,
web sessions.
Most of them need this query, right?
In this case, you probably...
You might have this problem.
You might have this observer effect.
And, of course, the solution would
be to start caching them,
probably, right?
And so on.
But, this is still an issue.
So, if you have, for example, a
thousand TPS or more of some
primary key lookup, probably you
already might have this problem,
overhead from pg_stat_statements, which
maybe for the whole workload which
you have is not so big, as we mentioned,
can be 1 or 0.5%.
But if you zoom into this part
of the workload, primary key lookup,
probably there this overhead is
bigger and maybe guys I talked
to, maybe they are right and maybe
sampling would be a good solution.
Maybe pg_stat_statements could guess,
oh, this is high frequency
query, a lot of QPS are happening
here, the call's number is high,
maybe I just need to stop sampling,
Maybe, I don't know.
It's an interesting question.
And of course, another interesting
question was why Intel?
And I don't have an answer yet.
We see these flame graphs, we realize,
okay, this code is running
much longer than on Intel.
Right now there is an idea to explore
older Intels' Cascade Lake
and maybe even older Xeons, which
may be used much more in production
systems.
And maybe also older Epycs, third
and second generation maybe.
We also have an issue, not related,
but we also observe an issue
with, we discussed it, Lightweight
Locks Log manager contention,
there AMD behaves worse than Intel.
So it's kind of interesting.
But what we need to understand,
if a lot of sessions run the
same query, it can be bad in terms
of lock manager.
So the solution would be to get rid
of planning time using prepared
statements, this is ideal.
Or just reduce frequency and you
won't notice this, right?
Or make sure fast path is true
always and it means that you have
only a few indexes and partition
pruning works in plans.
And also, you have an observer effect
in the pg_stat_statements
in this case.
And second, Intel versus AMD.
I don't have answers here, but
it's interesting to just dig into
it and understand it.
Michael: The thing I've heard you
say multiple times in the past
is when you're doing this kind
of work, or when you're looking
at your system, it's trying to
understand where the bottleneck
is.
Something I really like about this
piece of work that you've
done is you first identified that
pg_stat_kcache was a bottleneck
and then switched that off and
then tried to work out what the
next bottleneck was, looks like
it actually might be pg_stat_statements,
even though maybe in your head
you were thinking, it probably
isn't, but let's try turning it
off and see if that makes a difference.
That made a big difference.
So it's each time, like even when
you're thinking about maybe
trying to get above that 2.5 million
or maybe 3 million, we have
to work out what's the current
bottleneck.
Like that's how we...
Yeah, exactly.
So that's a really nice thing that
I don't see enough people
thinking about on their own workloads,
but also when they're
benchmarking, what are we currently
limited by?
Nikolay: Right, right, right.
So when we perform any performance
research, like benchmarks,
analysis, root cause analysis after
some incidents on production,
or we try to reproduce problems,
so we perform, like all these
database systems is a complex thing,
and workload usually is
quite complex.
So to study the whole, we need
to apply this, like I mentioned
before we had this call, I mentioned
René Descartes or how to
pronounce it in English.
Michael: I think that's, well,
he's French, right?
Nikolay: He's French, yes.
Michael: Yeah.
I like René Descartes.
Right.
So the idea
Nikolay: is we need to properly
split the whole into segments and
then try to study each segment
separately.
When you study each segment separately
and know how each of it
behaves.
For example, okay, there is a high frequency
select.
Let's study how it behaves without
anything else.
We know how it usually behaves
without anything else, in like
an emptiness, right?
By the way, when we study it, we
also can divide it into smaller
pieces.
For example, okay, let's remove
this, let's remove that extension.
So go deeper, deeper, so basically
minimal pieces.
We start, it takes time, of course,
right?
But then we know small pieces,
how they behave, we can try to
compose it back to a complex workload
and study an ensemble
of it as a whole.
This is a regular scientific approach,
I think one of the oldest
ones, but we must do it here, and
I cannot agree with PlanetScale
CEO.
It's not normal, it's not, but
we study it because it's presented
in our complex production workloads,
right?
For example, primary key lookups
with 1,000 or more QPS.
It's not uncommon.
Michael: Yeah.
Well, but to be clear, I'm still
a big fan of encouraging every
workload I've ever seen, or every
setup I've ever seen, especially
we're talking about SaaS companies,
like that kind of thing,
I would still encourage them to
have pg_stat_statements, unless
they've done some testing that
somehow is affecting their workload.
Nikolay: I agree.
Even if it was 30%, I would say
in many cases we still need to
have it.
Because without it, we are blind.
Michael: Yeah.
But if it was 30%, if we did find
that out, it would be cool.
to have sampling.
If we then took 1 in 10, we could
reduce that maybe to 3 percent.
Also, maybe it's not quite linear,
but you know.
Nikolay: But sampling here should
be smart.
It should be applied only to high
frequency queries. So anyway,
I think we're almost out of time.
We are.
The bottom line, yeah, the bottom
line, check...
I think we should also check ARM
platform as well and see how
it behaves.
So there is difference in pg_stat_statements
behavior on regular
queries versus regular slow queries
like updates or deletes versus
high frequent very fast selects
like primary key lookups, and
there is difference between AMD
and Intel.
So this is the bottom line.
And I'm excited to see the results
of our further investigation of
what's happening.
We plan to understand details here.
Michael: Nice one.
Thanks so much, Nikolay.
Thank you everyone for listening
and catch you soon.