Overhead of pg_stat_statements and pg_stat_kcache

Overhead of pg_stat_statements and pg_stat_kcache

Michael: Hello and welcome to PostgresFM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard,
and this is my co-host,

Nikolay, founder of Postgres.ai.

Welcome back, Nikolay.

How are you getting on and what
would you like to talk about

today?

Nikolay: Hi, Michael.

First of all, thank you for keeping
the, how is it called, the

ball rolling, or how to say, long
time no see or long time no

hear.

Yeah.

Yeah.

Thank you for very interesting
interviews.

I liked listening to them while
I was slightly off, but maybe

it's time to return to our regular
format or have a mix, I don't

know what we will decide next,
but today we have a regular format,

right?

Michael: Yeah.

Nikolay: Yeah, and the topic I
brought this topic is overhead

of observability tools, the most
popular ones.

Well, 1 is the most popular one,
pg_stat_statements.

And usually people, like I remember
exactly when we didn't have

pg_stat_statements, we had only
log-based analysis.

We didn't have wait events, didn't
have pg_stat_statements like

15 years ago, right?

And we only had log-based analysis
and there was a tool written

in PHP called pgFouine, and then
another tool replaced it, written

in Perl called pgBadger.

And the idea was, like, people
always, like, first thing to figure

out that, oh, usually we only see
only a tip of the iceberg only

we see only like some queries which
are the slowest ones but

you cannot properly perform like
full holistic analysis if you

don't see the fast queries, because
fast queries might be consuming

even more resources than slow queries.

So what people did usually, they
experienced DBAs usually said,

okay, I'm going to switch off all
query logging for a few minutes

to collect everything and then
analyze.

In very many cases it worked well.

Yes, there is an observer effect
because we probably put our

Postgres down completely if we
log all queries, but not always.

Sometimes it depends.

So yeah, this is quite understandable.

But then pg_stat_statements were
created and some DBAs, I remember,

we were saying, oh, you know, like
we still need to study its

overhead.

And then somehow there was like,
there were some benchmarks,

I don't remember any benchmarks
I would say I trust them.

But I remember many experienced
folks started saying, oh, you

know, yes, there is overhead, but
it's below 10%.

And since this thing is measuring
everything, it's observing

the whole workload, it's worth keeping
it enabled.

Let's have it enabled.

I remember also there was some
number, like, people, many people

saying, you know, pg_stat_statements
overhead is 7%.

It's kind of strange, but kind
of 7%.

7%?

7%.

I don't remember any benchmarks
that proved it, but I remember

it was like the consensus was below
10%, so we are all good.

If we try to find a good benchmark
from these early days of

pg_stat_statements, it would be
great.

I don't think it exists.

But honestly, we as a community
should perform good benchmarks

there.

I don't remember that.

It doesn't mean they don't exist, of
course, but...

Michael: I did a little bit of
searching beforehand because I

was interested in if there was
any up-to-date ones as well.

I couldn't find anything great,
but there were a few trustworthy

sources.

One was PG Analyze, a commercial
monitoring tool that uses pg_stat_statements.

statements.

In their frequently asked questions,
they gave an estimate of

approximately 1% of CPU time, which
I thought was interesting

that they're telling their customers.

Anyway, as we're going to discuss,
this is all very dependent

on workload, right but at least
that's interesting and

I trust them enough to say that's
probably based on some tests

on their side even if I didn't
see the actual benchmarking, and

then the other things I've seen
are a series of, well, there's

a question on the DBA Stack Exchange
and somebody's quoting various

Postgres authors and a benchmarking
series.

One that was like a more realistic
workload that was about 0.5%

measured overhead and another that
was a more stress test that

measured it at about 10%.

So those are numbers, but that's
a wide range of different numbers.

And as we're going to discuss later,
you've done some work that

shows even that it can be very
different, which is cool.

Nikolay: I hope by the end of this
episode, people will start

having some ideas why this range
can be so wide.

But I can imagine like 0.5% or
10%.

In both cases, it can be so.

I mean, it depends on workload,
actually.

Very well depends on the workload.

And also not only about workload,
we will talk about it later,

right?

So, yeah, okay.

But when I say overhead, of course
I mean CPU overhead, first

of all, because we don't think
about disk I/O or disk usage, disk

space usage here, network, nothing
like that, only pure CPU overhead.

Michael: By the

Nikolay: way, when guys like RDS
and others, probably inheriting

this from Oracle, when they consider
wait event being null in

pg_stat_activity as CPU, marking
it as green, CPU green in performance

insights.

Do they mean other wait events,
all of them are not CPU?

I don't understand this.

Michael: I don't know.

Nikolay: Because many of, for example,
lightweight locks, spin

locks, for example, they are purely
CPU.

Michael: I know we have several
listeners from AWS, so maybe

they can let us know.

Nikolay: We have listeners, yeah.

And on the other hand, if you say
CPU instead of null, null means

unknown.

And according to PostgreSQL documentation
and source code, it means

no wait.

In reality, it means either really
no wait and maybe some kind

of CPU work, but also in many cases
it means a wait event which

is not yet created and code is
not covered with this.

For example, PostgreSQL 16, we recently
had some benchmarks and

we saw wait event which you see
in PostgreSQL 16 but you don't

see it in PostgreSQL 14 because it
was not yet created there.

So I'm thinking, okay, it means
in performance insights on RDS

and I think in many other systems
maybe in CloudSQL as well

or this PgBouncer ad hoc tool
written in Java, it also likes

to use green color and say this
is CPU, but it's actually not

CPU, it's null.

Because CPU doesn't exist in the
list of wait events in pg_stat_activity.

of Activity.

So if you mark it CPU for 14 because
it's what was null, but

then you start distinguishing it
for 16.

Something is not right here.

So I'm asking, is it really CPU?

So interesting question, but it's
slightly off topic.

Back to PgStatStatements.

Let's talk about benchmarks we
had.

So, the idea was we got some credits
from Google Cloud.

And it's great, this year we will
be performing, I mean, we,

I mean, Postgres.ai team, we will
be performing a lot of benchmarks.

I'm going to do a lot of post...

We already had a lot of benchmarks,
I think thousands of them

if you count each iteration, and
hundreds of them if you count

the whole benchmark consisting
of many iterations.

So we were curious how many TPS
we can squeeze from PostgreSQL

16 on various big machines, Intel,
AMD, and there was a good

article from 2016 from Alexander
Korotkov, when both PostgreSQL

and MySQL, I remember, teamed up.

It was interesting.

I think, I don't remember if somebody
from Percona was also working

on MySQL, the goal was to demonstrate
that both systems can show

million TPS.

And during that work, some contention
issues were improved, fixed

in Postgres.

So, Postgres reached a million
TPS on some strong machines.

And I like that benchmark because
it's simple, it's pgBench,

select only.

So, it's only selects, very simple
ones.

So we just repeated the same benchmark.

And as usual for such benchmarks,
it's actually stress testing,

because you are exploring this
edge.

It's not what you need to do for
your application, for example,

unless you're on purpose studying
it.

I hate this behavior being default
in pgBench.

It should not be so, because it's
provoking to perform stress

tests instead of regular load tests
and exploring normal situation,

for example, 25 or 50% of CPU load.

But in our case, we just want to
squeeze as much as we can.

So we did this, and I published
a blog post about our new AI

bot and it has details and links
to details how we did it.

So we took the newest Intel, fourth
generation Intel scalable

Sapphire Rapids it's called, right?

And also 4th generation AMD EPYC,
C3 and C3D instances on GCP.

One has, the biggest number is 176
vCPUs, and AMD has 360 vCPUs.

And both have more than a terabyte
of RAM.

So it's insane, but we use spots.

AWS has spots, GCP has spot instances, it's
cool.

You pay a couple of bucks for such
an experiment, honestly.

Like it's not...

I realized even if we didn't have
credits, I would probably pay

for myself, it's interesting to
explore these things.

Because we just provision temporary
machines for 1 or 2 hours

and spot means huge discount.

Yeah, so, this kind of experiment,
it's a classic experiment for

stress testing.

You first start with 1 connection,
then more and more connections,

and you control...

We chose Aleksandr Korotkov's approach.

In pgBench both -c and -j are the same.

So we start from 1, then we put
50.

With such huge machines, this step
is reasonable.

Jumping to 50 right away.

Then 100, then 150, 200, and so
on until, I think, 500.

Exceeding the number of vCPUs in
both cases.

So since we use PostgreSQL cluster,
which is a project, it's

open source project, It's a very
good thing, Ansible kind

of set of playbooks to provision
Patroni clusters with a lot

of things.

It's maintained by Vitaliy, who
is working in my team.

So it's a great tool and it has
a lot of things, including those

we over years discussed together
and he just added into PostgreSQL

cluster.

So we used it and because of that
we had a lot of observability

tools including those for query
analysis, **pg_stat_statements**, **pg_stat_kcache**,

much less popular, and **pg_wait_sampling**
for weight event analysis.

And then we saw a very strange
picture.

Until 50 we grow in terms of TPS,
kind of approaching a million

TPS, but then go down very quickly.

And on Intel we went down even
more, but it's a different story.

And thanks to **pg_wait_sampling**,
we saw that we have number 1

weight event is related to **pg_stat_kcache**.

So we immediately realized this
is an observed effect from this

**pg_stat_kcache**.

For those who don't know, **pg_stat_kcache**
is an additional extension

to **pg_stat_statements** extension which
provides physical level metrics,

user CPU, sys CPU, real disk IO
at physical level, context switches.

So it's very useful to understand
real CPU usage, unlike this

where to when equals null, which
I think is wrongly presented

in RDS performance insights.

This thing, trustworthy, we use
it many years, a few big companies

use it as well.

It's not super popular, **RDS** doesn't
have it, **CloudSQL** doesn't

have it, most managed Postgres
providers don't have it.

I know **Yandex Cloud** has it.

I think they might be the

Michael: only ones that have it.

Nikolay: Maybe, But also 1 of the
biggest e-commerce companies

we worked with also has it for
many years and they have very

critical systems running on Postgres,
so it's...

But in this case, overhead was
so huge, like, what's happening?

And I immediately published it
on Twitter and Vitaliy created

an issue in **pg_stat_kcache**, GitHub,
a repository, and then maintainers,

they created a fix in 4 hours,
I think.

So I think they expected something.

There was some additional lock
which was removed in this fix

with the idea this lock is something
old and it's not really

needed, but in our case it was
slowing everything down when contention

is high, when many, many sessions
fight...

Not fight, basically what's happening
with **pgbench** you have by

default only 4 queries, right?

If you say select only dash uppercase
S, It's only a single query,

only 1 select to the **pgbench_accounts**
table, that's it.

In this case, like hundreds of
sessions try to increment metrics

in a single record of **pg_stat_kcache**,
right?

And due to that lock, which was
removed in the latest version,

it was not good.

I mean, the more sessions we have,
the biggest contention, the

bigger overhead is and we see it
as we increase the number of

clients, number of connections
and jobs, the dash C, dash J in

pgBench parameters, TPS go down.

So we have more, we have room,
we have more vCPUs, right?

But we cannot use them properly
because all of them try to execute

basically the same normalized query.

Parameters don't matter here.

Michael: Yeah, I pulled the chart
up from your tweet and just

to give people an idea, at 50 clients
it's about 550,000 TPS.

At 100 clients it does go up to
about 700,000 TPS, but at 150

we're down below the 50 client
rate and it's only 400,000.

Then it settles at about 200 clients
and above, at about 300,000

TPS, which is less than we had
at 50.

So it's a super interesting curve
and cool to hear that you got

so quickly to the root cause and
that the team was able to fix

it so quickly.

Nikolay: Yeah, in a few hours.

Well, let's split the story here
into 2 paths.

First path is purely pg_stat_kcache
and the second path is without

it, because when we saw this overhead
we continued our discovery

of maximum TPS on modern Intel
and modern AMD without pg_stat_kcache,

of course.

Because we didn't expect it to
be fixed so quickly.

But when the fix was ready, we
asked the bot to repeat this benchmark.

It was interesting to convince
the bot that it's safe to download

fresh code from GitHub and compile
it.

It hallucinated, saying it's against
policy.

When I asked which policy, It said
PostgreSQL policy.

So it was funny, like I went checking,
do we have such policy?

I checked all our docs because
the bot does know our documents.

So probably like, but it was just
pure hallucination.

So then we convinced it, verified
and indeed we saw that the

fix indeed resolves the problem
and no more such overhead.

So it was good and next week the
new release of pg_stat_kcache was

issued.

But interesting question, like,
as I mentioned, I trust those

people, and also not just trust, I worked
with some of them, and I

touched production systems with
my hands, so I didn't see such

problems for years on production.

But here we saw obviously a very
big overhead.

Why so?

Why have we survived with this on production?

This is an interesting question.
Let's return to it after we

explore the second path.

Second path, okay, forget about
pg_stat_kcache, we only have pg_stat

statements, pg_wait_sampling, we
know the overhead of both is not

so big.

And then we just explore from 1
to 500 connections on both modern

Intel platform and modern AMD,
176 vCPUs on one and 360 vCPUs on

another.

And a huge surprise was Intel behaved...

Again, the same workload, pgBench -S,
so select only, single query,

very simple, sub-millisecond latency.

Huge surprise was Intel behaves
not well at all.

It behaves similar to what we had
with pg_stat_kcache enabled.

But higher, yes, higher, it reached
1 million I think or so,

maybe slightly below it, and then
went down.

While AMD, like this going down,
even before you reach a number

of vCPUs, increasing number of
connections and jobs in pgBench.

Bench.

It's not normal.

Some kind of problem, obviously.

While AMD also was not good, but
it didn't go down.

It went down slightly, right?

It demonstrated almost a plateau.

Very different behavior.

And we started studying what's
happening here, and obviously

the problem was pg_stat_statements
in this case.

And we saw it, I think, from wait
events by pg_wait_sampling, but

we also collected flame graphs
and we obtained 2 very different

pictures for these platforms.

Everything is the same, Ubuntu
22.04, Postgres 16 latest version,

everything is like, some kind of
tuning applied, kind of default

tuning we usually apply to new
clusters under our control.

Nothing special, nothing fancy,
but very different behavior.

And on flame graphs we could see
that in case of Intel, these

few repits, we see that PgStatStatements_Nextval
had a huge S_log function call consumed a lot
of time.

And in case of AMD this call is
much, like, smaller.

So in case of Intel it was like
75% of whole time spent by CPU

in FlameGraph.

Like, Why?

What's happening?

Something not normal at all.

And I talked to a few guys, talked
to Andrey Borodin and Alexander

Korotkov, and interestingly, they
both mentioned the idea that

probably pg_stat_statements needs
sampling here.

Right?

Sampling.

So, it...

Michael: Yeah, so it's an interesting
idea.

We don't currently have a parameter.

Yeah, we don't...

Like for a lot of the other things
we have, like for logging,

for example, because there's such
overhead to logging, we have

parameters for that kind of thing.

We can sample to only measure 1
in 10 or 1 in 100.

Yeah, yeah.

Well, I'm thinking actually, I
know we've discussed this before,

but from log_min_duration_statement,
yeah, I think since more recently,

but for auto_explain from a long
time ago, for example, but the

the 75% makes sense as well just
to give people an idea of the

exact numbers I've pulled up that
chart as well On Intel with

pg_stat_statements.

It does get to very nearly a million
tps without pg_stat_kcache

and then drops a little bit less
rapidly but still it drops down

to 500,000 150 and then down to
about 300,000 by the time you

get into the hundreds of clients
whereas AMD with pg_stat_statements

continues to rise at 100 clients
to about 1.2 million, gets above

1.5 million at about 150 clients,
then seems saturated and it's

mostly a plateau, slight decline
as you get more.

So that's about 5 times more by
the time you get to, you know,

300,000 versus 1.5 million.

Starts to make sense that that's
like 75% overhead, I guess.

Nikolay: Yeah, so in all cases
when you reach some usual number

of vCPU, you go down.

But normally you go down slowly,
right?

But the picture demonstrated on
Intel, like very acute, like

acutely going down, right?

Going down very fast.

And this is not normal.

And of course, we, to confirm that
PgSentinel Attachments involved,

and also to get numbers we wanted,
we wanted big numbers, right?

When I posted it on Twitter, of
course, people started liking

and reposting.

Like, with AMD, I think we've got,
without Pages of Atonement,

just removing it, we've got 2.5
million TPS, right?

On AMD.

Michael: Nearly, yeah.

Nikolay: Almost, almost, right.

About 2 million TPS.

My tweet was 2 million TPS on PostgreSQL
16.

Right.

Well, yeah, It was funny to see
the reaction from CEO of PlanetScale.

Like, it's not real workload.

Well, no, it's not real workload.

It's select-only.

It's purely synthetic workload.

We're just exploring some edge
of what system can provide.

Of course, it's not realistic,
it's some select, that's it.

Well, maybe there are some systems
which need mostly this kind

of workload.

And this is an interesting question,
because if they have such

workload, they suffer from bigger
problems from pg_stat_statements settings.

I think this is the main idea of
today's episode.

Look at your workload and understand
the nature of it.

But obviously, CEO of PlanetScale
is not interested in single-node

performance, because their main
thing is sharding.

So single-node should not provide
millions of TPS, right?

It's not normal, right?

Michael: It doesn't help the marketing.

Nikolay: Right, right.

Everyone realizes it's select only,
everything is cached in memory,

but still, 2.5 million TPS, wow.

On a machine you can just rent easily.

Well, it's an expensive machine, I
think.

If it's not spot, if it's normal,
without any discounts, it's

above $10,000 per month.

So it's an expensive machine.

Yeah.

Of course.

But it's possible.

And this is just Postgres with
minimal tuning, right?

It's good.

Like 2 million TPS, whoo-hoo.

Of course, I'm slightly sad comparing
to 2016, how many years

have already passed?

Like 8?

3,000,000?

Maybe we can do more, we can squeeze
maybe more, but it's a separate

question.

We will probably think about it.

But returning to pg_stat_statements,
what's happening?

Again, a lot of sessions, they
compete trying to update the same

record in the pg_stat_statements, just single
record, select, query, that's

it.

If it was different queries, it
would be okay.

So, I even started calling this
workload pathological.

But then I say, okay, what about
all SaaS systems and social

media?

Do they have something, some query
which is executed, which should

be fast and it's executed in many
cases?

And the answer is yes.

Usually, if people work with your
system, you need to select...

It can be cached, of course, but
I saw it many times, some primary

key lookup to tables like users
or posts or projects or blogs

or something.

And you see most of sessions, I
mean not only database sessions,

but for example, like session in
terms of HTTP communication,

web sessions.

Most of them need this query, right?

In this case, you probably...

You might have this problem.

You might have this observer effect.

And, of course, the solution would
be to start caching them,

probably, right?

And so on.

But, this is still an issue.

So, if you have, for example, a
thousand TPS or more of some

primary key lookup, probably you
already might have this problem,

overhead from pg_stat_statements, which
maybe for the whole workload which

you have is not so big, as we mentioned,
can be 1 or 0.5%.

But if you zoom into this part
of the workload, primary key lookup,

probably there this overhead is
bigger and maybe guys I talked

to, maybe they are right and maybe
sampling would be a good solution.

Maybe pg_stat_statements could guess,
oh, this is high frequency

query, a lot of QPS are happening
here, the call's number is high,

maybe I just need to stop sampling,
Maybe, I don't know.

It's an interesting question.

And of course, another interesting
question was why Intel?

And I don't have an answer yet.

We see these flame graphs, we realize,
okay, this code is running

much longer than on Intel.

Right now there is an idea to explore
older Intels' Cascade Lake

and maybe even older Xeons, which
may be used much more in production

systems.

And maybe also older Epycs, third
and second generation maybe.

We also have an issue, not related,
but we also observe an issue

with, we discussed it, Lightweight
Locks Log manager contention,

there AMD behaves worse than Intel.

So it's kind of interesting.

But what we need to understand,
if a lot of sessions run the

same query, it can be bad in terms
of lock manager.

So the solution would be to get rid
of planning time using prepared

statements, this is ideal.

Or just reduce frequency and you
won't notice this, right?

Or make sure fast path is true
always and it means that you have

only a few indexes and partition
pruning works in plans.

And also, you have an observer effect
in the pg_stat_statements

in this case.

And second, Intel versus AMD.

I don't have answers here, but
it's interesting to just dig into

it and understand it.

Michael: The thing I've heard you
say multiple times in the past

is when you're doing this kind
of work, or when you're looking

at your system, it's trying to
understand where the bottleneck

is.

Something I really like about this
piece of work that you've

done is you first identified that
pg_stat_kcache was a bottleneck

and then switched that off and
then tried to work out what the

next bottleneck was, looks like
it actually might be pg_stat_statements,

even though maybe in your head
you were thinking, it probably

isn't, but let's try turning it
off and see if that makes a difference.

That made a big difference.

So it's each time, like even when
you're thinking about maybe

trying to get above that 2.5 million
or maybe 3 million, we have

to work out what's the current
bottleneck.

Like that's how we...

Yeah, exactly.

So that's a really nice thing that
I don't see enough people

thinking about on their own workloads,
but also when they're

benchmarking, what are we currently
limited by?

Nikolay: Right, right, right.

So when we perform any performance
research, like benchmarks,

analysis, root cause analysis after
some incidents on production,

or we try to reproduce problems,
so we perform, like all these

database systems is a complex thing,
and workload usually is

quite complex.

So to study the whole, we need
to apply this, like I mentioned

before we had this call, I mentioned
René Descartes or how to

pronounce it in English.

Michael: I think that's, well,
he's French, right?

Nikolay: He's French, yes.

Michael: Yeah.

I like René Descartes.

Right.

So the idea

Nikolay: is we need to properly
split the whole into segments and

then try to study each segment
separately.

When you study each segment separately
and know how each of it

behaves.

For example, okay, there is a high frequency
select.

Let's study how it behaves without
anything else.

We know how it usually behaves
without anything else, in like

an emptiness, right?

By the way, when we study it, we
also can divide it into smaller

pieces.

For example, okay, let's remove
this, let's remove that extension.

So go deeper, deeper, so basically
minimal pieces.

We start, it takes time, of course,
right?

But then we know small pieces,
how they behave, we can try to

compose it back to a complex workload
and study an ensemble

of it as a whole.

This is a regular scientific approach,
I think one of the oldest

ones, but we must do it here, and
I cannot agree with PlanetScale

CEO.

It's not normal, it's not, but
we study it because it's presented

in our complex production workloads,
right?

For example, primary key lookups
with 1,000 or more QPS.

It's not uncommon.

Michael: Yeah.

Well, but to be clear, I'm still
a big fan of encouraging every

workload I've ever seen, or every
setup I've ever seen, especially

we're talking about SaaS companies,
like that kind of thing,

I would still encourage them to
have pg_stat_statements, unless

they've done some testing that
somehow is affecting their workload.

Nikolay: I agree.

Even if it was 30%, I would say
in many cases we still need to

have it.

Because without it, we are blind.

Michael: Yeah.

But if it was 30%, if we did find
that out, it would be cool.

to have sampling.

If we then took 1 in 10, we could
reduce that maybe to 3 percent.

Also, maybe it's not quite linear,
but you know.

Nikolay: But sampling here should
be smart.

It should be applied only to high
frequency queries. So anyway,

I think we're almost out of time.

We are.

The bottom line, yeah, the bottom
line, check...

I think we should also check ARM
platform as well and see how

it behaves.

So there is difference in pg_stat_statements
behavior on regular

queries versus regular slow queries
like updates or deletes versus

high frequent very fast selects
like primary key lookups, and

there is difference between AMD
and Intel.

So this is the bottom line.

And I'm excited to see the results
of our further investigation of

what's happening.

We plan to understand details here.

Michael: Nice one.

Thanks so much, Nikolay.

Thank you everyone for listening
and catch you soon.

Some kind things our listeners have said