Postgres FM | Transcript: Four million TPS

July 5, 2024 • 46 Minutes

Four million TPS

Michael: Hello and welcome to Postgres
FM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard,
and this is my co-host Nikolay,

founder of Postgres.AI.

Hey Nikolay, what are we talking
about today?

Nikolay: Hi Michael, let's talk
about a lot of TPS we can get

from a single Postgres node.

I recently achieved something.

I'm curious, is it some interesting
achievement or like no?

I'm sure it's possible to get more
on bigger machines.

Because my machines were big enough,
right?

It's like just 360 vCPUs and just
2 terabytes of RAM or something,

maybe 1.5 terabytes of RAM.

Yeah.

Fourth generation AMD Epic.

It's, there is already 5th generation
AMD Epic and there are Intel,

Intel's also, Xeon Scalable, bigger
instances and so on, it's

possible.

But I've almost achieved 4 million
transactions per second, full-fledged

transactions, not just like what
other NoSQL databases have,

it's transactions.

Read-only, we should admit, read-only,
and meaningful transactions.

It was a table with, I think, 10
million rows.

It's a tiny table for this machine.

Everything is cached, of course.

And it was a primary key lookup
and PgBench.

Regular PgBench accounts table,
you know it, right?

So...

Michael: Is it single read per
transaction?

Single query per transaction?

Is that right?

Nikolay: That's a good question
actually, let's double check,

I will double check it right now,
but I think yes, it's just

select from PgBench, random
number, random ID, and that's

it, should be, should be.

Michael: I was just hoping you'd
get to multiply your 4 million

by 2 or something.

Well, well, I get queries per second.

Nikolay: Let me tell you what you
don't know.

Yesterday I was driving to LA and
I used this like I think

I should check my eye and
get even more TPS.

And thankfully, I had AI with autopilot
in the car, right?

So I could, maybe it's still not
a good idea to use phone, But

it's just a quick question, right?

So I asked my bot to repeat this
TPS.

Oh, I forgot to mention, all these
experiments are done with

our Postgres.AI bot, which simplifies
experimentation a lot.

So If you need to provision a huge
machine, run some PgBench

during a couple of hours according
to some script, you just ask

for it, then you think, oh, what
if we change something else?

It's just a few words.

You just ask, and then it reports
back in a couple of hours.

So I asked, can you please check
not regular select only for

PgBench, which is very standard
Postgres for benchmarks, quite

simple but widely used, but rather
let's just use SELECT;

Not meaningful, right?

Not dealing with data at all, but
still it needs to be parsed,

right?

And so on.

Michael: I had it on my list of
questions for you to ask what

would happen if we got...

I was going to suggest SELECT 1,
which I guess is the beginner's

version.

Nikolay: I can tell you we exceeded
12000000,

Michael: wow nice.

Nikolay: Yeah but it's I don't
consider it fair because it's

just heartbeat.

Well it

Michael: depends what you're measuring
right like If you're trying

to measure what's the most we can
achieve, then sure, it's meaningful.

But if you're actually trying to
do something meaningful, read-only

is probably not realistic.

There's so many things you could
say, and I like the idea of

what's the maximum we could get
if we don't worry about constraints,

about how realistic it is.

Nikolay: Right.

So yeah, in this case, we only
checked the parser and general

transaction, read-only transaction,
not full-fledged transaction,

all these pieces.

But we didn't check how Postgres
works with, for example, buffer

pool with reading data and so on.

Because I know there was some contention
in the past, even for

read-only queries, which was improved,
I think, in 9.6 on the

road to achieve 1 million TPS.

We will probably discuss this slightly
later.

In general, what's interesting,
I needed to explain, I hate SELECT

1 statements.

You know why, right?

No.

Because 1 is not needed.

Michael: Fair enough.

Nikolay: You can say SELECT and
pause this.

It will return you a result set
consisting of 0 rows and 0 columns.

Michael: Can you just run semicolon
on its own or something?

Nikolay: Yes, you can.

I think we also discussed it.

Michael: Must have done.

Nikolay: Yeah, it has issues, negative
side effects, because,

for example, you won't see it in
logs, if somehow it's slow.

It's like empty query.

You won't see it in pg_stat_statements,
for example, unlike regular

SELECT;

And I failed to convince bot to
run SELECT;

Somehow it didn't work.

And I also had difficulties convincing
bot to select.

It started telling me...

I think it was Gemini.

It was this passive-aggressive
manner, it's Gemini for sure.

So it started telling me, you know,
like it's incorrect, it's

invalid SQL statement.

So let's like, it offered me, let's
put 1 at least there, or

where is the table name?

Like, okay, let's, I said, no tables,
no tables, it was just

like idle queries, like no real
action.

And then said, like, let's put
at least 1 because otherwise it's

incorrect statement.

I forced, like, are you sure?

You can check.

So we had some battle.

But in general, yeah, it worked
and we achieved 6 million TPS,

which was good.

But SELECT; didn't work for our
bot.

I think it's possible.

I don't expect a huge win compared
to SELECT.

Maybe it will be parser.

I don't know.

You can

Michael: run the experiment manually,
right?

Could you just submit the JSON
yourself?

Nikolay: It's better to convince
the bot.

Michael: You prefer to fight the
LLM, fair enough.

Nikolay: A little bit, yeah, because
every time we do it, I find

opportunities to improve, of course.

But back to meaningful results,
which is interesting.

We almost achieved 4 million, but
originally it was just 1 million.

And when I saw just 1 million on
modern Postgres, we checked

16 and 17 beta 1.

By the way, beta 2 is already out.

I think we should update and switch
to it in our tests.

So I expected much more because
I remember the post from Alexander

Korotkov from 2016, when I remember
he worked with Percona people,

Sveta Smirnova particularly.

They had like a friendly battle,
let's see what's better, Postgres

or MySQL, in terms of how many
TPS we can get from a single node.

And I think both Postgres and MySQL
learned something from it.

And Alexander created a few patches
because some issues were

found, some contention was found.

And it was improved.

In Postgres 9.5 it was not so good
and in 9.6, it was improved just

based on this work.

And here achieved 1 million TPS.

MySQL also achieved 1 million TPS.

Good.

But I was surprised like so many
years, like 8 years, right?

We have so many Postgres versions,
basically 8 Postgres versions

already since then, and maybe 7.

And much better hardware.

Not much, but significantly better
hardware.

So I expected to get more from
this.

Michael: Well, we looked into the
machines, didn't we?

And even back then, they had managed
to get their hands on 144

VCPU machine.

And I think you mentioned yours
was a bit more than double that,

was it?

Nikolay: 360, and I think they
had the Intel's, much older Intel's,

and I had AMD, which is very easily
available on GCP, which I

use because I have credits there,
right?

So yeah, before we started this
call, I cannot miss this.

You showed me what AWS has.

Super impressive.

How many, almost 668, almost 800.

No, more, more.

Almost 900.

Michael: 896 vcpu a month.

Nikolay: And it's Intel scalable
fourth generation.

Michael: Yeah, it'll cost you.

Nikolay: But also 32 terabytes
of RAM maximum.

Yeah.

2224x large.

Wow.

And they don't have spots on it.

Of course, this is something new.

Yeah, and it's fourth generation,
it's scalable.

What I found interesting, like
during last year and this year,

various benchmarks, not this, like,
toy benchmark, it's a side

project for me, but working with
customers, various cases.

What I found, what my impression
right now from Intel's AMD,

like recent versions of them available
in cloud, is that AMD,

like for example, on GCP, you can
get more vCPUs and reach more

TPS on very simple workloads.

And it will be cheaper in general
in terms of how many, like

if you divide TPS per dollar, find
TPS per dollar, it will be

cheaper probably, and you can achieve
higher throughput for simple

workloads.

But when it comes to complex workloads
and various problems like

lightweight lock manager contention
we discussed a few times.

And in general, complex workloads
and behavior of auto vacuum

and so on, AMD looks less preferable
and Intel's are winning.

So that's why I'm additionally
impressed that AWS offers Intel

scalable Xeons, 4th generation,
almost 900 CPUs.

It's insane for me.

32 terabytes.

Wow.

And you think I can exceed 4 million
TPS, right?

It's a joke, I guess.

Well, it'd be interesting

Michael: to hear more about what
you think the current bottlenecks

are, like where you think.

But I guess, like my stupid brain
is thinking, like, this is

probably CPU constrained at this
point.

Throwing more at it makes sense
that it could...

Like, throw double at it, maybe
you can get close to doubling

it.

Nikolay: Well, we collect a lot
of information.

We have a wait event analysis,
we have flame graphs collected

automatically for each step.

And it still needs to be analyzed
deeper.

But what I see right now, the bottlenecks
are mostly communication.

Let's actually maybe take a step
or a couple of steps back and

discuss what issues I had originally.

Why only 1000000?

Michael: Can we take a step back
further than that?

Why were you doing this?

Yeah,

Nikolay: it's a good question.

I was asking myself why do we do
this.

Well, for fun, first of all, actually.

I was very curious.

I remember that work.

I think for Alexander and Sveta
it was also for fun 8 years ago.

But it led to some interesting
findings and eventually to optimizations.

So this fun was converted to something
useful.

And we started doing this for fun
because we have big machines,

we have very good automation, we
have newer Postgres.

I was just curious how easy it
is for us to achieve 1000000 and

maybe go further.

And first thing we saw, since we
have a complex system, a lot

of things, the observability tools,
first thing we saw is that

1 of extensions is a huge bottleneck.

It was pg_stat_kcache.

And it even didn't let us achieve
1 million TPS, quickly showing

huge observer effect, meaning that
just measuring pg_stat_kcache

extends pg_stat_statements to see
metrics related to physical metrics

like CPU, real CPU, user time,
system time, then IOD, real disk

IOD, not just talking to the page
cache, and context switches,

and very good detailed physical
query analysis.

And we like to have it, but if
you have it until the latest versions,

it quickly became a huge overhead
when you have a lot of queries

per second for a specific query.

Michael: I remember we discussed
this back in, I just looked

up the episode back in February,
about overhead.

I'll link that episode up as well.

Nikolay: Right, And yeah, just
to recap, we found that it shows

up in the wait event analysis using
pg_wait_sampling.

Great.

And we just quickly excluded it
and also reported to the maintainers

and They fixed it in 4 hours.

It was amazing.

And we tested it with Bot, confirmed
that now it's improved.

So this is how we achieved 1000000.

But why only 1000000?

We have newer Postgres and better
hardware.

Did I answer the question why,
by the way?

Michael: You said for fun.

Nikolay: For fun, right.

Yeah, this is the key, actually,
here.

I enjoy it.

You know, bot, let's just run pgBench
that way, or this way.

Let's collect more data points.

Let's visualize it.

Let's repeat this, but with an
adjusted version.

Let's compare various versions
for example.

Michael: So it sounds like for
fun but also you've got credits

and also dogfooding, right?

Like trying out products, trying
it out, here's an idea.

Nikolay: Yeah, we try to collect
successful experiments, and

we do it on smaller machines as
well, but on bigger machines

we collect just to understand how
the system is working.

Of course, we're moving towards
more useful benchmarks, for example,

when it will be already some specific
database and specific kind

of workload and maybe not only
PgBench and so on.

And all the tooling will be in
place and automation is super

high, so you can just talk to chatbot
and get results visualized

and collected with all details
and you don't miss any point.

So we collect almost 80 artifacts
for each datapoint.

It's huge.

It reflects many years of experience,
I would say, not only mine.

For example, you can say, okay,
let's compare versions 12 to

17 beta 1.

This is what I did as well.

And I confirmed that for regular
PgBench workload, or for Select-only

PgBench workload, no difference.

These versions are stable, no regression,
but also no wins.

But for regular...

Michael: 12 through 17.

Nikolay: Yeah.

But for select-only, this is exactly
like workload which allowed

me to achieve almost 4 million
TPS.

For this, 12 is losing significantly.

Michael: Wait, wait, wait.

You lost me.

So for read-only...

No,

Nikolay: no, for regular.

Michael: Oh, for read-write.

Nikolay: You know, regular transactions,
PagerBench has, It's

like, I don't know, like a couple
of updates, insert, delete,

a couple of selects, and they are
packaged as a single transaction,

and then you just say, okay, let's
just check it.

And the way we check it, it's called
like actually not load testing,

but stress testing, specialized
version of load testing.

And we check in the edge, we say,
okay, with 1 client, how many?

50 clients, how many?

100 clients, how many?

And then you have a graph dependency
of TPS, how TPS depends

on the number of clients.

Yeah, and it's interesting that
we cannot say 0 and due to scale

we need to jump with big steps
like 50.

So I say 1, 50, 100, then dot,
dot, dot, 500.

And the bot understands that it
needs to go with 50 increments,

but first dot is shifted to 1 and
it's okay.

So it's interesting, it adjusts
very quickly.

So, yeah, for fun and this, and
back to versions comparison,

So we found that for select only,
we don't see a difference,

but for regular workload and we
don't get a million transactions

there, we have only like 30,000
maybe, 40,000, something like

quite lower because it's already
writing transactions and so

on.

We saw that 12 is losing apparently,
and 16, 17 are winning slightly

compared to like average for these
old versions.

But yeah, I think we will continue
exploring and maybe test some

specific workloads, not just standard
ones.

Michael: And just to check, this
is like completely untuned,

like just default config.

Okay.

Nikolay: It's tuned.

Good to check.

Yeah.

We use PostgreSQL cluster, it's
Ansible playbooks maintained

by Vitaly, who works with me and
Vitaly Kukharik.

And it's a great project if you
don't like Kubernetes, for example,

and actually containers.

You just need bare Postgres installed
on Debian, Ubuntu or something,

and you want automation for it,
this is a good project.

It goes with everything, like everything
most popular stuff,

like Patroni, pgBackRest or WAL-G, what else?

Like a lot of simple things.

Actually, even with TimescaleDB,
it's packaged.

So you can choose to use it and
quickly have it on your cloud

or...

So for self-managed Postgres, it's
a good option.

It takes time to install, of course,
because it will be running

like apt-get install.

So it's coming with some tuning.

It's similar to what cloud providers
do, like 25% for shared

buffers.

It also adjusts operational system
parameters, like kernel settings.

Michael: I guess the dataset is
so small that 25% still so easily

fits within.

Nikolay: Yeah, of course.

And maybe we can tune additionally,
but I didn't see the point

yet.

This default tuning this project
provides was also like...

I know it quite well because we
adjusted in the past together

a few times, so I know how it works.

So I relied on it for now, but
maybe we will additionally adjust

some things.

But queries are super simple, right?

Data is cached.

So...

Michael: I understand.

I'm just trying to think, like,
the major things, like, all cached,

I guess, basic, like, write-ahead
log stuff, when you've got

the read-write stuff going on.

Doesn't matter.

Yeah.

Nikolay: It doesn't matter because
we don't write.

Michael: So in 1 of them, in the
30, 000, I reckon the 30, 000

is low and you could get much higher
there.

Nikolay: Well, yeah, if we talk
about writing transactions, of

course, we should think about checkpoints
and so on.

And it comes with some adjustments
based on the size of machine

already.

So I just relied on this default
behavior.

We can look further and good thing,
we consider settings as 1

of artifacts we collect.

So for history, we record everything,
all custom settings and

some system info as well.

So we can revisit it later and
think how to improve and next

step.

But bottlenecks are not there yet.

Michael: Yeah.

And I was wondering, last question
I had on the kind of the top

line number.

I get keeping pg_wait_sampling
for being able to introspect,

for being able to kind of try and
work out where is the bottleneck

now.

It's great for that.

But doesn't it also add a small
amount of overhead?

I know it's only sampling, so it's
probably minimal.

But Were you tempted to try without
it as well and just see what

you got?

Nikolay: I think we did.

And I think it's very minimal.

It's worth revisiting as well.

But for now, our impression is
that PageVision sampling among...

We have...

That's why, by the way, we encountered
the problems with pg.kcache,

because we used basically the package
we use for various kinds

of experiments in the case of self-managed
approach, right?

Michael: Had it more automatically
on by default.

Nikolay: Yeah, We already had these
observability tools there,

and this is how we found the pg.stat_kcache
problem.

As for pg.wait_sampling among all
3 extensions, pg.stat_kcache, pg.stat_statements,

and pg.wait_sampling, by the way,
both pg.wait_sampling and pg.stat_kcache

depend on pg.stat_statements because
they basically extend it for

query analysis.

They all provide query analysis.

Regular pg_stat_statements provides
query analysis like regular

database metrics like calls, timing,
IOMetrics at upper level,

buffer pool metrics, right?

Kcache goes down to physical level,
CPU and disk.

And PgWaitSampling for each query,
it also provides a profile

in terms of wait events, which
is super useful.

It's like different angles of analysis.

So if you can have them all free,
it's super good.

Also worth mentioning, it's off-topic
but interesting.

Until very recently, PgWaitSampling
didn't register events

where weight event is null, which
most weight event analysis

we know, such as RDS Performance
Insights, they picture it as

a green area called CPU.

By the way, Alexander Korotkov,
who's originally the author of

PgWaitSampling, confirmed my
idea that it's not fair to name

it CPU.

It should be like CPU or some unknown
weight event which is not

yet implemented.

Because not everything is covered
by a weight event analysis

in the code base.

Until recently, the pre-reward
sampling had a huge problem.

It didn't register nulls, naming
them somehow, at least like

CPU or something, at all.

But guys from CyberTech implemented
it a couple of weeks ago.

It's huge news.

It's good.

Thank you.

Thank you, guys, if you listen
to us.

I'm happy that it's implemented.

And I know GCP Cloud SQL also has
PGWait sampling.

And I like the idea that unlike
in Performance Insights, in RDS,

we have interface to this data.

I mean, good interface with SQL,
right?

We can just query this data right
from our database and build

our own observability extensions
or automation.

For experiments, it's super important
because I want to take

a snapshot and it's easier for
me just to take a snapshot using

SQL.

Let's collect profile, global,
and per query.

We dump it to CSV files basically.

That's it.

For long-term storage.

So we quickly found Kcache, solved
it, and we stuck at 1 million

TPS, which made me sad because
it just repeated the experiment

Alexander had 8 years ago.

So what's next?

Do you know what was next?

Michael: I've cheated.

I've read your post already.

Nikolay: You know.

Michael: Yeah.

And I think the last episode as
well gave it away in February

with page set statements being
next, right?

Yeah.

Nikolay: Huge bottleneck for such
cases.

I consider this edge case because
normally we don't have super

high frequent query, super fast,
super high frequent, and a lot

of CPUs.

But theoretically it can happen.

For example, you have Postgres
and a few replicas, and you need

to...

Like you have some huge table,
maybe not so huge, single index,

and you need just to find records.

Basically, almost like a key value
approach, right?

Maybe partitioned.

And then queries are super fast,
much, like a lot below 1 millisecond.

So in this case, if you have a
query, which after you remove

parameters is the same.

You have a single normalized query.

Very high frequency.

Michael: Yeah, is it how many?

There are probably only 2 or 3
normalized queries in that pgbench.

Nikolay: Well, by the way, I double-checked
for select only,

it's just a single query.

Michael: It's 1.

Okay, wow.

Nikolay: It has also set this macro
backslash set to get random,

but I think it's client-side.

So it's not going to Postgres.

It's only for PgBench itself to
generate a random number.

And I don't think it's a bottleneck.

I hope not.

Oh, by the way, our PgBench clients
were sitting on the same

host as Postgres, which is interesting.

Because they consume CPU as well.

And unfortunately, right now we
don't collect host stats and

so on.

We don't know how much of CPU was
used.

But usually it's noticeable.

It's not like, not 50% usually,
when you do experiments like

that, co-hosting clients and servers.

But I think it's noticeable, I
would say maybe 10-20%.

So if you offload them somehow,
but if you offload them, you

bring a network.

Yeah.

And we will get back to that point.

So, single machine, clients are
on the same machine, eating some

CPU.

And if you have this pattern, just
a single query, normalized

query, pre-resource statements
becomes a huge bottleneck.

Just removing pre-resource statements,
we jumped from 1 million

to 2.5 million TPS.

Michael: Yeah, wow.

Nikolay: Yeah.

And then what's next?

We reached 3 million.

How?

Let's recall.

Then we reached 3 million just
because I forgot I originally

I should use it.

I remember Alexander used it in
2016.

Prepare statements, right?

I forgot to use it and how did
I understand that I forgot?

Looking at flame graphs.

Our bot collects flame graphs.

Ah, nice.

Yeah.

And I just, I was inspecting flame
graphs and I just saw planning

time.

And I'm thinking, oh, we spend
a lot on planning time.

Right?

Let's just get rid of it.

So the right way is just to say
dash capital M hyphen, capital

M prepared for PgBench.

Michael: And it's a perfect workload
for it because it's simple,

same query over and over again,
no variants.

Perfect.

Nikolay: Right.

So also interesting that when you
like, again, step back, when

you see that starting from 150,
100, 150, 200, you go to 500

clients, then you have 360 cores.

And when you see it goes up to
like 100 or 150, and then it's

plateau.

It also gives you a strong feeling
it's not normal, right?

Because it means other Postgres
doesn't scale in terms of number

of parallel clients to number of
cores.

I expected it to scale to number
of cores.

Of course, we have clients running
on the same host, so maybe

the situation will start earlier,
but definitely not at 100 or

150 if you have 360 cores.

So this is how I saw this is not
right.

And it happened with pg_stat_kcache
and then pg_stat_statements.

Situation started earlier.

So yeah, some bottleneck, obviously.

But now the curve looks already
much better.

It's already like we reach maximum
point already close to number,

of course, 300 or 360, maybe 400,
actually, slightly bigger.

So then it doesn't make sense to
increase, but we just check

it, trying to find a plateau.

So prepare statements gave us more
than an additional half a

million of TPS.

Michael: Oh, I expected more, actually.

I was...

Only because, you know, when you're
looking at, like, super fast

primary key lookups, quite often
planning time is more than execution

time in a simple EXPLAIN ANALYZE.

Nikolay: Interesting, yeah.

Michael: Yeah, so it's just interesting
to me that it wasn't

like double, but yeah, half a million
was nothing to look down

our noses at.

Nikolay: Yeah.

Well, interesting.

Maybe I should think about it and
explore additionally, but this

is how it is.

Michael: Did you do force generic
plan as well?

Nikolay: No, not yet.

But mind

Michael: you, that would only,
that would only like be 5.

Yeah, no, that probably won't help.

Nikolay: Here's the thing, it's
in my to-do, maybe for this week

or next, when I have time.

I don't have time.

This is, again, this is a side
project just for fun and understanding

general behavior of Postgres.

What I remember about this generic
plan, Forrest, is that Jeremy

Schneider posted a very good overview
of problems others had

with log manager recently, right?

And I remember some benchmark,
some specific case someone showed

on Twitter.

If you just tell pgbench, If you
adjust user, for example, alter

user, and set force plan cache
mode, this force generic plan,

it leads to huge degradation in
terms of TPS.

So I need to check it.

And lightweight lock manager pops
up in top of wait events.

So I need to double check it because
I don't understand the nature

of it.

It's super interesting.

Michael: I realized soon after
saying it that the reason I thought

it would help is not going to help
here because you've got the

same query over and over, not lots
of little different queries.

Nikolay: But that should be cached
for sooner, right?

Michael: Yeah, but like after 5
executions, which is probably

like half a millisecond total.

Nikolay: Yeah, it should be no
difference in my opinion, but

no degradation.

Why degradation?

It's something interesting to check,
or maybe I just misunderstood.

This is my to-do to clarify.

So I guess it's interesting, exploring
postgres behavior on the

edge.

So let's just recap.

PgStart.kcache, we removed it,
also fixed, but did we get it

back?

We should get it back, actually,
because now it's much better.

They just removed some log, and
it's good again.

Then we found PgStart statements.

We achieved 1000000.

We removed PgStart statements.

This is the key.

By the way, when I say we keep
only pg_wait_sampling, but it

also depends on PgStart statements,
I also feel some inconsistency

in my...

Michael: I've looked it up.

When you said it, I got confused.

It doesn't depend on it.

It's just more stuff is possible.

Nikolay: It's optional, right?

Yeah.

It can run alone, as we do right
now in these experiments.

But if pg_buffercache is present,
we can join data using query

ID, right?

Michael: Exactly.

Nikolay: This is a good correction,
because I said the wrong

thing.

But pg_stat_kcache does depend on
pg_stat_kcache, you cannot install

it without it, this is for sure.

Okay.

And yeah, and these guys are not
available in cloud environments

usually.

PgWord sampling is in CloudSQL.

I mean, not in cloud, in managed
Postgres.

PgWord sampling is available on
CloudSQL, but that's it.

But maybe after this podcast, people
will consider adding it,

especially if pg_wait_sampling,
which is really great, especially

now with this fix from Cybertech.

So we removed pg_stat_statements, reached
2.5 million.

We added prepared statements, exceeded
3 million.

And then final step, how I almost
approached 4 million.

It's like 3.75 million TPS.

How?

Like Last optimization.

It's actually not super fair optimization.

My last optimization was, let's
just switch from TCP connection

to Unix domain socket connection,
which is possible in limited

cases, because it works only when
you're on the same node.

You should exclude network completely.

And of course, this is obvious,
TCP/IP connections are heavier,

definitely, like more overhead
than just when processes talk

through Unix domain socket.

It's much more lightweight.

And this allowed me to jump from
3 something to 3.75 million

TPS, which is good.

And also I found that 17 works
worse on Unix domain socket than

16.

Yeah, this is also in my to-do.

Why?

For TCP connections, no difference.

It's on this edge case, like select
only.

But for Unix domain socket, I
see degradation and it's worth

exploring why.

So it's maybe I found some degradation
for 17.

I also need to double check.

Michael: Quite a big difference.

Yeah.

Oh, no, no, no.

So sorry.

Nikolay: 7 to 8% as I remember
on higher number of clients.

Michael: Yeah, sure.

Sure, sure, sure.

I made the mistake of looking at
17 beta chart and looking at

TCP versus Unix, not 17 versus
16.

Nikolay: the way, when you have
a long chat with bot and different

series of experiments, then you
want to cross-compare something,

it's super easy.

Just visualize this and that on
the same graph, that's it.

Michael: Much easier than getting
it to give you the correct

JSON I saw from the chat transcript.

Nikolay: Well, yeah, it depends.

And Gemini is better in JSON than
GPT.

Well, we have a lot of fun stories.

So Let's not go there.

But it happens.

So, yeah, almost 4 million TPS
for quite meaningful workloads.

I think it's worth checking bigger
tables, for example, to see

how it depends.

For example, we can just take like
300 clients and then check,

draw different picture.

It's in my mind, like for example,
let's take very small table,

bigger, bigger, bigger and huge,
right?

And how it will degrade, for example,
right?

Michael: When you say bigger table,
do you mean like 100 million

rows instead of 10 million rows?

Nikolay: 100 million rows, yeah.

200 million rows, half a billion
rows, billion rows, partition,

not partition.

Like it's many, it's like a huge
maze.

You can go in many directions and
turn and then have like dead

end maybe, and you turn and inspect
another.

Like it's cool.

The cool thing is that I do it
like anywhere.

Like I don't know, like I'm eating
breakfast and checking what's

up with our experiments.

Let's do something else.

It's just my current hobby, you
know, to explore.

I think we should have more tools
like Sysbench, for example,

to have different kinds of workloads
to be brought up to this.

So yeah, that's it.

I'm very curious for select-only
PgBench, anyone had more?

But now, I was thinking, oh, maybe
it's the biggest number.

It's definitely the biggest I ever
saw, but I didn't see everything.

Maybe someone has bigger numbers
or saw bigger numbers or like

totally pg-bitch.

Michael: If they have, I couldn't
find it.

I did do some looking before the
episode just to try and see

if anyone pushed it further.

If it's been published, I've struggled
to find it.

Nikolay: But it's definitely possible
on these huge machines

on AWS, right?

Because you have more vCPUs and
as we see Postgres scales quite

well in terms of number of clients
and number of CPUs.

Michael: That's the direction.

If you're talking about that maze,
the way I would be tempted

to go with this is with the currently
available cloud-provided

VMs, what's the most TPS I can
get?

I don't care about cheating.

I don't care if it's only select
semicolon.

And I just would love to know,
what is that number?

Is it, I'm guessing it's above
10 million, but is it 12?

Is it 15?

Like, how many could we, like,
that would be super interesting.

How

many millions?

Nikolay: Actually, it's another good point.

By the way, I forgot to mention
that the problem with pg_stat_statements

statements identified.

Again, if you have a select-only
key-value approach and you need

to squeeze more, pg_stat_statements
can drop performance 3 times

here, basically.

So it's worth removing it.

And of course, there are already
discussions.

Let's mention there is a new discussion
started recently, I think,

by Michael Pacquere.

Sorry, I pronounce all this wrong.

How to optimize, what to do with
producer statements in general,

and he reacted on Twitter on my
benchmark.

So definitely there is an opportunity
here to improve producer

statements.

1 idea was start sampling at some
point.

Michael: Sampling is the obvious
1.

Nikolay: Also, like, deal with
high contention case differently,

and how then, like, update less.

There are ideas.

Michael: I would hate for the takeaway
here to be that there's

a real overhead here of pg_stat_statements, mostly because this

is so much the extreme version,
right?

This is the same query being worked

Nikolay: on.

I put it everywhere.

Disclaimer, this is edge case.

On production, you probably don't
see it.

With your workload, you won't see
it.

With your workload, probably it's
like a few percent only overhead.

But in some cases, when your workload
is just 1 or a few normalized

queries and the frequency is super
high, you have a lot of VCPUs

and they start competing, this
is a high contention situation.

In this case, just check it.

In terms of wait event analysis,
PG stat statements didn't show

up in wait event analysis until
I think Postgres 16, because

basically it was not covered by
wait events, I think.

Interesting.

Maybe I remember incorrectly, but
definitely it was not covered

until some version, and then it
started to be covered.

So you can see it, if you have
a very fresh version of Postgres,

you can see it in, for example,
RDS performance insights.

In this case, it's worth just removing
it, but I think some guys

have such kind of workload.

It's not for everyone.

Quite rare, let's say.

It's quite rare.

But it's valid, key value, right?

Just selects.

You need to serve a lot of, like,
you need to respond very quickly

using just B-tree, index search, just
single row, super fast query,

and then you have a lot.

So just removing pg_stat_statements,
you can get more.

So this leads to the question of
how to optimize pg_stat_statements.

There is an interesting discussion
in Hacker News.

Let's just advertise it a little
bit for those who are interested.

Michael: Yeah, for sure.

Nikolay: And another point I wanted
to mention, you started this

discussion on various clouds.

Our bot knows already about price,
about cost, so we could potentially

check different machines on GCP
and bring also AWS and start

discovering, like, I want 1 million
TPS, just select where it's

cheaper, which machine, like ARM,
Intel, AMD, let's compare different

clouds, different machines, where?

It's interesting, right?

I mean, performance cost.

Michael: I think so.

I liked your transactions per second
per dollar.

TPS per dollar, I reckon you should go

Nikolay: with.

Starting from very simple trivial
workload like these selects

and then maybe to extend it to
closer to what you have and then

you can decide which machine is
better for you.

Michael: This is something I've
actually been struggling with.

I think it could be really useful
is how do you let someone know

that their workload could be done
on the next instance size down,

you know, with maybe through a
bit of optimization.

But if you give people if people
have a rough idea of their workload,

maybe they know it's all OLTP.

And they know roughly like the
average transactions per second,

what size instance should they
roughly be on?

Like just as a rule of thumb.

Nikolay: This is called capacity
planning.

Michael: I know, but also I think
so many people are so over

provisioned because they're so
under tuned.

Nikolay: Many people are under
provisioned.

Michael: For example, I have

Nikolay: cases when people like
when, yeah, the people already

hit the ceiling for specific kind
of platform, for example, for

example, Graviton on AWS or Yandex,
for example, and you have

already the biggest size.

Maybe it's time to, I don't know,
like to scale somehow or to

take different kinds of machines
because you see how big Intel

you have.

But it's not available in RDS yet.

So I see cases both under-provisioned
and maybe you're right,

it's more common.

And over-provisioned is more common.

When you have CPU, like 5% maximum,
and you pay for everything.

else, like why?

But I also see cases when it's
already hitting ceilings and it's

time to do something about it,
right?

Of course.

Maybe I see them because people
go to us for consulting and help,

right?

So they don't go for, well, sometimes
we have this, like, let's

optimize our spendings, right?

But it's like, we're mostly useful
when you have problems with

performance and how to grow further.

Michael: Yeah, well, I see the
opposite.

I think not because they come to
me, or they come to us, it's

more that I know friends, like
founder friends and things, quite

often the cheapest solution in
the early days is double the instance

size.

You know, if you're going from
$30 to $60 to $120, there's not

many engineering hours that can
go into fixing that.

You might as well just upgrade
and then worry about it later.

For example, you

Nikolay: became a big fan of Graviton,
arm, right?

And you decide, oh, it's great.

And it's like checked performance
costs, great.

But maybe it's not like it was
good a year ago, but maybe not

now.

And things change very quickly.

And different nodes, instances,
and types of instances are added

in the cloud very quickly.

So in an ideal world, we should
have some rule how to find the

best optimal choice for instance,
family, and size.

And also we should have great migration
automation to switch.

0 downtime, right?

Michael: That's the other thing.

I think people are scared of going
back down.

Nikolay: Test, test and then switch.

Everything is possible, but requires
some effort, unfortunately.

So, good.

4000000 TPS, almost.

Michael: Almost, yeah.

Or 6 million.

I'm going with 6 million TPS.

Nikolay: And maybe someone already
has this machine with almost

900 vCPUs, and they can show us
10 million or something.

Michael: I'd expect more.

Nikolay: Maybe.

10 million TPS next milestone for
Postgres on a single node.

Great.

By the way, final note, I tried
to find information about MongoDB.

I asked guys, I found only slightly
outdated benchmarks, like

3 years old or so, and I didn't
see anything above, like from

a single node.

I even didn't care about if they
have transaction mode enabled

or not, because they can cheat
here, right?

But I didn't see 4 million, I saw
only 2 million, 2.5 million

or something TPS.

That's it.

So I think Postgres is very competitive.

Michael: It's a nice area.

Nikolay: Yeah.

Okay.

Michael: Good.

Awesome.

Cheers, Nikolay.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

Four million TPS

Creators and Guests

Some kind things our listeners have said