Getting started with benchmarking
Michael: Hello and welcome to
Postgres.FM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard,
and as usual, I'm joined
by Nikolay, founder of Postgres.AI.
Hello, Nikolay.
Nikolay: Hi, Michael.
Michael: Today is no normal episode
though, because we are delighted
to be joined by Melanie Plageman,
who is a Database Internals
Engineer working at Microsoft and
major contributor and committer
to PostgreSQL.
It's a real honor to have you on
the show, Melanie.
Welcome.
Melanie: Thank you.
I'm excited to be here.
Michael: We're excited to have
you.
There are so many things we could
have chosen as a topic today,
But off the back of a couple of
talks you've given at Postgres
conferences, we've opted to go
for the topic of getting started
with benchmarking.
Would you mind giving us a little
background about why this is
a topic you're particularly interested
in?
Melanie: Yeah, sure.
So since I started working at Microsoft
a few years ago, a lot
of the work that our team does
is performance related.
So I would say as a Postgres engineer,
even for features that
are not performance related, you
have to make sure that you understand
their performance impact.
But we tend to work on things like
AIO and other features where
you wouldn't do them if they didn't
have a positive performance
impact.
So assessing that is a huge part
of my job.
And it's an area where every little
detail matters.
And the things that don't matter
are kind of counterintuitive.
And it's just, it's kind of an
art and it's so hard to learn.
And it's like a dark art, you know,
and it's, I think I've been
really blessed in that I've gotten
to talk to and learn from
people that know how to do it,
like Andres Freund and other people
at Microsoft and in the community,
but not everyone gets to learn
from people that have 15 years
or 20 years of experience.
So I like talking about it and
trying to see if I can help people
get started and demystify what
I can about it.
Nikolay: Yeah, that's a good point.
So you think it can be improved,
right?
And people can understand benchmarks
better, right?
What are the tools, do you think,
for that to find solutions
better and understand bottlenecks
faster and so on?
Yes.
Misleading results we have all
the time, like we think we found
something, but it's wrong, and
then you like 1 day later, you
realize that it's completely opposite
sometimes, right?
So, so you think it can be improved,
right?
Melanie: Right.
So I think one of the things that
users have going for them is
that they have a baseline understanding
of how you should configure
your database.
And some of the mistakes that you
maybe, that I made as a beginning
with performance work, you might
not make.
So I remember like the first set
of benchmarks that I did, I
didn't configure shared buffers
at all.
This was years ago, so I give myself
a little bit of an out.
But basically, I was looking at
it and I was saying, this improvement,
you know, that I, this is supposed
to be an improvement and it
looks like it's actually hurting
performance.
And I was showing Andres and he
was like, well, tell me how you
configured it and what, you know,
what kind of, I didn't change
any configuration.
And he was like, okay, we'll just
throw all of this away.
And so some of the basic stuff
around benchmarking, I think that's
something that developers who maybe
don't have as much user experience
have to learn, but users have that.
I think one of the things that I
see happen more than you would
think is that there's some unexplained
performance impact or,
like a lot of times you're, you
develop a change, you want to
see how does it improve performance,
and when it doesn't, or
it has an effect that you can't
explain, you have to start investigating.
One of the things that people almost
never do, at least performance
engineers, they don't look at the
logs, and it's like maybe because
it's a development branch, right?
Like maybe you actually had a bug
and it's crashing or, you know,
in the logs, there's something
that's actually, and that's what's
slowing it down.
And it's such a basic thing that
you don't think about.
You're like looking at IO stat
and looking at these more advanced
tools when we really just need
to look at the log.
Nikolay: Or it was fast because
it didn't do any work, right?
Melanie: Yeah.
Nikolay: And looking at logs, you
can notice it.
Yeah.
Good point.
Sometimes, even some companies
we had, I won't name it, but sometimes
some benchmarks are published and
they claim something, and if
you have experience, 10 plus years,
you quickly have some ideas
what's wrong with those benchmarks.
For example, full-text search in
Postgres is bad.
It shows capital O of N.
You think, how come?
It's not possible.
Like, you quickly realize it's
not possible.
And you look at this and realize,
oh, yeah, they just didn't
build GIN index at all.
Right?
But if you don't have experience,
it's easy to make errors, like
right, to have errors, mistakes.
And my idea, like it would be good
if something would help people
to brainstorm what's wrong or what
else to look at as early as
possible.
So we need some brainstorming mechanism,
right?
What do you think?
Melanie: Yeah, almost like a checklist
or that kind of thing.
I think that's one of the reasons
why when I started out with development,
I thought, you know, with Postgres,
there's not enough basic
resources like for performance
work that give you a checklist.
But then I think I realized over
time, like I sort of moved towards
the, you almost need that one-on-one
interaction with someone
who has more experience because
it's just so individual and so
it's just so hard to give general
advice to people.
So I think that there are basic
questions you can ask yourself.
There's a big difference between
I think a lot of users when
they do benchmarking, I think a
common case is they're going
to install, they're thinking about
upgrading.
And so they're like, let me just
see if it's slower or faster,
whatever, right?
In that case, using pgbench for
that is not necessarily going
to help you, right?
Because first of all, benchmarking
what you actually want to
benchmark with pgbench or any
benchmarking tool is, is hard.
And second of all, what if that's
not like your workload at all?
You know, so let's say that in
Postgres there were changes to
the way that vacuum works and you
run a benchmark for less than
a few minutes.
Well, like, did you even end up
having anything be vacuumed?
You almost have to think about
what do I actually care about?
So if you're a user who's worried
about upgrading, like, I guess
you guys would know better than
me, but you'll probably have
options like running your maybe
non in not production, you can
run your real workload and on a
newer version and see how it
is.
And like, that's the ideal scenario.
But if that's not an option for
you, then I think what you have
to do is really think about what
are the highest risk areas for
me?
Like, do I have one critical report
I have to run?
Or, you know, I need to make sure
that I'm able to reach this
TPS at this time of day, or whatever
it is.
And then you have to have a way
to reproduce that.
Like you almost have to have a
minimal repro for the performance
question that you're asking and
be able to test it.
And I would say even then, like,
I don't know if you're gonna
get a perfect answer outside of
actual production environment.
Like the people, the use case that
users have for benchmarking,
I haven't heard of one where
I think it's going to work out
for them.
But I don't know, maybe you can
tell me what people want to use
Nikolay: it for.
Let me shamelessly ask you your
opinion about the approach I
ended up having for upgrades with
our customers, big and small,
doesn't matter.
So I was a big fan of replaying
workload like 5 years ago, but
we don't have tooling for it.
And mirroring, like PgCat, I think,
has mirroring, but it's hard.
It should be in between clients
and server and it's very critical
infrastructure.
Usually pgBouncer is there and
you cannot just quickly replace
it with PgCat and so on and overhead
and so on.
And mirroring also is tricky thing
actually.
So what I ended up doing, I split
it into 2 things.
First thing is load testing generally
with pgbench.
And usually, we have already done
by hackers and so on, like
performance farms or something.
And we can check it additionally
with some synthetic workload,
pgbench, not pgbench, sysbench,
doesn't matter.
So this is one thing.
And usually we rely on community
results or we can extend them
as well.
But as for our workload for our
project, Instead of replaying
the whole, we think, okay, we forget
about log manager, buffer
pool, and so on, all those components,
and we care only about
plan flips.
We want just to check the planner
behavior, and we just grab
query examples, like 100 of them,
which are most critical in
terms of total time or calls or
something from pg_stat_statements.
Sometimes we collect multiple examples
for one normalized query
in pg_stat_statements.
And then this is our testing set.
And then we don't care about machine
we use, actually.
It can be smaller machine.
We just have exact clone of production
database to have the same
data and pg_statistic.
And then we also adjust settings,
all planner settings and work_mem,
which is not planner setting, but
also affects planner behavior.
And then we just reproduce plans,
upgrade, reproduce plans again,
and compare and see if some plans
changed.
And also we use BUFFERS, of course,
and we see like costs and
BUFFERS we usually look at.
We don't care about timing because
it can be smaller machine,
cache type can be different.
And if some plan flip occurs, we
quickly find it.
And it can happen like in small
machine, like in shared environment,
basically.
It's not, pgbench is not needed
for plan flip check.
Melanie: Right.
Nikolay: And that's it.
Melanie: Yeah, I think if I understand
correctly, at some point
you use pgbench, and then you realize
that it wasn't actually
going to help you with this particular
case.
Nikolay: It's super hard to reproduce
actual workload.
Melanie: Yeah, so I support that.
I think really, honestly, what
you're doing, which is looking
for plan flips and starting with
pg_stat_statements is probably
the best thing that you could do
to try to make sure that you're
going to, when you upgrade that
it's going to work out for you.
And because I think that's one of
the only things that you can
realistically reproduce because
like the planner doesn't take
into account other things that
are happening.
Each plan is planned in isolation
with the statistics and the
data that you're talking about.
It's not looking at overall system
load.
I mean, maybe we should do things
like that, but planner doesn't
do that right now.
So it's actually possible to do
that.
So I think that's a really good
sort of precautionary thing that
you can do to see if the next version
of Postgres is going to
cause you problems.
So yes, I sign off on that.
But that makes sense as a strategy.
Nikolay: Before that, we were using
pgbench and we took most
frequent and most time-consuming
queries from pg_stat_statements
and we called it crafted workload.
We've tried to find some parameters
or examples of queries and
put them to pgbench using hyphen
F and then add sign to balance
somehow like to have some balanced
workload and then ignore some
errors if they happen like foreign
key violation or something.
I cannot say it worked well.
It helped, but then I just realized,
why do we care about buffer
pool and so on?
Let's just split this.
Melanie: So you were trying to,
you were doing that before upgrading
to see if you would get the same.
Nikolay: I stopped doing this because
It's hard, actually, time
consuming.
And if we care about plan flips,
that's it.
We have a recipe.
If we care about improvements in
some components of Postgres,
we can research them using some
simple workloads like pgbench.
That's it.
Melanie: Right.
Yeah.
You really can't use pgbench to
replicate real workloads because,
so for example, if I want to analyze
a patch that I'm doing and
I want to replicate some scenario,
I might run multiple instances
of pgbench at the same time.
So I run 1 that's doing a SELECT
query, and another 1 that's
doing a transactional workload.
And you can combine different pgbenches,
and you can sequence
them, and that kind of thing.
But ultimately, each pgbench is
going to do the same thing, no
matter what kind of custom script
you provide.
So you can there's different variables
and you can do things
like have variables in the script
and then interpolate random
numbers and things like that.
But in a real system, you'll have
some kind of work happening
and then another kind of work happening
that's different work
and like interspersed or at unpredictable
intervals and with
pgbench in the, you know, sort
of during a pgbench run, all
of the workers are going to be
doing the same thing, you know,
whatever's in the pgbench script.
So you can't get, it's very, very
hard to replicate realistic
scenarios with it.
And there's lots of creative things
you can do.
Like you can have some if statement
logic where you're like,
okay, if this worker is this worker
number, then do this other
thing.
And like, but at that point, I
think it's not useful if you're
trying to understand what the performance
impact is going to
be for your real world workload.
Like I can see users maybe using
it to figure out, to understand
a particular quirk or something
like that.
Or maybe if you're thinking about
changing your hardware, using
a different SKU from the same vendor
or something like that,
you can try to understand how it
might affect general Postgres
workloads.
And then it could be helpful.
But pgbench is like mainly a developer
tool.
And I mean, that's what it's marketed
at.
It's not really marketed as something,
not that it's marketed,
but it's not, I don't think it's
represented to anyone as a tool for
users to understand Postgres performance
of their workloads.
But I mean, all the other benchmarks
that are out there, I think
pgbench kind of gets lumped in
with them.
And it's not really useful for
any of the things other than development.
In my opinion, it's very hard to
make it useful for like comparing
whatever Postgres to MySQL or that
kind of, I mean, cause you
can't use
Nikolay: them.
Different hardware options, for
example, it can be useful.
Melanie: Yeah, you can use it for
that for sure.
Yeah, but it's definitely serves
a different purpose than the
TPC benchmarks, for example, which
allow you to compare across,
you know, databases.
And then of course we also have
a community of people that are
doing performance testing from
version to version of Postgres
that's different than just testing
1 patch and saying, what is
this patch doing?
They're like sort of looking for
broad themes and regressions
across versions.
And you can use pgbench for that,
but ultimately, I would say
it's easier or more meaningful
to do something like run the TPC
benchmarks when you're looking
to see if Postgres has a regression
from 1 version to another for the
purpose of community work.
Michael: You mentioned we've got
a community of people doing
that.
I've seen a couple, but I don't
know of many.
Do you have any names or things
I could look into?
Melanie: So I think the most, the
benchmarker that I follow closely
and that many people follow is
Mark Callaghan.
Michael: Yeah, great.
Melanie: Yeah, so he's great.
And he actually came to PGConf.dev
this year and I got to meet
him.
He's kind of like one of my, I was
starstruck when I met him people,
you know, and he's like super nice,
you know, in that.
Nikolay: I like his blog's name,
Small Datum.
Melanie: Yeah, yeah.
And he's great because he has,
I mean, he's an engineer who has
a ton of experience in doing performance
work for a long time,
you know, in MySQL mainly, the
MySQL community.
And then with benchmarking, describing
your methodology is so
important.
So he does that in great detail.
And because he's active and he's
doing these benchmarks like
all the time.
He's providing a lot of really
useful information for the community
around, you know, how he does version
to version of Postgres,
different types of benchmarks,
different on different sized machines
and that kind of thing.
So he's great.
And if you want to learn more about
benchmarking, I would say
his blog posts are fairly advanced.
So like, if you're getting started,
it might not be the best
resource.
I don't know.
But I still am kind of like, okay,
let me read through this.
And you know, he's getting down
to looking at how does your CPU
frequency governor affect your
benchmark results, right?
So it's definitely at the point
of, it's definitely past configuring
shared buffers for sure.
So he's great.
And then I would say within the
Postgres community, I mean, there's
developers like Tomas Vondra and
of course Andres Freund who do benchmarks
when they're developing and then
usually they'll look at the
holistic, you know, release.
But most of those results are just
getting posted on hackers
and they're not sort of like, because
it takes so much time to
do what Mark does and sort of describe
your methodology, like
double check every result, investigate
every discrepancy, and
then publish it and sort of be
accountable for it because people
will come after you, they'll be
like, no, that is not a regression
or whatever.
You have to be ready to, to like
defend your work.
And it's, it's a full, it's almost
like a full-time job in itself.
So a lot of the benchmarks that
get done within the community
are people
doing them and then just saying,
hey, I found this.
Can anyone else look into it?
So there's also Alexander Lakhin
runs GPCDS and I think GPCH
also every release and does a lot
of investigation.
And he's great because he does
so much investigation of bugs
and like reporting of bugs and
reproducing very hard to reproduce
bugs.
So if he finds some sort of regression,
performance regression,
he'll also bisect it down to the
commit that actually caused
it.
And as a developer, it's nice
to have someone doing that
level of analysis.
And so again, like that's not,
he's not necessarily publishing
a methodology post and all of
that.
Then there's some people
that maintain the kits for,
because if you want to implement
the TPC benchmarks, it's not like
you just run it.
Like you have, it's more
involved than that.
There are people that maintain
different kits to like the
bash scripts and stuff.
And I think Mark Wong just did
a new one for TPC-DS.
So he has a lot of database internals
experience.
And then he was involved with the
TPC council for a while.
And so he kind of is an expert
on just the TPC benchmarks.
And so he tends to put together
these kits to try to help people
because it's not that straightforward.
And then I know there's some ongoing
work with the Postgres performance
farm initiative, but it's really
hard, right, to agree upon what
is a useful benchmark in general
to run.
So you can find some regression
or whatever.
And it's like, in order to actually
prove that that's correct
or valid is a lot of work on the
part of everyone else.
So I think in some ways, publishing
benchmarks, you can't just
say I set up this farm and I ran
all these benchmarks and now
I expect people to go look at them.
As a benchmarker, you're kind of
accountable for the analysis
and trying to find out if your
results are valid.
Nikolay: Practically useful.
Practically useful.
I found 1 day, I found, we found,
my team found bottleneck in
WAL sender for logical replication.
It was so easy to reproduce.
You just create some transactions
with delete and roll back them.
And like at some point very quickly
you reach a hundred percent
of CPU for WAL sender.
And it led to bad conclusions that
in production we won't be
able to use logical for many things,
but it turned out that this
kind of workload doesn't happen
in the wild at all, because it's
like kind of very specific kind
of workload.
So I guess this bottleneck maybe
it's not worth fixing right
now, right?
I mean, it's like, so it's interesting.
Melanie: Yeah, you can definitely
report things like that on
the hackers mailing list or on
performance mailing lists and
have a discussion about, you know,
if it's kind of the right
thing to focus on and also, you
know, if it's reproducible.
Nikolay: I did some discussions
with some hackers about this
and didn't meet understanding that
it's worth fixing.
And then I realized in production,
we don't see such bottleneck
at all.
And I just like postpone this research
for future maybe.
What do you think about observability
tools to be used in feature
benchmarks?
What's too like extensions or additional
tools and so on what
to look at?
To avoid observer effect as well,
right?
Melanie: Yeah, I mean, I tend to
not worry too much about the
observer effect, depending on what
tools I'm using, because I,
you have to do some, I mean, as
long as you don't have like log
min duration statement set to
0 or something like that.
But if you're comparing to another
version of Postgres, and you're
using the same observability tools,
it's sort of the price you
have to pay, some overhead.
But every person that does performance
work is different.
When I started, I made the mistake
of trying to like generalize
and make some of my tools useful
to other people and no one wants
that.
So I learned that the hard way.
But personally, I find that it's
very hard to look at aggregated
numbers at the end of a benchmark
run and make sense of it.
And like, that might be what you
present to other people, because
they don't want to look at every
detailed, you know, graph that
you produced or chart that you
produced.
But while I'm doing the work, I
can't do it at all without visual
representations and charts and
things like that.
So I use for that, I mean, for
the actual charting, I use like
Matplotlib and pandas and that
kind of thing.
But I have all sorts of scripts
that do things because I use
a lot of data input sources.
So, I mean, one of the things that
depends on the patch, but I,
so I'll query pg_stat_io, or, you
know, depending on what the
patch is, pg_stat_activity or pg_stat_all_tables,
and then gather
certain information every 2 seconds
or every 1 second.
And then I have scripts that take
this output and then I'm able
to just basically make it CSVs
and then load it into pandas data
frames.
So for example, I've been working
on vacuum recently, and then
one of the things that you consider
when you're doing vacuum performance
work is how often vacuum work is
being done.
So, you might want to look at the
wait events and look at, you
know, how many if autovacuum workers
are waiting on vacuum delay,
wait events a lot and like not
actually doing the vacuuming work
and if you have the right autovacuum settings to actually be
able to observe the thing you're
trying to observe.
So, you know, just gathering that
information and then plotting
it correctly and that kind of thing.
Another thing, because Postgres
uses buffered I/O right now,
I tend to use external tools for
I/O observability, like iostat,
because you can't actually tell
with, you know, reads and writes
in the Postgres statistics, that
could be a read or write to the
kernel buffer cache.
You could be reading from there
and that's obviously going to
be quite different than actually
reads from disk.
So I use iostat depending
on what I'm doing.
And then also there's different
files like, well, I use Linux,
so there's different CPU statistics
or memory usage statistics
that you can get.
And so a lot of what my scripts
do is just query whatever files
somewhere in the file system that
has that information over time
at different intervals so that
I can see, okay, what's the memory
utilization or what kinds of effects
is this patch having on
my resource utilization?
So that's a source that I use.
And then I also use some different
Postgres extensions, again,
depending on the patch.
So pg_buffercache, I look at the...
Again, I just query it at intervals
and then see in terms of
the shared buffers that I have,
how many are pinned.
And it helps me to see how the
patch is affecting the behavior
of shared buffers and utilization
there.
So it's, there's a, Basically all
of the statistics views are
fair game.
All of the, the output from
pgbench itself.
So I parse that.
So there's progress output, which
is like, you know, pgbench
does some averaging and that kind
of thing.
So I also parse the execution reports
that come out, which is
like, they call it a transaction,
very confusing.
It's literally just an execution
of the script by a worker.
So you might get an average TPS,
but depending on, you know,
because of how it's calculated,
that might not actually show
you the P99 latency of an individual
execution of one of the execution
scripts.
So I typically parse all the execution
reports.
There's one line for each execution.
And then do various analysis of
that and plot that.
So that's an input and there's
always new sources and it's funny,
I'll take some analysis challenges
and say, I haven't figured
out like why there's this weird
spike in IO at this one particular
time.
And then he's like, well, here's
this other input source you
could use to investigate this.
And I was like, I have 55 input
sources or what?
How is it possible?
There's something else that I could
look at.
But I guess like everything is
caused by something.
And so it's possible to find out
what it is, I guess, if you
look hard enough.
Nikolay: Right.
Don't you feel like it'll be so
good to have P99, P95 in pgbench?
It would be so good.
It would.
It would work well, right?
I'm also using pgbench all the
time, like 10 seconds, 30 seconds.
It's so great to see how like caches
are filled and so on, like,
but lack of percentiles.
Melanie: Yeah.
Yeah.
So if you do get the execution
reports, you can just like, then
read them in with some Python tooling.
Nikolay: Execution reports, it's
like hyphen R or
Melanie: It's dash dash L and then
you provide a log prefix.
Yeah.
So that gives you the actual time
that each execution of the
script.
Nikolay: So this is so slow.
No.
Melanie: Well, I mean, it's not
free.
Yeah, that's true.
But I think if you wanted the P99
latency, I'm sure there's statistical
methods for getting it without
recording every execution, but
like you would need something like
that.
Nikolay: Yeah.
Do we put it, do you use the actual
file there or like some RAM
disk, or memory or something?
Melanie: I just, I use tmpfs and
then when it's over, I copy
it somewhere to a disk.
Yeah.
Nikolay: Good idea.
I will add it to my tool set.
Melanie: Yeah.
At first, I was pretty apprehensive
about doing it because it
sounded like a lot of work, but
once you actually have the execution,
you literally just like the statistics
modules in Python, you
can just give it the quantile you
want and then it just calculates
it's 2 lines of code or 3 or 4
or whatever.
Nikolay: Yeah, and you mentioned
pg_stat_io.
This was great when I first saw
it, like this should have existed many
years already, and we discussed
it with Michael, and we here in
PostgresFM often discuss
lack of buffers in plans, for
example, right?
In individual query plans.
Buffers is like, we try to advertise
it's so important to have
buffers in execution plans.
And this is a macro level, like
high level.
So good to have it, right?
But unfortunately, I cannot say
I've used it because it's only in
Postgres 16.
We don't have it in most production
systems yet, still.
But maybe next year, I will start
feeling it in production with
our customers.
Thank you for doing this.
My question is how you came up
with this idea originally?
Melanie: Well, I can't take credit
for it.
So Andres has done a lot of performance
work for years and years.
So when I joined Microsoft, one of
the first projects he suggested
was to add some view that would
improve observability of I/O.
And at that time, we didn't have
the shared memory stats system.
So, the patch was much bigger at
that point because of the way
the stats collector worked.
So I did an early version of it,
and then I used it in my own
development, and I ended up working
on some of the AIO work and
found it really useful for that.
And then I ended up doing some
review of the shared memory stats
patch.
And after that, it made the pg_stat_io
patch much smaller and
simpler.
So then I went back to it.
But I honestly, like For me, I
just looked at it as this helps
me as a developer doing benchmarking.
So it's cool that users like it
because I don't have that perspective
really.
Michael: Obviously it's useful
for benchmarking itself, but I'm
wondering about the other way around,
like, did you have to look
at its overhead and how did you
go about that?
Melanie: Yeah.
So it was much easier, like I said,
once we had shared memory
stats, because the way that it
didn't need any extra infrastructure
is basically just like another
data structure, couple other data
structures.
And then everything works the way
that it does now.
We already had some counting for
different kinds of reads and
writes, because that's how for
explain, analyze and that kind
of thing, we had those counters.
And then there was some reads and
writes there in pg_stat_io,
all tables at 1 of the views that
has some io things.
And then there was some for pg_stat_statements.
So there was some places where
we already had some IO counting.
And then the other places where
we didn't, there wasn't too much
risk because like if you're adding
some timing for, you know,
whatever, background writer or
checkpointer, there are processes
that it's okay to add a tiny bit
of overhead.
So you didn't have to think too
much about it.
I think the hard thing was actually
finding all of the places
that we could possibly do IO for
the types of IO that we were
representing and then thinking
about how to categorize it.
And especially for the different
buffer access strategies.
So like that's one of the things
that was completely not surfaced
to users was how buffer access
strategies work and how they reuse
buffers.
And so for copy and vacuum and
things like that, literally you
have no information about that.
So I think just figuring out how
to represent all of that, that
was probably like the least performance
work that patch did
in terms of the impact, performance
impact of it, just because there
was a lot of instrumentation already,
just that was not complete,
if that makes sense.
Michael: So you're saying like
we're already paying a lot of
the overhead, like we're running
Postgres 15, for example, we're
doing a lot of that counting anyway.
So storing it, but like also displaying
it in an extra view,
there's very little added in actual
instrumentation.
Okay, cool.
Melanie: Yeah, I mean, like we
had for backends that were doing
queries, like counting of reads
and writes and hits in most places.
So the places that it was sort
of net new were mostly for types
of processes that we don't mind
a little bit of overhead if there
was.
And it would be so small.
I mean, it's measuring IO, right?
So like the, if you're doing IO,
the overhead of that will cover
any sort of overhead of instrumentation
for the most part.
Nikolay: Unless it's hits, actually,
it also counts hits.
And I found in some cases, for
example, monitoring systems try
to avoid storing these metrics
at all, for example, from pg_stat_statements.
statements.
But I had some incidents where
we lacked hits numbers.
But pg_stat_io has hits, right?
I'm looking at the documentation
just to confirm.
Melanie: Yeah, it does have hits.
And this
Nikolay: might be kind of overhead
of counting.
This might be interesting because
it doesn't have real IO, I
mean, in terms of disk.
Melanie: Yeah, it actually was
already being counted.
We had hits somewhere else, one of
the other views, or maybe it
was for pg_stat_statements, I'm
not sure.
So that was already kind of being
counted.
And when you go to get something
from shared buffers, there's
locking and all sorts
Nikolay: of
Melanie: other things that happen
that are expensive enough that
it was pretty negligible.
Nice.
Nikolay: Yeah, okay.
But there are some benchmarks proving
that Timescale didn't add.
Or like just...
Melanie: Wait, you want to see
my receipts?
Nikolay: Yeah, yeah.
I'm just curious.
It's just a general feeling that
it doesn't bring overhead or
there's a solid proof of it in
terms of benchmarks.
Melanie: That's a good point.
It's mostly based on a feeling.
So, like...
Nikolay: I'm curious, is it worth
checking this, for example?
Because maybe we can strip it out
somehow or compare 15 versus
16 in various cases and see if...
Melanie: I would look at those
benchmarks, but I guess my gut
feeling would be that it, because
we already were doing, when
track_io_timing is on, we already
were doing a lot of the measurements
that it wouldn't be much different.
Nikolay: But actually, I can answer
myself.
We have benchmarks with insane
number.
Oh,
Melanie: you do?
Okay.
Nikolay: No, no, I mean, 15, we
compare all versions and we try
to reach as many TPS as possible.
Achieved how many, Michael?
4 million?
Michael: Nearly.
3.75,
Nikolay: I think.
Yeah, just select only pgbench,
right?
And we didn't see any degradation
for version 16 at all.
It's already some kind of proof
that there is no overhead here.
And we have wait events and flame
graphs and so on.
We would notice it.
Michael: But Nikolay, you wouldn't
have had TrackIO timing on.
Nikolay: I want to check it.
And maybe repeat it.
No way.
Yeah, I will check.
Michael: If you did, you should
repeat again.
Maybe you'll get 4 million.
Melanie: Okay, I'm checking right
now.
Nikolay: Good point.
Michael: But yeah, this is great.
Like, it's super interesting what,
like how do you choose which
things to benchmark and which things
not to then?
Melanie: Yeah.
So I think developing intuition
around that is, so I actually
became a software engineer, like
my, the first thing that I worked
on professionally was Greenplum,
which is a fork of Postgres.
So almost all of my software engineering
experience comes from
being a Postgres engineer.
I was pretty new to software engineering,
right?
So I think a lot of software engineers
develop intuition around,
like, you know, if you take a lock
every millisecond versus every
30 seconds, like, which 1 of those
is okay, or whatever, right?
Like, it obviously depends on the
context.
But I think there's a lot of sort
of performance intuition that
people develop just from being
software engineers.
And that's sort of like the first
step of, you know, thinking
about, okay, well, is it okay for
me to take an exclusive lock
here?
Do I need to find a lock-free design
for this or whatever it
is?
And that's when you have something
where you had a performance
question and you said, is it okay
that I do X, then you have
to think about, okay, how would
I benchmark this?
And in a lot of cases, maybe not
benchmark, how would I profile
this?
How would I evaluate or microbenchmark
this or something like
that?
So, For example, in 17, I worked
on some refactoring of heap pruning,
which is kind of scary for different
reasons because of like
data corruption stuff.
But from the perspective of performance,
You also, because we
do on access heap pruning when you're
doing a select query or
reading in the buffer to do an
update, changing that code, adding
any sort of.
Additional instructions, you know,
potentially has overhead.
So of course, like if you're doing
actual heap pruning, then you
have IO from the WAL and you have
IO from writing out the dirty
buffer and that kind of thing.
And so you have to think about
is the thing that I'm changing
Actually going to matter in the
context that it's in?
And then I think evaluate from
there.
So and also how important would
it be if there was a difference,
right?
So like, on every SELECT query,
if it's a little bit more expensive,
that's obviously bad, even if it's
only a tiny bit more expensive.
So I, I and Heikki did a lot of
microbenchmarking around those
changes to make sure that they
seemed right.
And then you have to design the
benchmark so that it's actually
exercising what you think might
be the bottleneck, which is its
own challenge.
And then like, right now I'm working
on something where it has
to do with improving eager freezing
so that you freeze more data,
VACUUM freezes more data sooner,
And there's a bunch of different
pieces to it.
But 1 of them is a new sort of
responsibility that Background
Writer would have for maintaining
a new data structure that contains
some new statistics.
And 1 of the questions I got in
review was like, have you profiled
this to make sure that it doesn't
have an overhead?
And my first reaction was like,
but it's Background Writer.
No 1 cares about Background Writer's
performance, you know, like
kind of like, it's different than
on every SELECT query, you're
going to have some overhead.
But again, it was doing something
that was like more, I'd say,
heavyweight.
Like it's doing some calculations
that could take time and doing
them while holding an exclusive
lock.
And so even though it's Background
Writer, like, it's worth proving
that, you know, the number of extra
instructions, for example,
is tolerable.
And that's a judgment call.
But what you want is a paper trail,
too, that you, like, did
your due diligence.
Which is why it was funny when
you were asking me about pg_stat_io's
performance impact, because in
that case, I kind of thought,
it's negligible compared to the
fact that you're doing IO first
of all, and second of all, we had
a lot of that instrumentation
there.
But for most other things, like,
I haven't had a patch other
than that probably in the last
couple years where there wasn't
a discussion of what the performance
impact could be and then
benchmarking or profiling.
Nikolay: Nice.
Yeah, Michael, it was on.
Tracking IO timing was on.
Michael: Well, that's good news
for Melanie.
There's potential for 4 million.
Nikolay: Yeah, and we didn't see
a difference between 15 and
16, so I mean...
Wow, wow, wow.
Yeah, I already asked our robot
to double-check with tracking
our timing off, so we'll see this
as well.
Yeah, good.
Interesting.
I hope we'll have more TPS.
I’m curious what you are working
on right now and direction of
future work, ideas, and so on.
Melanie: Yeah, so I mean, I started
a project last release that
had to do with reducing the amount
of WAL emitted by vacuum.
And that was like kind of a refactoring,
but it spun out a lot
of other projects.
And I think I sort of wanted to
take a break from vacuum this
release, but it didn't work out
that way.
So I’m working on a couple of different
things around vacuum.
And some of them are some leftover
pieces around combining WAL
records that vacuum used to emit,
like up to 6 WAL records per
block.
So we've combined a few of them
and there's more to do.
So that's 1 of the things, but
sort of the more probably exciting
work is around the freezing work.
So that basically was born out
of Peter Geoghegan a few years ago,
I think it was '16, did some work
to have us do more eager freezing.
So one of the things we hear a lot
from users is they have, say,
insert-only tables and none of
the data gets frozen.
And then all of a sudden they have
anti-wraparound vacuums and
then eventually their system becomes
read-only.
But even before that anti-wraparound
vacuums, if you don't have
any actual like pruning or vacuuming
to do, it can be expensive
from an IO perspective.
And so it can slow your whole system
down.
And so being more aggressive about
freezing data sooner was something
that people had been asking about.
And so Peter worked on that and
it's really difficult to find
a heuristic that works for determining
whether or not to freeze
something because if you have a
table where you're doing a lot
of updates, like a standard pgbench
built-in workload, then
it's a waste to freeze things because
freezing emits WAL and
it dirties the page and that kind
of thing.
So finding a heuristic that freezes
the stuff you want to freeze
and doesn't freeze the stuff you
don't want to freeze is then
a journey.
And one of the parts that's really
hard is that you have to try
to come up with all of the workloads
that are important, the
best case and worst case, especially
the worst case workloads
that users might have and think
about how it would perform.
And that's so hard.
So that took a lot of time.
And I spent a lot of time improving
my different benchmarking
setups and coming up with these
different workloads and evaluating
them, looking at them.
So that's one thing I've been working
on.
And then last release, I also,
so there's the new read stream
API that does vector I/O, does
larger reads that Thomas Munro
did last release.
And I worked on the different users.
So for sequential scans and this,
I had some additional users
that didn't go into 17 because
their performance effects that
you have to sort of analyze really
closely and they weren't ready
And one of
Nikolay: them was...
Did you
Michael: use the word analyze deliberately
there?
Was it analyze?
Melanie: Yeah.
No, no, analyze went in actually.
Michael: Oh, cool.
Nice.
Melanie: That was the laws work.
Yeah.
So analyze went in, but a bitmap
heap scan and vacuum didn't.
Michael: Got it.
Melanie: So vacuum is an interesting
one because streaming vacuum
sounds great, but vacuum uses a
ring buffer to keep it from having
too much of a negative effect on
shared buffers.
So you don't wash out all of the
things that are resident and
shared buffers from your workload.
It reuses buffers, which means
that it has to write WAL, typically,
and there's a lot of calculation
around how big the ring is such
that WAL writer can help you or
help vacuum to write out the
WALs.
So vacuum's not doing all of it.
And if you do I/O and if you do
reads in a different size, that
means that you're dirtying and
needing to clean buffers at a
different rate.
And that actually was affecting
how many individual WAL writes
and syncs that we would do.
So like, in a given Fsync, you
can have different amounts of
data, but each Fsync's a system
call and has overhead, right?
So you want to amortize that cost
over a large enough amount
of data and also have WAL writer
help to do some of the work.
So by doing bigger I/Os, we are
actually making the performance
worse in some cases because of
these interactions with additional
WAL syncs.
So Thomas Munro has been working
on a, I think he's calling
it streaming write-behind.
It's kind of a way of thinking
about, for the purposes eventually
of AIO, but of how backends and
maintenance processes and writes
in general in Postgres can sort
of make sure that they are cleaning
up and doing writes in large enough,
if they're doing larger
writes, right, that they're actually
being conscious of when
they should do those writes based
on the WAL that it's required
that you emit.
So like if you're doing reads,
you don't have to really think
about that.
But if you're writing out data,
the WAL associated with those
dirty buffers has to be flushed
first.
So you have to think about when
do I wanna do writes?
Because you need to think about
when do you actually want to
do the WAL writes.
So he's been working on that and
it's a tough problem.
A lot of benchmarking, a lot of
thinking about it and buffer
access strategies really complicate
it.
And those are used for large selects.
Those are used for, you know, copy,
vacuum.
So he's doing some work around
that and I'll probably sort of
jump in after some of this vacuum
stuff is done and try to work
on some of the AIO work that's
coming up.
And I think that ultimately, like
just between those things,
that'll probably be most of what
I end up doing on my own.
But I would like to see there's
some for pg_stat_io, there's been
a patch to add WAL-related IO
statistics to it that's been around
for about a release.
And we haven't basically what we
have, we haven't come to an
agreement on what the right thing
to do is because you can technically,
like our thought was that we would
have the block size for, like,
you can configure, you know, your
block size for Postgres.
You can configure it for WAL.
And that they would be in units
of block size, WAL block size.
But that actually isn't always
the unit that we do reads and writes
in exactly, like it usually is.
But so now we're actually talking
about rejiggering pg_stat_io,
especially in light of vector.io
to change it so that it's not
looking at, you know, if you have
a thousand writes and then
you have the block size and then
you multiply them to get the
number of bytes, do we change it
to represent it differently
and put it just in the number of
bytes?
And how do you represent that?
If you're doing it, do you want
the right number of writes to
be the number of system calls.
So we were just thinking about
how to actually represent it,
but I would love to see WAL stats
go into pg_stat_io in 18 also.
So we'll see.
And there's also a bunch of other
exciting things going on, like
Masahiko Sawada is working on parallelizing
the first and third
phases of vacuum.
So he did that really exciting
work in 17 on TID store and making
it so that I think that's my favorite
feature from 17.
Nikolay: Can you explain it to?
Michael: Yeah, I don't know that.
Melanie: Yeah, so you know, like
one of people's biggest complaints
about vacuum is when they have
to do multiple rounds of index
vacuuming because the maintenance
work mem, even if you set it
to a high value, you might end
up filling it up with dead tids
and then you have to do a round
of index vacuuming.
And if your indexes are very big,
you can imagine that that ends
up really affecting your vacuum
performance.
So what he did was introduce a
new data structure that organized
the dead tids in a way that was
much, much, much more efficient.
And that made it so that you're
using way less memory for the
actual dead tid storage, but then
it also, you can end up having,
you can end up sort of changing
it on the fly.
You could, I mean, basically you're
not tied to the same restrictions
that we had around the size of
maintenance work mem as before.
So most people are going to not
need ever to do multiple passes
of index vacuuming.
So I think that people with very
large indexes and large tables
are going to see their vacuum performance
be a lot better.
Nikolay: I have big customers with
big partitioned tables.
We talk about partitioning for
years and it's hard and they will
benefit.
So vacuuming will be faster, basically,
right?
Melanie: Yeah.
I mean, if you don't have this
problem with the multiple index
vacuuming passes, then maybe not.
But it's something we hear a lot
from sophisticated customers.
Nikolay: And also, I remember in
PostgreSQL 17, before that,
it was only up to 1 gigabyte could
be used for.
Melanie: Yeah, that restriction
is lifted now.
Nikolay: Is it related to this
work or?
Melanie: Yes, it is related,
Nikolay: yeah.
I suspected so, good.
Melanie: Yeah.
So that's really cool.
And I think it's something that's
hard to explain unless you've
had the problem.
I think if you haven't had this
problem, you're not really thinking
about it.
But for people that do have this
problem, I think it's going
to make a big difference.
He's building on that work and
there's a lot of things that we
can do with vacuum because of having,
so this read stream API
will make it somewhat easier to
do parallelization of the first
and third phases of vacuuming.
So he's working on that as well.
And then I think, you know, there's
the, the dreams of having
global indexes involved a step
where you persisted the dead TIDs
to disk, because otherwise you
wouldn't be able to.
Basically, there's, there's discussion
of being able to split
up the phases of vacuuming and
be able to vacuum indexes at some
point and come back to it and do
it later.
Like vacuum, just do the first
stage of vacuuming.
And if you want to do the phases
of vacuum separately, you have
to have the dead tuples that you
need, basically what index
entries you need to delete, you
need to save that somewhere that's
not in memory.
And so there's some modifications
to the TID store, but they
can make and probably make it easy
to persist.
And I don't know that he's planning
on working on that now, but
Dilip had done some work on dead
TID storage.
And so I think there's like a lot
of exciting things around vacuum
that'll be happening.
And there's also been discussion,
which I know has happened in
the past, but new discussion about
auto vacuum scheduling.
And not just auto vacuum scheduling,
but cost-based delay and
how to change that to be more adaptive
and to use statistics
basically that are collected while
vacuuming to decrease the
delay adaptively while vacuuming
if you're not able to finish
the vacuum appropriately and things
like that.
So we'll see.
Nikolay: So we talked to Peter
Geoghegan in the past as well.
And I remember the duplication.
So you both sound not like people
dreamed to get rid of vacuum
completely, but you sound like
there is a big potential for improvements
and it's happening and so it's
going to stay, but with improvements.
This is like I'm trying to simplify
for our audience.
Melanie: The undo people?
I mean, they still
Nikolay: would have needed vacuum.
The undo people, right?
Melanie: I think that there has
been a lot more work on vacuum
this release than there has been
Nikolay: 17 or 18
Melanie: 17 and 18
Nikolay: 17 18.
Yeah, yeah, that's great.
That's great.
So it's, yeah, many people need
it.
I mean, Postgres instances.
So yeah.
Michael: I like this kind of work.
I think Peter Geoghegan's done a lot
of work in the last few releases,
and you and others that focus on
things that will help people.
Well, almost everybody will benefit
without having to change
anything.
And that's, those are so powerful.
They're so easily forgotten or
easily ignored, but everybody
benefits, it gets better for almost
everybody without A, noticing
and B, having to do anything.
Just so powerful, those kind of
changes.
So thank you for working on those
and encouraging others to as
well.
Appreciate it a lot.
Melanie: Yeah, the dream is that
we get rid of all of those auto
vacuum gucks because that's terrible.
Like, there's, I don't know, the
fact that there are so many
blog posts.
I mean, yeah, like they should
all be gone.
Your auto vacuum configuration
should be basically nothing.
I mean, the system should figure
out what to do.
Nikolay: We often discuss outdated
defaults, but this is a radical
position.
Melanie: Right.
Yeah.
I mean, it's not going to happen
overnight, but I think my vision,
a lot of people I think want this.
I think the cost-based delay system
is really hard to understand
because it's not intuitive.
The relationship, like scale factor
and all of that, It's just
like, what is that?
I mean, I understand why it happened
from a developer perspective,
but I think that getting vacuum
to do the right thing is like
our job as engineers and shouldn't
be users' jobs, basically.
So we'll see.
Michael: Wow.
Well, we don't do clips on this
show, but if we did, I think
that would be one.
Nikolay: Also number of workers,
3 workers.
If you have like almost 200 cores
and you have only 3 workers
by default, this is insane, right?
Melanie: Yeah.
Well, actually, so I will take
it back.
I would say that the one configuration
that I think users should
be able to provide is the max number
of workers because like
that's your system preference,
right?
Like, I mean, maybe not the number
of workers, the minimum number,
but the maximum number.
You should be able to keep the
system from having 50 processes
just doing vacuuming work if you
want.
Right?
I think that they should be.
Nikolay: Because this is second
phase of rights.
We need it.
Right?
Because if you just deleted something
or updated something, work
is not done until it's vacuumed
and you need more workers anyway.
Michael: So- Also, Postgres doesn't
know how many cores, right?
Nikolay: Doesn't know.
Melanie: It doesn't take that into
account.
Yeah.
I mean, but like people are allowed
to use their servers for
other things at the same time. I
maybe you don't, but like we sort
of we don't assume that we have
access to everything, I guess.
Michael: Yeah, makes sense.
Nikolay: Thank you for coming.
It was very interesting. Thank you
for this work again.
Thank you for pg_stat_io
work and so on, and benchmarks.
I like, we will keep an eye on
it and good luck.
Good luck.
Melanie: Thank you.
Michael: Yeah, thanks so much.
Melanie: Thanks.
Michael: Bye.