
Return of the BUFFERS
Nikolay: Hello, hello, this is
PostgresFM.
I'm Nikolay from Postgres.AI.
And as usual, my co-host is Michael
from pgMustard.
Hi, Michael.
Hello, Nikolay.
So I chose a very boring topic.
And we talked about it a lot I
hope this is the last time we
were going to talk about it maybe
right BUFFERS or pages or blocks
or how how to name it depending
on the perspective you have, right
but BUFFERS episode 3, which
title are you going to choose?
Michael: Oh, it's like the third
in a trilogy, isn't it?
Maybe it's like Return of the BUFFERS
or something, I don't know.
Nikolay: Right, but we have a few
reasons to revisit this topic
today.
Of course, this topic has been
very heartwarming for me.
For a couple of years, we discussed
how important it is to consider
BUFFERS when optimizing queries
at micro-level, when we talk
about EXPLAIN ANALYZE, or at macro-level when we talk
about pg_stat_statements and other
extensions, and aggregate
everything and see parts of workload
or the whole workload.
And recently, finally, as we already
mentioned a few times, BUFFERS
were added to EXPLAIN ANALYZE
by default after a few unsuccessful
attempts from previous years.
I think it was David Rowley, right?
Michael: Yes, and Guillaume Lelarge
played a big part as well.
Nikolay: Right, and all previous
hackers also who made even unsuccessful
attempts made some shift in this
direction.
Of course, it's really a small
technical change.
So now you don't need to write
the word BUFFERS.
You just write EXPLAIN ANALYZE.
And as a reminder, EXPLAIN just
provides a plan for the query,
which planner is going to choose,
optimizer is going to choose.
And then EXPLAIN ANALYZE actually
provides both the plan and actually
execution.
It actually executes the query,
so you see more data.
And by default, buffer information,
buffer hits, reads, BUFFERS,
dirtied and written, we are not
present in the plan.
And that was a bummer for us who
dealt with queries at scale,
right?
So because this is important information,
as we discussed, and
we will discuss again, I think.
But now in Postgres 18, you don't
need to write BUFFERS explicitly.
This information will be included
by default.
And that's great, that's great.
I think it's a super important
small change which will affect
lives of many engineers.
Because...
Michael: Yeah, I think just worth,
I know you've alluded to this
already, is an upcoming version
of Postgres 18.
It's been committed, and things
that get committed have a very
good chance of making it to the final version.
But things do get reverted.
So there's a small chance that we'll still have to talk about
this more in future but yeah we already had the ability to request
BUFFERS we could already add EXPLAIN (ANALYZE, BUFFERS) and other
parameters but a lot of the time when people are asking for help
online or when people set up auto_explain although I see BUFFERS
more in auto_explain I don't know what your experience is but
yeah when people are asking for help online or when people set
up auto_explain, although I see BUFFERS more in auto_explain, I
don't know what your experience is, but yeah when people are
asking for help or people are writing blog posts or people are
just getting familiar with Postgres, they'll come across
EXPLAIN ANALYZE and then not know that BUFFERS exists or not use it when
it would actually have shown the issue and so there's a bunch
of cases where by having it on by default we'll be able to help
more people online when they first have an issue without having
to ask them to go back and get another query plan or it will
just be more likely to be in, well it will be in those query
plans by default if they're on Postgres 18 or newer.
Nikolay: Yeah, yeah.
And I know you have also opinion and I agree with you that some
other things should be included if we provide something like
all or verbose.
There is verbose but it doesn't include everything and this is
slightly strange and unexpected if you don't know details.
But let's focus only on BUFFERS.
It's really hard for me to imagine it will be reverted.
It might be, but I think, yeah.
Michael: I could just imagine this aging really, you know, aging
like milk, we say sometimes, just in case.
I just wanted to make people aware that might have happened.
Nikolay: Yeah, so we talk about Postgres 18, which is going to
be released this fall in many months and it will hit production
only in a couple of years, on average, maybe, or maybe even more.
So it's still in far future, in distant future.
But I wanted to just think in advance, since we are both building
some tools and methodologies and everything, I wanted to think,
is it going to have some positive, I think, positive effect on
how we are going to optimize queries in the future.
For example, defaults matter a lot at scale, and I learned it
building 3 social networks in the past.
The simple lesson I remember some spammer was sending on behalf
of administration was sending the message, send text message
to this number without any reasoning.
And if you send it to 1000000 people, there will be some people
who will follow.
Because, you know, like, why not?
Of course, if you add some reasoning, it will work even more.
So defaults matter a lot.
And we can put everywhere words
like don't forget BUFFERS, don't
forget BUFFERS, but if it's still
like if default behavior remains
without BUFFERS, there will be
a lot of people who will not use
it, right?
And this is what we see If you
go to explain.depesz.com, it has
history.
And, of course, actually, even
explain.depesz.com also advocates
for BUFFERS since not long ago,
which is great.
But if you just check history,
you see a lot of plans provided
without BUFFERS, right?
Because defaults.
Now we have good reason to expect
that after a couple of years
we will see majority of plans with
BUFFERS.
Do you agree?
Michael: Yes, maybe not in the
full, but maybe in like definitely
in 5 years time, maybe in 234 years
it will become the majority.
Nikolay: Well let's say if it will
be 18 plus version, this plan
like again 99% or even almost 100%
that will include BUFFERS
because only few people will bother
turning it off.
Michael: Oh, I meant actually people
still running very old versions.
Yeah, that was my understanding.
Nikolay: But imagine we're already
thinking about people with
new versions and I think we will
have BUFFERS.
What does it mean for analysis?
Are we going to highlight this
information or use this information
even more intensively?
What do you think?
Michael: Well, good question.
I think people will ask...
People already have lots of questions
when looking to
EXPLAIN ANALYZE output for the first time
and people are like what does
this statistic mean what does that
statistic mean and I think
BUFFERS are another kind of 1 that
at first people are a little
bit confused by what does it mean?
Like what is that number?
Is that a lot?
Is it not a lot?
What unit is it in?
All these kind of questions.
So I suspect it will immediately
in people's heads have a bit
of extra confusion.
But once they're familiar, I can't
see how it isn't helpful.
It's just going to be so helpful,
especially once people start,
like, people and tools start converting
that into data volume
sizes.
I know it's an imperfect, but let's
accept the imperfection of
timesing it by 8 kilobytes, so
each buffer being a page, and
multiplying it by 8 kilobytes,
you get an idea of the data volume
being read.
When people see a query and notice
it's reading a gigabyte of
data, even if it's from shared
memory, when it's not using an
index or when it's using a specific
index, and then it's only
using a few kilobytes when it's
using a different index, suddenly
they realize the index is helping
speed things up because it's
more efficient because Postgres
is only having to read far fewer
blocks, like much less in order
to serve that query.
I think it becomes a lot more intuitive
that indexes help because
they're a more efficient way of
running certain queries, not
their magic.
And I think this is currently not
amongst Postgres folks, but
definitely in the developer community
as a whole, there's this
perception that indexes are magic.
And I think BUFFERS being present
help is like a good step towards
no they're not magic they're just
a they're very clever idea
that makes things really simple
Nikolay: yeah yeah that's a good
point I was listening to you
and thought I'm contradicting to
myself when I say buffer, BUFFERS,
hit and read and written and dirted.
It's incorrect, it should be operations,
BUFFERS, buffer reads,
hits, writes, and I don't know
how to deal with dirt.
Michael: Dirties?
I don't know.
Nikolay: Yeah, yeah, yeah.
Pieces of dirt, right?
No, no, no, like actions of dirt.
But anyway, like I'm saying, these
are operations And if you
sum up some numbers, it might be
the same buffer which you hit
multiple times.
Not sure about other operations,
but hit, we definitely have
in the nested loop, we can have
hits for the same buffer many,
many, many times.
But when you say we need to convert,
I totally agree.
I did it for a long time.
I noticed that very long ago, once
you converted to bytes, megabytes,
gigabytes, sometimes terabytes
actually, tebibytes, right?
Gibibytes.
Once you convert it, engineers
have a ha moment.
Always.
I mean, those who see it first
time.
Because 100 BUFFERS, buffer hits,
or reads, 100 buffer reads,
there's nothing to them, actually.
It's what it is.
Hard.
But once you say, okay, we here
we read 800 kibibytes to return
a number of like a 8 byte number.
You can feel it already how much...
It's just...
To return 1 byte, you are reading
100 or 800...
How many?
KB.
Each KB is 1024 bytes.
Wow, like it's not very efficient, right?
How about having I don't like hash hash or hash table or something
to find this number and then Oh, okay, this is why like, because
it's a sequential scan.
And this sequential scan is just scanning whole table.
If we remember big O notation, sequential scan has terrible,
terrible performance, because so many operations.
And it also depends on the size of each tuple, how many tuples
are present in 1 8-kilobyte block, 8-kilobyte buffer, right?
So if only a few tuples there, we need to read a lot of data
from memory or from disk.
So once you switch to a B-tree index, it becomes only a few
buffer operations.
Even if you have a billion rows, it will be like 7.
I don't remember exactly, but
Michael: it's
Nikolay: like 7 buffer hits.
That's it.
And even, okay, they are still like 8 kilobytes, it's a lot,
but it's already much better than if it would be sequential scan,
which is insanely slow in this case.
And you think, okay, that's why it's slow, because we deal with
huge volumes of data.
And when we apply index, we suddenly deal with very low volumes
of data.
Right?
And this is number 1 reason why database becomes fast.
This is how indexing works.
Like, we just reduce the number of I/O
Operations.
And when I say I/0
Operations, I often include operations with memory hits.
Because, well, we know operations with memory versus operations
with disk, it's like 1, 000 times difference.
It's, of course, a lot, but still, like, we can compare it, and
1 read is roughly like 1000 hits to memory.
Unless this read is also from memory, from the page cache in
this case.
So they're very similar.
So what I'm trying to say, we talked about this reasoning and
I agree with you, we need to convert to bytes and present it.
At the same time, we need somehow like have some remark that
these buffer hits might be to the same buffer.
It's not only...
It's a volume...
It's some abstract volume of data.
We can have a kilobyte or megabyte of data heating the same 8-kB
buffer, right?
So it's okay.
But comparing volumes of data, it's so good.
Because first of all, you start thinking about data volumes,
which is the number 1 reason for slowness, especially if we forget
about concurrency.
Because concurrency is a different topic.
If we take just 1 query, there are cases when it can be a very
CPU-intensive workload, but in the majority of cases it will
be I/O
Intensive workload for 1 query, for databases, right?
And our goal is to help Postgres with indexes, with maybe some
redesign, to help Index to deal with as few operations with memory
and disk as possible, it means we need to focus on BUFFERS.
Michael: Yeah, are there any...
I think there might be a few other exceptions, but I think they're
more...
Yeah, I don't think that...
Well, maybe it's only 1 I think maybe a CPU as well actually
I was thinking JIT compilation what that's
Nikolay: probably recently we had a recent discussion about our
RLS and there we had inefficient memory work CPU work like check
some volatile function or stable function also, right?
So if it's in loop for all rows, it's CPU intensive work.
Michael: Yeah, there have been, I was just trying to think, there
have been a couple of other times I've really not been able to
spot issues by looking at BUFFERS.
But they're not actually, it's not because Postgres isn't reading
data, it's just it's not reporting it in EXPLAIN ANALYZE yet.
So actually, maybe getting it on by default will encourage some
more reporting of BUFFERS as well.
So, for example, I don't think triggers report the BUFFERS read.
Like, if you have, you know, the famous case of not indexing
a foreign key and then having on cascade delete.
Like that trigger might tell you it's taking many, many seconds
because it's having to do a sequential scan of the reference
table but no BUFFERS reported as part of that.
Same with in memory operations, like sorts or hashes.
They could be really slow, even though they're done maybe you've
got quite large work mem, and maybe it's the dominant part of
the query plan, but no BUFFERS reported as part of that.
Because it's in memory.
It will report the BUFFERS if it spills to disk.
For reads of tables and indexes, we get reports even if it's
from memory.
But for operations like sorts and hashes, we don't get.
So there are a few things that are still doing work that it would
be good to get those.
But it doesn't go against what you're saying in principle.
It just means I can't use what Postgres is actually giving me
in EXPLAIN (ANALYZE, BUFFERS) to actually diagnose the issue there.
Nikolay: Yeah, I remember also touched the topic that it would
be really great to see details
for BUFFERS.
For example, how many BUFFERS from
heap, from index, maybe distinguish
in each index, some details to
see.
Because if we go back to the original
approach for dealing with
EXPLAIN ANALYZE, and I see it like
even many hackers, blog posts,
provide plans, ANALYZE plans, but
they don't use BUFFERS still.
I hope this will change, of course,
over time.
So usually, okay, we have time,
we try to guess, okay, this time
is lost here, why?
Okay, because sequential scan.
But in terms of data volumes, what
do we have?
Only rows, right?
Expected and actual rows.
Logical rows.
And sometimes we think, okay, we
fetch 1 row here.
Michael: Oh, we get width as well.
Like an average estimated width.
Which is...
Times by rows gives you some idea,
but it's only the width of
what's being returned not the width
of what's being read
Nikolay: Exactly.
This is like, okay.
Yeah, we can multiply with by rows
get big basically number of
bytes But it will be like logical
level and an underlying level
might be very different, very different,
drastically different.
For example, if there is a lot
of dead tuples, which Postgres
checked to return this very row,
right?
Maybe it checked a thousand dead
tuples.
And it's hidden if we look only
at rows and timing.
And we have no idea why timing
is so bad, right?
Okay, we returned 1 row in the
index scan, but why so bad?
Here's why.
We go to BUFFERS and we see that
a lot of BUFFERS were hit or
read.
I don't know, it doesn't matter.
That's why timing is better.
And then, my point is that it's
already very useful, But imagine
if we saw, oh, actually, from this
particular index, from this
particular table, we got those
operations with BUFFERS.
It means that to return this row,
so many BUFFERS, even with
index scan, it's bad.
Oh, that's why, because it's bloat.
This row had many versions not
cleaned, and the index scan needs
to check heap table to get version
information.
So this would be useful to like
detail BUFFERS, to have detail
BUFFERS maybe, right?
And I
Michael: had this case recently,
I was in January, I was doing
some office hours calls just to
get an idea of what kind of issues
people were facing and I had some
really good conversations thank
you to everybody that jumped in on those and but 1 of them they
were they were using our tool pgMustard and it was 1 of the
tips we show people is, we call it read efficiency, to look for
exactly that.
When are you reading lots of data?
And we actually, for long-term users, used to call it table bloat
potential, or like bloat potential.
So it could be index bloat, could be table bloat.
But we renamed it read efficiency a few years ago because there
are other possible causes of it as well.
So it's a tricky 1.
And because we didn't, so this office that I was called, they
were hitting quite a few read efficiency issues but they couldn't
diagnose exactly where it was like was it index level was it
table level they looked at both it didn't they didn't seem bloated
they reindexed they they tried a lot of the things and it couldn't
reduce it so it was a it was a really interesting call but that
yeah that was my first port of call is in this used to be true
more in older versions of Postgres but indexes can get especially
can get extremely bloated so can tables in some cases but yeah
it's it's amazing that you can suddenly see that by turning on
BUFFERS.
And that is something I wasn't expecting to be true.
Because I see bloat as kind of like more of a system-wide issue,
But to be able to spot it in query licenses was really cool.
Nikolay: Oh, bloat can be very tricky.
Sometimes we have, with our consulting clients also, we have
like bloat is low, but some query suffers.
And it turns out to that, like, we call it usually local bloat.
So for particular IDs, the situation is super bad.
It can also be not a bloat, but some, you know, like sparsely
stored records.
They are distributed among many BUFFERS.
And you expect, okay, I have very narrow table, only like 4 or
5 columns.
I expect a lot of tuples to be fit inside 1 page.
But somehow reading 1, 000 records, rows, I deal with thousands
of buffer operations, what's happening here.
And if we just check CTID, we can understand that each row is
stored in each particular page, so we have a lack of data locality
here.
And so clustering with pg_repack without downtime could help, or
partitioning helps usually in such situations and so on.
Yeah, yeah, yeah.
So, and the buffer is exactly the way to fill these, all these
problems better.
But I agree, I like your idea to talk about read efficiency and
maybe write efficiency as well.
Because for end user, it's obvious.
Okay, I have, I return this data, I return only 25 rows on my
page.
So I expect not gigabytes of data to be fetched from buffer pool,
or even worse, from disk, right?
Or from page cache, which is between disk and the buffer pool.
So I expect maybe only like some kilobytes, right?
Maybe tens of kilobytes.
Even not megabytes, if it's quite a narrow table.
Postgres should not do a lot of work to show 25 rows.
If it does a lot of work, something is not right and we need
to optimize this query.
And without BUFFERS, we only can guess.
With BUFFERS, we see it, right?
With buffer details, it would be even better, right?
Someday maybe we will have detailed BUFFERS per each database
object.
Some statistics would be interesting, I think, to observe.
And there is criticism of this approach.
I recently had an interesting discussion on Twitter on X, And
somebody told me that in Oracle, there is ability to flush caches.
Right.
And in Postgres, it's tricky.
Usually it means you, we need to restart Postgres to flush the
buffer pool and also to echo 3 blah blah blah to flush Linux
caches if you want to go hardcore and very cold state, to check
very cold state.
And this is interesting.
Michael: I think you can restart.
Nikolay: Restart what?
Michael: The Database.
Like if you're testing locally, for example, or on a clone, if
you restart the Database, doesn't that reset caches?
Nikolay: Restart of Postgres will make empty only the buffer
pool.
But we also have, if it's Linux, We have also page cache, right?
We can flush page cache with simple command like we of course
if you have sudo
Michael: Yes,
Nikolay: so it's it's easy.
I always Google it echo 3 to some path and and then sink and
that's it
Michael: If you're a managed service, what's the best you can
do?
Nikolay: Well, managed service, you cannot do this, of course.
Michael: Restart is like, I
Nikolay: think the best
Michael: you can
Nikolay: do.
Reboot.
Reboot, yeah.
Yeah, yeah, reboot.
Well, this is hardcore.
So let's talk about this.
I also see a constant desire to
check the cold state in the Database
Lab we built.
Some users use it for query optimization
with some bot we have,
drawbot, very old stuff, but it
works really well for query optimization
and also like experiments, right?
So you can check on thinkloan,
on branch, you can check various
ideas And there people say, okay,
I see the hot state.
Data is already cached.
I want to check the cold state.
And I think it's natural desire,
but yeah.
Well, every time we say there is
no such easy way.
And interesting that, I didn't
know, but interesting that in
PostgreSQL 17, in pg_buffercache,
there is a new function, pg_buffercache_evict,
and you can try to use it.
I think we are going to try it
at some point when Postgres 17
will be more present on production
systems.
We probably will consider adding
this to make the buffer pool
empty, right?
But unfortunately, this function,
I suspect it won't work really
well.
It's definitely not for production,
first of all, right?
It's for like lab environments
we build.
And unfortunately, per documentation,
it won't evict BUFFERS
which are pinned by, for example,
autovacuum, vacuum process
and so on.
So in some cases it will fail,
we will need to think how to deal
with it.
Of course, restart is definitely
working in this case.
So if we think about why do people
need cold case at all, What's
your opinion on this?
Michael: Well, I think it's coming
from a good engineering rationale.
I want to make sure the worst case
isn't too bad.
It's definitely, you want good
engineers to be thinking about
edge cases, to be thinking about
the worst possible case.
But I also think that in practice,
the kinds of optimizations
we're doing here, the whole point
of the discussion here is we're
saying try and reduce the amount
of data being read to as minimal
as possible.
If it's an important query, try
and get a very, very good access
path for that query.
That will involve getting the buffer
numbers down as much as
possible, whether they're cached
or not.
Maybe you want to explain to people
in the PR, or you need to
explain to your team, how much
faster has this made it?
So I can understand wanting to
compare cold cache state to cold
cache state, and then having these
kind of relative differences.
But I would encourage going the
opposite route.
It's much easier to work with warm
cache, especially when you're
doing query optimization, you're naturally going to warm up the
cache if you're running the same query over and over again.
So I would encourage more going the opposite direction and say
compare...
Yes, it definitely applies for OLTP, but even for OLAP or OLAP
queries, I don't see the downside of saying to your team, these,
I mean, it's still EXPLAIN ANALYZE, right?
If you're showing query execution plans before and after, which
is often what people are doing, kind of show this is what it
was, this is what it will be, it's already imperfect in terms
of reporting numbers.
It's not exactly the same as what the client's seeing.
So If we're already imperfect, I don't see the downside of comparing
warm cache to warm cache, so I tend to run things a few times,
get the...
Nikolay: Before optimization and after optimization, we compare
2 cases, both should be warm, and we focus on not...
Even if we have reads still, for example, buffer pool may be
smaller in lab environment.
We usually have much smaller buffer pool because we run multiple
Postgres instances on the same machine.
It's a shared environment.
So if you deal with large volumes of data, it might not fit the
buffer pool, so you are going to see reads.
But the idea is let's just summarize reads and hits, maybe even
writes and dirties, how to name it, right?
So all 4 buffer operations, if we just summarize it and use it
like as universal metric, this is now our number of...
Or just maybe reads and hits, depends, right?
So but overall, we just see this is overall volume of data.
And this is what you return, 25 rows.
Michael: I like that a lot that's exactly what we do but I get
the sense that what people really want is to compare timings
so they want to say this query that used to take Because sometimes
it's coming from a complaint, right, or something.
This query that used to take 20 seconds now is 500 milliseconds.
Or this query that used to be 200 milliseconds is now less than
a millisecond.
So they want this kind of difference in timing, even though behind
the scenes it's a difference in BUFFERS.
So for that, if you want kind of apples to apples comparison,
instead of trying to get cold and be fair on the cold side, I'd
just say it's easier to be fair and do it on the warmer side.
I know it's not perfect, and it might be that your real users
are hitting a cold cache state, but if that's what you want,
that's the direction I would go.
Nikolay: Yeah, well, we have slightly different methodologies
here.
I usually say, okay, we start from the statement, the query is
slow, it takes like 30 seconds or 1 minute.
It's super slow, unacceptable, we want it below 100 milliseconds
or 10 milliseconds.
And then I say, forget about timing for now, completely.
Focus only on BUFFERS.
If we check BUFFERS and see a low volume, 100 BUFFERS, 100 buffer
hits and reads, to return 25 rows, we know our micro-optimization
is over.
We need to go and see the whole picture of what's happening with
concurrency.
Probably we'll have some...
Maybe this query is waiting on lock acquisition of all those seconds.
That's it, right?
So it's already this.
But if we confirm large volumes of data, we forget about time.
We focus only on obtaining sane numbers for BUFFERS.
Sane means, okay, hundreds, thousands, not more.
Even thousands to return 25 rows, it's already quite a lot.
And we must do it.
Sometimes, of course, it's not easy.
Sometimes we need to do denormalization, some complex surgery
on our database involving long-lasting operations with backfilling
and so on, but once we achieve this, in most cases we can have
same number of BUFFERS and then we say, okay, now we can compare
timing.
And Comparing timing in lab environment, it still might differ
from production.
Michael: Yeah, of course.
Nikolay: But of course, I know, I know, I use it as well.
I say, we improve timing hundred, like 10000 times.
It happens, right?
And this is good for final comment you know in the end but only
when we did work on BUFFERS fully right this is me
Michael: yeah given it's you you you could with your reputation
and your teachings you could leave that last line as we improved
buffer read or read efficiency for example.
Read efficiency and it would be probably almost the exact same
number.
Nikolay: Not necessarily, but yeah, maybe.
Michael: You'd be surprised how often it's close.
Nikolay: Okay, good, good, good.
Yeah, that's good.
Maybe I should think about it and also present this information
during consulting activities and in tools as well.
Yeah, I think it's good.
So yeah, read efficiency.
And you take number of rows and width and just multiply, get
bytes from there and then...
Michael: Yeah, it's a tricky 1 because like different scan types
you expect different amounts of efficiency so yes it's slightly
Nikolay: more involved also interesting right
Michael: yeah but I think I think the main differences are sequential
scan like sequential scans very different from well sequential
Scan and Bitmap Scan have some similarities.
Index Scan and Index Only Scan have some similarities, but then
like looped index scans are somewhat different to like range
index scans in terms of how many tuples, like in a loop for example,
you can't get fewer than the number of loops, like it has to
do at least that many.
So yeah, there's some subtlety, but yeah, that's it essentially.
Nikolay: Yeah, But back to the cold state, I also agree.
I can imagine you would need it.
And unfortunately, it's possible, as I see, only in lab environment
where you are alone.
Because this is a global operation to flash operational system
page cache, for example.
Michael: And when you control the hardware, right?
Or like the, you know, the VMs.
Nikolay: Well, if it's a managed situation, again, restart, as
you said, also an option, but again, this only works if you're
alone, and this is expensive.
We aim to make experiments super, super, super, extremely fast
and cost-efficient.
So it means a managed environment.
Many people work on the same machine, so if 1 decides to flush
everything, others will notice, right?
And With this function, pg_buffercache_evict, I think if we manage
to make it work, I see it's useful.
Even imagine if we have page cache, or in the case of DBLab,
it's ZFS-Arc, which is common.
If 1 block is cached, it's cached for all clones, all branches,
it's very good in terms of efficiency.
But sometimes, indeed, engineers want to check inefficient situation.
Okay, in this case, I think what we will do, we try to okay,
if user requests for their particular database, particular Postgres
instance running on this machine, we can ask to evict maybe everything
from buffer pool or for particular tables and indexes, say everything.
And if we succeeded, there are chances that these buffers are
cached in underlying cache, right?
Page cache or arc.
In this case, query timing is not going to be very different,
but plans will be different.
We will see reads instead of hits.
Observing buffer numbers, we will see the difference.
Okay, reads now, not hits.
And what I'm thinking, In this case, we could consider it relatively
cold, this situation.
Michael: Lukewarm.
Nikolay: Yeah, yeah, yeah.
And we could just say, okay, we know that reading from disk is
roughly a thousand times slower than from memory.
So we could do some estimation
of real call state.
Michael: Well, or there's that...
How often do you see clients having
track_io_timing on?
Nikolay: Oh, very often.
Michael: Great.
Okay, that's really positive.
But that because that gives us
an idea of how long those reads
rather than hits took.
And you could do some division
of reads by I/O timing to get an
idea of where they're coming from.
I think there might also be some
work.
I think I saw something either
in a commit for a century or a
hackers thread to try and get
Nikolay: the details.
It's like, there's just that cache
for micro level, right?
Yeah.
Well, it's possible if you have
access to everything, it's possible
from proc, process number, I/O,
you can get it from there, I think.
Michael: But it would be good to
have it in Postgres core so
that we could even get it from
managed services, for example.
Nikolay: Yeah, yeah.
Well, there is something that can
be done here and improved observability
of part and work with particular
query if you have control.
Yeah.
But I'm trying to say, like, even
if you with this evict, we
can get some interesting information,
even if we cannot allow
ourselves to reset the whole, everything,
including page cache.
So it already can be useful in
non-production environments to
study the query behavior if it's
a clone.
Yeah, yeah.
So I feel there's a lot of things
that can be improved in this
area to work with queries.
And especially it's important to
continue this work Because I
suspect we will use some tools
to work on this at massive scale.
Like for example, if we fetched
from...
We already know during consulting,
we did it with a few clients.
With auto_explain, we fetched
a lot of queries, hundreds usually,
sometimes thousands.
It's insane.
Nice.
Examples with plans.
And then, well, actually, you know,
because we partnered, in
some cases we use pgMustard to provide
additional insights, which
is great.
And then we have a lot of stuff.
Unfortunately, you know, unfortunately,
I still don't see how
to avoid human here.
Well, not like, I like to involve
human there, But I would like
to involve less, you know because
usually Right now we have a
lot of cases of query analysis
like that and then we need to
draw some common picture from it right and this This like going
above all those plans and say, okay, we have a pattern here.
For example, bloat is obvious or something like this, like, or
index health is not good.
And to draw this picture, human is involved heavily right now.
I'm thinking like over time, probably we will build some additional
macro-level analysis for micro-level cases of hundreds or thousands
of them.
This is going to be great.
And then looking at BUFFERS, and actually we have BUFFERS there
as well.
You were right in how to explain timing is not present.
People are afraid of overhead BUFFERS also have overhead, but
kind of smaller.
Right.
And yeah, we had articles, you had articles about this as well.
Yeah.
So yeah, and so with cold cache it's tricky, and there is this
function, that's it.
So Postgres 17 plus.
What else?
There is macro level, right?
And at macro level we can talk about pg_stat_statements, blocks,
read, hit.
By the way, there it's, yeah, also naming, we talked about it.
It's also naming like it's data volumes, but it's okay.
And so read, hit, written, dirtied.
For shared buffer blocks.
In pg_stat_statements, also there is pg_stat_kcache if we want
real disk I-O to be observed.
And again, I advertise to include pg_stat_kcache in any setup.
It's a great extension.
And there we can talk about some macro level, like, okay, this
query ID has this volume per second, or this volume per query,
or it's responsible for 90% of all hits to the buffer pool happening
in database.
Maybe it doesn't look super slow, but it's so intensive with
shared buffer pool.
Michael: That's a good point.
I almost always encourage people to look at their top queries
by total time.
Unfortunately, with pg_stat_statements, at least, there's no
kind of total blocks.
Because it's split into these many, many steps, you could order
by shared hit plus shared read and get a good sense of 1 type
or by dirtied for a different, but yeah, you'd have to sum in
most cases I'm thinking of that would give a good system-wide
aggregation, you need to sum 2 columns, I think, to get a good...
Nikolay: Reads and hits, right?
Michael: Ordering.
Yeah.
Nikolay: Yes, yes.
So we usually...
It's underestimated approach with
pg_stat_statements to order by
sum of reads and hits.
Michael: Yeah.
Nikolay: It's super important.
This is how you identify I/O-intensive
queries.
And now they can be mostly hits,
but if database grows, they
can be converted more and more
into reads.
Or vice versa, you're going to
double your memory size and the
buffer pool size, so reads will
be gone.
But just sum them and consider
them together to understand how
IO intensive the query is.
And think, oh, this query probably
needs optimization, especially
if it returns only 1 row or 0 rows.
So, yeah, in this case, we can
also consider buffer-focused optimization,
which is important.
I think it's underestimated.
But also, it's worth mentioning
this interesting blog post from
Andrei Lepikhov, who gave, I would
say, a hacker researcher perspective
on this, saying, actually, I didn't
mention that this is my point
for sure for lab environments we
build it's important to say
timing is super volatile it's unpredictable
you can have timing
1 in 1 today another tomorrow 1
on production another on clone
of production Very different timing.
And it will be a big question to
track all reasons why timing
is unpredictable.
And there is a concept, there is
a simple methodology, if we
deal with a complex problem, let's
split it to pieces and analyze
pieces separately.
So when we take a clone and bring
a single query and optimize
it, we already get rid of concurrency
issues, heavy locks, lightweight
locks, everything, like forgetting
about them.
And we deal with a single query
alone in 1 clone and it's already
super good.
But we still need to have something
reliable in terms of metrics
to optimize.
And timing is not reliable because
it's unpredictable.
It depends on the states of caches.
If it's a shared environment, maybe
CPU and disk are super busy,
and there's noise from neighbors
when we optimize in this case,
right?
But once you focus on BUFFERS inside
optimization, you have invariant
metric.
Okay, it can be reads versus hits,
but some of it will be constant
if you repeat the query.
For this particular clone, for
this particular query, these parameters,
everything is the same.
Plan is the same, right?
So you have the same thing.
It's repeatable, it's reproducible.
This gives you a foundation for
optimization because without
it you have very shaky foundation,
like not solid foundation,
right?
Timing.
So, and then you optimize and you...
I like your idea about optimization,
like, actually I used it.
I used it.
I said, okay, 10,000 times timing.
And I remember I mentioned BUFFERS
as well, but I didn't expect
people will value this comment
a lot.
Maybe with BUFFERS by default,
we should push on this methodology
more and more.
But what Andrei is talking about
in this article is that for
research purposes and so on, instead
of timing, maybe we should
focus only on reads.
And reads is something that could
be a target itself for optimization
and decision if the system works
efficiently or not efficiently.
Because we are talking about databases
and I always the number
1 thing in terms of what takes
time.
And I like this idea.
Yes.
So like this is very close to what
I see for many years building
lab environments, non-production
environments, which can help
people scale their teams and databases
and processes and so on.
Yeah, so what's your opinion about
this article?
Michael: Yeah, I thought it was
good.
I think all of, I think this blog's
great, but it's very, I think
it's very hacker focused.
And I think that's good, right?
Like This is a good topic for hackers
as well.
There's a lot of blog posts I see
from people often explaining
performance issues and not including
BUFFERS, and it seems like
a missed opportunity to educate
on that front.
So this is a great argument for
the people doing the educating,
I think, or even doing the implementations
of why it can be helpful
to see this.
I think he makes some good points
as well about hardware changing
over time and his initial paragraph
is even about I can't reproduce
this paper.
Nikolay: I can't reproduce timing, yes.
Michael: Yeah, because all this
and it's not just that it's also
there was and this is a different
point slightly but there's
insufficient setup information
which is really common in benchmarks
it's really common to not include
which exact machines were being
used or which exact.
So without that information, BUFFERS
can still be used as like,
oh, this is interesting.
Sure, there might be optimizations
over time that Postgres can
use fewer BUFFERS for certain operations
or certain indexes become
more efficient.
There's been a bunch of work to
optimize B-trees in Postgres
over the years, and I would expect
that to lead to slightly fewer
buffer reads in a bunch of cases.
But they would be easier to see,
and easier to see the differences
in than timings.
And I think he's also got a really
good graph here that shows
that there's not 0 variance in
buffer numbers, but it's much,
much lower and it's way more stable
than timings.
Nearly completely stable with some
exceptions, which is my experience
as well.
Nikolay: Yeah, yeah, yeah.
I agree, especially for example,
if you run the query first time
real cache is not there, it will
give you,
Michael: yeah, planning
Nikolay: time, it will give you
huge overhead and huge buffer
numbers, but second execution will
fully shave it off.
Michael: Yeah, planning BUFFERS.
Nikolay: Planning BUFFERS is a
separate topic.
It's a very important topic.
I like to have it.
Michael: Me too, but I don't fully
understand them yet.
Like why do we get so many more
in the first execution and then
why do we sometimes get none reported
at all?
Nikolay: Ah, okay.
You mean, yeah, yeah, yeah.
I think we discussed it maybe,
not sure if you, I saw some very
weird behavior in planning BUFFERS.
Yeah.
Maybe something is not tracked
as well, as you've mentioned.
Michael: I think it
Nikolay: must be.
Foreign keys, yeah, yeah, yeah.
So maybe we should dig into this
topic and raise it, the discussion
about that.
Yeah.
So back to Andrei's blog post.
Michael: Yes.
Nikolay: I agree.
And like I cannot reproduce and
we built lab environments sometimes
very different from production
just for the sake of cost optimization.
Like if you have 360 core Epic,
5th generation Epic on production,
and almost terabyte or more of
memory, it doesn't mean you cannot
optimize your queries on Raspberry
Pi.
You can on magnetic disks, very
cheap.
And that query fully cached might
be slow for you on production,
might be like a second.
You go and optimize, you understand
that you have shitty hardware.
In this case, you just see, okay,
it takes minutes or hours.
But BUFFERS are the same, and plan
is the same.
You can make decisions on super
cheap hardware with smaller caches,
everything is smaller, very slow
disk, you can still make cautious
decisions and move forward with
optimization, find solution,
see that you optimize buffer numbers
like a thousand times and
expect similar effect, maybe the
same effect on production, affecting
time accordingly.
Right.
And this is the, this is the power
of lab environments.
I think everyone should have them
super cheap, shared.
So many people and CI pipelines
work together and focusing on
BUFFERS is magic.
Right?
Yeah, I
Michael: think it's easy.
I think you're completely right.
I think it's easy for people to
make the mistake of then also
not using sufficient data volumes.
I know you don't.
I know that's not what you're advocating
for.
But yeah, it's the data volumes
that matter way more than the
power of the disks or the specific
CPU.
Nikolay: Yeah, usually people say,
okay, we are going to have
weaker hardware for non-production,
and we are going to focus
on relative optimization.
So if it was a second in production,
10 seconds in this non-production,
which is weak, we are going to
optimize from there.
But it's still weak because of
the status of cache and so on.
Just focus on BUFFERS and then
you can understand it.
Again, there are exceptions, I
agree, such as we discussed with
RLS and so on.
It's possible that your particular
case is CPU-intensive, but
these are exclusions from the main
rule.
Absolute majority of cases, you
will be dealing with BUFFERS
during optimization, right?
So congrats with Postgres 18 change.
And I also want to thank Andrei
Lepikhov for this blog post,
which is very welcome.
I think we need more posts about
the importance of I/O Metrics
at various levels, like 1 query
level, micro level, or macro
level like pg_stat_statements, pg_stat_kcache.
So yeah, I hope this topic will
grow and be more popular, this
methodology.
Michael: Yeah, agreed.
Another thank you to the people
involved in getting this committed
to Postgres 18 and big thanks oh
and I think this might be our
first episode since we both got
Postgres Contributor status Nikolay
so congratulations to you and thank
you to whoever nominated
us or whatever the process is there.
Thank you to everyone involved.
Nikolay: Good, see you next time.
Michael: See you soon, bye.