Nikolay: Hello, hello, this is
PostgresFM.

I'm Nikolay from Postgres.AI.

And as usual, my co-host is Michael
from pgMustard.

Hi, Michael.

Hello, Nikolay.

So I chose a very boring topic.

And we talked about it a lot I
hope this is the last time we

were going to talk about it maybe
right BUFFERS or pages or blocks

or how how to name it depending
on the perspective you have, right

but BUFFERS episode 3, which
title are you going to choose?

Michael: Oh, it's like the third
in a trilogy, isn't it?

Maybe it's like Return of the BUFFERS
or something, I don't know.

Nikolay: Right, but we have a few
reasons to revisit this topic

today.

Of course, this topic has been
very heartwarming for me.

For a couple of years, we discussed
how important it is to consider

BUFFERS when optimizing queries
at micro-level, when we talk

about EXPLAIN ANALYZE, or at macro-level when we talk

about pg_stat_statements and other
extensions, and aggregate

everything and see parts of workload
or the whole workload.

And recently, finally, as we already
mentioned a few times, BUFFERS

were added to EXPLAIN ANALYZE
by default after a few unsuccessful

attempts from previous years.

I think it was David Rowley, right?

Michael: Yes, and Guillaume Lelarge
played a big part as well.

Nikolay: Right, and all previous
hackers also who made even unsuccessful

attempts made some shift in this
direction.

Of course, it's really a small
technical change.

So now you don't need to write
the word BUFFERS.

You just write EXPLAIN ANALYZE.

And as a reminder, EXPLAIN just
provides a plan for the query,

which planner is going to choose,
optimizer is going to choose.

And then EXPLAIN ANALYZE actually
provides both the plan and actually

execution.

It actually executes the query,
so you see more data.

And by default, buffer information,
buffer hits, reads, BUFFERS,

dirtied and written, we are not
present in the plan.

And that was a bummer for us who
dealt with queries at scale,

right?

So because this is important information,
as we discussed, and

we will discuss again, I think.

But now in Postgres 18, you don't
need to write BUFFERS explicitly.

This information will be included
by default.

And that's great, that's great.

I think it's a super important
small change which will affect

lives of many engineers.

Because...

Michael: Yeah, I think just worth,
I know you've alluded to this

already, is an upcoming version
of Postgres 18.

It's been committed, and things
that get committed have a very

good chance of making it to the final version.

But things do get reverted.

So there's a small chance that we'll still have to talk about

this more in future but yeah we already had the ability to request

BUFFERS we could already add EXPLAIN (ANALYZE, BUFFERS) and other

parameters but a lot of the time when people are asking for help

online or when people set up auto_explain although I see BUFFERS

more in auto_explain I don't know what your experience is but

yeah when people are asking for help online or when people set

up auto_explain, although I see BUFFERS more in auto_explain, I

don't know what your experience is, but yeah when people are

asking for help or people are writing blog posts or people are

just getting familiar with Postgres, they'll come across

EXPLAIN ANALYZE and then not know that BUFFERS exists or not use it when

it would actually have shown the issue and so there's a bunch

of cases where by having it on by default we'll be able to help

more people online when they first have an issue without having

to ask them to go back and get another query plan or it will

just be more likely to be in, well it will be in those query

plans by default if they're on Postgres 18 or newer.

Nikolay: Yeah, yeah.

And I know you have also opinion and I agree with you that some

other things should be included if we provide something like

all or verbose.

There is verbose but it doesn't include everything and this is

slightly strange and unexpected if you don't know details.

But let's focus only on BUFFERS.

It's really hard for me to imagine it will be reverted.

It might be, but I think, yeah.

Michael: I could just imagine this aging really, you know, aging

like milk, we say sometimes, just in case.

I just wanted to make people aware that might have happened.

Nikolay: Yeah, so we talk about Postgres 18, which is going to

be released this fall in many months and it will hit production

only in a couple of years, on average, maybe, or maybe even more.

So it's still in far future, in distant future.

But I wanted to just think in advance, since we are both building

some tools and methodologies and everything, I wanted to think,

is it going to have some positive, I think, positive effect on

how we are going to optimize queries in the future.

For example, defaults matter a lot at scale, and I learned it

building 3 social networks in the past.

The simple lesson I remember some spammer was sending on behalf

of administration was sending the message, send text message

to this number without any reasoning.

And if you send it to 1000000 people, there will be some people

who will follow.

Because, you know, like, why not?

Of course, if you add some reasoning, it will work even more.

So defaults matter a lot.

And we can put everywhere words
like don't forget BUFFERS, don't

forget BUFFERS, but if it's still
like if default behavior remains

without BUFFERS, there will be
a lot of people who will not use

it, right?

And this is what we see If you
go to explain.depesz.com, it has

history.

And, of course, actually, even
explain.depesz.com also advocates

for BUFFERS since not long ago,
which is great.

But if you just check history,
you see a lot of plans provided

without BUFFERS, right?

Because defaults.

Now we have good reason to expect
that after a couple of years

we will see majority of plans with
BUFFERS.

Do you agree?

Michael: Yes, maybe not in the
full, but maybe in like definitely

in 5 years time, maybe in 234 years
it will become the majority.

Nikolay: Well let's say if it will
be 18 plus version, this plan

like again 99% or even almost 100%
that will include BUFFERS

because only few people will bother
turning it off.

Michael: Oh, I meant actually people
still running very old versions.

Yeah, that was my understanding.

Nikolay: But imagine we're already
thinking about people with

new versions and I think we will
have BUFFERS.

What does it mean for analysis?

Are we going to highlight this
information or use this information

even more intensively?

What do you think?

Michael: Well, good question.

I think people will ask...

People already have lots of questions
when looking to

EXPLAIN ANALYZE output for the first time
and people are like what does

this statistic mean what does that
statistic mean and I think

BUFFERS are another kind of 1 that
at first people are a little

bit confused by what does it mean?

Like what is that number?

Is that a lot?

Is it not a lot?

What unit is it in?

All these kind of questions.

So I suspect it will immediately
in people's heads have a bit

of extra confusion.

But once they're familiar, I can't
see how it isn't helpful.

It's just going to be so helpful,
especially once people start,

like, people and tools start converting
that into data volume

sizes.

I know it's an imperfect, but let's
accept the imperfection of

timesing it by 8 kilobytes, so
each buffer being a page, and

multiplying it by 8 kilobytes,
you get an idea of the data volume

being read.

When people see a query and notice
it's reading a gigabyte of

data, even if it's from shared
memory, when it's not using an

index or when it's using a specific
index, and then it's only

using a few kilobytes when it's
using a different index, suddenly

they realize the index is helping
speed things up because it's

more efficient because Postgres
is only having to read far fewer

blocks, like much less in order
to serve that query.

I think it becomes a lot more intuitive
that indexes help because

they're a more efficient way of
running certain queries, not

their magic.

And I think this is currently not
amongst Postgres folks, but

definitely in the developer community
as a whole, there's this

perception that indexes are magic.

And I think BUFFERS being present
help is like a good step towards

no they're not magic they're just
a they're very clever idea

that makes things really simple

Nikolay: yeah yeah that's a good
point I was listening to you

and thought I'm contradicting to
myself when I say buffer, BUFFERS,

hit and read and written and dirted.

It's incorrect, it should be operations,
BUFFERS, buffer reads,

hits, writes, and I don't know
how to deal with dirt.

Michael: Dirties?

I don't know.

Nikolay: Yeah, yeah, yeah.

Pieces of dirt, right?

No, no, no, like actions of dirt.

But anyway, like I'm saying, these
are operations And if you

sum up some numbers, it might be
the same buffer which you hit

multiple times.

Not sure about other operations,
but hit, we definitely have

in the nested loop, we can have
hits for the same buffer many,

many, many times.

But when you say we need to convert,
I totally agree.

I did it for a long time.

I noticed that very long ago, once
you converted to bytes, megabytes,

gigabytes, sometimes terabytes
actually, tebibytes, right?

Gibibytes.

Once you convert it, engineers
have a ha moment.

Always.

I mean, those who see it first
time.

Because 100 BUFFERS, buffer hits,
or reads, 100 buffer reads,

there's nothing to them, actually.

It's what it is.

Hard.

But once you say, okay, we here
we read 800 kibibytes to return

a number of like a 8 byte number.

You can feel it already how much...

It's just...

To return 1 byte, you are reading
100 or 800...

How many?

KB.

Each KB is 1024 bytes.

Wow, like it's not very efficient, right?

How about having I don't like hash hash or hash table or something

to find this number and then Oh, okay, this is why like, because

it's a sequential scan.

And this sequential scan is just scanning whole table.

If we remember big O notation, sequential scan has terrible,

terrible performance, because so many operations.

And it also depends on the size of each tuple, how many tuples

are present in 1 8-kilobyte block, 8-kilobyte buffer, right?

So if only a few tuples there, we need to read a lot of data

from memory or from disk.

So once you switch to a B-tree index, it becomes only a few

buffer operations.

Even if you have a billion rows, it will be like 7.

I don't remember exactly, but

Michael: it's

Nikolay: like 7 buffer hits.

That's it.

And even, okay, they are still like 8 kilobytes, it's a lot,

but it's already much better than if it would be sequential scan,

which is insanely slow in this case.

And you think, okay, that's why it's slow, because we deal with

huge volumes of data.

And when we apply index, we suddenly deal with very low volumes

of data.

Right?

And this is number 1 reason why database becomes fast.

This is how indexing works.

Like, we just reduce the number of I/O

Operations.

And when I say I/0

Operations, I often include operations with memory hits.

Because, well, we know operations with memory versus operations

with disk, it's like 1, 000 times difference.

It's, of course, a lot, but still, like, we can compare it, and

1 read is roughly like 1000 hits to memory.

Unless this read is also from memory, from the page cache in

this case.

So they're very similar.

So what I'm trying to say, we talked about this reasoning and

I agree with you, we need to convert to bytes and present it.

At the same time, we need somehow like have some remark that

these buffer hits might be to the same buffer.

It's not only...

It's a volume...

It's some abstract volume of data.

We can have a kilobyte or megabyte of data heating the same 8-kB

buffer, right?

So it's okay.

But comparing volumes of data, it's so good.

Because first of all, you start thinking about data volumes,

which is the number 1 reason for slowness, especially if we forget

about concurrency.

Because concurrency is a different topic.

If we take just 1 query, there are cases when it can be a very

CPU-intensive workload, but in the majority of cases it will

be I/O

Intensive workload for 1 query, for databases, right?

And our goal is to help Postgres with indexes, with maybe some

redesign, to help Index to deal with as few operations with memory

and disk as possible, it means we need to focus on BUFFERS.

Michael: Yeah, are there any...

I think there might be a few other exceptions, but I think they're

more...

Yeah, I don't think that...

Well, maybe it's only 1 I think maybe a CPU as well actually

I was thinking JIT compilation what that's

Nikolay: probably recently we had a recent discussion about our

RLS and there we had inefficient memory work CPU work like check

some volatile function or stable function also, right?

So if it's in loop for all rows, it's CPU intensive work.

Michael: Yeah, there have been, I was just trying to think, there

have been a couple of other times I've really not been able to

spot issues by looking at BUFFERS.

But they're not actually, it's not because Postgres isn't reading

data, it's just it's not reporting it in EXPLAIN ANALYZE yet.

So actually, maybe getting it on by default will encourage some

more reporting of BUFFERS as well.

So, for example, I don't think triggers report the BUFFERS read.

Like, if you have, you know, the famous case of not indexing

a foreign key and then having on cascade delete.

Like that trigger might tell you it's taking many, many seconds

because it's having to do a sequential scan of the reference

table but no BUFFERS reported as part of that.

Same with in memory operations, like sorts or hashes.

They could be really slow, even though they're done maybe you've

got quite large work mem, and maybe it's the dominant part of

the query plan, but no BUFFERS reported as part of that.

Because it's in memory.

It will report the BUFFERS if it spills to disk.

For reads of tables and indexes, we get reports even if it's

from memory.

But for operations like sorts and hashes, we don't get.

So there are a few things that are still doing work that it would

be good to get those.

But it doesn't go against what you're saying in principle.

It just means I can't use what Postgres is actually giving me

in EXPLAIN (ANALYZE, BUFFERS) to actually diagnose the issue there.

Nikolay: Yeah, I remember also touched the topic that it would

be really great to see details
for BUFFERS.

For example, how many BUFFERS from
heap, from index, maybe distinguish

in each index, some details to
see.

Because if we go back to the original
approach for dealing with

EXPLAIN ANALYZE, and I see it like
even many hackers, blog posts,

provide plans, ANALYZE plans, but
they don't use BUFFERS still.

I hope this will change, of course,
over time.

So usually, okay, we have time,
we try to guess, okay, this time

is lost here, why?

Okay, because sequential scan.

But in terms of data volumes, what
do we have?

Only rows, right?

Expected and actual rows.

Logical rows.

And sometimes we think, okay, we
fetch 1 row here.

Michael: Oh, we get width as well.

Like an average estimated width.

Which is...

Times by rows gives you some idea,
but it's only the width of

what's being returned not the width
of what's being read

Nikolay: Exactly.

This is like, okay.

Yeah, we can multiply with by rows
get big basically number of

bytes But it will be like logical
level and an underlying level

might be very different, very different,
drastically different.

For example, if there is a lot
of dead tuples, which Postgres

checked to return this very row,
right?

Maybe it checked a thousand dead
tuples.

And it's hidden if we look only
at rows and timing.

And we have no idea why timing
is so bad, right?

Okay, we returned 1 row in the
index scan, but why so bad?

Here's why.

We go to BUFFERS and we see that
a lot of BUFFERS were hit or

read.

I don't know, it doesn't matter.

That's why timing is better.

And then, my point is that it's
already very useful, But imagine

if we saw, oh, actually, from this
particular index, from this

particular table, we got those
operations with BUFFERS.

It means that to return this row,
so many BUFFERS, even with

index scan, it's bad.

Oh, that's why, because it's bloat.

This row had many versions not
cleaned, and the index scan needs

to check heap table to get version
information.

So this would be useful to like
detail BUFFERS, to have detail

BUFFERS maybe, right?

And I

Michael: had this case recently,
I was in January, I was doing

some office hours calls just to
get an idea of what kind of issues

people were facing and I had some
really good conversations thank

you to everybody that jumped in on those and but 1 of them they

were they were using our tool pgMustard and it was 1 of the

tips we show people is, we call it read efficiency, to look for

exactly that.

When are you reading lots of data?

And we actually, for long-term users, used to call it table bloat

potential, or like bloat potential.

So it could be index bloat, could be table bloat.

But we renamed it read efficiency a few years ago because there

are other possible causes of it as well.

So it's a tricky 1.

And because we didn't, so this office that I was called, they

were hitting quite a few read efficiency issues but they couldn't

diagnose exactly where it was like was it index level was it

table level they looked at both it didn't they didn't seem bloated

they reindexed they they tried a lot of the things and it couldn't

reduce it so it was a it was a really interesting call but that

yeah that was my first port of call is in this used to be true

more in older versions of Postgres but indexes can get especially

can get extremely bloated so can tables in some cases but yeah

it's it's amazing that you can suddenly see that by turning on

BUFFERS.

And that is something I wasn't expecting to be true.

Because I see bloat as kind of like more of a system-wide issue,

But to be able to spot it in query licenses was really cool.

Nikolay: Oh, bloat can be very tricky.

Sometimes we have, with our consulting clients also, we have

like bloat is low, but some query suffers.

And it turns out to that, like, we call it usually local bloat.

So for particular IDs, the situation is super bad.

It can also be not a bloat, but some, you know, like sparsely

stored records.

They are distributed among many BUFFERS.

And you expect, okay, I have very narrow table, only like 4 or

5 columns.

I expect a lot of tuples to be fit inside 1 page.

But somehow reading 1, 000 records, rows, I deal with thousands

of buffer operations, what's happening here.

And if we just check CTID, we can understand that each row is

stored in each particular page, so we have a lack of data locality

here.

And so clustering with pg_repack without downtime could help, or

partitioning helps usually in such situations and so on.

Yeah, yeah, yeah.

So, and the buffer is exactly the way to fill these, all these

problems better.

But I agree, I like your idea to talk about read efficiency and

maybe write efficiency as well.

Because for end user, it's obvious.

Okay, I have, I return this data, I return only 25 rows on my

page.

So I expect not gigabytes of data to be fetched from buffer pool,

or even worse, from disk, right?

Or from page cache, which is between disk and the buffer pool.

So I expect maybe only like some kilobytes, right?

Maybe tens of kilobytes.

Even not megabytes, if it's quite a narrow table.

Postgres should not do a lot of work to show 25 rows.

If it does a lot of work, something is not right and we need

to optimize this query.

And without BUFFERS, we only can guess.

With BUFFERS, we see it, right?

With buffer details, it would be even better, right?

Someday maybe we will have detailed BUFFERS per each database

object.

Some statistics would be interesting, I think, to observe.

And there is criticism of this approach.

I recently had an interesting discussion on Twitter on X, And

somebody told me that in Oracle, there is ability to flush caches.

Right.

And in Postgres, it's tricky.

Usually it means you, we need to restart Postgres to flush the

buffer pool and also to echo 3 blah blah blah to flush Linux

caches if you want to go hardcore and very cold state, to check

very cold state.

And this is interesting.

Michael: I think you can restart.

Nikolay: Restart what?

Michael: The Database.

Like if you're testing locally, for example, or on a clone, if

you restart the Database, doesn't that reset caches?

Nikolay: Restart of Postgres will make empty only the buffer

pool.

But we also have, if it's Linux, We have also page cache, right?

We can flush page cache with simple command like we of course

if you have sudo

Michael: Yes,

Nikolay: so it's it's easy.

I always Google it echo 3 to some path and and then sink and

that's it

Michael: If you're a managed service, what's the best you can

do?

Nikolay: Well, managed service, you cannot do this, of course.

Michael: Restart is like, I

Nikolay: think the best

Michael: you can

Nikolay: do.

Reboot.

Reboot, yeah.

Yeah, yeah, reboot.

Well, this is hardcore.

So let's talk about this.

I also see a constant desire to
check the cold state in the Database

Lab we built.

Some users use it for query optimization
with some bot we have,

drawbot, very old stuff, but it
works really well for query optimization

and also like experiments, right?

So you can check on thinkloan,
on branch, you can check various

ideas And there people say, okay,
I see the hot state.

Data is already cached.

I want to check the cold state.

And I think it's natural desire,
but yeah.

Well, every time we say there is
no such easy way.

And interesting that, I didn't
know, but interesting that in

PostgreSQL 17, in pg_buffercache,
there is a new function, pg_buffercache_evict,

and you can try to use it.

I think we are going to try it
at some point when Postgres 17

will be more present on production
systems.

We probably will consider adding
this to make the buffer pool

empty, right?

But unfortunately, this function,
I suspect it won't work really

well.

It's definitely not for production,
first of all, right?

It's for like lab environments
we build.

And unfortunately, per documentation,
it won't evict BUFFERS

which are pinned by, for example,
autovacuum, vacuum process

and so on.

So in some cases it will fail,
we will need to think how to deal

with it.

Of course, restart is definitely
working in this case.

So if we think about why do people
need cold case at all, What's

your opinion on this?

Michael: Well, I think it's coming
from a good engineering rationale.

I want to make sure the worst case
isn't too bad.

It's definitely, you want good
engineers to be thinking about

edge cases, to be thinking about
the worst possible case.

But I also think that in practice,
the kinds of optimizations

we're doing here, the whole point
of the discussion here is we're

saying try and reduce the amount
of data being read to as minimal

as possible.

If it's an important query, try
and get a very, very good access

path for that query.

That will involve getting the buffer
numbers down as much as

possible, whether they're cached
or not.

Maybe you want to explain to people
in the PR, or you need to

explain to your team, how much
faster has this made it?

So I can understand wanting to
compare cold cache state to cold

cache state, and then having these
kind of relative differences.

But I would encourage going the
opposite route.

It's much easier to work with warm
cache, especially when you're

doing query optimization, you're naturally going to warm up the

cache if you're running the same query over and over again.

So I would encourage more going the opposite direction and say

compare...

Yes, it definitely applies for OLTP, but even for OLAP or OLAP

queries, I don't see the downside of saying to your team, these,

I mean, it's still EXPLAIN ANALYZE, right?

If you're showing query execution plans before and after, which

is often what people are doing, kind of show this is what it

was, this is what it will be, it's already imperfect in terms

of reporting numbers.

It's not exactly the same as what the client's seeing.

So If we're already imperfect, I don't see the downside of comparing

warm cache to warm cache, so I tend to run things a few times,

get the...

Nikolay: Before optimization and after optimization, we compare

2 cases, both should be warm, and we focus on not...

Even if we have reads still, for example, buffer pool may be

smaller in lab environment.

We usually have much smaller buffer pool because we run multiple

Postgres instances on the same machine.

It's a shared environment.

So if you deal with large volumes of data, it might not fit the

buffer pool, so you are going to see reads.

But the idea is let's just summarize reads and hits, maybe even

writes and dirties, how to name it, right?

So all 4 buffer operations, if we just summarize it and use it

like as universal metric, this is now our number of...

Or just maybe reads and hits, depends, right?

So but overall, we just see this is overall volume of data.

And this is what you return, 25 rows.

Michael: I like that a lot that's exactly what we do but I get

the sense that what people really want is to compare timings

so they want to say this query that used to take Because sometimes

it's coming from a complaint, right, or something.

This query that used to take 20 seconds now is 500 milliseconds.

Or this query that used to be 200 milliseconds is now less than

a millisecond.

So they want this kind of difference in timing, even though behind

the scenes it's a difference in BUFFERS.

So for that, if you want kind of apples to apples comparison,

instead of trying to get cold and be fair on the cold side, I'd

just say it's easier to be fair and do it on the warmer side.

I know it's not perfect, and it might be that your real users

are hitting a cold cache state, but if that's what you want,

that's the direction I would go.

Nikolay: Yeah, well, we have slightly different methodologies

here.

I usually say, okay, we start from the statement, the query is

slow, it takes like 30 seconds or 1 minute.

It's super slow, unacceptable, we want it below 100 milliseconds

or 10 milliseconds.

And then I say, forget about timing for now, completely.

Focus only on BUFFERS.

If we check BUFFERS and see a low volume, 100 BUFFERS, 100 buffer

hits and reads, to return 25 rows, we know our micro-optimization

is over.

We need to go and see the whole picture of what's happening with

concurrency.

Probably we'll have some...

Maybe this query is waiting on lock acquisition of all those seconds.

That's it, right?

So it's already this.

But if we confirm large volumes of data, we forget about time.

We focus only on obtaining sane numbers for BUFFERS.

Sane means, okay, hundreds, thousands, not more.

Even thousands to return 25 rows, it's already quite a lot.

And we must do it.

Sometimes, of course, it's not easy.

Sometimes we need to do denormalization, some complex surgery

on our database involving long-lasting operations with backfilling

and so on, but once we achieve this, in most cases we can have

same number of BUFFERS and then we say, okay, now we can compare

timing.

And Comparing timing in lab environment, it still might differ

from production.

Michael: Yeah, of course.

Nikolay: But of course, I know, I know, I use it as well.

I say, we improve timing hundred, like 10000 times.

It happens, right?

And this is good for final comment you know in the end but only

when we did work on BUFFERS fully right this is me

Michael: yeah given it's you you you could with your reputation

and your teachings you could leave that last line as we improved

buffer read or read efficiency for example.

Read efficiency and it would be probably almost the exact same

number.

Nikolay: Not necessarily, but yeah, maybe.

Michael: You'd be surprised how often it's close.

Nikolay: Okay, good, good, good.

Yeah, that's good.

Maybe I should think about it and also present this information

during consulting activities and in tools as well.

Yeah, I think it's good.

So yeah, read efficiency.

And you take number of rows and width and just multiply, get

bytes from there and then...

Michael: Yeah, it's a tricky 1 because like different scan types

you expect different amounts of efficiency so yes it's slightly

Nikolay: more involved also interesting right

Michael: yeah but I think I think the main differences are sequential

scan like sequential scans very different from well sequential

Scan and Bitmap Scan have some similarities.

Index Scan and Index Only Scan have some similarities, but then

like looped index scans are somewhat different to like range

index scans in terms of how many tuples, like in a loop for example,

you can't get fewer than the number of loops, like it has to

do at least that many.

So yeah, there's some subtlety, but yeah, that's it essentially.

Nikolay: Yeah, But back to the cold state, I also agree.

I can imagine you would need it.

And unfortunately, it's possible, as I see, only in lab environment

where you are alone.

Because this is a global operation to flash operational system

page cache, for example.

Michael: And when you control the hardware, right?

Or like the, you know, the VMs.

Nikolay: Well, if it's a managed situation, again, restart, as

you said, also an option, but again, this only works if you're

alone, and this is expensive.

We aim to make experiments super, super, super, extremely fast

and cost-efficient.

So it means a managed environment.

Many people work on the same machine, so if 1 decides to flush

everything, others will notice, right?

And With this function, pg_buffercache_evict, I think if we manage

to make it work, I see it's useful.

Even imagine if we have page cache, or in the case of DBLab,

it's ZFS-Arc, which is common.

If 1 block is cached, it's cached for all clones, all branches,

it's very good in terms of efficiency.

But sometimes, indeed, engineers want to check inefficient situation.

Okay, in this case, I think what we will do, we try to okay,

if user requests for their particular database, particular Postgres

instance running on this machine, we can ask to evict maybe everything

from buffer pool or for particular tables and indexes, say everything.

And if we succeeded, there are chances that these buffers are

cached in underlying cache, right?

Page cache or arc.

In this case, query timing is not going to be very different,

but plans will be different.

We will see reads instead of hits.

Observing buffer numbers, we will see the difference.

Okay, reads now, not hits.

And what I'm thinking, In this case, we could consider it relatively

cold, this situation.

Michael: Lukewarm.

Nikolay: Yeah, yeah, yeah.

And we could just say, okay, we know that reading from disk is

roughly a thousand times slower than from memory.

So we could do some estimation
of real call state.

Michael: Well, or there's that...

How often do you see clients having
track_io_timing on?

Nikolay: Oh, very often.

Michael: Great.

Okay, that's really positive.

But that because that gives us
an idea of how long those reads

rather than hits took.

And you could do some division
of reads by I/O timing to get an

idea of where they're coming from.

I think there might also be some
work.

I think I saw something either
in a commit for a century or a

hackers thread to try and get

Nikolay: the details.

It's like, there's just that cache
for micro level, right?

Yeah.

Well, it's possible if you have
access to everything, it's possible

from proc, process number, I/O,
you can get it from there, I think.

Michael: But it would be good to
have it in Postgres core so

that we could even get it from
managed services, for example.

Nikolay: Yeah, yeah.

Well, there is something that can
be done here and improved observability

of part and work with particular
query if you have control.

Yeah.

But I'm trying to say, like, even
if you with this evict, we

can get some interesting information,
even if we cannot allow

ourselves to reset the whole, everything,
including page cache.

So it already can be useful in
non-production environments to

study the query behavior if it's
a clone.

Yeah, yeah.

So I feel there's a lot of things
that can be improved in this

area to work with queries.

And especially it's important to
continue this work Because I

suspect we will use some tools
to work on this at massive scale.

Like for example, if we fetched
from...

We already know during consulting,
we did it with a few clients.

With auto_explain, we fetched
a lot of queries, hundreds usually,

sometimes thousands.

It's insane.

Nice.

Examples with plans.

And then, well, actually, you know,
because we partnered, in

some cases we use pgMustard to provide
additional insights, which

is great.

And then we have a lot of stuff.

Unfortunately, you know, unfortunately,
I still don't see how

to avoid human here.

Well, not like, I like to involve
human there, But I would like

to involve less, you know because
usually Right now we have a

lot of cases of query analysis
like that and then we need to

draw some common picture from it right and this This like going

above all those plans and say, okay, we have a pattern here.

For example, bloat is obvious or something like this, like, or

index health is not good.

And to draw this picture, human is involved heavily right now.

I'm thinking like over time, probably we will build some additional

macro-level analysis for micro-level cases of hundreds or thousands

of them.

This is going to be great.

And then looking at BUFFERS, and actually we have BUFFERS there

as well.

You were right in how to explain timing is not present.

People are afraid of overhead BUFFERS also have overhead, but

kind of smaller.

Right.

And yeah, we had articles, you had articles about this as well.

Yeah.

So yeah, and so with cold cache it's tricky, and there is this

function, that's it.

So Postgres 17 plus.

What else?

There is macro level, right?

And at macro level we can talk about pg_stat_statements, blocks,

read, hit.

By the way, there it's, yeah, also naming, we talked about it.

It's also naming like it's data volumes, but it's okay.

And so read, hit, written, dirtied.

For shared buffer blocks.

In pg_stat_statements, also there is pg_stat_kcache if we want

real disk I-O to be observed.

And again, I advertise to include pg_stat_kcache in any setup.

It's a great extension.

And there we can talk about some macro level, like, okay, this

query ID has this volume per second, or this volume per query,

or it's responsible for 90% of all hits to the buffer pool happening

in database.

Maybe it doesn't look super slow, but it's so intensive with

shared buffer pool.

Michael: That's a good point.

I almost always encourage people to look at their top queries

by total time.

Unfortunately, with pg_stat_statements, at least, there's no

kind of total blocks.

Because it's split into these many, many steps, you could order

by shared hit plus shared read and get a good sense of 1 type

or by dirtied for a different, but yeah, you'd have to sum in

most cases I'm thinking of that would give a good system-wide

aggregation, you need to sum 2 columns, I think, to get a good...

Nikolay: Reads and hits, right?

Michael: Ordering.

Yeah.

Nikolay: Yes, yes.

So we usually...

It's underestimated approach with
pg_stat_statements to order by

sum of reads and hits.

Michael: Yeah.

Nikolay: It's super important.

This is how you identify I/O-intensive
queries.

And now they can be mostly hits,
but if database grows, they

can be converted more and more
into reads.

Or vice versa, you're going to
double your memory size and the

buffer pool size, so reads will
be gone.

But just sum them and consider
them together to understand how

IO intensive the query is.

And think, oh, this query probably
needs optimization, especially

if it returns only 1 row or 0 rows.

So, yeah, in this case, we can
also consider buffer-focused optimization,

which is important.

I think it's underestimated.

But also, it's worth mentioning
this interesting blog post from

Andrei Lepikhov, who gave, I would
say, a hacker researcher perspective

on this, saying, actually, I didn't
mention that this is my point

for sure for lab environments we
build it's important to say

timing is super volatile it's unpredictable
you can have timing

1 in 1 today another tomorrow 1
on production another on clone

of production Very different timing.

And it will be a big question to
track all reasons why timing

is unpredictable.

And there is a concept, there is
a simple methodology, if we

deal with a complex problem, let's
split it to pieces and analyze

pieces separately.

So when we take a clone and bring
a single query and optimize

it, we already get rid of concurrency
issues, heavy locks, lightweight

locks, everything, like forgetting
about them.

And we deal with a single query
alone in 1 clone and it's already

super good.

But we still need to have something
reliable in terms of metrics

to optimize.

And timing is not reliable because
it's unpredictable.

It depends on the states of caches.

If it's a shared environment, maybe
CPU and disk are super busy,

and there's noise from neighbors
when we optimize in this case,

right?

But once you focus on BUFFERS inside
optimization, you have invariant

metric.

Okay, it can be reads versus hits,
but some of it will be constant

if you repeat the query.

For this particular clone, for
this particular query, these parameters,

everything is the same.

Plan is the same, right?

So you have the same thing.

It's repeatable, it's reproducible.

This gives you a foundation for
optimization because without

it you have very shaky foundation,
like not solid foundation,

right?

Timing.

So, and then you optimize and you...

I like your idea about optimization,
like, actually I used it.

I used it.

I said, okay, 10,000 times timing.

And I remember I mentioned BUFFERS
as well, but I didn't expect

people will value this comment
a lot.

Maybe with BUFFERS by default,
we should push on this methodology

more and more.

But what Andrei is talking about
in this article is that for

research purposes and so on, instead
of timing, maybe we should

focus only on reads.

And reads is something that could
be a target itself for optimization

and decision if the system works
efficiently or not efficiently.

Because we are talking about databases
and I always the number

1 thing in terms of what takes
time.

And I like this idea.

Yes.

So like this is very close to what
I see for many years building

lab environments, non-production
environments, which can help

people scale their teams and databases
and processes and so on.

Yeah, so what's your opinion about
this article?

Michael: Yeah, I thought it was
good.

I think all of, I think this blog's
great, but it's very, I think

it's very hacker focused.

And I think that's good, right?

Like This is a good topic for hackers
as well.

There's a lot of blog posts I see
from people often explaining

performance issues and not including
BUFFERS, and it seems like

a missed opportunity to educate
on that front.

So this is a great argument for
the people doing the educating,

I think, or even doing the implementations
of why it can be helpful

to see this.

I think he makes some good points
as well about hardware changing

over time and his initial paragraph
is even about I can't reproduce

this paper.

Nikolay: I can't reproduce timing, yes.

Michael: Yeah, because all this
and it's not just that it's also

there was and this is a different
point slightly but there's

insufficient setup information
which is really common in benchmarks

it's really common to not include
which exact machines were being

used or which exact.

So without that information, BUFFERS
can still be used as like,

oh, this is interesting.

Sure, there might be optimizations
over time that Postgres can

use fewer BUFFERS for certain operations
or certain indexes become

more efficient.

There's been a bunch of work to
optimize B-trees in Postgres

over the years, and I would expect
that to lead to slightly fewer

buffer reads in a bunch of cases.

But they would be easier to see,
and easier to see the differences

in than timings.

And I think he's also got a really
good graph here that shows

that there's not 0 variance in
buffer numbers, but it's much,

much lower and it's way more stable
than timings.

Nearly completely stable with some
exceptions, which is my experience

as well.

Nikolay: Yeah, yeah, yeah.

I agree, especially for example,
if you run the query first time

real cache is not there, it will
give you,

Michael: yeah, planning

Nikolay: time, it will give you
huge overhead and huge buffer

numbers, but second execution will
fully shave it off.

Michael: Yeah, planning BUFFERS.

Nikolay: Planning BUFFERS is a
separate topic.

It's a very important topic.

I like to have it.

Michael: Me too, but I don't fully
understand them yet.

Like why do we get so many more
in the first execution and then

why do we sometimes get none reported
at all?

Nikolay: Ah, okay.

You mean, yeah, yeah, yeah.

I think we discussed it maybe,
not sure if you, I saw some very

weird behavior in planning BUFFERS.

Yeah.

Maybe something is not tracked
as well, as you've mentioned.

Michael: I think it

Nikolay: must be.

Foreign keys, yeah, yeah, yeah.

So maybe we should dig into this
topic and raise it, the discussion

about that.

Yeah.

So back to Andrei's blog post.

Michael: Yes.

Nikolay: I agree.

And like I cannot reproduce and
we built lab environments sometimes

very different from production
just for the sake of cost optimization.

Like if you have 360 core Epic,
5th generation Epic on production,

and almost terabyte or more of
memory, it doesn't mean you cannot

optimize your queries on Raspberry
Pi.

You can on magnetic disks, very
cheap.

And that query fully cached might
be slow for you on production,

might be like a second.

You go and optimize, you understand
that you have shitty hardware.

In this case, you just see, okay,
it takes minutes or hours.

But BUFFERS are the same, and plan
is the same.

You can make decisions on super
cheap hardware with smaller caches,

everything is smaller, very slow
disk, you can still make cautious

decisions and move forward with
optimization, find solution,

see that you optimize buffer numbers
like a thousand times and

expect similar effect, maybe the
same effect on production, affecting

time accordingly.

Right.

And this is the, this is the power
of lab environments.

I think everyone should have them
super cheap, shared.

So many people and CI pipelines
work together and focusing on

BUFFERS is magic.

Right?

Yeah, I

Michael: think it's easy.

I think you're completely right.

I think it's easy for people to
make the mistake of then also

not using sufficient data volumes.

I know you don't.

I know that's not what you're advocating
for.

But yeah, it's the data volumes
that matter way more than the

power of the disks or the specific
CPU.

Nikolay: Yeah, usually people say,
okay, we are going to have

weaker hardware for non-production,
and we are going to focus

on relative optimization.

So if it was a second in production,
10 seconds in this non-production,

which is weak, we are going to
optimize from there.

But it's still weak because of
the status of cache and so on.

Just focus on BUFFERS and then
you can understand it.

Again, there are exceptions, I
agree, such as we discussed with

RLS and so on.

It's possible that your particular
case is CPU-intensive, but

these are exclusions from the main
rule.

Absolute majority of cases, you
will be dealing with BUFFERS

during optimization, right?

So congrats with Postgres 18 change.

And I also want to thank Andrei
Lepikhov for this blog post,

which is very welcome.

I think we need more posts about
the importance of I/O Metrics

at various levels, like 1 query
level, micro level, or macro

level like pg_stat_statements, pg_stat_kcache.

So yeah, I hope this topic will
grow and be more popular, this

methodology.

Michael: Yeah, agreed.

Another thank you to the people
involved in getting this committed

to Postgres 18 and big thanks oh
and I think this might be our

first episode since we both got
Postgres Contributor status Nikolay

so congratulations to you and thank
you to whoever nominated

us or whatever the process is there.

Thank you to everyone involved.

Nikolay: Good, see you next time.

Michael: See you soon, bye.

Some kind things our listeners have said