Michael: Hello and welcome to Postgres.FM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard, and this is Nikolay, founder

of Postgres.AI.

Hey Nikolay, how's it going?

Nikolay: Hello, Michael.

Going great.

How are you?

Michael: Yeah, good.

Thank you.

So what are we talking about this week?

Nikolay: Let's talk about why we can have writes on replicas.

Michael: Yeah, or the thing I've seen people get confused by

is why they can get writes even on the primary for SELECT queries,

like for read queries.

But yeah, I like your framing.

Nikolay: Yeah, it can happen on primary as well.

Yeah.

I mean, this surprise can happen on the primary as well.

Michael: And specifically where people might notice this is seeing

if they get buffers in query plans like shared

Nikolay: dirty.

Exactly.

Michael: But also, I think the place I've seen people spot it,

at least when they mention it in blog posts, is pg_stat_statements

or monitoring tools that are looking at pg_stat_statements

and seeing that some of the top offending queries for buffers

dirtied could even be SELECT queries, and that really surprised

them.

So it's kind of one of those surprising moments where you think,

ah, is something badly wrong?

And then you look into it and realize maybe not.

Nikolay: Yeah, actually, to be fully clear, writes can happen

during SELECTs because of temporary files as well, right?

And if you think about writes to disk.

But this is pretty straightforward.

We see, again, in pg_stat_statements, we can see it as

well as temporary bytes or blocks written and read and...

Michael: Yeah, is it local or is it temp?

I think it's local.

Nikolay: No, local is different.

I always forget what local

Michael: is.

No, you're right.

You're right.

Local is for temporary tables, that's why I get confused.

Local is for temporary tables and temporary objects.

But temp is for like, if for example, sort or hash spills to

disk because it's too big for work_mem or work_mem times hash

mem multiplier then you get temp blocks read pretend blocks written

But I think what people are spotting is the dirtied and being

like, wait, that should...

Dirtied means, you know, an Update or Delete, normally.

Nikolay: It's so confusing, right?

I'm checking pg_stat_statements documentation and for local blocks,

read, dirtied, written, and local block read time, write time.

It only says total number of local blocks read by statement.

Total number of local blocks, like everyone understands what

local block is.

Well, local block, okay, I think you're right.

It's a block.

I also saw like the more and more I deal with it in Postgres,

almost how many years, like 20 years, right?

The more I dislike the fact that we call operations blocks.

These are operations, not blocks.

Because if we say 10 blocks written, it's actually 10 block writes.

Maybe it's the same block which we've written 10 times, right?

Or rare.

So it should be block reads, block writes, block...

Dirty, I don't know the proper word how to say.

Dirtying...

Dirtyings, I don't know.

But anyway, here, of course I think it would be great to have

understanding what local block is.

Local block, I think, it's, again, operation.

Local block operation, it's what is caused when we work with

temporary tables, which I always try to avoid because of different

reasons.

Maybe we should talk 1 day about temporary tables.

But temporary tables, temporary blocks, temporary blocks is not

about temporary tables, but rather about temporary files which

can happen during execution of a Query dealing with normal tables

anytime because work memory is not enough, right?

So again, this is easy.

This part is easy.

If we see some SELECT or Update or anything is producing a lot

of temporary blocks, read or written.

We need to just consider raising work memory or optimize Queries

so it deals with lower volumes, smaller volumes of data, right?

For example, I don't know, like, just increase selectivity of

our, I don't know, it's a separate topic anyway.

Let's just point that writes can happen because of temporary

files.

Of course, SELECTs can produce temporary files because work_mem

is not enough.

That's it.

But then sometimes we have huge work memory, definitely, which

is enough, and SELECT is still writing to disk.

Or, well, technically dirty, blocks dirtied.

These operations, I will keep calling these operations, not data.

These events, these operations
can happen during SELECTs.

And if it's just dirtied, it not
necessarily goes to disk immediately,

because this is the work for Checkpointer and background writer.

But it's already like, it's not
good.

Like SELECT is producing some writes
to the buffer pool, to shared_buffers

area.

And of course, if it's written,
it's even worse, because I'm

pretty sure this makes SELECT also
slow similarly as it happens

when it creates temporary file
and writes to it right

Michael: yeah good point although
I think in the 2 cases that

we're going to be talking about,
actually, no, no, there could

be a lot of them.

I was thinking, I was thinking
it's generally only in a lot of

the cases I've seen, it's only
1 block at a time.

But actually, you could easily
have a scan doing a lot of these.

So yeah, could be a lot of data.

Nikolay: Yeah, and actually to
be even like absolutely precise,

a block written not necessarily
goes to disk because this is

written to the page cache.

And then like pdflush or something,
this is already about Linux,

it should write to disk at some
point.

Written doesn't necessarily mean
that it goes to disk.

And I think to understand that,
we should use different extension.

Because pg_stat_statements is great,
it's available everywhere.

Because it's like, I think this
is the most popular extension,

should be at least among observers.

And there is opinion that it should
go to core at some point.

It should stop being an extension,
especially after query ID

propagated to pg_stat_activity to
auto_explain, right?

Definitely, pg_stat_statements
should be in core.

But what I'm saying is that there
is a KCache which extends

the predecessor statements to allow
users to see metrics related

to actual physical resources such
as CPU, system CPU, and also

disk I/O.

Real disk writes, real disk reads
how many bytes are written,

how many bytes are read for each
query ID, which is great.

I think We talked that finally
buffers are committed to EXPLAIN

analyze.

By default, you run EXPLAIN ANALYZE
in Postgres 18 unless it's

reverted.

I hope it won't be reverted.

In Postgres 18, EXPLAIN ANALYZE
will always include BUFFERS in

output, which is great victory.

We talked about it a couple of
years on this podcast, I think

almost since the very beginning.

And I was thinking what topic to
choose next to complain about

occasionally, regularly.

Michael: Or maybe campaign rather
than complaint.

Nikolay: Campaign, okay.

Michael: More positive, yeah.

Nikolay: Right, so what kind of
campaign to launch after this?

Like, I was happy, but I was like,
oh, I will be missing us to

talk about buffers and explain
the lies and so on.

I chose this topic.

I think we should focus on all
managed Postgres platform builders

and advertise to include pg_wait_sampling
and pg_stat_kcache

to extensions.

Because this is super important
to have them both to be able

to do proper query analysis in
different aspects.

pg_stat_statements is great.

It has a lot of metrics.

It keeps growing in terms of number
of metrics, but it doesn't

have what pg_stat_kcache and pg_wait_sampling
provide.

So we need those extensions on
each serious database, especially

heavily loaded, and pg_stat_kcache
is great here.

And this would allow us to see,
oh, actually this query being

SELECT caused actual physical disk
writes.

Michael: Yeah, I'm going to have
a think.

If we only get 1 campaign, I'm
not sure that's mine, but maybe

I like it.

It's a good idea.

Nikolay: Okay, let's please
consider this.

I'm definitely launching campaign
at least on my Twitter and

LinkedIn and actually I already
launched it and I saw a

reaction from So actually I mentioned
that these extensions are

present on all serious setups,
which we helped to maintain for

quite some time.

And pg_stat_kcache has been used like
in serious databases I'm dealing

with for a very long time, many
years.

And I know some platforms use it,
but in different countries,

right?

And the only case I see is pg_wait_sampling
is available on Google

Cloud SQL.

pg_stat_kcache is not available there,
but pg_wait_sampling is,

which is great.

Michael: Are you sure?

Nikolay: I'm sure.

pg_wait_sampling is available,
so you can have...

Imagine RDS, they have performance
insights, which is great.

But this information is not available
through SQL interface.

I'm not sure about CloudWatch,
maybe it's available there, but

it's a different API.

I want to see details right in
SQL because I have tools to analyze

pg_stat_statements and I want extension
to ActiveSession history

analysis and also to physical metrics,
which pg_stat_kcache provides.

So Cloud SQL has pg_wait_sampling,
but it doesn't have pg_stat_kcache.

Others don't have any of these
2.

And the reaction was from one of
Cloud SQL engineers I saw who

promised to think about it.

I mean, it would be great and Cloud
SQL could be much better

if they have pg_stat_kcache.

I think they're already much better
in terms of observability,

better than RDS because pg_wait_sampling
is better than performance

insights.

Usually we combine multiple tools
for observability and you go

to for active session history for
RDS or Aurora you go to one place,

to their own place, for everything
else, you go to another place.

It's not convenient.

So in the case of Google, they
have a code, I think, query insights

or something like this.

They have it exposed in their own
monitoring interface, but they

also expose it for anyone through
SQL interface, which is great,

right?

Through pg_wait_sampling.

If they have pg_stat_kcache, I will
be happy, but I also I'm

going to ask others to consider
these two extensions.

For example, Crunchy Bridge, Supabase,
who else?

Everyone.

I actually think RDS should consider
pg_wait_sampling as well,

and pg_stat_kcache as well.

Why not?

So, yeah, it should be just an
option.

Users could decide.

So back to our question.

I think if you have pg_stat_kcache,
you can see actual writes

happening from queries and you
can see SELECTs and you know work

memory is enough, so it's not temporary
files.

You can confirm it through pg_stat_statements
or also through

individual query execution via
EXPLAIN ANALYZE, BUFFERS, and that's

it, you're thinking what's happening,
right?

Why SELECT is writing to disk?

Or at least, is it writing to the
buffer pool, which is already

quite a surprise, right?

Michael: Yeah.

And well, until recently, I knew
this was a phenomenon.

And I'd read about it a couple
of times seen on the main list

and in a post or two. But I only knew
of one of the potential

reasons and the reason I think
this came up recently is there

was a good blog post by Alex Jesipow.

I'm not sure how to pronounce that.

Sorry.

Who mentioned a second reason as
well.

So the first reason is related
to setting of hint bits and maybe

we can go into detail on that one
first.

Cause that's the one I've seen most
often and it comes up the most.

Do you want to discuss that one first?

Nikolay: Yeah, let's do it

Michael: So why do why do we have
hint bits and what are hint bits?

Nikolay: By the way, maybe it's
Alex Jesipow but Alex

is from Germany I see and works
at Luminovo, which is interesting

because We we work together last year.

I mean with his colleagues.

It's interesting.

I just realized.

It's cool.

Yeah.

Yeah, I'm going to send him this episode for sure.

So we have 2 cases, right?

2 cases.

And I'm not sure which 1 is the easiest.

Let's start with easiest.

Michael: I think Hint Bits is easiest to understand, and it's

most common.

So my understanding of this is when new data is written to a

page, at that moment, Postgres doesn't yet know if that could

be part of a much larger, longer transaction, and we don't yet

know if that transaction is going to commit or get aborted and

rolled back.

So we at that moment cannot set, we cannot say on the data page

that this transaction has been committed and therefore should

be visible to new reads of this page.

So that information is in a separate log once the transaction

does commit.

And that means when you go, if a, if somebody's reading the page,

they need to check which of these row versions has already been

committed for sure.

And if there's any ambiguous, any ones where we don't know yet,

because, but based on these hint bits, based on these 4 different

flags, we can tell what status it is.

Is it unknown?

Is it definitely committed?

Is it already out of date?

Like is it already being replaced?

So if in the case of these reads causing writes, it's a subsequent

read coming along, seeing a new row version that hasn't yet,

that we don't know yet whether it's been committed, checking

the commit log and then setting, like having read it once, it

sets that and dirties the page, writes the page again to wow,

so that any future reads now don't have to check.

Nikolay: Right, yeah, so yeah, I think this can be like invisible,

dead, frozen, right, for tuple and the thing is that the most

interesting part of it that this this write can happen on the

replica which is kind of surprise for people, right?

How come?

Michael: There's a really good, oh, sorry, just to go back to

hint bits, there's a really good page on the Postgres wiki that

describes it really succinctly, and there are 4 hint bits.

xmin committed, xmin aborted, xmax committed, xmax aborted.

Nikolay: Yeah.

So, yeah, that's it actually.

So we just know about this phenomenon
and that's it.

It should not be a surprise that
it happens.

And subsequently...

Michael: And it's healthy, right?

It's not a sign of anything having
gone wrong in healthy like

it will happen Quite often.

Nikolay: Yeah, and in case it is
subsequent.

So it already is not causing this,
right?

So so it's it's because it's already
updated and it's kind of

you can consider it a kind of like
dealing with warming up caches,

but it's vice versa because it's
writes.

So it's only, this overhead is
only present on the first very

call with the same parameters.

Of course, different parameters
can cause such writes in different

pages.

Michael: And it's a necessary result
of us wanting to be able

to serve things concurrently.

Because of MVCC, we need versions
of rows.

And we need to then know which
1 should be visible to which transactions.

So in order to do that, this is
a necessary part of that.

Nikolay: Right.

So yeah, let's talk about the second
case.

Michael: Yeah, this.

Yeah.

The second was more interesting
to me.

Like I hadn't come across this
before.

Nikolay: Yeah.

Let's talk about it.

Michael: So in Alex's post, I think
you're probably right on

the surname pronunciation, but
I'm not going to try it again.

He describes these as page pruning
he calls it which I have not

heard it called that before and
when I think of pruning I think

my mind naturally goes to for example
the mechanism for removing

an empty page at the end of the
heap, for example.

Is that also called pruning?

But this is different.

Nikolay: I think it's truncation.

If you talk about removing last
page when vacuum does it, it's

truncation.

Michael: That makes sense.

So this is in almost like a, I
think you just, before the call

you mentioned it was described
as a kind of in-page vacuum.

Yeah.

So it's like.

Nikolay: And it can happen during
SELECTs on the fly, like it's

interesting, right?

Which is like also strange.

But if we, if we recall how HOT
updates are organized, which

is great, a feature, unfortunately,
not available always because

it requires 2 special conditions
to be met.

First is we are updating only,
we are changing values of the

columns which are not indexed.

And second is there is enough empty
space in the same page where

our old tuple is stored.

In this case, so-called hot chains
are created, and it can be

multiple versions in the same page
of the same tuple.

In this case, what happens when,
if we have index scan, we only

know the page and offset for the
first.

Indexes are not updated in this
case because it's hot update.

This is optimization to fight index
write amplification, which

is terrible Postgres MVCC behavior
many projects are suffering

from.

When updating 1 row, having many
indexes on the table.

We deal with...

We need to update all of indexes,
producing a lot of WAL writes

and just making updates slow, heavy,
and so on.

In case of hot updates, those 2
conditions I mentioned are met.

Postgres writes only to the same
page where tuple is already

stored because there is enough
space and it doesn't update indexes

at all because we are changing
the value which is not indexed.

And it produces a new version tuple,
new raw version inside the

same page, creating chain, it can
be new, new, new, and when

we have index scan, index scan
points to the first version in

the page, and then it's quick to
jump between versions and find

the actual 1 inside the same page.

Michael: 1 crucial thing that I
think becomes important later

is that that chain information
is stored in the header of the

page, whereas the row, like the
data from the row version is

stored at the end of the page.

Nikolay: Right, right.

This is true.

Yes, it's so.

And if we already have old versions,
at some point they need

to be deleted.

It can happen during vacuuming
or it can happen earlier if during

dealing with this page Postgres
says, okay, we can clean up old

versions right now, why not?

And can happen during SELECT as
well, which is very strange.

Yeah.

Michael: Well, it's so cool.

But I think, again, there's like
a condition where it will only

do this if there is not that much
space left on the page.

So I think the number is 10% of
the page is left in terms of

free space.

And there's some subtlety around
if you've changed fill factor.

Nikolay: And so which means that
this is an effort to maintain

hotness of updates further, because
if without this, we would

need to go to another page.

And this will definitely lead to
updating all indexes, right?

Michael: Yes.

But I think the optimization of
not doing it unless the page

is quite full means we don't have
to do it that often.

So for example, if we have maybe
like say a really quick 10-15

updates of the same row, and they
all fit on the same page, and

then we're having reads of that
row in between, we're not cleaning

up each version each time we do
a read, we're waiting until we

get full and then doing it in 1
go.

So I think it's quite a pretty
smart optimization.

Nikolay: It's a kind of trade-off,
balance between 2 worst situations

and so on.

So We don't do it too often, but
also we try not to allow this

tuple to drift to another page,
which would cause the need to

update indexes because indexes
point to the old page.

This is quite interesting.

Again this can happen during SELECT
and yeah but it can happen

only on the primary right yes or
no it's my guess I don't know

honestly why because because can
happen on replica

Michael: yeah well I actually I
don't know well let us know in

the comments.

Nikolay: Yeah, this is an interesting
question, because if it's

happening on replica, it means
we have different, very different

content of pages on both replica
and primary.

vacuum happening on a replica,
please no, because this changes

the content of page.

I cannot understand how page content
should be synchronized in

terms of

Michael: tuples.

It's physical replica, yeah.

Nikolay: Yeah, so it should happen
only on the primary because...

Michael: Okay, yeah.

Nikolay: And the replica should
just get this content.

Now, with Hint Bits it's different.

Hint Bits gets additional information
and it doesn't change how

tuples are stored inside the page.

This thing is changing.

It's cleaning up the space for
new tuples.

So it should happen on the primary,
I think.

Michael: That makes sense.

Yeah.

Logically on a physical replica.

Nikolay: Right.

Okay.

Well, good.

Michael: There's this 1 more open
question.

Yeah.

I saw when this blog, this latest
blog post was shared to Reddit

there was a really interesting
comment that didn't that no 1

has replied to and there's almost
no activity on it but somebody

posed the question I think it's
a good 1 whether there might

be a third possible cause of this.

And that's, so it's more of a question
to the listeners because

I'm not actually sure, I don't
know myself and haven't had, you

know, to confirm for sure.

But there's a PD_ALL_VISIBLE flag
on pages.

And that might be a third way if
if SELECTs can flip that based

on visibility of all the rows in
the or the tuples in the page

then that might be a third case
but I don't I don't know if SELECTs

can so yeah again if you know I'll
be really interested to hear

Nikolay: yeah yeah so I guess that's
it right yeah so yeah

Michael: so yeah great great blog
post there's also an old 1

from Nikolay Sivko on the Okmeter
blog that I'll share So there's

a few articles to read more about
this.

Nikolay: Correction, Nikolay currently
has a new startup which

is called Coroot.

Yes.

Yeah, so then Okmeter is his old
startup.

And Peter Zaitsev from, like, founder
of Percona joined Nikolay

recently, not very long ago, and
I'm like, Coroot is a great

project.

But this, I guess this blog post
was created while Nikolay was

working on the previous project
and just started observing the

writes happening from SELECTs and
was very surprised and decided

to blog about it.

Michael: Yeah, well it was a monitoring
project as well, right?

Nikolay: Yeah, it had very good
Postgres related features, which

I guess in some practices were
inherited by Coroot.

But Coroot is amazing, just a couple
of words about it.

If you, if you like flame graphs,
Coroot is, can show you dynamically,

like imagine you have a dashboard,
you choose time for Postgres

and you see flame graphs but hanging
down so they are like 180

degrees rotated or mirrored, right?

Horizontal mirrored.

You can see details like If you
have debug symbols installed,

I think everyone should have debug
symbols always installed.

You see deep details what's happening,
where time is spent.

Dynamic diagnostics like this,
it's crazy.

I think probably I should add this
to recommendations for Postgres

platform builders, managed Postgres
platform builders, because

to consider Coroot is also like
a good thing for observability.

Michael: It's open source, right?

Nikolay: It has open source.

It has, I think, it's like an open
core model.

I'm not sure.

But the main thing is definitely
open source.

Again, I'm not sure.

Michael: Thanks so much, Nikolay.

Thanks everyone for listening.

We'll catch you next week.

Nikolay: Thank you, Michael.

See you soon.

Some kind things our listeners have said