Reads causing writes
Michael: Hello and welcome to Postgres.FM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard, and this is Nikolay, founder
of Postgres.AI.
Hey Nikolay, how's it going?
Nikolay: Hello, Michael.
Going great.
How are you?
Michael: Yeah, good.
Thank you.
So what are we talking about this week?
Nikolay: Let's talk about why we can have writes on replicas.
Michael: Yeah, or the thing I've seen people get confused by
is why they can get writes even on the primary for SELECT queries,
like for read queries.
But yeah, I like your framing.
Nikolay: Yeah, it can happen on primary as well.
Yeah.
I mean, this surprise can happen on the primary as well.
Michael: And specifically where people might notice this is seeing
if they get buffers in query plans like shared
Nikolay: dirty.
Exactly.
Michael: But also, I think the place I've seen people spot it,
at least when they mention it in blog posts, is pg_stat_statements
or monitoring tools that are looking at pg_stat_statements
and seeing that some of the top offending queries for buffers
dirtied could even be SELECT queries, and that really surprised
them.
So it's kind of one of those surprising moments where you think,
ah, is something badly wrong?
And then you look into it and realize maybe not.
Nikolay: Yeah, actually, to be fully clear, writes can happen
during SELECTs because of temporary files as well, right?
And if you think about writes to disk.
But this is pretty straightforward.
We see, again, in pg_stat_statements, we can see it as
well as temporary bytes or blocks written and read and...
Michael: Yeah, is it local or is it temp?
I think it's local.
Nikolay: No, local is different.
I always forget what local
Michael: is.
No, you're right.
You're right.
Local is for temporary tables, that's why I get confused.
Local is for temporary tables and temporary objects.
But temp is for like, if for example, sort or hash spills to
disk because it's too big for work_mem or work_mem times hash
mem multiplier then you get temp blocks read pretend blocks written
But I think what people are spotting is the dirtied and being
like, wait, that should...
Dirtied means, you know, an Update or Delete, normally.
Nikolay: It's so confusing, right?
I'm checking pg_stat_statements documentation and for local blocks,
read, dirtied, written, and local block read time, write time.
It only says total number of local blocks read by statement.
Total number of local blocks, like everyone understands what
local block is.
Well, local block, okay, I think you're right.
It's a block.
I also saw like the more and more I deal with it in Postgres,
almost how many years, like 20 years, right?
The more I dislike the fact that we call operations blocks.
These are operations, not blocks.
Because if we say 10 blocks written, it's actually 10 block writes.
Maybe it's the same block which we've written 10 times, right?
Or rare.
So it should be block reads, block writes, block...
Dirty, I don't know the proper word how to say.
Dirtying...
Dirtyings, I don't know.
But anyway, here, of course I think it would be great to have
understanding what local block is.
Local block, I think, it's, again, operation.
Local block operation, it's what is caused when we work with
temporary tables, which I always try to avoid because of different
reasons.
Maybe we should talk 1 day about temporary tables.
But temporary tables, temporary blocks, temporary blocks is not
about temporary tables, but rather about temporary files which
can happen during execution of a Query dealing with normal tables
anytime because work memory is not enough, right?
So again, this is easy.
This part is easy.
If we see some SELECT or Update or anything is producing a lot
of temporary blocks, read or written.
We need to just consider raising work memory or optimize Queries
so it deals with lower volumes, smaller volumes of data, right?
For example, I don't know, like, just increase selectivity of
our, I don't know, it's a separate topic anyway.
Let's just point that writes can happen because of temporary
files.
Of course, SELECTs can produce temporary files because work_mem
is not enough.
That's it.
But then sometimes we have huge work memory, definitely, which
is enough, and SELECT is still writing to disk.
Or, well, technically dirty, blocks dirtied.
These operations, I will keep calling these operations, not data.
These events, these operations
can happen during SELECTs.
And if it's just dirtied, it not
necessarily goes to disk immediately,
because this is the work for Checkpointer and background writer.
But it's already like, it's not
good.
Like SELECT is producing some writes
to the buffer pool, to shared_buffers
area.
And of course, if it's written,
it's even worse, because I'm
pretty sure this makes SELECT also
slow similarly as it happens
when it creates temporary file
and writes to it right
Michael: yeah good point although
I think in the 2 cases that
we're going to be talking about,
actually, no, no, there could
be a lot of them.
I was thinking, I was thinking
it's generally only in a lot of
the cases I've seen, it's only
1 block at a time.
But actually, you could easily
have a scan doing a lot of these.
So yeah, could be a lot of data.
Nikolay: Yeah, and actually to
be even like absolutely precise,
a block written not necessarily
goes to disk because this is
written to the page cache.
And then like pdflush or something,
this is already about Linux,
it should write to disk at some
point.
Written doesn't necessarily mean
that it goes to disk.
And I think to understand that,
we should use different extension.
Because pg_stat_statements is great,
it's available everywhere.
Because it's like, I think this
is the most popular extension,
should be at least among observers.
And there is opinion that it should
go to core at some point.
It should stop being an extension,
especially after query ID
propagated to pg_stat_activity to
auto_explain, right?
Definitely, pg_stat_statements
should be in core.
But what I'm saying is that there
is a KCache which extends
the predecessor statements to allow
users to see metrics related
to actual physical resources such
as CPU, system CPU, and also
disk I/O.
Real disk writes, real disk reads
how many bytes are written,
how many bytes are read for each
query ID, which is great.
I think We talked that finally
buffers are committed to EXPLAIN
analyze.
By default, you run EXPLAIN ANALYZE
in Postgres 18 unless it's
reverted.
I hope it won't be reverted.
In Postgres 18, EXPLAIN ANALYZE
will always include BUFFERS in
output, which is great victory.
We talked about it a couple of
years on this podcast, I think
almost since the very beginning.
And I was thinking what topic to
choose next to complain about
occasionally, regularly.
Michael: Or maybe campaign rather
than complaint.
Nikolay: Campaign, okay.
Michael: More positive, yeah.
Nikolay: Right, so what kind of
campaign to launch after this?
Like, I was happy, but I was like,
oh, I will be missing us to
talk about buffers and explain
the lies and so on.
I chose this topic.
I think we should focus on all
managed Postgres platform builders
and advertise to include pg_wait_sampling
and pg_stat_kcache
to extensions.
Because this is super important
to have them both to be able
to do proper query analysis in
different aspects.
pg_stat_statements is great.
It has a lot of metrics.
It keeps growing in terms of number
of metrics, but it doesn't
have what pg_stat_kcache and pg_wait_sampling
provide.
So we need those extensions on
each serious database, especially
heavily loaded, and pg_stat_kcache
is great here.
And this would allow us to see,
oh, actually this query being
SELECT caused actual physical disk
writes.
Michael: Yeah, I'm going to have
a think.
If we only get 1 campaign, I'm
not sure that's mine, but maybe
I like it.
It's a good idea.
Nikolay: Okay, let's please
consider this.
I'm definitely launching campaign
at least on my Twitter and
LinkedIn and actually I already
launched it and I saw a
reaction from So actually I mentioned
that these extensions are
present on all serious setups,
which we helped to maintain for
quite some time.
And pg_stat_kcache has been used like
in serious databases I'm dealing
with for a very long time, many
years.
And I know some platforms use it,
but in different countries,
right?
And the only case I see is pg_wait_sampling
is available on Google
Cloud SQL.
pg_stat_kcache is not available there,
but pg_wait_sampling is,
which is great.
Michael: Are you sure?
Nikolay: I'm sure.
pg_wait_sampling is available,
so you can have...
Imagine RDS, they have performance
insights, which is great.
But this information is not available
through SQL interface.
I'm not sure about CloudWatch,
maybe it's available there, but
it's a different API.
I want to see details right in
SQL because I have tools to analyze
pg_stat_statements and I want extension
to ActiveSession history
analysis and also to physical metrics,
which pg_stat_kcache provides.
So Cloud SQL has pg_wait_sampling,
but it doesn't have pg_stat_kcache.
Others don't have any of these
2.
And the reaction was from one of
Cloud SQL engineers I saw who
promised to think about it.
I mean, it would be great and Cloud
SQL could be much better
if they have pg_stat_kcache.
I think they're already much better
in terms of observability,
better than RDS because pg_wait_sampling
is better than performance
insights.
Usually we combine multiple tools
for observability and you go
to for active session history for
RDS or Aurora you go to one place,
to their own place, for everything
else, you go to another place.
It's not convenient.
So in the case of Google, they
have a code, I think, query insights
or something like this.
They have it exposed in their own
monitoring interface, but they
also expose it for anyone through
SQL interface, which is great,
right?
Through pg_wait_sampling.
If they have pg_stat_kcache, I will
be happy, but I also I'm
going to ask others to consider
these two extensions.
For example, Crunchy Bridge, Supabase,
who else?
Everyone.
I actually think RDS should consider
pg_wait_sampling as well,
and pg_stat_kcache as well.
Why not?
So, yeah, it should be just an
option.
Users could decide.
So back to our question.
I think if you have pg_stat_kcache,
you can see actual writes
happening from queries and you
can see SELECTs and you know work
memory is enough, so it's not temporary
files.
You can confirm it through pg_stat_statements
or also through
individual query execution via
EXPLAIN ANALYZE, BUFFERS, and that's
it, you're thinking what's happening,
right?
Why SELECT is writing to disk?
Or at least, is it writing to the
buffer pool, which is already
quite a surprise, right?
Michael: Yeah.
And well, until recently, I knew
this was a phenomenon.
And I'd read about it a couple
of times seen on the main list
and in a post or two. But I only knew
of one of the potential
reasons and the reason I think
this came up recently is there
was a good blog post by Alex Jesipow.
I'm not sure how to pronounce that.
Sorry.
Who mentioned a second reason as
well.
So the first reason is related
to setting of hint bits and maybe
we can go into detail on that one
first.
Cause that's the one I've seen most
often and it comes up the most.
Do you want to discuss that one first?
Nikolay: Yeah, let's do it
Michael: So why do why do we have
hint bits and what are hint bits?
Nikolay: By the way, maybe it's
Alex Jesipow but Alex
is from Germany I see and works
at Luminovo, which is interesting
because We we work together last year.
I mean with his colleagues.
It's interesting.
I just realized.
It's cool.
Yeah.
Yeah, I'm going to send him this episode for sure.
So we have 2 cases, right?
2 cases.
And I'm not sure which 1 is the easiest.
Let's start with easiest.
Michael: I think Hint Bits is easiest to understand, and it's
most common.
So my understanding of this is when new data is written to a
page, at that moment, Postgres doesn't yet know if that could
be part of a much larger, longer transaction, and we don't yet
know if that transaction is going to commit or get aborted and
rolled back.
So we at that moment cannot set, we cannot say on the data page
that this transaction has been committed and therefore should
be visible to new reads of this page.
So that information is in a separate log once the transaction
does commit.
And that means when you go, if a, if somebody's reading the page,
they need to check which of these row versions has already been
committed for sure.
And if there's any ambiguous, any ones where we don't know yet,
because, but based on these hint bits, based on these 4 different
flags, we can tell what status it is.
Is it unknown?
Is it definitely committed?
Is it already out of date?
Like is it already being replaced?
So if in the case of these reads causing writes, it's a subsequent
read coming along, seeing a new row version that hasn't yet,
that we don't know yet whether it's been committed, checking
the commit log and then setting, like having read it once, it
sets that and dirties the page, writes the page again to wow,
so that any future reads now don't have to check.
Nikolay: Right, yeah, so yeah, I think this can be like invisible,
dead, frozen, right, for tuple and the thing is that the most
interesting part of it that this this write can happen on the
replica which is kind of surprise for people, right?
How come?
Michael: There's a really good, oh, sorry, just to go back to
hint bits, there's a really good page on the Postgres wiki that
describes it really succinctly, and there are 4 hint bits.
xmin committed, xmin aborted, xmax committed, xmax aborted.
Nikolay: Yeah.
So, yeah, that's it actually.
So we just know about this phenomenon
and that's it.
It should not be a surprise that
it happens.
And subsequently...
Michael: And it's healthy, right?
It's not a sign of anything having
gone wrong in healthy like
it will happen Quite often.
Nikolay: Yeah, and in case it is
subsequent.
So it already is not causing this,
right?
So so it's it's because it's already
updated and it's kind of
you can consider it a kind of like
dealing with warming up caches,
but it's vice versa because it's
writes.
So it's only, this overhead is
only present on the first very
call with the same parameters.
Of course, different parameters
can cause such writes in different
pages.
Michael: And it's a necessary result
of us wanting to be able
to serve things concurrently.
Because of MVCC, we need versions
of rows.
And we need to then know which
1 should be visible to which transactions.
So in order to do that, this is
a necessary part of that.
Nikolay: Right.
So yeah, let's talk about the second
case.
Michael: Yeah, this.
Yeah.
The second was more interesting
to me.
Like I hadn't come across this
before.
Nikolay: Yeah.
Let's talk about it.
Michael: So in Alex's post, I think
you're probably right on
the surname pronunciation, but
I'm not going to try it again.
He describes these as page pruning
he calls it which I have not
heard it called that before and
when I think of pruning I think
my mind naturally goes to for example
the mechanism for removing
an empty page at the end of the
heap, for example.
Is that also called pruning?
But this is different.
Nikolay: I think it's truncation.
If you talk about removing last
page when vacuum does it, it's
truncation.
Michael: That makes sense.
So this is in almost like a, I
think you just, before the call
you mentioned it was described
as a kind of in-page vacuum.
Yeah.
So it's like.
Nikolay: And it can happen during
SELECTs on the fly, like it's
interesting, right?
Which is like also strange.
But if we, if we recall how HOT
updates are organized, which
is great, a feature, unfortunately,
not available always because
it requires 2 special conditions
to be met.
First is we are updating only,
we are changing values of the
columns which are not indexed.
And second is there is enough empty
space in the same page where
our old tuple is stored.
In this case, so-called hot chains
are created, and it can be
multiple versions in the same page
of the same tuple.
In this case, what happens when,
if we have index scan, we only
know the page and offset for the
first.
Indexes are not updated in this
case because it's hot update.
This is optimization to fight index
write amplification, which
is terrible Postgres MVCC behavior
many projects are suffering
from.
When updating 1 row, having many
indexes on the table.
We deal with...
We need to update all of indexes,
producing a lot of WAL writes
and just making updates slow, heavy,
and so on.
In case of hot updates, those 2
conditions I mentioned are met.
Postgres writes only to the same
page where tuple is already
stored because there is enough
space and it doesn't update indexes
at all because we are changing
the value which is not indexed.
And it produces a new version tuple,
new raw version inside the
same page, creating chain, it can
be new, new, new, and when
we have index scan, index scan
points to the first version in
the page, and then it's quick to
jump between versions and find
the actual 1 inside the same page.
Michael: 1 crucial thing that I
think becomes important later
is that that chain information
is stored in the header of the
page, whereas the row, like the
data from the row version is
stored at the end of the page.
Nikolay: Right, right.
This is true.
Yes, it's so.
And if we already have old versions,
at some point they need
to be deleted.
It can happen during vacuuming
or it can happen earlier if during
dealing with this page Postgres
says, okay, we can clean up old
versions right now, why not?
And can happen during SELECT as
well, which is very strange.
Yeah.
Michael: Well, it's so cool.
But I think, again, there's like
a condition where it will only
do this if there is not that much
space left on the page.
So I think the number is 10% of
the page is left in terms of
free space.
And there's some subtlety around
if you've changed fill factor.
Nikolay: And so which means that
this is an effort to maintain
hotness of updates further, because
if without this, we would
need to go to another page.
And this will definitely lead to
updating all indexes, right?
Michael: Yes.
But I think the optimization of
not doing it unless the page
is quite full means we don't have
to do it that often.
So for example, if we have maybe
like say a really quick 10-15
updates of the same row, and they
all fit on the same page, and
then we're having reads of that
row in between, we're not cleaning
up each version each time we do
a read, we're waiting until we
get full and then doing it in 1
go.
So I think it's quite a pretty
smart optimization.
Nikolay: It's a kind of trade-off,
balance between 2 worst situations
and so on.
So We don't do it too often, but
also we try not to allow this
tuple to drift to another page,
which would cause the need to
update indexes because indexes
point to the old page.
This is quite interesting.
Again this can happen during SELECT
and yeah but it can happen
only on the primary right yes or
no it's my guess I don't know
honestly why because because can
happen on replica
Michael: yeah well I actually I
don't know well let us know in
the comments.
Nikolay: Yeah, this is an interesting
question, because if it's
happening on replica, it means
we have different, very different
content of pages on both replica
and primary.
vacuum happening on a replica,
please no, because this changes
the content of page.
I cannot understand how page content
should be synchronized in
terms of
Michael: tuples.
It's physical replica, yeah.
Nikolay: Yeah, so it should happen
only on the primary because...
Michael: Okay, yeah.
Nikolay: And the replica should
just get this content.
Now, with Hint Bits it's different.
Hint Bits gets additional information
and it doesn't change how
tuples are stored inside the page.
This thing is changing.
It's cleaning up the space for
new tuples.
So it should happen on the primary,
I think.
Michael: That makes sense.
Yeah.
Logically on a physical replica.
Nikolay: Right.
Okay.
Well, good.
Michael: There's this 1 more open
question.
Yeah.
I saw when this blog, this latest
blog post was shared to Reddit
there was a really interesting
comment that didn't that no 1
has replied to and there's almost
no activity on it but somebody
posed the question I think it's
a good 1 whether there might
be a third possible cause of this.
And that's, so it's more of a question
to the listeners because
I'm not actually sure, I don't
know myself and haven't had, you
know, to confirm for sure.
But there's a PD_ALL_VISIBLE flag
on pages.
And that might be a third way if
if SELECTs can flip that based
on visibility of all the rows in
the or the tuples in the page
then that might be a third case
but I don't I don't know if SELECTs
can so yeah again if you know I'll
be really interested to hear
Nikolay: yeah yeah so I guess that's
it right yeah so yeah
Michael: so yeah great great blog
post there's also an old 1
from Nikolay Sivko on the Okmeter
blog that I'll share So there's
a few articles to read more about
this.
Nikolay: Correction, Nikolay currently
has a new startup which
is called Coroot.
Yes.
Yeah, so then Okmeter is his old
startup.
And Peter Zaitsev from, like, founder
of Percona joined Nikolay
recently, not very long ago, and
I'm like, Coroot is a great
project.
But this, I guess this blog post
was created while Nikolay was
working on the previous project
and just started observing the
writes happening from SELECTs and
was very surprised and decided
to blog about it.
Michael: Yeah, well it was a monitoring
project as well, right?
Nikolay: Yeah, it had very good
Postgres related features, which
I guess in some practices were
inherited by Coroot.
But Coroot is amazing, just a couple
of words about it.
If you, if you like flame graphs,
Coroot is, can show you dynamically,
like imagine you have a dashboard,
you choose time for Postgres
and you see flame graphs but hanging
down so they are like 180
degrees rotated or mirrored, right?
Horizontal mirrored.
You can see details like If you
have debug symbols installed,
I think everyone should have debug
symbols always installed.
You see deep details what's happening,
where time is spent.
Dynamic diagnostics like this,
it's crazy.
I think probably I should add this
to recommendations for Postgres
platform builders, managed Postgres
platform builders, because
to consider Coroot is also like
a good thing for observability.
Michael: It's open source, right?
Nikolay: It has open source.
It has, I think, it's like an open
core model.
I'm not sure.
But the main thing is definitely
open source.
Again, I'm not sure.
Michael: Thanks so much, Nikolay.
Thanks everyone for listening.
We'll catch you next week.
Nikolay: Thank you, Michael.
See you soon.