Massive DELETEs
Michael: Hello and welcome to Postgres
FM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard.
This is my co-host Nikolay, founder
of Postgres.AI.
Hello Nikolay, what are we talking
about today?
Nikolay: Hi Michael, we wanted
to talk about one of my favorite
topics, massive delete, I usually
call it, but maybe you can
call it somehow differently.
Michael: Yeah, I really like this.
So when you say massive delete,
do you mean deleting a lot of
rows all at once?
Right.
Nikolay: In one SQL statement.
So for example, 10 million rows,
1 billion rows, or just 1 million.
It's good already.
Good enough to feel how bad this
is for Postgres.
So we can discuss in detail what's
happening if you do it.
I like this type of thing because
it's not only bad, it's also
good.
And we can explore that as well
because if you do it on production,
it's bad, but if you do it on purpose
in some testing environment,
it can be very helpful.
Michael: Oh, I see what you mean.
So it's a useful tool for certain
things.
Right.
Okay, cool.
Yeah, I get it.
Nikolay: Stress test,
Michael: yeah.
So you mentioned this in a previous
episode, that it would be
good to cover this in more detail,
and I was really interested.
But I think it's important to say
what we're not gonna cover
as well.
So massive deletes, some people
might also be thinking of like
some other things that sound like
deletes.
Like some people often implement
soft deletes, which would be
like implemented by updates.
We're not going to talk about that
kind of thing.
We're talking about only actual
Postgres deletes at this point.
Nikolay: Some people complain why
soft deletes are not already
implemented by default in the database
system, like, natively.
By then, maybe it's a good idea
to have, but...
Yeah, we will skip this topic.
So, maybe it's another episode
because it has interesting nuances.
Yeah, but let's talk about what's
happening if you want to delete
many, many rows, how to do it correctly
on production, not to
put your database on knees, and
how to use it in non-production
environments for something good,
for good reasons.
Right?
Michael: I like it.
Where do
Nikolay: you want to start?
Michael: Maybe with some use cases,
like When are you generally
seeing people do these huge deletes?
I've got a few examples I've seen,
but I'd be interested in yours.
Nikolay: Well, usually people do
massive deletes by mistake.
Not understanding how Postgres
works.
And MVCC and so on, and they expect
it's a simple operation.
They probably saw some warning
that you shouldn't expect immediate
reduction of disk space used.
For example, you delete from some
table and think, okay, I will
free some space.
No, you won't free some space because
it's only part of the work.
Another part is Autovacuum, which
will need to actually delete
those tuples.
But, yeah, what?
Michael: And even then, you won't
see, if you're monitoring disk
space, you won't see that as free
space again.
Nikolay: It depends.
Autovacuum also can truncate relations
if some pages at the end
are deleted, all tuples are deleted
and vacuumed.
So physically deleted already,
right?
Pages become free, it will truncate
them and it will reclaim disk
space.
Michael: Yeah, okay.
But I feel like those cases are
the exception rather than the
norm.
I see a lot more cases where people
are deleting a subset of
data from a non-partitioned table
and really surprised when their
relation doesn't decrease in size.
But just to cover a few of the
other basics, I think this is
actually a topic that's coming
up more and more because we have
these new privacy laws that require
people to be able to, to
have the right to be forgotten.
And deleting data is quite a safe
way of not having a data leak.
The less data you have, the less
data that can be leaked security-wise.
So I feel like there's these forcing
functions.
I think also people are starting
to get serious about having
data retention policies.
So how long should we keep each
type of data around for?
So I think there are some of these
use cases that mean this is
coming up more.
And in the past, at least in the
last 10 years, I saw a lot more
people kind of just storing everything
forever and not really
thinking about it.
So I didn't know if this is something
you're seeing more often
or it's just an age-old problem.
Nikolay: Well, yes.
So I think GDPR and everything,
it's not that big because, you
know, like, single user delete,
it's not, it's usually like some
fraction of the whole, right?
But usually, massive delete happens
when we need to clean up
and understand there's some old
data which is not super useful,
we pay for it, and we want to postpone
the moment when we need
to scale our instances, database
instances, machines, and so
on.
In this case, in general, I would
like to mention before COVID,
I went to the VLDB conference, which
is a conference I know since
being a kid, basically, when I
was a student learning database
theory.
I've heard about this conference.
So I went to it because it was
very close in Los Angeles.
I remember a keynote, some researcher,
she was presenting the
talk, which was interesting.
It was about delete basically.
She said this is exponential growth
of the total data volumes
of data in our databases in the
world.
It was like phew.
Because storage becomes cheaper,
we produce a lot of data, big
data, huge data, and so on.
This is like some zettabytes or
something.
It's an insane curve, insane, with
some forecast.
And she said, we spent decades
to learn how to store properly
CID, not to lose data, reliable,
highly available and so on.
Now it's coming time to learn how
to clean up and delete.
Understand which data can be safely
deleted, how to delete it
efficiently and so on.
And I was super inspired.
I also did a talk at a conference
because around the same time
I had a production incident.
A senior engineer, a back-end engineer,
with a good understanding
of analytical databases actually,
got a task to delete some old
data, preparing for some marketing
campaign.
Because the forecast was saying
we either need to invest a lot
to upgrade or we need to clean
up before we do this huge marketing
campaign.
So he went to production and
just performed delete from table
where created at older than 1
year.
He estimated in advance that it
will be 10 million rows and we
got an incident.
Downtime, more than 10 minutes,
cost the company a lot of money.
I was super impressed.
A lot of money.
And he almost left the company
himself because he was super embarrassed.
And at the same time, this VLDB
conference without delete, and
this, like, I'm saying, like, we
need something, right?
We need, at least we need to educate
people that delete should
be split into batches.
Of course, if you have good disk,
maybe you'll be fine.
But in that case, we had quite
slow...
It was some enterprise level, but
some sun, I think, or some
old system.
It was on premise.
And also, at the same time, I was
advocating this company, like
I was helping as a consultant with
Postgres.
The company was growing super fast,
a huge startup.
And I was saying, guys, you do
need to increase the max-wal
size.
1 gigabyte is not enough.
So with default setting max-wal
size, untuned checkpointer and
quite slow storage.
I think maximum throughput for
writes was maybe 600 megabytes
per second, maybe 500.
It's not like current modern NVMe
which gives you like 2 gigabytes
per second, for example.
These 2 factors plus the idea let's
delete 10 million rows, it's
not a huge number, right?
So checkpointer became crazy because
a lot of...
What's happening?
First of all, we find rows.
I mean, executor finds rows.
In that case, it was also unfortunate
that it was not a sequential
pattern.
So rows were ordered according
to created at, so this starts
sparsely and first you need to
put, Postgres needs to put xmax
value to this hidden system column.
It needs to put current transaction
which deletes, right?
Value to xmax column for first row
and the page becomes dirty.
Dirty means in memory we have a
different page than on disk,
so we need to flush it to disk
eventually.
A checkpointer needs to do it,
basically.
And then a different row was a different
tuple, basically, a physical
row, a version of row in a different
page, so we dirty different
page.
So almost, like if you need to
update XMax in 100 rows, it's
like almost 100 pages already dirtied,
very inefficiently already,
because of this pattern of random
access basically.
It's not random, but it's not sequential.
This is the key, right?
And then checkpointer sees, okay,
I have max wal size 1 gigabyte.
And it was Postgres, I think, 9.5,
9.6.
At that time, it means that real
distance between 2 checkpoints
was 3 times lower.
Only 300 megabytes.
You can read in Yegor Rogov's book,
Postgres Internals, a very
good explanation mentioning the
improvements and why it's not
really even a max wal size but
3 times smaller.
Now I think it's either 2 times
smaller or something, I already
don't remember.
So it means that Checkpointer comes
and says okay it's time already,
we already accumulated too many
buffers dirty, we need to put,
like, to flash them to page cache
first of all, and then pgflush,
something like pgflush will go
and flash them to disk.
So it starts working, producing
a lot of IO and it converts to
disk IO already.
And then, boom, different row happens
to be in the same page
which just flashed.
It was just flashed.
But we update it again.
I mean, we update different tuples
on the same page, it becomes
dirty again.
And Checkpointer says, okay, again,
a lot of work.
Michael: So it's duplicating effort
by flushing too excessively.
Is that what you're saying?
Nikolay: Yeah.
So the distance between checkpoints
was like 20 seconds only.
Also, I learned that if I'm not
mistaken, I think many years
already passed, checkpoints can
overlap sometimes for a couple
of seconds.
It's still happening, but it's
insane.
So checkpoint produced a lot of
I.O.
And disk couldn't handle this I.O.
And database became unresponsive
and we had a critical incident,
basically downtime, failovers,
it's insane, everything became
insane, the downtime was huge,
and that's not good.
But As I remember, there are 2
effects.
Not only do you need to flush the
same page multiple times if
you have this non-sequential access
pattern, but also once the
checkpointer or anything else made
the page clean, we have a
different effect.
Michael: Full-page images.
Nikolay: Yes, yes.
So full-page images, full-page
writes.
Since full-page writes is on, it
means that after checkpoint,
if we visit with our change, our
delete, this patch again, it
goes in full to the WAL, right?
Michael: Whereas if it was in a
previous, if it had managed to
sneak into the previous before
the check, the next checkpoint,
it would have been a much smaller
amount of work.
Nikolay: Yeah, yeah.
If we had, for example, a huge,
a maximal size like 100 gigabytes,
we have enough disk space, we can
afford it, we understand if
we crash, startup time will be
longer, replica provisioning also
takes longer, because the recovery
point
Michael: takes time.
Nikolay: And then in this case,
even if we have random, not random,
almost non-sequential access pattern,
we visit the same page
multiple times, not sequentially.
And not only the checkpointer will
flush it just once instead
of multiple times, but also only
the first change will go to
the WAL as a full page, but subsequent
changes will be just
presented as a tuple.
And this means WAL generation
decreases significantly if we
do this.
So it was helpful to me as a consultant
because I finally found
an argument that we need checkpoint
tuning.
It was the beginning of my practice
in the US, so it was hard
for me to find good arguments,
but this case showed okay.
I just did a very good series of
experiments.
You know I'm a big fan of experiments.
So if you just start experimenting
with max_wal_size and run
deletes each time, you just do
this 1 gigabyte.
This is the I.O.
For this massive delete.
Like 2 gigabytes, 4 gigabytes,
logarithmic approach.
64 for example, or 128 gigabytes.
And you can draw a good curve.
Right?
Look how I/O.
And usually, I/O, if you also have
monitoring, if this delete
takes like a minute or 2 minutes,
5 minutes, you can see monitoring
that you have a plateau for your I/O because
it's separated.
Michael: Are you taking the total
I/O?
So let's say you're doing the same
delete each time and you're
looking at total I/O across the
time and showing that there's
less I/O as we increase the...
Nikolay: I just, yeah, I just,
Well, I did several things in
this experiment, and I think it's
worth maybe a good how-to article
or something.
Actually, it was 2 phases, 2 parts
of the whole experiment.
First is to study this I/O behavior
and checkpointer behavior.
So I did a snapshot of pg_stat_bgwriter,
which as we know,
until the recent version, contains
not only the background writer,
but also a checkpointer and backend
activity for cleaning dirty
buffers.
So I converted as usual, I converted
buffers to gigabytes because
this is how any engineer can start
understanding what's happening.
If you say buffers, nobody understands
except DBAs, right?
If you multiply it by 8 kilobytes,
you have gigabytes, megabytes,
everyone starts understanding.
So I just showed that with default
setting, Checkpointer had
a lot of I/O.
And also it just had WAL.
Much more WAL was generated.
But I also made screenshots of
monitoring showing that we had
a plateau situation for disk I/O, disk
write I/O.
But when we had already like 16
gigabytes, 32 gigabytes, we already
see like a spiky pattern up and down
and this is good.
It means we have room.
Michael: Because of the batches?
Why is it spiky?
Nikolay: Yeah, some batching, something
like, yeah, checkpoint,
that's also like, it's of course
checkpoint completion target,
it's like close to 1, so it should
be spread, but it's like,
this is batching there, obviously.
And this is good.
We must iterate it.
Plato is bad.
Michael: Right.
So.
Actually, it's a good point that
this helps, this optimizer,
like tuning the checkpointer is
helping with massive deletes,
regardless of whether we batch
or not.
It's helping in the case where
we don't batch and it's helping
in the case where we do batch.
Is that right?
Nikolay: Right now I'm talking
only about how we were tuning
how tuning would prevent this incident.
I think it would convert P1 or
priority 1 or criticality 1 incident
to P2.
So basically we have a slow database
but it's not down.
Because we just discussed this.
A well-tuned checkpointer has
less I.O.
To perform.
Michael: Okay.
I understand the spikiness now.
We're talking about that's when
the checkpoints are happening.
Nikolay: Yeah.
The I.O.
From checkpointer, disk I.O.
Was spiky and it's good.
That means we are not having plateau,
we are not saturated.
Of course, better if you know your
numbers, you know the limit
of IO your device, storage device
can handle, and you can draw
this line on graph and understand
how far.
Basically, regular SRE practices
starting from usage situation
errors should be applied here to
study this incident and perform
root cause analysis.
And this was great.
It was obvious that if we reach
like 16 gigabytes or 32 gigabytes,
we are in much better shape.
We just need to have disk IO and
also second phase of experiment.
I think we had an episode about
maximal size and checkpoint tuning.
So second phase, I won't go into
detail there, but second phase
of you do need to understand recovery
time in the worst situation.
And I invented the term like double
worst situation, double unlucky.
So worst situation if your database
crashed right before checkpoint
completes and double unlucky if
at the same time you had massive
delete or something like that.
In this case it means a lot of
work during recovery.
So yeah, that's it actually.
This is how massive delete looks
like in the wild, and you should
avoid it.
Michael: Yeah, So how do we then
go from the...
I guess we've gone from priority
1 incident to priority 2.
How do we make this not an incident
at all?
Nikolay: Batching.
So just split it to batches.
I think there might be cases when
logically you cannot afford
splitting to batches because you
must delete everything in 1
transaction.
But in my practice, I never saw
this.
I mean, I always could convince
people to split to batches and
do everything in different transactions.
So in this case, we need just to
understand what is the ideal
batch size for us.
Not too small, not too big.
Too small means a lot of transaction
overhead, too big, we just
discussed.
I was
Michael: gonna say, I haven't seen
a delete use case that like
we're deleting stuff right we don't
need it anymore that's almost
by definition we don't need it
anymore so why do you why would
you need it to all be gone or none
of it to be gone it doesn't
That doesn't make sense to me.
Nikolay: Potentially there might
be a case when you don't want
users to see parts of the
Michael: old data.
So you want them to not see anything,
but you'd rather they saw
everything than part of it.
Nikolay: Yeah, in this case, Grails
only adjust application logic
so the application cannot reach
that data already, right?
Even if it's present in the database,
but you hide it already,
maybe like that, based on timestamps
or something.
But again, this is just a theoretical
discussion.
In practice I didn't see any cases
when we couldn't.
The benefits for Postgres with
its MVCC model, benefits from
batch deletes always much higher
than experiencing this pain
and risks.
So yeah, batches, and we know the
ideal size of batches, we discussed
it many, many times, starting from
the very first episode we
had almost 2 years ago.
Michael: And very recently in the
Don't Do This episode, right?
Nikolay: Yeah, yeah, yeah.
So ideal, my recipe is just try
to be below 1 second, but maybe
not below 100 milliseconds or 50.
So this is the ideal size to me based
on human perception.
Michael: Yeah.
You mentioned something in the
recent episode about it degrading
over time.
So let's say you can, you get a
clone or you have a replica of
production where you're testing
batch sizes and you get the sweet
spot.
Maybe you find out that you can
delete 5,000 rows in just under
a second.
And you think let's go with 5,000
as the batch size.
You mentioned
Nikolay: 1 hour later, you see
the same delete already takes
1 minute.
What's happening?
Michael: Yeah, so I was gonna ask,
not only what's causing it,
but how would you, like, do you
run the test for longer?
Or like, how do you think about
that stuff?
Nikolay: Well, you just need to
understand that when you delete
or update, you produce dead tuples,
and it means that only the
first part of 2 parts of the operation
is done.
It's done sequentially by your
SQL query, but there is a second
super important part, which is
vacuuming.
So delete is not complete when
you see a transaction committed.
It's complete only logically, but
physically we have these dead
tuples, basically garbage unless
some transactions are still
needed.
And you need to understand vacuum.
And actually in the same company
to perform delete properly,
of course I split 2 batches and
then I was a big fan like, oh
I want to create a simple query
which is basically stateless.
I don't want to deal with remembering
last ID or last timestamp
of previous batch.
Basically no pagination.
I don't want to have pagination
because they have a similar problem
as pagination for selects showing
huge result set split to pages
to users, similar problem here,
Very similar, because it's the
same basically problem.
So I wanted to do something like
stateless and I just relied
on Postgres on some index, I checked
indexes used, and then already
in production I saw this degradation.
So degradation was because of a
lot of dead tuples and auto vacuum
couldn't catch up with my speed
of deletion.
I also remember an interesting
thing.
Since then, I'm a big fan of single-threaded
maintenance jobs.
I implemented multiple threads
originally, but then I saw that
even a single thread is too fast
for auto vacuum.
And we don't need parallelization
here.
Of course, parallelization in my
case was based on select for
update, skip locked.
Like fancy, super cool.
It was maybe like 5, 6 years ago,
like, let's do fancy stuff,
we will have like 5 workers, let's
go.
But you don't need it here.
1 thread is enough to clean up.
Because otherwise you need to speed
up vacuum somehow, and vacuum
for a single table is always single-threaded.
You cannot have multiple workers
cleaning up dead tuples from
a single physical table.
Michael: Non-partitioned, yeah.
Nikolay: If it's partitioned, yes,
they can work on multiple
partitions at the same time, but
if it's a single table, no.
Michael: I want to come back to
partitioning.
But in the meantime, like, so massive
delete also has a big impact
on, like, I was thinking about
index only scans, not for the
delete, but for other like concurrent
reads.
Let's say we're in a situation
where we are deleting a lot of
tuples from not necessarily random
but like lots of different
pages; the visibility map is going
to be well; there's going to
be a lot of pages that have changes
and therefore can't be true
index only scans. So I was thinking
for that case we might actually
want to run a vacuum manually every
certain number of batches.
Nikolay: Yeah, this is what I did
and I also thought maybe I
should.
Yeah, I did.
I did manual, not manual, I mean
my script did it.
But after n batches, like 100,000
batches, vacuum.
Michael: And how did you determine
n?
Nikolay: Good question.
Well, experiments.
So if you want to go stateless
and you know a lot of dead tuples
will be a problem because the index
you use on primary key or
on the creation timestamp, create
tab, this index also needs
vacuum because it will be pointing
to dead tuples and that's
why subsequent batches will be
slower, degradation over
time will happen.
But if you vacuum, you get rid
of these links to dead tuples
And it becomes good again.
So based on that, applying the
rule, we want our transactions
to exceed 1 or 2 seconds because
users might notice some bad
things.
When we have a long transaction,
we also block...
We also are some kind of a problem
for all autovacuum workers
globally for our database.
So if you see, oh, it degraded
to 2 seconds, this is the right
time to run vacuum, right?
But I eventually chose not to do
vacuum, actually.
I had many iterations of my solution,
and finally I gave up and
decided to go stateful.
So I performed just this key set
pagination, just based on the
last timestamp.
I select the next batch based on
that.
It's super fast even if we had
a lot of dead tuples.
So I don't care.
Because Autovacuum is good to skip
pages it can skip.
It's not like 10 or 15 years ago.
It's already quite good.
So multiple runs versus just single
big run.
This is a good question, by the
way, worth exploring.
Is it good to run vacuum quite
often in this case, versus let's
run it just once.
In terms of how low it will produce,
in terms of how many disk
operations in total, this is a
good experiment to conduct and
compare these numbers.
Michael: Well, and doesn't it depend
a little bit on like, I
think it depends a bit on your
workload, right?
If you have no other concurrent
users, running 1 vacuum is likely
more efficient than running lots
of small ones.
But if you have other concurrent
users and you're forcing heap
fetches, then maybe your IO impact
from those reads outweighs
the duplicate effort on vacuum.
Or even when I say duplicate effort
on vacuum, I guess we're
talking about
Nikolay: the same kind of a similar...
Yeah, you're bringing up a good
point here.
If we see that we want to run single
vacuum not frequently, we
will live with a lot of buffers
which are out of visibility
map.
Yeah.
Yeah, and index-only scans will
degrade.
But you add complexity here.
I know, I know.
In my case, I don't care about
other users for now.
I like this approach.
First you study the problem alone
and then you start adding what
you did, like, oh, we have other
users.
Let's already understand the problem
in a vacuum, so to speak.
Michael: I know what
Nikolay: you mean.
In single user mode, basically.
Then you start adding considerations,
what about other users?
If you know how Postgres behaves,
how your workload behaves for
single user situation, It's already
better.
You are much more confident when
you start looking at others
as well.
Yeah, it's a good point.
If we rely on index-only scans
a lot, probably we should vacuum
frequently to avoid heap fetches.
You saw it in your plans at PgMaster
quite a lot, right?
Michael: Yeah, it's quite a common
one.
It's not the biggest issue, but
it is, especially range scans
where people are doing like returning
a lot of rows or doing
aggregates that rely on index only
scans on these kind of you
know like hybrid workloads set
kind of analytical queries they
can degrade quite quickly in certain
environments if you go from
an index-only scan across a few
thousand pages to an index-only
scan with a few thousand heap fetches.
Nikolay: Yeah, so your deletes
can be a problem for these.
So, selects might degrade Because
visibility map is outdated.
Michael: Exactly.
Nikolay: So yeah, more frequent
vacuums might be good.
It's an interesting discussion,
interesting point as well.
But in my case, I liked the...
When I gave up my idea to be stateless,
I liked so much, like
I just remember this, and remember,
I just...
I did it also to single-threaded
approach, forgetting about this
select for update, skip locked.
And so I just need 1 value always
to remember, So easy.
And then performance became very
predictable, reliable, stable.
So all batches were the same size
and latencies were the same.
Also interesting, when you do it,
you start thinking, oh, I want
a progress bar basically.
I want to predict ETA, how much
time left.
And if it's degrading or not stable,
this latency of each delete,
you cannot predict reliably.
But once you perform keyset pagination
here, you know your
batch, it takes like 300 milliseconds
each batch.
Great.
I integrated it into my SQL with
a progress bar.
It reported percentage and how
much left in ETA.
Michael: So, quite
Nikolay: an important thing.
Michael: A quick question on the
ordering of batches.
Do you try and do any kind of natural…
You mentioned created
that being a little bit random,
but I would have expected, unless
the tuples are getting updated
a fair amount, the creator that
might be relatively natural ordering.
Nikolay: It should be a very good
correlation between created
physical location and created_at.
You can check it easily.
Select ctid, created_at, and order
by created_at, some limit.
And you can see ctid is the physical
location and you can understand
how correlated physical location
is with createdAt.
In my case it did have a lot of
updates in the past, so it was
basically everywhere in the table.
That's why the incident happened,
actually.
Yeah, it
Michael: makes sense.
Nikolay: So, yeah.
Oh, no.
I ordered by modified_at and I
even considered creating an index
on modified_at.
But this is an anti-pattern to me.
An index on modified_at is a problem
because you start losing hot
updates, heap-only tuple updates
immediately because each update
needs, usually by trigger or something,
it changes this value
and if you have an index on it,
By definition, Postgres cannot
implement heap-only tuple updates,
and it means you need to deal
with index write amplification
problem, which is nasty.
Michael: So you're optimizing your
delete, but you're messing
up your updates.
Nikolay: So now
Michael: you are considering other
users.
Nikolay: I remember I went deeper.
I decided, okay, honestly, I don't
want this index to be built.
I know it will bite me back because
of heap-only tuples.
I don't want to lose them.
I checked statistics and I saw
we have a good ratio of them among
all updates.
We have a lot of hot updates.
So I decided, you know what I decided?
I decided to rely on created at
index, but then somehow perform
logical replication with modified at based
on partial correlation.
It was something crazy.
It worked quite well.
I don't remember details, but I
did some tricks there.
Avoiding this.
At first I already coordinated
index creation with guys.
They said, oh yeah, let's do it.
But then I said, no, no, no, no.
In different company, like 1 year
before that, I already had
this mistake done.
I suggested some index and then
degradation of updates happened
because we lost hot updates.
Actually, that incident led me
to the idea we need to verify
all changes, holistically checking
all queries, ideally, what
we have.
And this experimentation approach
and so on.
It's all connected and interesting.
But maybe let's skip this, since
we are out of time.
I have a question.
Michael: 2 more quick, 2 more ones.
I guess they're not quick, but
when we're done with this large
delete, what do you, like, are
there any kind of maintenance
tidy up tasks that you recommend
doing?
I was thinking like we did a whole
episode on index maintenance,
I think rebuilding index is concurrently
my...
Nikolay: This question exactly
is a good bridge to what I wanted
to discuss as the last topic.
I wanted to discuss how massive
delete can be useful in non-production.
Michael: Okay,
Nikolay: yeah.
And let's talk about this, why
your question is a bridge.
Because now I remember very well,
but in my career I forgot about
this many, many, many times.
Delete doesn't touch indexes at
all.
This is the key.
So, delete, they just, indexes
are not touched at all during
delete.
Delete just puts xmax, that's it.
And if the transaction is considered
as committed, then vacuum will
remove this, and also if no other
transactions need this version,
they're already in the future.
In this case, the tuple will be deleted.
And also links to this tuple in
this index will be deleted asynchronously
by a vacuum, but our backend is
not dealing with indexes at all.
It only needs to use one index to
find the scope of work, the tuples
we need to touch.
Oh, well, also during planning,
as we know, all indexes are considered
by a planner and an access share lock
is acquired.
So I don't see any index maintenance
here at all.
Michael: Well, I was just thinking
about them being, let's say
we deleted 10% of our table.
Nikolay: Well, yeah, you're right.
Right, so if we don't do frequent
vacuuming...
Michael: Even if we do frequent
vacuuming, there'll still be
like 10% bloat, especially if it's
like created out or something
where we're deleting old data.
Nikolay: Yes, vacuum doesn't rebalance
B-tree, so B-tree will be in
worse shape probably.
I would say if we deleted a big
fraction of the table data,
we probably need to check bloat
in indexes and probably we'll
need to rebuild them.
But I already got used to relying
on automatically rebuild jobs,
which we suggest implementing on
weekends or something.
Michael: But this feels like a
good time to reconsider.
Anyway, it felt worth mentioning.
I know you want to bridge to that
1, but I had one more quick one.
And that's like, more and more
I'm thinking about some of these
use cases are deleting, like a
multi-tenant SaaS application
wants to delete a single tenant.
And if you've partitioned, like
I was thinking about partitioning
and detached, like the another
way of avoiding big deletes is
if you can just drop a whole partition.
Nikolay: Or truncate it.
Yeah, depending.
Michael: Sorry, yeah, like, you
know, get hatched.
Nikolay: PGQ, long, Skype implemented
this, like, three partitions,
four queue-like workloads, three partitions,
and then, like, round-robin
approach truncating when it's possible.
It's much faster, it's much better.
No job for many components and
it's good.
Yeah, cool.
I agree.
And indexes are on partitions,
they are like physical and if
it's truncated, it's also truncated.
So, great.
Yeah, and let's touch a little
bit on the last topic.
Why delete is useful.
So I already was, like, you know,
database map and ZFS, like,
branching, thing cloning and iteration
starting from the idea
of experimentation, it should start
from the same point in each
iteration.
You need to, you compare things,
you need to compare apples versus
apples, right?
Always apples to apples.
So each iteration must start from
the same state of database.
And this is difficult if table
is huge, database is huge, and
you want to test it on big hardware.
In case of single-threaded behavior,
if you don't care about
checkpoint or vacuuming and so
on, usually thin clones that the
Database Lab Engine provides are great
if you just study plans and
so on.
But in this case we deal with like
we need to consider the same
file system and we need to understand
checkpointer behavior,
WAL, like everything, vacuum.
So we need a dedicated clone when
we perform this experiment
for checkpoint tuning and to study
this behavior.
In this case, for this tuning,
I found it super cool this kind
of workload can bring us very good
interesting tool.
If we perform massive delete but
don't commit with the rollback,
the physical layout remains the
same.
I mentioned it multiple times in
previous episodes and this is
a super cool observation.
You can just delete rollback, begin
delete rollback massively.
This puts good pressure to your
WAL system, right?
Checkpoint, it's cool.
And it means that, of course, for
vacuum it won't do anything
because the transaction marked
has rolled back, so even if XMAX
is already updated, the tuples
survive, they're still alive.
But Postgres generates a lot of
work for such workload.
It also puts pressure to a backup
system, to a replication system,
And also a checkpoint has a lot
of work.
So it means you can stress test
many components just with
this kind of workload.
And I know we mentioned last week,
right, We had Sai as a guest.
I suggest, who haven't watched
that episode, it was a super interesting
discussion about logical replication
with PureDB founder Sai.
So I mentioned that this is exactly
how I found that we can easily
reach and saturate single core
on WAL sender.
So you just delete middle-rows,
rollback.
It spams WAL and WAL sender hitting
100% of single vCPU and
it's becoming bottleneck quite
quickly.
Fortunately, in production, as
I mentioned, I couldn't find such
situation, but in non-production
it's easy to reproduce.
So I found this very handy tool
to stress test Postgres and for
checkpoint tuning and so on.
That's why messages delete not only
bad but also good.
Useful.
But only being rolled back.
Yeah.
Isn't it funny and exciting that
the physical layout doesn't change
of the table?
You didn't look at rollback, but
it's the same.
This is
Michael: one of those things where
if you talk to Oracle guys,
you'll get them sweating by this
point because of a different implementation.
It's only because of Postgres'
specific MVCC implementation.
Because in Oracle they have the
undo log, it's like the opposite
trade-off.
So it's so interesting that the
implementation details are so...
It's good to know them in order
to work out how to test things.
Nikolay: Yeah, Postgres is expecting
you to perform rollbacks
often.
Michael: Yeah, It's good to make
use of it when it's advantageous.
Nikolay: Right.
So tuple remains in the same spot
in the page, doesn't shift,
unlike in other systems.
But still, xmax is updated, so
page becomes dirty and must be
flushed to
Michael: disk.
Yep, makes sense.
Nikolay: Yeah, I think we explored
it quite well enough, quite
deep.
Takeovers, takeaways.
First, perform checkpointer tuning,
watch out for our episode about
that.
And if you prohibit massive deletes,
perform deletes in batches
roughly not longer than 1 or 2
seconds.
Unless you don't have users who
deal with your database a lot
and maybe you can go with like
30 seconds in this case.
Michael: Well, at least consider
batching and try yeah.
I just wanted to
Nikolay: have some rule and for
regular OLTP 1 second is a good
rule.
Michael: Yeah great I like it.
Yeah, I think it's like a healthy
recommendation as well.
Like even if you wouldn't go down
or even if you wouldn't have
like degradation of like that users
would notice, it's just an
unhealthy thing to do like to go
excessive and then like, it's
kind of like extremes where if
you can keep things more level,
you probably see better things.
Yeah, and
Nikolay: a couple more takeaways
is vacuuming, don't forget about the
regular vacuum, affects about index-only
scans we discussed.
maybe you want frequent and quite
aggressive vacuum, like to
go faster, more are you consumed,
and also partitioning.
Maybe your massive delete is just
truncated, in this case all
those negative effects can go away.
Yeah.
Yeah.
Love it.
Michael: Thanks so much, Nikolay.
Nikolay: Thank you, Michael.