autovacuum
Nikolay: Everyone who is running
Postgres has autovacuum, and
most of people heard about autovacuum
and vacuuming and bloat
and this garbage collection.
But a lot of databases grow with
problems being unnoticed and
they start tuning autovacuum
too late.
So this topic today is brought
by you Michael, right?
Hi Michael.
Michael: Hi Nik.
Nikolay: And we definitely had
this topic discussed in the past
in terms of like vacuum and bloat,
but we never discussed the
vacuum itself and how to tune it
properly and what are the best
practices.
And yeah, I'm glad that you brought
it.
Let's talk about this.
Michael: Sounds good.
Yeah.
I'm looking forward to hearing
your views on this kind of from
a seeing lots and lots of different
clients perspective because
I get the impression dealing with
people less frequently that
Often it's left really late before
people start tuning or to
vacuum and when I say late a startup
will grow to relatively
significant size before touching
those parameters And I often
think they could have avoided quite
a few headaches and got into
issues slower, if that makes sense,
by just changing a few settings
much earlier.
So yeah, I was keen to get your
thoughts on If you see the same
and also like, when would you advise
people like startup life's
busy, when would you advise people
looking at this kind of thing?
Nikolay: Yeah.
First of all, every single client
which comes to us, we talk
about autovacuum tuning on any
platform.
Doesn't matter.
Self-managed, any managed service.
We always touch this topic and
always like in hundred percent
of cases there is something that
should be done there.
Which means your point is absolutely
right.
It should be done earlier because
when people come to us for
help it's already, they experience
already some problems, some
troubles, right, and this is a
reactive approach.
Michael: And just to clarify, when
you say 100 percent, do you
mean some percentage have done
loads of tuning and still have
issues and some percentage have
done nothing at all.
What's the kind of distribution
like?
Nikolay: It's hard to say.
Actually I checked how many clusters
we observed lately.
It was like 140 or something where
we used thorough like comprehensive
health analysis and tuning and
recommendations and work closely
and so on.
So I'd say 0% have an ideal picture
in terms of how autovacuum
is tuned.
Most customers running on managed
services, especially RDS, have,
let me say it, half-ass tuned
autovacuum.
This is new term we can use here
because this is like, it amazes
me why it's so.
We can dive into details soon.
And only few, like literally that's
a several, I saw they have
very well tuned autovacuum and
we just shift focus to why it's
still not enough and what we should
do with specific pieces of
workload.
Michael: Sure, makes sense.
Nikolay: Local tuning already,
what to do about it.
And in most cases there we also
involve, we focus on specific
workload, but also we go, of course,
beyond just autovacuum
tuning, because usually this is
like some pathological workload
or some, there are some problems
which are not conflicting, how
to say, like they are blocking
autovacuum work, right?
Yes.
Before we dive into details, maybe
let's just explain to wider
audience what autovacuum is, because
I think many still misunderstand
it.
Recently I talked to, you know,
like our focus is fast growing
startups and I talked to technical
founders quite often.
These guys are super smart.
Technically they understand, they
definitely understand throughput,
latency, all the numbers and so
on.
But I see they just didn't have
time to dive into autovacuum,
what it is.
They heard bloat issue, MVCC, Postgres
is very widely criticized,
right?
But what autovacuum is and what to
do about it, what they need
to have, it's like, it's always
a good topic to discuss and dive
into.
So my simple explanation, autovacuum
is just garbage collection,
first of all.
It has a few more jobs, but the
main job, original job, is garbage
collection.
When you have updates, deletes,
or failed inserts, by the way,
the third part is usually forgotten,
but failed inserts, all
of them produce successful updates,
successful deletes, and failed
inserts, rolled back inserts.
They all produce dead row versions
called tuples, right?
Michael: Probably also rolled back
updates.
Nikolay: Oh yes, you're right.
Yeah.
Successful deletes, rolled back
inserts, or any updates.
Yeah.
Besides HOT updates, which are
nuances, right?
Even HOT
Michael: updates, yeah.
Nikolay: If we want to be thorough,
like we are destroying my
intention to provide simple explanation.
Let's keep it short.
So it's first of all garbage collections.
So if, especially for guys who
run Postgres on machines with a
lot of cores, like typical like
Intel machine 96 cores or 48
or like AMD hundreds of cores,
right?
Michael: They- We have very, we
have a very different definition
of what typical is for-
Nikolay: Well, typical big, large
database.
Michael: Sure, sure.
Nikolay: So it's, it amazes me,
like I, I saw, I told you, I
saw Postgres clusters achieving
10, 20 terabytes with default
autovacuum settings.
Somehow surviving, already having
5 replicas.
It's insane.
Michael: This is what I meant in
terms of like, should people
be doing it earlier or is it okay?
There's a certain argument.
Nikolay: It's not
Michael: okay, no.
Yeah, I think I agree, but as you
say, they've somehow got by
even to that large extent.
But I've also seen smaller systems
really struggling because
autovacuum hasn't been tuned.
So far fewer, like maybe not even
a terabyte, but struggling
because of a certain workload.
So yeah, it's not just size, right?
Nikolay: It can be a table of 2
gigabytes in size, but it's mostly
bloated, like 99.9%.
I saw it many times, especially
with queue-like workloads.
And then it can suffer, small database,
but it suffers from the
fact that autovacuum is not tuned
and maybe something is blocking
it.
Garbage collection job is blocking
it.
So we need to combine both things
here.
But my point is, autovacuum is
just garbage collection.
If you tune Postgres, or maybe
your provider tuned, for example,
to have bigger shared_buffers,
bigger cache, right, and a work_mem
for sort and join operations and
so on.
If it's already tuned and max_connections
has increased, maybe
a pooler is in front of it, like
you tuned it to handle workload.
Why the heck garbage collection
remains untuned?
It should be tuned together with
everything else, right?
So it should be.
But we see it's often lagging in
tuning.
And it can be a local workload
on small database already suffering,
or it can be a big cluster and
garbage collection mostly default.
autovacuum is mostly default.
But that being said, we should
mention that there are a few more
jobs that the vacuum has.
Maintenance statistics.
Yeah?
Do you want to
Michael: stick to the vacuum ones
first?
Because I think the name autovacuum
is quite a good name in
that it's all of the jobs that
vacuum does but automated But
with 1 addition, which is there
is also an auto-analyze
feature Right, which is what you're
just about to talk about
in terms of statistics, but I feel
like the other vacuum jobs
are worth—do you think of it like
as garbage collection, statistics,
and then the other vacuum jobs?
Nikolay: — It's just easier to
memorize.
Statistics also should be maintained
up to date.
If it's lagging, it may affect
performance, might affect performance.
But you're right, there are also
a couple of more jobs.
Transaction ID wraparound.
If you have a lot of transactions
per day, we know there are
sometimes some like 1000000000
per day.
It's a lot.
In this case, churn is high and
int4 is not enough.
Half of int4 is not enough.
transaction ID wraparound prevention
should be very frequent
in each tuple, in each row version.
And also there is Another very
important job, especially for
performance, it's maintaining visibility
maps.
It's a bitmap with 2 bits for each
page.
Page is 8 kibibytes, right?
And 2 bits, is the whole page fully
visible, All visible?
Means all tuples are visible to
all transactions.
And second is all frozen.
Yeah, just previous topic.
Transaction ID wraparound prevention.
So not to visit this page again
if we know that all the tuples
are in the past, they're already
marked frozen, so they're definitely
in the past, even if their xmin,
xmax look like they are in the
future.
This is freezing.
It marks like...
All frozen bit means like it's
a whole page in the past.
The next vacuum will just skip
it.
That's optimization as well, right?
So for more frequent vacuuming,
not to do the same job once again,
right?
Michael: I want to come back to
that, but before we do, The all
visible is really important for
index only scan performance,
right?
So if an index only scan can know
that every single tuple on
that page is visible to all transactions
then it can serve the
result it finds in the index because
it knows it can't have changed.
So that's a really neat optimization
without having to do what
it calls a heap fetch.
So it's like fetching the data
from the table instead of from
the index.
So if you ever if you see an index
only scan that ends up with
a lot of heap fetches that's because
Postgres couldn't...
Yeah it might it might be that
information was in the index but
because it Postgres didn't know
for sure that it hadn't changed
it still went and got it from the
heap.
So it might be the exact same information
that it already had
from the index, but there's this
inefficiency if vacuum hasn't
run recently enough.
And it only takes, because it's
that bit 0 or 1, it only takes
1 tuple on the page to have changed
for any of the tuples on
that page to lose out on index
only scans.
Nikolay: I think if information
is not in the index, you mean
if we select columns which are
not part of index definition,
it will be index scan immediately,
not index only scan.
Michael: Yeah, of course.
Nikolay: And index only scan already
means that information is
present.
Information means values, but what
is never present in indexes
is visibility information.
This xmin, xmax.
So For that we need heap, but if
visibility map bit says for
this page it's all visible, this
can be skipped.
So if the fetch's number goes lower,
ideally it should be 0,
then it's a perfect situation,
it's a true index-only scan.
If it's high, it approaches the
performance of index scan, which
is like always consulting heap.
Yeah, naming should be better here,
but my position on naming,
I'm always like, index scan could
be renamed to something which
would be explicitly mentioning
that heap is always consulted.
Anyway, this is a different topic.
I think we covered quite well what
autovacuum does, and again,
the main job is just garbage collection.
Also, I wanted to mention, autovacuum
is improving a lot from
release to release.
And it's enabled by default, of
course.
It should be disabled only by experts.
For example, PgQ, Skype engineers
decided to disable autovacuum
for queue partitions.
And it's also disabled in PgQue tool,
which is built on top of
PgQ.
Name is awful, right?
So anyway, there autovacuum is
disabled, but normally it should
be never disabled, and it's enabled
by default.
And it's done, drop, and so on.
So it's just a garbage collection
plus additional features and
it's improving from release to
release and Funny fact I recently
purchased some Chinese modern auto
vacuum for home with washing
feature.
It's very similar, like advancement,
like I had a few years ago,
I had a couple of them and they
were now there, like so many
features, so it became much smarter.
And they have multiple functions.
So I enjoy how it works really.
But it requires some tuning as
well.
Michael: Yeah, I was gonna say,
Bruce brought this up in our
episode with him, didn't he?
He said people complain about vacuum
and bloat with Postgres
and MVCC, but he's saying, which
version are you complaining
about?
Because it is continually getting
better.
But I would say that some of the
defaults are still very conservative,
or maybe make sense at smaller
table sizes.
And they do have scale factors
built in, but they make less and
less sense I think as tables get
bigger, some of them.
Nikolay: So my position on defaults,
they are all BS defaults.
They are like for, I call it Postgres,
kitchen kettle Postgres,
which should have, I don't know,
like 1 gigabyte memory and some
Raspberry Pi or something, a smart
pod or something.
This is it.
For them it's okay, but for any
even like small cluster with
8 cores, 16 gig of memory, it's
not enough at all.
And the settings can be split to
3 categories for simplicity.
Number of workers, which requires
restart.
The only thing which requires restarted
number is number of workers,
which is 3 by default.
If you have 8 cores, 3 maybe is
okay.
It's okay.
If you have 32, 64, 48, 96 cores,
3 is not enough, but we see
it all the time.
People come to us, 96 cores, 3
workers.
This is 1 thing.
Michael: You say it's not enough.
I think you're right in the sense
it's not a sensible default.
Like why not give it more in case
it needs it?
Nikolay: Right.
Michael: But as you said, sometimes
people come to you and they've
only had 3 workers the whole time
and somehow it's not fallen
over.
Nikolay: It's grinding through
challenges.
Yeah.
Like super bloated tables, super
bloated indexes, and it's a
lot of dirt to clean, which is
great for us.
We have opportunity to help them.
Instead of framing, oh, how bad
it is, I actually have it, and
it's actually our product strategy.
We frame it, oh, great, we can
help you really quickly, really
quickly.
Either way, we have low-hanging
fruit.
Let's go.
With our tooling and methodologies,
it's super simple.
This low-hanging fruit, let's go,
and your database will fill
and breathe quite soon.
Much better.
You can even consider downgrading
or postponing upgrade for a
couple of years after this.
Yeah.
And this is great.
Yeah, this is 1 thing.
And this is super important thing,
we just discussed.
This is how garbage collection
works and updates, deletes, and
sometimes inserts.
They do just part of the job.
Right?
And these poor autovacuum workers
need to do the rest of the
job.
Michael: The tidying up afterwards,
Nikolay: yeah.
You have 1,000 max_connections,
meaning you have up to 1,000
back-end workers which produce
all those dead tuples.
And only 3 that should catch up
with them.
It's nonsense, right?
It feels like it's nonsense.
So it should be tuned.
And simple rule, like we say, like
actually let's consider like
25% of vCPU count, But we should
do it very properly, thinking
about memory management and so
on.
I know there is opposition to this
view, but this is my position.
It's a simple case, just let's
give more workers, especially
if you have partitioning.
Because, fun fact, a worker processing
a table, if it's a huge
table, it will be processing it
sequentially, your table and
its indexes, and that's it.
If it's a partitioned table, it
can be distributed.
Each partition can be, you can
parallelize this, right?
So it makes sense to have more
workers if you have more partitions.
And it's great to have partitions
and more workers together.
This is the best situation.
And 2 more categories.
1 defines frequency of revisiting.
Frequency, how frequent workers
come to do stuff.
And another defines how fast they
can throughput, basically,
or throttling.
I think
Michael: throttling's the better
way of thinking about it, yeah.
Nikolay: Throttling, yeah, limiting.
It's cost_limit and cost_delay.
And how often they visit defines
by scale factors and threshold,
but I'm okay to think about only
scale factors, actually.
At first glance, it's enough.
So we have 3 scale factors today,
right?
Or maybe there is also in Postgres
18, there is
autovacuum_vacuum_max_threshold.
This is a new stuff in Postgres
18, which changes the picture.
So usually we talked about only
scale factor and threshold.
Scale factor by default, 3 scale
factors, but the basic 1 is
autovacuum_vacuum_scale_factor,
right?
And it is either 10 or 20 percent.
So I think 20 percent, right?
Michael: Yeah, so I suppose because
the threshold by default
is 0.2, right, so 20 percent.
Let's say you've got 1000000000
rows in the table.
It doesn't wait to get to 200 million,
it starts at 100 million.
Nikolay: Yeah, so we don't allow
more than 100 million, which
is a huge number.
Maybe huge, maybe not.
100 million dead tuples in the
table.
Michael: It's quite a lot though,
yeah.
Nikolay: Yeah, so the basic is
the vacuum scale factor.
This is the first thing.
And it's 20% by default, which
means you need to accumulate 20%
of dead tuples in the table before
it starts garbage collection.
When I just say this, I had recent
experiences.
When I said this to some startup
CTO who which is growing and
so on he like, oh We don't have
garbage collection before 20%
of dirt accumulated This is
Michael: Of course when when things
are small, it's not a big
deal.
But when things get big, this is
just a huge number.
And it's not just...
Nikolay: Yeah, it's...
Let's say you have like almost
terabyte size table in terms of
size, billion rows unpartitioned,
which is common, I see this.
20% means 200 million rows, dead
tuples, right?
Which is by the way above that
threshold introduced in Postgres
18.
So, it's like, wow.
Michael: But it's not just storage,
right?
It's also polluting caches and
indexes.
It's everything.
Nikolay: Oh yes, oh yes.
This is big point.
And actually, let me add 1 important
notice here.
Usually, it's not a big deal for
smaller tables.
But in some cases, we see local...
So if, for example, 1 row updated
many times, thousands of times.
For 1 logical row, you have many
versions.
And globally, like table is big,
but for this not lucky row,
You have so many versions and they
can be, might be scattered
among many pages.
So, they are sitting there as data
tuples and for this specific
query select where id equals something.
This means that all of them need
to be fetched to be checked
in terms of visibility for this
transaction and it degrades for
a particular ID, like local degradation.
Lots of queries are fine, but this
unlucky because it had a lot
of updates or something, right?
It became really degraded.
And it's like, what's happening
here?
This is where p99s or something
are good to have because average
will hide it, average latency and
so on.
And this means you should be vacuuming
more frequently, even
on smaller tables.
So this local degradation due to
dead tuple for particular logical
rows, dead tuple count is high.
So it's happening, and it's happening
silently.
That's the problem.
And tooling, which is like lacking
p99s and so on, is not helping
us to understand.
That's why still logging of slow
queries helps, and auto_explain
helps to catch these like edge
cases and start analyzing why
it's happening.
The next day you select the same
row, maybe it's already vacuumed
and degradation has gone.
But in that moment, if you go back
in time and select it, you
see so many buffers.
This particular query, it's very
like it's primary key lookup,
but too many buffers somehow.
That's why.
So it's important to feel it because
it's really like subtle
problem.
It's hard to catch it if you don't
know where to look at.
Yeah.
And you're right.
So scale factors.
And since Postgres 13, there is
also autovacuum_vacuum_insert_scale_factor,
and there is also
scale factor for analyzing.
For insert, it's really great.
I remember Darafei, the guy who,
from Belarus, I think, he introduced
it.
It was so obvious, why didn't we
have it?
We insert a lot with double table
in size, but vacuum doesn't
come.
And we need vacuum not to clean
up the tuples after successful
insert, but to rebuild visibility
maps, first of all.
And yeah, it is great because if
you have select count or something,
you want index-only scan to be
with low heap fetches, as you
mentioned.
So anyway, there are these scale
factors, And it's enough at
first glance to look only at them.
Yes, you can go deeper and think
about thresholds and how additional
tuning, but scale factors is already
enough.
You want them to be really low
in OLTP workloads, just 1%.
Sometimes 2%, sometimes 1%, I see
people choose, but 1% is is
simple rule.
Set it to 1 percent for all scale
factors and benefit.
Michael: And to give people an
idea of how drastic a change that
is, these scale factors are 20
percent for the vacuum threshold,
10 percent for the analyze threshold,
and I think 20% also for
the insert scale factor.
Nikolay: Yes.
So
Michael: dropping from 20 to 1
might feel like a huge change,
it is quite a huge change, But
you're not repeating the same
work, right?
Like it's not, there's a tiny bit
of repetition.
Yeah, it's, this is a slight difference
to analyze, isn't it,
in that if you run vacuum twice
in quick succession on the same
table, It's not doing all of the
same work again.
It's done a lot of the cleanup.
It's like tidying a room.
Nikolay: Some work
Michael: can
Nikolay: be done again.
For example, it's very nuanced
here.
It's super interesting.
Like you mentioned, it's not only
about data storage.
It's polluting memory, WAL.
WAL means replication, WAL backups.
It's polluting everything.
And memory is super, memory is
expensive these days, right?
So we should optimize, not about
data storage, we should think
about how many pages we need to
keep in this working set.
But the ideal situation when you
raise the workers, make a vacuum
to visit us frequently with lowering
scale factors, all of them
to say 1% is very rough rule here.
But also partition quite well,
because in this case some partitions,
especially if it's historic data,
they will be processed with
marked all visible, all frozen,
and very rarely those pages will
be touched.
So vacuuming will be like a breeze,
like skipping, right?
Unlike if it's a huge, messy heap
with terabyte in size or so,
and any page can be touched at
any given time.
It means it will be need to process
again.
Of course it means that if you
have infrequent vacuuming of a
huge table, of course you're doing
work less, because maybe this
page was visited multiple times
before it was vacuumed.
And if you start vacuuming frequently
and between 2 processing
these pages are revisited by some
rates, then it needs to be
vacuumed again.
So it's multiple processing of
the same page versus a single
processing.
So we cannot say it will be, but
it's highly optimized.
Vacuum is highly optimized.
But if you have a huge table, not
partition, maybe it will be
like frequency, high frequency
vacuuming will cost you some extra,
but it's worth it usually.
So there are nuances here and trade-offs,
of course.
But usually in OLTP, we very strongly
recommend to lower down
scale factors so vacuuming happens
more frequently.
And the fact is that if you're
on managed service, I see it's
quite common.
This is a big problem.
RDS included, for example.
They don't care about this somehow
and they keep defaults.
I would like to hear their reasoning
why they don't proactively
tune it for the customers.
And because I
Michael: mean, I mean, I mean,
I'm just makes, because they're
a scale factor, right?
They can make sense at any scale,
right?
Like it, it's fine when you're
small, These are inexpensive things
to run on small tables, and it
works when it's big.
There's no reason not to tune it
from the start for all customers.
You don't have to be careful about
which instance size.
Nikolay: And I don't complain,
because as I said, I enjoy when
I see the problem.
And I'm excited to help and have
low-hanging fruit and see
how health next day is much better.
But of course we need also automated
reindexing, we need the
pg_repack to be used and so on.
By the way, we mentioned that Postgres
19 will have...
Yeah.
Yeah, it will have...
Michael: REPACK and REPACK CONCURRENTLY.
Nikolay: That's awesome.
And we should somehow revisit this
topic, maybe inviting the
guy who made it, right?
Michael: Yeah, I think it was...
Antonín, right?
Yeah, Antonín was...
We had him on, didn't we, to talk
about pg_squeeze, which inspired
the work.
Nikolay: Remember in the very end
of the episode, I learned about
these plans to make VACUUM FULL
concurrently basically, which
is like became REPACK CONCURRENTLY.
I couldn't believe, I had like
doubts, is it like, but here we
go, like this is happening a year
later, it's planned to be released,
I hope it won't be reverted, it's
a super important feature to
have, it's interesting how out
of the box.
Michael: I have to admit when I,
yeah, I saw that REPACK, they
introduced effectively a rename
of VACUUM FULL to REPACK and that
was going to make 19 but it didn't
have a concurrently option
and I was like, oh, that's such
a shame but it makes sense.
Nikolay: Another year of waiting.
Michael: And then about a week
later, REPACK CONCURRENTLY gets
committed as well.
So yeah, that was great.
Nikolay: It's yeah, it was a mood
swing for me as well.
It was like, oh no.
Oh, yeah.
So anyway, back to scale factors.
This is probably 1 of the most
important to lower scale factors
for OLTP workloads and benefit
from more frequent vacuuming
and analyzing as well, right?
Michael: Just a question, is there
any stage that people are
at where you wouldn't recommend
this.
I was thinking when should people
be thinking about these things
but I think those 2 are such simple
or those 3 let's say are
such simple settings to change
you could change them from the
start and never think about it
again like
Nikolay: globally I
Michael: was thinking.
Nikolay: This is complex and this
is complex thing in terms of
we cannot talk about autovacuum
tuning completely without analyzing
what's blocking the work.
Michael: If anything.
Nikolay: Yes and that's why I'm
like we recently developed a
dashboard for xmin horizon analysis
because we we probably need
to revisit it once again in terms
of long transactions versus
pure xmin horizon blocker analysis
and autovacuum throughput
analysis and again like this I
cannot recommend enough Laurenz
Albe recent blog post about how
he monitors autovacuum.
Combining with what, how we monitor
autovacuum, I think it's
much like we leveled up monitoring
of autovacuum blockers basically,
right?
So xmin horizon, there are 5 reasons
why it can be blocked.
And if you keep those problems
unresolved for hours, or sometimes
dozens of hours, tuning autovacuum
to run more frequently, especially
if you reduce also autovacuum_naptime, how
long it sleeps between runs,
and raised workers.
And also, some problems might happen.
For example, it might start consuming
too much resources just
trying to do something.
It cannot do something.
And also, logging can be, you can
have observer effect.
If there is a, like it tries to
do something, it cannot, but
it reports and many workers report.
And if you, at extreme, you have
log_autovacuum_min_duration
set to 0, meaning all attempts
are logged.
In this case, if you have 25 workers,
for example, and they cannot
do, and they autovacuum_naptime 1 second,
it's a storm of logs.
This is 1 problem I can definitely
imagine.
Michael: You mean 1 millisecond?
Nikolay: No, by default autovacuum_naptime
is 30 seconds.
Michael: What am I thinking of
that's 2 milliseconds by default?
Nikolay: Oh, actually I was wrong.
I'm looking at it right now, it's
60, 1 minute.
autovacuum_naptime is by default 1 minute,
which is quite long in my opinion,
so we usually say let's choose
a few seconds, but I admit aggressive
tuning without resolution of xmin horizon
blockers might lead
to storm of logging and resource
utilization as well.
So this is what I can think of
from top of my head.
Michael: Yeah, so sorry, I was
thinking of cost, there's a cost
delay.
Nikolay: Right, this is the last
thing we haven't touched.
So as I said, There are like throughput
in terms of number of
workers, there is frequency, how
often vacuum and analyzing happens,
and also the remaining piece, big
piece is throttling.
You said we should say throttling,
right?
So 2 parameters.
And in Postgres, who is responsible
for naming?
So there is an attempt to save
on some defaults or something.
There is some parameters that have
relationships.
Inheritance, right?
So there are 2 parameters,
cost_delay and cost_limit.
There are vacuum_cost_delay,
vacuum_cost_limit, this pair.
And by default, I think cost_delay
is 0, means it's unthrottled.
If you run vacuum manually, it's
unthrottled.
It tries to go full speed, usually
bounded by I/O, disk.
Right?
And there is also autovacuum_vacuum_cost_delay,
autovacuum_vacuum_cost_limit, 2 parameters.
1 of which I think
autovacuum_vacuum_cost_limit is set to minus
1, which means it inherits from
vacuum_cost_limit, which is set
to 2 milliseconds, right?
It was 200 milliseconds and then
it was, or 20 milliseconds in
the past before Postgres.
Michael: Yeah, the delay is, I've
just looked it up:
autovacuum_vacuum_cost_delay is
Nikolay: 2 milliseconds.
Yeah, I confused everything, I
see.
So autovacuum_vacuum_cost_delay
is 2 milliseconds and it was
before Postgres 12 it was 30, so
it became 2 in Postgres 12 which
is great.
And autovacuum_vacuum_cost_limit
is minus 1 means that it inherits
from vacuum_cost_limit Which is
by default 200
Michael: Which is like an arbitrary
unit of like work right and
there's some various things that
make up, yeah.
Nikolay: Yeah, I your CPU cycles,
there is some math.
What I don't like about acquisition
of EDB, of SecondQuadrant
by EDB, they just killed the blog.
Blog was brilliant.
And they had a great blog post
about some math.
If you think only about the I/O part,
all defaults before Postgres
12 yielded to 8 MB/s of reads.
So with 120 milliseconds, it was
roughly 8 mb of reads.
So bumping 20 milliseconds to 2
milliseconds, I remember it was
first considered to raise cost
limit, but then decided to reduce
frequency of this
throttling, right?
Because it makes things smoother,
right?
So going down from 20 milliseconds
to 2 milliseconds and maintaining
the cost_limit 200, inherited from
vacuum_cost_limit.
I don't like this, just put straightforward
defaults.
It doesn't hurt.
Minus 1 makes life of humans much
harder, much harder, right?
Like it's, okay, anyway, maybe
I should just propose it.
Maybe it wasn't proposed.
Maybe it was proposed.
I don't know.
It's My battle over dozens of,
over decades already.
This is happening decades, this
suffering from memorizing what
minus 1 means.
We forgot actually memory management.
Fourth piece, right?
There is minus 1 there as well.
autovacuum_work_mem is minus 1
by default, which means you need
to go and consult and there is
big interesting trick happened
in Postgres 17, separate story.
So that's
Michael: maintenance_work_mem,
right?
Nikolay: Yeah, maintenance_work_mem.
If we have time, we can talk about
that as well and dive into
it a little bit.
So, minus 1 for
autovacuum_vacuum_cost_limit inherited 200,
this like arbitrary cost, not arbitrary
actually, but some cost
like 2 milliseconds.
Since Postgres 12, meaning in all
current versions, it means by
default that can just take reads
of 80 MB/s,
roughly, which is not enough in
modern systems.
Ah, and it's shared among all workers,
unless you start tuning
a table level, which makes things
much more complex because then
quotas with throttling becomes
local for this table, for this
worker which is processing specific
table.
I prefer not to go there.
We can have a separate episode
for table level analysis.
Tuning, table level tuning.
So big picture is like by default
it has only 80 MB/s
of reads and if you
have powerful, even EBS volume
which I don't know it's a nitro,
old nitro architecture so it
definitely NVMe, like it has
good throughput.
80 is tiny among all workers.
Again, this is garbage collection
for everything.
So it should have several hundred
megabytes per second at least.
But the good thing is, for example,
RDS fixes this, especially
on Aurora we see enormous quota.
Throttling there is quite good
in terms of like it has a lot
of power.
But in many other cases, we see
it needs attention and we need
to give more like, how to say,
work capacity because otherwise
it's too slow And that's not good.
Michael: It never catches up, if
that makes sense.
Nikolay: It's a kind of situation.
You see all the workers you allocated,
they're always busy.
This is number 1, how you can understand
that something is not
okay.
And second, we have a special report,
we have a query, I have
a blog post about that, and we
also have it in our monitoring,
like a queue or line of tables
which sit there and should be
already processed but not processed
yet.
And this, like, basically a wait
line for tables.
And I think Laurenz also has in
his blog post a slightly different
analysis.
But the idea is the same.
How bad is it, the picture?
It should be done already, but
not yet done.
This is how you also feel this
situation happening.
Michael: From the other side of
things, I saw a good post recently
by Jeremy Schneider about the risk
of completely removing throttling.
So he wrote a good post recently
about an issue they encountered
once.
I think they reduced the...
Nikolay: I haven't seen it.
Sometimes people put to Cron.
We haven't touched this specific
cases related to MultiXact
member situation when people have
this approaching and they need
to vacuum separately.
Switching of indexes because index
vacuuming because for that
problem you don't need, and this
is actually another job to clean
up those things.
You don't need indexes to be involved
and this slows down the
whole thing if you have many indexes.
So anyway, I wonder, I need to
look at that, but, and when you
do manual vacuuming it's unthrottled,
as I said, because
vacuum_cost_delay is 0.
So it's not checking at all.
Let's go.
Michael: But I think that's, isn't
that because every now and
again, you want to run a manual
vacuum as like an emergency procedure?
Like,
Nikolay: yeah, it makes sense.
I agree.
But if it's throttled and it's
unthrottled, Actually, this is
super interesting for me.
In the past, I had problems with
unthrottled vacuum, but it was
when we had some enterprise disks
and self-hosted Postgres and
our capacity was like 400 megabytes
per second.
I also am very curious right now
in Postgres 18 with AIO and
so on, like how a single worker
might behave.
Remember I told you I was super
impressed to see the pg_basebackup,
which is single threaded
and I expected 100 megabytes
of reads from it.
Suddenly it amazed me, approaching
like gigabyte per second of
reads in Postgres 18.
So maybe a vacuum also, I'm not
sure, like this I haven't visited.
So- Yeah,
Michael: I think it was me that
suggested it might be AIO related.
Nikolay: Yeah, but let me correct
myself.
So there is fourth piece memory,
super important.
And if we don't think about it,
it might bite badly, especially
if we raise workers.
So all our clients, we say, okay,
we raised workers, but we're
jumping from say Postgres 15 to
17, you must limit memory for
autovacuum in advance because
it's by default minus 1 means
it takes the value from maintenance_work_mem.
And maintenance_work_mem people
usually raise quite quickly thinking
they will have much faster index
creation.
So they put 2 gigabytes or so.
Now I think like 25 workers, 2
gigabytes is already a lot, right?
And the problem is like before
Postgres 17, you might not notice
it because autovacuum workers
suddenly limit themselves with
1 gigabyte and then suddenly jump
to no obvious limit anymore.
No 1 gigabyte limit anymore for
autovacuum workers.
And If you have 4 gigabytes, I
saw 8 gigabytes, like these things,
right?
And many workers, it's huge memory
consumption.
Michael: Or risk, like not necessarily,
but there's the risk
that they could take up that much,
right?
It's not pre-allocated or anything.
It's just that's the budget they
have.
Nikolay: It can influence and it
can increase risks of out of
memory.
—
Michael: Yeah, yeah, yeah.
— OOM killer.
Nikolay: —
Michael: So, but then you can set,
yeah, great.
But then you can set the autovacuum_work_mem.
I've forgotten the exact name of
it.
Nikolay: — Yeah, I found like half
a gigabyte, some math should
be done there, and we usually like,
we have it, like some formulas.
So usually half a gigabyte, even
maybe lower, and it will be
okay, right?
It's interesting topic, but yeah,
so usually for maintenance
work, again, I wish like this minus
1 is killing me, like decayed
already.
Like why we don't put explicit
value right away?
Why this like I need to go in some
maze and remember.
It took me many years to settle
in terms of these, like, relationship
between vacuum and autovacuum
settings.
And for non-expert who is not working
with Postgres settings
every day, it's impossible to memorize.
Right?
It's super tricky.
Michael: So you would have hard-coded
values for each even if
they start with the same value?
Nikolay: Doesn't hurt.
Yeah, so if we have maintenance_work_mem by default, I don't know,
what is it, like 128 megabytes
or so, like maybe more.
Just copy paste.
This is not the place where we
should save on copy pasting.
We shouldn't save like few bytes.
Minus 1 is 2 bytes, right?
128, it's okay, a few bytes more.
This is not the place where we
should save it, right?
This is this complexity, extra
complexity, which is like, okay,
I need to redirect this frustration
to proper place.
Michael: Yeah, by the way it's
64 megabytes default
maintenance_work_mem.
I didn't realize it was
Nikolay: that low.
Postgres, yeah, Postgres.
Yeah.
Michael: All right, awesome.
And actually, the last thing I
wanted to say, so as a little
summary, if people have less than
8 vCPUs, maybe leave, you can
probably leave the workers in place,
but still change the thresholds
and it's just those 3 thresholds
will get you like 90% of the
way there probably.
I think most of...
Reduce, you said down to 1% I think
that makes a lot of sense.
I've generally advised people to
like 5 or 2% as like a starting
point but 1% makes a lot of sense.
It's a bit more, but I think it
Nikolay: won't save you from large
tables It won't save you from
this problem of local degradation
for particular rows It's just
like maybe you need to think about
it additionally, but overall
it's like good, very rough tuning
for
Michael: all the teams.
Yeah, absolutely.
I think that'll get a lot of people
a lot further.
Most providers let you set that
globally.
I did come across 1 recently that
only lets you change these
things at a table level, at which
point my advice would be focus
on your important tables and just
change a few of those.
But that was only 1 provider.
Nikolay: Or just migrate.
Michael: Or migrate, yeah.
That small task of migrating database.
But yeah, so that sounds wise.
And then, yeah, memory tuning,
make sure it's not using
maintenance_work_mem if you've upped that.
Smart advice.
Nikolay: Yeah, there are other
parameters, but it's like already
deeper things.
Michael: Yeah, sounds good.
Nikolay: My simple rough tuning
list, raise number of workers.
This requires restart, unfortunately,
then make visits more frequent,
decreasing scale factors to 1%,
for example, roughly, and give
more power, unless it's already
done by provider, raising, for
example, raising cost_limit.
Oh yeah, cost_limit, right?
And default settings, as I said,
Postgres default settings, because
not providers default settings.
So it's roughly 80 MB/s for disk reads.
Think how much you want, like maybe
just raise it, maybe sometimes
4 or 5 times, depending on disk
capacity.
It's very important to understand
that.
And finally, autovacuum_work_mem,
set it explicitly, minus 1 sucks in all
senses.
It's just very hard to deal with.
I like
Michael: that.
I like setting it explicitly, even
if you plan to keep it the
same as your maintenance_work_mem
now.
Nikolay: Yes, like freeze it before
upgrading to 17 for example.
Or before touching number of autovacuum
workers.
Just make it explicit so to simplify
math analysis including
with AI so it won't jump additional
like mind hop.
Michael: And just avoiding that
future change where somebody
wants to increase maintenance_work_mem
and doesn't think
Nikolay: at
Michael: that moment about
autovacuuming.
Nikolay: Because these are 2 separate
things.
They operational, they are very
separate.
Indexing and vacuuming, autovacuuming.
Yeah, and that's it.
These 4 areas for rough tuning,
it's enough.
It should become bloat situation
much better, but there are also
blockers.
If You can tune a lot, tune everything
perfectly, but if you
have some xmin horizon blocker,
and we have episode about xmin horizon,
then it won't help.
Bloat will accumulate anyway.
Michael: Yeah.
Great.
All right.
Nikolay: Good.
Yeah.
Michael: Nice one, Nik.
Thanks so much.
Nikolay: Yeah, I hope it was helpful.
See you next time.
Michael: See you next time.