Nikolay: Everyone who is running
Postgres has autovacuum, and

most of people heard about autovacuum
and vacuuming and bloat

and this garbage collection.

But a lot of databases grow with
problems being unnoticed and

they start tuning autovacuum
too late.

So this topic today is brought
by you Michael, right?

Hi Michael.

Michael: Hi Nik.

Nikolay: And we definitely had
this topic discussed in the past

in terms of like vacuum and bloat,
but we never discussed the

vacuum itself and how to tune it
properly and what are the best

practices.

And yeah, I'm glad that you brought
it.

Let's talk about this.

Michael: Sounds good.

Yeah.

I'm looking forward to hearing
your views on this kind of from

a seeing lots and lots of different
clients perspective because

I get the impression dealing with
people less frequently that

Often it's left really late before
people start tuning or to

vacuum and when I say late a startup
will grow to relatively

significant size before touching
those parameters And I often

think they could have avoided quite
a few headaches and got into

issues slower, if that makes sense,
by just changing a few settings

much earlier.

So yeah, I was keen to get your
thoughts on If you see the same

and also like, when would you advise
people like startup life's

busy, when would you advise people
looking at this kind of thing?

Nikolay: Yeah.

First of all, every single client
which comes to us, we talk

about autovacuum tuning on any
platform.

Doesn't matter.

Self-managed, any managed service.

We always touch this topic and
always like in hundred percent

of cases there is something that
should be done there.

Which means your point is absolutely
right.

It should be done earlier because
when people come to us for

help it's already, they experience
already some problems, some

troubles, right, and this is a
reactive approach.

Michael: And just to clarify, when
you say 100 percent, do you

mean some percentage have done
loads of tuning and still have

issues and some percentage have
done nothing at all.

What's the kind of distribution
like?

Nikolay: It's hard to say.

Actually I checked how many clusters
we observed lately.

It was like 140 or something where
we used thorough like comprehensive

health analysis and tuning and
recommendations and work closely

and so on.

So I'd say 0% have an ideal picture
in terms of how autovacuum

is tuned.

Most customers running on managed
services, especially RDS, have,

let me say it, half-ass tuned
autovacuum.

This is new term we can use here
because this is like, it amazes

me why it's so.

We can dive into details soon.

And only few, like literally that's
a several, I saw they have

very well tuned autovacuum and
we just shift focus to why it's

still not enough and what we should
do with specific pieces of

workload.

Michael: Sure, makes sense.

Nikolay: Local tuning already,
what to do about it.

And in most cases there we also
involve, we focus on specific

workload, but also we go, of course,
beyond just autovacuum

tuning, because usually this is
like some pathological workload

or some, there are some problems
which are not conflicting, how

to say, like they are blocking
autovacuum work, right?

Yes.

Before we dive into details, maybe
let's just explain to wider

audience what autovacuum is, because
I think many still misunderstand

it.

Recently I talked to, you know,
like our focus is fast growing

startups and I talked to technical
founders quite often.

These guys are super smart.

Technically they understand, they
definitely understand throughput,

latency, all the numbers and so
on.

But I see they just didn't have
time to dive into autovacuum,

what it is.

They heard bloat issue, MVCC, Postgres
is very widely criticized,

right?

But what autovacuum is and what to
do about it, what they need

to have, it's like, it's always
a good topic to discuss and dive

into.

So my simple explanation, autovacuum
is just garbage collection,

first of all.

It has a few more jobs, but the
main job, original job, is garbage

collection.

When you have updates, deletes,
or failed inserts, by the way,

the third part is usually forgotten,
but failed inserts, all

of them produce successful updates,
successful deletes, and failed

inserts, rolled back inserts.

They all produce dead row versions
called tuples, right?

Michael: Probably also rolled back
updates.

Nikolay: Oh yes, you're right.

Yeah.

Successful deletes, rolled back
inserts, or any updates.

Yeah.

Besides HOT updates, which are
nuances, right?

Even HOT

Michael: updates, yeah.

Nikolay: If we want to be thorough,
like we are destroying my

intention to provide simple explanation.

Let's keep it short.

So it's first of all garbage collections.

So if, especially for guys who
run Postgres on machines with a

lot of cores, like typical like
Intel machine 96 cores or 48

or like AMD hundreds of cores,
right?

Michael: They- We have very, we
have a very different definition

of what typical is for-

Nikolay: Well, typical big, large
database.

Michael: Sure, sure.

Nikolay: So it's, it amazes me,
like I, I saw, I told you, I

saw Postgres clusters achieving
10, 20 terabytes with default

autovacuum settings.

Somehow surviving, already having
5 replicas.

It's insane.

Michael: This is what I meant in
terms of like, should people

be doing it earlier or is it okay?

There's a certain argument.

Nikolay: It's not

Michael: okay, no.

Yeah, I think I agree, but as you
say, they've somehow got by

even to that large extent.

But I've also seen smaller systems
really struggling because

autovacuum hasn't been tuned.

So far fewer, like maybe not even
a terabyte, but struggling

because of a certain workload.

So yeah, it's not just size, right?

Nikolay: It can be a table of 2
gigabytes in size, but it's mostly

bloated, like 99.9%.

I saw it many times, especially
with queue-like workloads.

And then it can suffer, small database,
but it suffers from the

fact that autovacuum is not tuned
and maybe something is blocking

it.

Garbage collection job is blocking
it.

So we need to combine both things
here.

But my point is, autovacuum is
just garbage collection.

If you tune Postgres, or maybe
your provider tuned, for example,

to have bigger shared_buffers,
bigger cache, right, and a work_mem

for sort and join operations and
so on.

If it's already tuned and max_connections
has increased, maybe

a pooler is in front of it, like
you tuned it to handle workload.

Why the heck garbage collection
remains untuned?

It should be tuned together with
everything else, right?

So it should be.

But we see it's often lagging in
tuning.

And it can be a local workload
on small database already suffering,

or it can be a big cluster and
garbage collection mostly default.

autovacuum is mostly default.

But that being said, we should
mention that there are a few more

jobs that the vacuum has.

Maintenance statistics.

Yeah?

Do you want to

Michael: stick to the vacuum ones
first?

Because I think the name autovacuum
is quite a good name in

that it's all of the jobs that
vacuum does but automated But

with 1 addition, which is there
is also an auto-analyze

feature Right, which is what you're
just about to talk about

in terms of statistics, but I feel
like the other vacuum jobs

are worth—do you think of it like
as garbage collection, statistics,

and then the other vacuum jobs?

Nikolay: — It's just easier to
memorize.

Statistics also should be maintained
up to date.

If it's lagging, it may affect
performance, might affect performance.

But you're right, there are also
a couple of more jobs.

Transaction ID wraparound.

If you have a lot of transactions
per day, we know there are

sometimes some like 1000000000
per day.

It's a lot.

In this case, churn is high and
int4 is not enough.

Half of int4 is not enough.

transaction ID wraparound prevention
should be very frequent

in each tuple, in each row version.

And also there is Another very
important job, especially for

performance, it's maintaining visibility
maps.

It's a bitmap with 2 bits for each
page.

Page is 8 kibibytes, right?

And 2 bits, is the whole page fully
visible, All visible?

Means all tuples are visible to
all transactions.

And second is all frozen.

Yeah, just previous topic.

Transaction ID wraparound prevention.

So not to visit this page again
if we know that all the tuples

are in the past, they're already
marked frozen, so they're definitely

in the past, even if their xmin,
xmax look like they are in the

future.

This is freezing.

It marks like...

All frozen bit means like it's
a whole page in the past.

The next vacuum will just skip
it.

That's optimization as well, right?

So for more frequent vacuuming,
not to do the same job once again,

right?

Michael: I want to come back to
that, but before we do, The all

visible is really important for
index only scan performance,

right?

So if an index only scan can know
that every single tuple on

that page is visible to all transactions
then it can serve the

result it finds in the index because
it knows it can't have changed.

So that's a really neat optimization
without having to do what

it calls a heap fetch.

So it's like fetching the data
from the table instead of from

the index.

So if you ever if you see an index
only scan that ends up with

a lot of heap fetches that's because
Postgres couldn't...

Yeah it might it might be that
information was in the index but

because it Postgres didn't know
for sure that it hadn't changed

it still went and got it from the
heap.

So it might be the exact same information
that it already had

from the index, but there's this
inefficiency if vacuum hasn't

run recently enough.

And it only takes, because it's
that bit 0 or 1, it only takes

1 tuple on the page to have changed
for any of the tuples on

that page to lose out on index
only scans.

Nikolay: I think if information
is not in the index, you mean

if we select columns which are
not part of index definition,

it will be index scan immediately,
not index only scan.

Michael: Yeah, of course.

Nikolay: And index only scan already
means that information is

present.

Information means values, but what
is never present in indexes

is visibility information.

This xmin, xmax.

So For that we need heap, but if
visibility map bit says for

this page it's all visible, this
can be skipped.

So if the fetch's number goes lower,
ideally it should be 0,

then it's a perfect situation,
it's a true index-only scan.

If it's high, it approaches the
performance of index scan, which

is like always consulting heap.

Yeah, naming should be better here,
but my position on naming,

I'm always like, index scan could
be renamed to something which

would be explicitly mentioning
that heap is always consulted.

Anyway, this is a different topic.

I think we covered quite well what
autovacuum does, and again,

the main job is just garbage collection.

Also, I wanted to mention, autovacuum
is improving a lot from

release to release.

And it's enabled by default, of
course.

It should be disabled only by experts.

For example, PgQ, Skype engineers
decided to disable autovacuum

for queue partitions.

And it's also disabled in PgQue tool,
which is built on top of

PgQ.

Name is awful, right?

So anyway, there autovacuum is
disabled, but normally it should

be never disabled, and it's enabled
by default.

And it's done, drop, and so on.

So it's just a garbage collection
plus additional features and

it's improving from release to
release and Funny fact I recently

purchased some Chinese modern auto
vacuum for home with washing

feature.

It's very similar, like advancement,
like I had a few years ago,

I had a couple of them and they
were now there, like so many

features, so it became much smarter.

And they have multiple functions.

So I enjoy how it works really.

But it requires some tuning as
well.

Michael: Yeah, I was gonna say,
Bruce brought this up in our

episode with him, didn't he?

He said people complain about vacuum
and bloat with Postgres

and MVCC, but he's saying, which
version are you complaining

about?

Because it is continually getting
better.

But I would say that some of the
defaults are still very conservative,

or maybe make sense at smaller
table sizes.

And they do have scale factors
built in, but they make less and

less sense I think as tables get
bigger, some of them.

Nikolay: So my position on defaults,
they are all BS defaults.

They are like for, I call it Postgres,
kitchen kettle Postgres,

which should have, I don't know,
like 1 gigabyte memory and some

Raspberry Pi or something, a smart
pod or something.

This is it.

For them it's okay, but for any
even like small cluster with

8 cores, 16 gig of memory, it's
not enough at all.

And the settings can be split to
3 categories for simplicity.

Number of workers, which requires
restart.

The only thing which requires restarted
number is number of workers,

which is 3 by default.

If you have 8 cores, 3 maybe is
okay.

It's okay.

If you have 32, 64, 48, 96 cores,
3 is not enough, but we see

it all the time.

People come to us, 96 cores, 3
workers.

This is 1 thing.

Michael: You say it's not enough.

I think you're right in the sense
it's not a sensible default.

Like why not give it more in case
it needs it?

Nikolay: Right.

Michael: But as you said, sometimes
people come to you and they've

only had 3 workers the whole time
and somehow it's not fallen

over.

Nikolay: It's grinding through
challenges.

Yeah.

Like super bloated tables, super
bloated indexes, and it's a

lot of dirt to clean, which is
great for us.

We have opportunity to help them.

Instead of framing, oh, how bad
it is, I actually have it, and

it's actually our product strategy.

We frame it, oh, great, we can
help you really quickly, really

quickly.

Either way, we have low-hanging
fruit.

Let's go.

With our tooling and methodologies,
it's super simple.

This low-hanging fruit, let's go,
and your database will fill

and breathe quite soon.

Much better.

You can even consider downgrading
or postponing upgrade for a

couple of years after this.

Yeah.

And this is great.

Yeah, this is 1 thing.

And this is super important thing,
we just discussed.

This is how garbage collection
works and updates, deletes, and

sometimes inserts.

They do just part of the job.

Right?

And these poor autovacuum workers
need to do the rest of the

job.

Michael: The tidying up afterwards,

Nikolay: yeah.

You have 1,000 max_connections,
meaning you have up to 1,000

back-end workers which produce
all those dead tuples.

And only 3 that should catch up
with them.

It's nonsense, right?

It feels like it's nonsense.

So it should be tuned.

And simple rule, like we say, like
actually let's consider like

25% of vCPU count, But we should
do it very properly, thinking

about memory management and so
on.

I know there is opposition to this
view, but this is my position.

It's a simple case, just let's
give more workers, especially

if you have partitioning.

Because, fun fact, a worker processing
a table, if it's a huge

table, it will be processing it
sequentially, your table and

its indexes, and that's it.

If it's a partitioned table, it
can be distributed.

Each partition can be, you can
parallelize this, right?

So it makes sense to have more
workers if you have more partitions.

And it's great to have partitions
and more workers together.

This is the best situation.

And 2 more categories.

1 defines frequency of revisiting.

Frequency, how frequent workers
come to do stuff.

And another defines how fast they
can throughput, basically,

or throttling.

I think

Michael: throttling's the better
way of thinking about it, yeah.

Nikolay: Throttling, yeah, limiting.

It's cost_limit and cost_delay.

And how often they visit defines
by scale factors and threshold,

but I'm okay to think about only
scale factors, actually.

At first glance, it's enough.

So we have 3 scale factors today,
right?

Or maybe there is also in Postgres
18, there is

autovacuum_vacuum_max_threshold.

This is a new stuff in Postgres
18, which changes the picture.

So usually we talked about only
scale factor and threshold.

Scale factor by default, 3 scale
factors, but the basic 1 is

autovacuum_vacuum_scale_factor,
right?

And it is either 10 or 20 percent.

So I think 20 percent, right?

Michael: Yeah, so I suppose because
the threshold by default

is 0.2, right, so 20 percent.

Let's say you've got 1000000000
rows in the table.

It doesn't wait to get to 200 million,
it starts at 100 million.

Nikolay: Yeah, so we don't allow
more than 100 million, which

is a huge number.

Maybe huge, maybe not.

100 million dead tuples in the
table.

Michael: It's quite a lot though,
yeah.

Nikolay: Yeah, so the basic is
the vacuum scale factor.

This is the first thing.

And it's 20% by default, which
means you need to accumulate 20%

of dead tuples in the table before
it starts garbage collection.

When I just say this, I had recent
experiences.

When I said this to some startup
CTO who which is growing and

so on he like, oh We don't have
garbage collection before 20%

of dirt accumulated This is

Michael: Of course when when things
are small, it's not a big

deal.

But when things get big, this is
just a huge number.

And it's not just...

Nikolay: Yeah, it's...

Let's say you have like almost
terabyte size table in terms of

size, billion rows unpartitioned,
which is common, I see this.

20% means 200 million rows, dead
tuples, right?

Which is by the way above that
threshold introduced in Postgres

18.

So, it's like, wow.

Michael: But it's not just storage,
right?

It's also polluting caches and
indexes.

It's everything.

Nikolay: Oh yes, oh yes.

This is big point.

And actually, let me add 1 important
notice here.

Usually, it's not a big deal for
smaller tables.

But in some cases, we see local...

So if, for example, 1 row updated
many times, thousands of times.

For 1 logical row, you have many
versions.

And globally, like table is big,
but for this not lucky row,

You have so many versions and they
can be, might be scattered

among many pages.

So, they are sitting there as data
tuples and for this specific

query select where id equals something.

This means that all of them need
to be fetched to be checked

in terms of visibility for this
transaction and it degrades for

a particular ID, like local degradation.

Lots of queries are fine, but this
unlucky because it had a lot

of updates or something, right?

It became really degraded.

And it's like, what's happening
here?

This is where p99s or something
are good to have because average

will hide it, average latency and
so on.

And this means you should be vacuuming
more frequently, even

on smaller tables.

So this local degradation due to
dead tuple for particular logical

rows, dead tuple count is high.

So it's happening, and it's happening
silently.

That's the problem.

And tooling, which is like lacking
p99s and so on, is not helping

us to understand.

That's why still logging of slow
queries helps, and auto_explain

helps to catch these like edge
cases and start analyzing why

it's happening.

The next day you select the same
row, maybe it's already vacuumed

and degradation has gone.

But in that moment, if you go back
in time and select it, you

see so many buffers.

This particular query, it's very
like it's primary key lookup,

but too many buffers somehow.

That's why.

So it's important to feel it because
it's really like subtle

problem.

It's hard to catch it if you don't
know where to look at.

Yeah.

And you're right.

So scale factors.

And since Postgres 13, there is
also autovacuum_vacuum_insert_scale_factor,

and there is also
scale factor for analyzing.

For insert, it's really great.

I remember Darafei, the guy who,
from Belarus, I think, he introduced

it.

It was so obvious, why didn't we
have it?

We insert a lot with double table
in size, but vacuum doesn't

come.

And we need vacuum not to clean
up the tuples after successful

insert, but to rebuild visibility
maps, first of all.

And yeah, it is great because if
you have select count or something,

you want index-only scan to be
with low heap fetches, as you

mentioned.

So anyway, there are these scale
factors, And it's enough at

first glance to look only at them.

Yes, you can go deeper and think
about thresholds and how additional

tuning, but scale factors is already
enough.

You want them to be really low
in OLTP workloads, just 1%.

Sometimes 2%, sometimes 1%, I see
people choose, but 1% is is

simple rule.

Set it to 1 percent for all scale
factors and benefit.

Michael: And to give people an
idea of how drastic a change that

is, these scale factors are 20
percent for the vacuum threshold,

10 percent for the analyze threshold,
and I think 20% also for

the insert scale factor.

Nikolay: Yes.

So

Michael: dropping from 20 to 1
might feel like a huge change,

it is quite a huge change, But
you're not repeating the same

work, right?

Like it's not, there's a tiny bit
of repetition.

Yeah, it's, this is a slight difference
to analyze, isn't it,

in that if you run vacuum twice
in quick succession on the same

table, It's not doing all of the
same work again.

It's done a lot of the cleanup.

It's like tidying a room.

Nikolay: Some work

Michael: can

Nikolay: be done again.

For example, it's very nuanced
here.

It's super interesting.

Like you mentioned, it's not only
about data storage.

It's polluting memory, WAL.

WAL means replication, WAL backups.

It's polluting everything.

And memory is super, memory is
expensive these days, right?

So we should optimize, not about
data storage, we should think

about how many pages we need to
keep in this working set.

But the ideal situation when you
raise the workers, make a vacuum

to visit us frequently with lowering
scale factors, all of them

to say 1% is very rough rule here.

But also partition quite well,
because in this case some partitions,

especially if it's historic data,
they will be processed with

marked all visible, all frozen,
and very rarely those pages will

be touched.

So vacuuming will be like a breeze,
like skipping, right?

Unlike if it's a huge, messy heap
with terabyte in size or so,

and any page can be touched at
any given time.

It means it will be need to process
again.

Of course it means that if you
have infrequent vacuuming of a

huge table, of course you're doing
work less, because maybe this

page was visited multiple times
before it was vacuumed.

And if you start vacuuming frequently
and between 2 processing

these pages are revisited by some
rates, then it needs to be

vacuumed again.

So it's multiple processing of
the same page versus a single

processing.

So we cannot say it will be, but
it's highly optimized.

Vacuum is highly optimized.

But if you have a huge table, not
partition, maybe it will be

like frequency, high frequency
vacuuming will cost you some extra,

but it's worth it usually.

So there are nuances here and trade-offs,
of course.

But usually in OLTP, we very strongly
recommend to lower down

scale factors so vacuuming happens
more frequently.

And the fact is that if you're
on managed service, I see it's

quite common.

This is a big problem.

RDS included, for example.

They don't care about this somehow
and they keep defaults.

I would like to hear their reasoning
why they don't proactively

tune it for the customers.

And because I

Michael: mean, I mean, I mean,
I'm just makes, because they're

a scale factor, right?

They can make sense at any scale,
right?

Like it, it's fine when you're
small, These are inexpensive things

to run on small tables, and it
works when it's big.

There's no reason not to tune it
from the start for all customers.

You don't have to be careful about
which instance size.

Nikolay: And I don't complain,
because as I said, I enjoy when

I see the problem.

And I'm excited to help and have
low-hanging fruit and see

how health next day is much better.

But of course we need also automated
reindexing, we need the

pg_repack to be used and so on.

By the way, we mentioned that Postgres
19 will have...

Yeah.

Yeah, it will have...

Michael: REPACK and REPACK CONCURRENTLY.

Nikolay: That's awesome.

And we should somehow revisit this
topic, maybe inviting the

guy who made it, right?

Michael: Yeah, I think it was...

Antonín, right?

Yeah, Antonín was...

We had him on, didn't we, to talk
about pg_squeeze, which inspired

the work.

Nikolay: Remember in the very end
of the episode, I learned about

these plans to make VACUUM FULL
concurrently basically, which

is like became REPACK CONCURRENTLY.

I couldn't believe, I had like
doubts, is it like, but here we

go, like this is happening a year
later, it's planned to be released,

I hope it won't be reverted, it's
a super important feature to

have, it's interesting how out
of the box.

Michael: I have to admit when I,
yeah, I saw that REPACK, they

introduced effectively a rename
of VACUUM FULL to REPACK and that

was going to make 19 but it didn't
have a concurrently option

and I was like, oh, that's such
a shame but it makes sense.

Nikolay: Another year of waiting.

Michael: And then about a week
later, REPACK CONCURRENTLY gets

committed as well.

So yeah, that was great.

Nikolay: It's yeah, it was a mood
swing for me as well.

It was like, oh no.

Oh, yeah.

So anyway, back to scale factors.

This is probably 1 of the most
important to lower scale factors

for OLTP workloads and benefit
from more frequent vacuuming

and analyzing as well, right?

Michael: Just a question, is there
any stage that people are

at where you wouldn't recommend
this.

I was thinking when should people
be thinking about these things

but I think those 2 are such simple
or those 3 let's say are

such simple settings to change
you could change them from the

start and never think about it
again like

Nikolay: globally I

Michael: was thinking.

Nikolay: This is complex and this
is complex thing in terms of

we cannot talk about autovacuum
tuning completely without analyzing

what's blocking the work.

Michael: If anything.

Nikolay: Yes and that's why I'm
like we recently developed a

dashboard for xmin horizon analysis
because we we probably need

to revisit it once again in terms
of long transactions versus

pure xmin horizon blocker analysis
and autovacuum throughput

analysis and again like this I
cannot recommend enough Laurenz

Albe recent blog post about how
he monitors autovacuum.

Combining with what, how we monitor
autovacuum, I think it's

much like we leveled up monitoring
of autovacuum blockers basically,

right?

So xmin horizon, there are 5 reasons
why it can be blocked.

And if you keep those problems
unresolved for hours, or sometimes

dozens of hours, tuning autovacuum
to run more frequently, especially

if you reduce also autovacuum_naptime, how
long it sleeps between runs,

and raised workers.

And also, some problems might happen.

For example, it might start consuming
too much resources just

trying to do something.

It cannot do something.

And also, logging can be, you can
have observer effect.

If there is a, like it tries to
do something, it cannot, but

it reports and many workers report.

And if you, at extreme, you have
log_autovacuum_min_duration

set to 0, meaning all attempts
are logged.

In this case, if you have 25 workers,
for example, and they cannot

do, and they autovacuum_naptime 1 second,
it's a storm of logs.

This is 1 problem I can definitely
imagine.

Michael: You mean 1 millisecond?

Nikolay: No, by default autovacuum_naptime
is 30 seconds.

Michael: What am I thinking of
that's 2 milliseconds by default?

Nikolay: Oh, actually I was wrong.

I'm looking at it right now, it's
60, 1 minute.

autovacuum_naptime is by default 1 minute,
which is quite long in my opinion,

so we usually say let's choose
a few seconds, but I admit aggressive

tuning without resolution of xmin horizon
blockers might lead

to storm of logging and resource
utilization as well.

So this is what I can think of
from top of my head.

Michael: Yeah, so sorry, I was
thinking of cost, there's a cost

delay.

Nikolay: Right, this is the last
thing we haven't touched.

So as I said, There are like throughput
in terms of number of

workers, there is frequency, how
often vacuum and analyzing happens,

and also the remaining piece, big
piece is throttling.

You said we should say throttling,
right?

So 2 parameters.

And in Postgres, who is responsible
for naming?

So there is an attempt to save
on some defaults or something.

There is some parameters that have
relationships.

Inheritance, right?

So there are 2 parameters,
cost_delay and cost_limit.

There are vacuum_cost_delay,
vacuum_cost_limit, this pair.

And by default, I think cost_delay
is 0, means it's unthrottled.

If you run vacuum manually, it's
unthrottled.

It tries to go full speed, usually
bounded by I/O, disk.

Right?

And there is also autovacuum_vacuum_cost_delay,

autovacuum_vacuum_cost_limit, 2 parameters.

1 of which I think
autovacuum_vacuum_cost_limit is set to minus

1, which means it inherits from
vacuum_cost_limit, which is set

to 2 milliseconds, right?

It was 200 milliseconds and then
it was, or 20 milliseconds in

the past before Postgres.

Michael: Yeah, the delay is, I've
just looked it up:

autovacuum_vacuum_cost_delay is

Nikolay: 2 milliseconds.

Yeah, I confused everything, I
see.

So autovacuum_vacuum_cost_delay
is 2 milliseconds and it was

before Postgres 12 it was 30, so
it became 2 in Postgres 12 which

is great.

And autovacuum_vacuum_cost_limit
is minus 1 means that it inherits

from vacuum_cost_limit Which is
by default 200

Michael: Which is like an arbitrary
unit of like work right and

there's some various things that
make up, yeah.

Nikolay: Yeah, I your CPU cycles,
there is some math.

What I don't like about acquisition
of EDB, of SecondQuadrant

by EDB, they just killed the blog.

Blog was brilliant.

And they had a great blog post
about some math.

If you think only about the I/O part,
all defaults before Postgres

12 yielded to 8 MB/s of reads.

So with 120 milliseconds, it was
roughly 8 mb of reads.

So bumping 20 milliseconds to 2
milliseconds, I remember it was

first considered to raise cost
limit, but then decided to reduce

frequency of this
throttling, right?

Because it makes things smoother,
right?

So going down from 20 milliseconds
to 2 milliseconds and maintaining

the cost_limit 200, inherited from
vacuum_cost_limit.

I don't like this, just put straightforward
defaults.

It doesn't hurt.

Minus 1 makes life of humans much
harder, much harder, right?

Like it's, okay, anyway, maybe
I should just propose it.

Maybe it wasn't proposed.

Maybe it was proposed.

I don't know.

It's My battle over dozens of,
over decades already.

This is happening decades, this
suffering from memorizing what

minus 1 means.

We forgot actually memory management.

Fourth piece, right?

There is minus 1 there as well.

autovacuum_work_mem is minus 1
by default, which means you need

to go and consult and there is
big interesting trick happened

in Postgres 17, separate story.

So that's

Michael: maintenance_work_mem,
right?

Nikolay: Yeah, maintenance_work_mem.

If we have time, we can talk about
that as well and dive into

it a little bit.

So, minus 1 for
autovacuum_vacuum_cost_limit inherited 200,

this like arbitrary cost, not arbitrary
actually, but some cost

like 2 milliseconds.

Since Postgres 12, meaning in all
current versions, it means by

default that can just take reads
of 80 MB/s,

roughly, which is not enough in
modern systems.

Ah, and it's shared among all workers,
unless you start tuning

a table level, which makes things
much more complex because then

quotas with throttling becomes
local for this table, for this

worker which is processing specific
table.

I prefer not to go there.

We can have a separate episode
for table level analysis.

Tuning, table level tuning.

So big picture is like by default
it has only 80 MB/s

of reads and if you
have powerful, even EBS volume

which I don't know it's a nitro,
old nitro architecture so it

definitely NVMe, like it has
good throughput.

80 is tiny among all workers.

Again, this is garbage collection
for everything.

So it should have several hundred
megabytes per second at least.

But the good thing is, for example,
RDS fixes this, especially

on Aurora we see enormous quota.

Throttling there is quite good
in terms of like it has a lot

of power.

But in many other cases, we see
it needs attention and we need

to give more like, how to say,
work capacity because otherwise

it's too slow And that's not good.

Michael: It never catches up, if
that makes sense.

Nikolay: It's a kind of situation.

You see all the workers you allocated,
they're always busy.

This is number 1, how you can understand
that something is not

okay.

And second, we have a special report,
we have a query, I have

a blog post about that, and we
also have it in our monitoring,

like a queue or line of tables
which sit there and should be

already processed but not processed
yet.

And this, like, basically a wait
line for tables.

And I think Laurenz also has in
his blog post a slightly different

analysis.

But the idea is the same.

How bad is it, the picture?

It should be done already, but
not yet done.

This is how you also feel this
situation happening.

Michael: From the other side of
things, I saw a good post recently

by Jeremy Schneider about the risk
of completely removing throttling.

So he wrote a good post recently
about an issue they encountered

once.

I think they reduced the...

Nikolay: I haven't seen it.

Sometimes people put to Cron.

We haven't touched this specific
cases related to MultiXact

member situation when people have
this approaching and they need

to vacuum separately.

Switching of indexes because index
vacuuming because for that

problem you don't need, and this
is actually another job to clean

up those things.

You don't need indexes to be involved
and this slows down the

whole thing if you have many indexes.

So anyway, I wonder, I need to
look at that, but, and when you

do manual vacuuming it's unthrottled,
as I said, because

vacuum_cost_delay is 0.

So it's not checking at all.

Let's go.

Michael: But I think that's, isn't
that because every now and

again, you want to run a manual
vacuum as like an emergency procedure?

Like,

Nikolay: yeah, it makes sense.

I agree.

But if it's throttled and it's
unthrottled, Actually, this is

super interesting for me.

In the past, I had problems with
unthrottled vacuum, but it was

when we had some enterprise disks
and self-hosted Postgres and

our capacity was like 400 megabytes
per second.

I also am very curious right now
in Postgres 18 with AIO and

so on, like how a single worker
might behave.

Remember I told you I was super
impressed to see the pg_basebackup,

which is single threaded
and I expected 100 megabytes

of reads from it.

Suddenly it amazed me, approaching
like gigabyte per second of

reads in Postgres 18.

So maybe a vacuum also, I'm not
sure, like this I haven't visited.

So- Yeah,

Michael: I think it was me that
suggested it might be AIO related.

Nikolay: Yeah, but let me correct
myself.

So there is fourth piece memory,
super important.

And if we don't think about it,
it might bite badly, especially

if we raise workers.

So all our clients, we say, okay,
we raised workers, but we're

jumping from say Postgres 15 to
17, you must limit memory for

autovacuum in advance because
it's by default minus 1 means

it takes the value from maintenance_work_mem.

And maintenance_work_mem people
usually raise quite quickly thinking

they will have much faster index
creation.

So they put 2 gigabytes or so.

Now I think like 25 workers, 2
gigabytes is already a lot, right?

And the problem is like before
Postgres 17, you might not notice

it because autovacuum workers
suddenly limit themselves with

1 gigabyte and then suddenly jump
to no obvious limit anymore.

No 1 gigabyte limit anymore for
autovacuum workers.

And If you have 4 gigabytes, I
saw 8 gigabytes, like these things,

right?

And many workers, it's huge memory
consumption.

Michael: Or risk, like not necessarily,
but there's the risk

that they could take up that much,
right?

It's not pre-allocated or anything.

It's just that's the budget they
have.

Nikolay: It can influence and it
can increase risks of out of

memory.

Michael: Yeah, yeah, yeah.

— OOM killer.

Nikolay: —

Michael: So, but then you can set,
yeah, great.

But then you can set the autovacuum_work_mem.

I've forgotten the exact name of
it.

Nikolay: — Yeah, I found like half
a gigabyte, some math should

be done there, and we usually like,
we have it, like some formulas.

So usually half a gigabyte, even
maybe lower, and it will be

okay, right?

It's interesting topic, but yeah,
so usually for maintenance

work, again, I wish like this minus
1 is killing me, like decayed

already.

Like why we don't put explicit
value right away?

Why this like I need to go in some
maze and remember.

It took me many years to settle
in terms of these, like, relationship

between vacuum and autovacuum
settings.

And for non-expert who is not working
with Postgres settings

every day, it's impossible to memorize.

Right?

It's super tricky.

Michael: So you would have hard-coded
values for each even if

they start with the same value?

Nikolay: Doesn't hurt.

Yeah, so if we have maintenance_work_mem by default, I don't know,

what is it, like 128 megabytes
or so, like maybe more.

Just copy paste.

This is not the place where we
should save on copy pasting.

We shouldn't save like few bytes.

Minus 1 is 2 bytes, right?

128, it's okay, a few bytes more.

This is not the place where we
should save it, right?

This is this complexity, extra
complexity, which is like, okay,

I need to redirect this frustration
to proper place.

Michael: Yeah, by the way it's
64 megabytes default

maintenance_work_mem.

I didn't realize it was

Nikolay: that low.

Postgres, yeah, Postgres.

Yeah.

Michael: All right, awesome.

And actually, the last thing I
wanted to say, so as a little

summary, if people have less than
8 vCPUs, maybe leave, you can

probably leave the workers in place,
but still change the thresholds

and it's just those 3 thresholds
will get you like 90% of the

way there probably.

I think most of...

Reduce, you said down to 1% I think
that makes a lot of sense.

I've generally advised people to
like 5 or 2% as like a starting

point but 1% makes a lot of sense.

It's a bit more, but I think it

Nikolay: won't save you from large
tables It won't save you from

this problem of local degradation
for particular rows It's just

like maybe you need to think about
it additionally, but overall

it's like good, very rough tuning
for

Michael: all the teams.

Yeah, absolutely.

I think that'll get a lot of people
a lot further.

Most providers let you set that
globally.

I did come across 1 recently that
only lets you change these

things at a table level, at which
point my advice would be focus

on your important tables and just
change a few of those.

But that was only 1 provider.

Nikolay: Or just migrate.

Michael: Or migrate, yeah.

That small task of migrating database.

But yeah, so that sounds wise.

And then, yeah, memory tuning,
make sure it's not using

maintenance_work_mem if you've upped that.

Smart advice.

Nikolay: Yeah, there are other
parameters, but it's like already

deeper things.

Michael: Yeah, sounds good.

Nikolay: My simple rough tuning
list, raise number of workers.

This requires restart, unfortunately,
then make visits more frequent,

decreasing scale factors to 1%,
for example, roughly, and give

more power, unless it's already
done by provider, raising, for

example, raising cost_limit.

Oh yeah, cost_limit, right?

And default settings, as I said,
Postgres default settings, because

not providers default settings.

So it's roughly 80 MB/s for disk reads.

Think how much you want, like maybe
just raise it, maybe sometimes

4 or 5 times, depending on disk
capacity.

It's very important to understand
that.

And finally, autovacuum_work_mem,
set it explicitly, minus 1 sucks in all

senses.

It's just very hard to deal with.

I like

Michael: that.

I like setting it explicitly, even
if you plan to keep it the

same as your maintenance_work_mem
now.

Nikolay: Yes, like freeze it before
upgrading to 17 for example.

Or before touching number of autovacuum
workers.

Just make it explicit so to simplify
math analysis including

with AI so it won't jump additional
like mind hop.

Michael: And just avoiding that
future change where somebody

wants to increase maintenance_work_mem
and doesn't think

Nikolay: at

Michael: that moment about
autovacuuming.

Nikolay: Because these are 2 separate
things.

They operational, they are very
separate.

Indexing and vacuuming, autovacuuming.

Yeah, and that's it.

These 4 areas for rough tuning,
it's enough.

It should become bloat situation
much better, but there are also

blockers.

If You can tune a lot, tune everything
perfectly, but if you

have some xmin horizon blocker,
and we have episode about xmin horizon,

then it won't help.

Bloat will accumulate anyway.

Michael: Yeah.

Great.

All right.

Nikolay: Good.

Yeah.

Michael: Nice one, Nik.

Thanks so much.

Nikolay: Yeah, I hope it was helpful.

See you next time.

Michael: See you next time.

Some kind things our listeners have said