Postgres FM | Transcript: Disks

August 29, 2025 • 46 Minutes

Disks

Michael: Hello and welcome to Postgres.FM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard, and I'm joined as usual by

Nik, founder of PostgresAI.

Hey Nik!

Nikolay: Hi Michael, how are you?

Michael: I am good, how are you?

Nikolay: Very good.

Michael: Great.

And what are we talking about this week?

Nikolay: Disks.

If you imagine database regular icon or how to say like picture

how we usually visualize database on various diagrams it consists

of disks right?

Michael: Yeah like 3 I'm thinking of cylinder sometimes a cylinder

yeah with like normally 3 layers?

Nikolay: Yeah, if you 4, and obviously databases and disks, they

are close to each other, right?

But my first question, why do we keep calling them disks?

Michael: Like outdated term you mean?

Nikolay: Yeah, obviously.

I don't know.

Michael: What does the D in SSD stand for?

Nikolay: Yeah, actually Sometimes we like logical level volumes,

storage volumes, something like this.

And in cloud context, especially EBS volumes, right?

We talk about them like that, but In all cases, we still, it's

acceptable to say disks, but disks are, they don't look like

disks anymore, right?

They are rectangular and microchips instead of rotational devices,

right?

Michael: Yeah, makes sense.

Nikolay: In most cases, not in all cases.

Rotational devices can be still seen in the world, but not often

if we talk about OLTP databases because it's not okay to use

rotational devices if you want good latency.

But yeah, so disks, because databases, they require good disks,

and they depend on it heavily in most cases, not in all.

Sometimes it's fully cached, so we don't care if it's cached,

right?

Michael: Yeah, I was gonna ask you about that.

Because I think even in the fully cached state, if we've got

a lot of writes, for example, we might still want really good

disks.

There's things where we're still writing out to disk and we want

that to be fast, not just reading from.

Nikolay: But we are not writing to disk.

If we move to the Postgres context, we don't write to the disk

except to WAL, right?

WAL.

Yes.

Yeah, and that's it.

Well, yeah, I agree it can be expensive
if a lot of data is written.

So, yeah, you're right.

Because we need to write our tuples.

And if it's full page write after
checkpoint, we need to write

whole page, 8 kilobyte page.

Yes, so and we need to get a sync
before commit is finalized.

So definitely it goes to disk,
But data in terms of table and

index, tables and indexes, it's
written only to memory and it's

dropped for checkpoint normally.

It dropped for checkpoint or to
later write it first to page

cache and then page cache can use
pdflush or something like to

write it further to disk.

But yeah, in terms of fsync, write
latency is important.

It affects commit time.

By the way, I just had a case,
it's slightly off topic, but I

published a tweet and LinkedIn
post about LISTEN/NOTIFY.

I added them to the list of deprecated
stuff.

Michael: It's not deprecated, right?

But you're saying you recommend
not using it yeah at scale?

Nikolay: Yeah well if...

Michael: Or possibly at all.

Nikolay: Yes my Postgres vision
deviates from the official vision

in some cases for example official
documentation says don't set

statement amount globally because
blah, blah, blah.

And I don't agree with this.

I, in OLTP it's a good idea to set
it globally to some value and

override locally when needed.

And here, LISTEN/NOTIFY, I just
see like we should just abandon

this completely until it fully
redesigned because there is global

lock and 1 of our customers Recall AI,
they published great post

about this because they had outages
And it's related to the topic

we discussed in an interesting
way.

To reproduce it, I used a bigger
machine.

And the issue is with NOTIFY, at
commit time, It gets a global

log to serialize NOTIFY events.

Global log like on database, exclusive
log, insane.

And if commit is fast, everything
is fine.

But if in the same transaction
you write something, commit, like

WAL, it waits a little bit, right?

In this case, contention starts
because of that lock.

So if you have a lot of commits
which are writing something to

WAL, meaning they need a sync
and they need to wait on disk.

If disk is slow, you use NOTIFY.

This doesn't scale.

Performance will be terrible very
soon.

At some concurrency level, you
will have issues and you will

see commit spans like many milliseconds
and dozens of milliseconds

and then up to seconds.

And eventually system will be down.

Anyway, this is related to slow disks.

You're right if latency write is bad, we might have issues.

Michael: Yeah, but you're right too, that the majority of the

time we care about the quality of our disks, it's when our data

isn't fully in memory and we're worrying about reading things

either from, well, from disk, or even from the operating system.

It's hard to tell from Postgres sometimes where it's coming from.

But we have a

Nikolay: bit of documentation.

It's impossible to tell in Postgres unless you have pg_stat_kcache

extension.

That's why, since buffers is already in Postgres 18, again, I

advertised to all people who develop some systems with Postgres,

if it's possible, include extensions, pg_wait_sampling, and

pg_stat_kcache.

And kcache can show.

Michael: Yeah.

I think it's not, I think you're right, it's impossible to be

certain without those, but for example with, through timings,

through I/O timings, which another thing that people might want

to consider having on, obviously a bit of overhead.

Nikolay: track_io_timing.

you mean?

Michael: track_io_timing.

gives you an indication, like if you're seeing not too

many reads from either the disk or the operating system and the

I/O.

Timings are bad, you've got a clue that it's coming from disk.

Nikolay: Yeah, indirectly we can guess that this time was spent.

Yeah, not many, it's a good point because sometimes it's fully

cached.

In page cache we see reads and since there are so many of them,

are your timing is spent reading from page cache to the buffer

pool and disk is not involved.

If volumes are huge, but if volumes are not huge and still significant

time is spent, very likely it's from disk.

Michael: Exactly.

Nikolay: Yeah, yeah.

Michael: Yeah.

It's not a novel, like this is something we added to our product

just as a tip.

It doesn't come up that often.

Like it's not...

Nikolay: Enabled track_io_timing?

Michael: We actually just use, we actually, because most people

don't have that on, we actually just use the buffers, like shared

red, and then the timing, the total time of the operation.

Nikolay: It's a pity it's not on.

In big systems, we have it on, like I never saw big problems

on modern, at least Intel and Arm,
Graviton2 on Amazon.

Like I just see it's working well.

There is a utility you can check
your infrastructure and understand

if it's worth enabling, but my
default recommendation is to enable

it.

Of course there might be an observer
effect, but it can be double-checked

if you want to be serious with
this change, but I just see we

enable it.

Michael: Yeah, it's all to do with
the performance of the system

clock checks.

And I think, for example, the setup
I've seen with really bad

performance there are like running,
like dev systems that are

running Postgres inside Docker
and things like that, that still

have really slow system clock lookups.

But most people aren't doing that
with production Postgres databases.

And I haven't seen even any of
the cloud providers have slow,

I think it's pg_test_timing or something
like

Nikolay: that.

pg_test_timing, I double checked.

Michael: So yeah, you can run it
really easily.

And.

Nikolay: But what if it's managed
Postgres?

You cannot run it there.

In this case, you need to understand
what type of instance is

behind that managed Postgres instance.

Take the same instance in the cloud.

For example, if it's RDS, from
RDS instance name, you can easily

understand what EC2 instance is
this.

Right?

Yeah.

You can install it, it will be...

Well, operating system matters
also, right?

Michael: There are some, yeah,
there are some tricks you can

do, like do things that would call
the system clock a lot, like

nested loop type things, or count
aggregations, things like that,

like trying to get lots and lots
of loops.

Nikolay: Ah, you're talking about
testing at higher level, at

Postgres level.

Yeah.

Oh, that's a good idea.

Yeah, a lot of nest loops.

And you test with this, without
this, completely like running

like 100 times, taking average,
for example, and comparing averages.

And as you can guess, yeah, it's
a good test, by the way.

Michael: I think the first time
I saw that was Lukas Fittl.

I think he must have done a 5 minutes
of Postgres episode on

this kind of thing.

So I'll link that up.

Nikolay: Yeah, I'm glad we touched
this because again, like our

default recommendation is to have
it enabled.

It's super helpful in pg_stat_statements
analysis and explain

analysis plans and yeah, track_io_timing, if possible, should

be enabled.

And this is related to disks directly,
of course.

Yeah.

Although, strictly speaking, it's
not timing of disks.

It's timing of reading from page
cache to buffer pool.

So it might include pure memory
timing as well.

That's why you-

Michael: But it does.

Nikolay: Yeah, yeah.

That's why your comment about large
or not large volumes, it's

important.

But it's honestly like if you even
like, if you are a Backend

engineer, for example, listening
to this episode, I can easily

imagine in 1 month you will forget
about this nuance and will

think about track_io_timing like
only about disks.

Right?

And it's okay because like, it's
really like super narrow topic

to remember, to memorize.

Yeah.

Michael: Well, and I guess this
is moving the topic on a tiny

bit but if you're on a managed
Postgres setup which a lot of

Back-end engineers are working
with Postgres are, you don't have

control over the disks.

You're probably not going to migrate
provider just for quality

of disks.

Maybe you would, but it would have
to be really bad and you'd

have to be in a setup that really
was hammering them.

Maybe super write-heavy workload
or huge data volumes that you

can't afford to have enough memory
for, you know, those kinds

of edge cases where you're really
hammering things.

Nikolay: Well, there are 2 big
areas where things can be bad.

Bad means saturation, right?

Yeah.

We can saturate disk space, so
to speak, out of disk space, and

we can saturate disk I/O.

Both happen quite often.

And managed Postgres providers
are not all equal, and clouds

are not all equal.

They manage disk capacities quite
differently.

For example, at Google, at GCP,
I know regular PD SSD, quite

old stuff.

They have maximum 1, 200 mbps separately
for reads, separately

for write, speaking of throughput.

And they have 100,000 or 120,000
IOPS maximum.

And I know from the past discussions
with Google engineers that

actually real capacity is bigger,
but it was not sustainable.

So it was not guaranteed all the
time and they could raise the

bar but it could be it would be
not guaranteed so they needed

they decided to choose the guaranteed
bar for us.

Michael: Makes sense.

Nikolay: Yeah but like basically
we're not using at full possible.

We could use more, right?

But we can not.

So they throttle it.

Michael: Okay, interesting.

Artificially,

Nikolay: to have guaranteed capacity
for this disk I/O.

Michael: Interesting.

I guess the subtlety that I was
missing was not when you're at

the maximum.

So in between tiers, imagine you're
in a much smaller setup.

I see a lot of people just in upgrading
to the next level up

within that cloud provider to get
more IOPS.

You know, if you're on Aurora,
just scaling up a little bit instead

of switching from an Aurora to
Google Cloud.

But you're right.

When you're at the last, or second
to last level is when people

start to worry, isn't it?

When you're at the last level,
you can't just scale up on that

cloud provider anymore.

So yeah, really good point.

Nikolay: And also, at Google, for
example, let's say we're, like

I know this, these rules, they're
artificial.

So this throttling, what I just
told you, it also can be throttled

additionally if you have not many
CPUs, VCPUs.

So the maximum possible throttling
is achieved only if you have

32, as I remember, VCPUs or more.

If it's less, also it can depend
on family, I think.

Instance family.

Interesting.

So complex rules.

On Amazon, AWS, EBS volumes, okay,
there is GP2, GP3, IO1.

You choose between them, also there
is provisioned IOPS.

Really complex, right?

Michael: And you haven't even mentioned
burst IOPS yet.

Nikolay: Yeah, yeah.

So hitting IOPS limits is really
easy, actually.

If you

Michael: insist on smaller.

Yeah, well, the times I see people
hitting it is like they're

doing a massive migration.

Nikolay: No, no, that's not it.

Michael: Big import.

Okay, when do you just like, just
growing?

Nikolay: Yeah, just project just
grows and then latency, database

latency becomes worse.

Why?

We check and we see, well, if you
have experienced capabilities

like to look at graphs, you can easily identify some plateau.

It's not full, like not ideal plateau, but usually some spikes,

small, but you feel, oh, this is, we are hitting the ceiling

here.

Checking disk I/O.

It's not cliff, no it's a wall instead of cliff.

Cliff it's when, this is important distinction, cliff is when

everything was okay, okay, okay, and then suddenly slightly more

load or something and you completely down or down like drastically

50 plus%.

Michael: Okay, yeah.

Nikolay: Here we have a wall and everything is okay, okay, and

then slightly not okay.

Okay, okay, slightly not okay.

And then more is coming and we start scheduling processing, right,

accumulating active processes.

So in performance cliff, if you raise load slowly, there is acute

drop in capabilities to process workload.

In the case of hitting the ceiling in terms of situation of disk

I/O or CPU, it's different.

You grow your load slowly and then you see you grow further and

things become worse, worse, worse.

It's not like acute.

It's slightly more, more, more, and things become very bad only

if you grow a lot further, right?

So it's not acute drop.

It's like hitting the wall.

It feels like hitting the wall.

You know, like if you imagine many lines in store, for example,

we have several cashiers, 8 for example, and then normally lines

should be 1 or 2, 0 or 1 people only.

This is ideal throughput, everything good.

We haven't saturated them.

Once we saturated, we see lines are accumulating.

And latency, meaning how much we spend to process each customer,

they start to grow, but they don't grow acutely, boom, no.

Here we talk about, like, performance cliff is that, for example,

if we talk about cash only, no cards involved, and suddenly,

like, we had remainings of cash for change in all lines, right?

And cashier suddenly, they can say, okay, you do have change,

I have change, okay, we're processing, and then suddenly we're

out of cash to give change.

This is acute performance cliff.

They say, okay, we cannot work anymore.

Boom.

Right.

We need to wait until someone goes somewhere like this, like

we need 15 minutes of wait.

This is like important distinction of performance cliff and hitting

the wall or ceiling.

Michael: Okay, I haven't heard
that strict a definition before.

Like it sounds to me like you're
describing the difference between

blackout and a brownout.

Have you heard of a brownout?

Like so blackout is kind of like
your database can't accept

writes anymore or even SELECTs
like no reads like everything

is down. Brownout would be like
it's still working but people

like the they're spinning loaders
and it may be loads after 30

seconds or maybe some people are
hitting timeout some people

aren't and there's like the like
the queuing issue in the supermarket

you talked about. Performance is
severely degraded but it's not

completely offline. Still working
at least for some people. So

it feels like that's the kind of
distinction you're talking about.

Nikolay: Yeah, and brown can become
dark if you keep loading

a lot.

So if situation happened at some
workload level, but you gave

it 10x, of course it will be blackout,
but because of context

switching, and then it's different.

But for performance cliff, it happens
very quickly.

It's very much more acute situation.

Michael: I think I'm also biased
by the cases that I've seen

which are more acute because they
are bulk loads or backfills

where they are running at a much
much higher rate than they would

normally be, they're consuming
IOPS at a much higher rate than

they normally would so they hit
it really fast and it's like

running at the wall extremely fast
But I guess if you approach

the wall slowly, it's not going
to hurt quite as much.

Yeah.

Okay, I think I understand.

Nikolay: Yeah, back to disks.

Definitely, we should check disk
IO usage and situation risks.

So you

Michael: mean like monitor, monitor
for it alerts when we're

close to our limits?

Yeah.

Nikolay: Yeah.

And also it might be interesting,
for example, I remember, I

don't know right now, but many
years ago in RDS, I remember we

like ask, okay, a small system,
maybe we need 10,000 IOPS, But

we see situation at 2,500 somehow.

Oh, there is RAID, actually.

We have 4 disks, and that's why.

Okay, okay.

So there are interesting nuances
there.

But also, so understanding your
limits is super important.

And like, I think clouds could
do a better job explaining where

the limits are.

Because right now you need to do
a lot of legwork to figure out

what is your advertised limit.

For example, as I said, at GCP
you need to understand how many

vCPUs.

Also, disk, I forgot, like 10 terabytes, I think, is when you

achieve the...

Or 1 terabyte.

Memory fools me a little bit.

So you need to take into account many factors to understand,

oh, our theoretical limit is this.

And then ideally you should test it to see that it can be achieved.

Testing is also interesting because, of course, it depends on

block size you're using.

And also it depends on, like, you're testing through page cache

or direct I/O, right?

So directly writing to device.

And then you go to the graphs and monitoring and see some disk

I/O in terms of IOPS and throughput, separately reads and writes,

and then you think, okay, let's draw a line here.

This is our limit.

So what I'm saying, they should draw the line.

Clouds should draw the line.

They know all these damned rules, right, which are really complex.

So this should be automated.

This line should be automated.

Okay, with this, this, and this, and this, we give you this.

This is your line in terms of capabilities of your disk.

And here are you, okay, at 50%.

Okay, I know.

Now it's like a whole day of work for someone to understand all

the details, double-check them and then correct mistakes.

Even if you know all the nuances, still you return to this topic

and you, oh, I forgot this.

Redo.

So, yeah.

Yeah.

Michael: When you mentioned the terabytes thing, is that I was

working with somebody a while back who, they weren't using the

disk space they already had at the, like, Let's say they had

a 1 terabyte disk, they only had a couple hundred gigabytes.

But they upgrade, they expand their disk to a few terabytes so

that they would get more provisioned IOPS.

Because that was the way of, So is that what you're talking about?

You need a certain size?

Yeah.

Nikolay: So the rule for throttling is so multi-factor.

You need to read a lot of docs.

And like with GCP, AWS, I have pages which I read many, many

times per year, carefully trying to remember, oh, this rule I

forgot again.

Why isn't this automated?

Someone can say, OK, these limits depend on block sizes.

OK, But if it's RDS, block size is already chosen.

Postgres uses 8 kilobytes.

If it's ext4, it's 4 kilobytes
there.

Everything is already defined.

So we can talk about limits for
throughput quite well, right?

So yeah, this is, I think, lack
of automation here.

Michael: Also, if you mentioned
the number of vcpus like I guess

that is that they have all the
setting why they have they

Nikolay: have all the knowledge
and yeah they define these rules

Michael: yeah

Nikolay: so give me this like usage
level and understanding how

far from saturation I am.

Because it's so important.

No, in reality, we wait until that
plateau I mentioned, and then

only then we go and do something
about it and raise the bar.

There should be alerts even. You,
like... your database is spending

at 80 plus percent of your capacity
on this guy you are prepared

to upgrade you know add more

Michael: yeah well I was gonna
say sometimes there are perverse

incentives here where they're not
incentivized to help you improve

your performance so that you upgrade.

But in this case, it should be
the incentives should be aligned.

Nikolay: Yeah, at the same time,
these complaints we are currently

expressing, they all are like,
reminding me complaints of a guy

who is sitting on airplane and
saying that there is no leg room

and so on.

You're sitting in the air and flying
30,000 feet above ground.

And it's magic, right?

So these EBS volumes, PD, SSD,
like other newer disks on GCP

or NVMe's they are great like I
mean snapshots elasticity of

everything it's great right we
just yeah we just want even more

Michael: you were It's good that
you're being positive about

them.

But I feel like I hear quite a
lot of people saying that 1 of

the cases still for self hosting
is better this you can.

So actually, I think a lot of a
lot of the time with the cloud

you're paying for hardware that
might be a bit on the older side

and you have no control over that.

So it's, yeah, I'm interested in
your take on that as somebody

who's historically been pro-self-managing
or some hybrid version.

Nikolay: So I love clones and snapshots,
that's why... EBS volumes

and what RDS has, and even if it's
a lazy load involved, and

when we restore from snapshot,
it's actually getting data from

S3.

It still feels like magic and great,
and like, we're very good

for reproducing incidents and so
on.

And snapshots are cheap because
they are stored in S3.

At GCP it's the same, although
there is lazy load there as well,

although their documentation still
doesn't admit it.

But just looking at the price,
we understand it's the snapshots

of Google Cloud disks stored in
GCS, so S3 analog.

It's great.

But also, if you think about a
cluster of 3 nodes or 456, up

to 10 nodes and more.

Some people have more.

Database is basically copied to
all replicas and on replicas

it's stored on disk and disk becomes
more and more expensive over

time.

So it can be significant.

It can be even more than compute
sometimes.

That's the point.

Like if we have a large database,
but working set is not that

large, we can have much smaller
memory, that's much smaller,

like not big compute instance.

We had these cases, for example,
a lot of time series data.

And we have much bigger disk than
you could expect.

And then all replicas need to have
the same disk.

And this disk, if it's EBS volume,
it becomes expensive.

Very expensive and contributes
to costs so much.

So then you think, why not to use
local disks?

Well, we used local disks for benchmarks.

It was an i3 instance like years
ago, 7 years ago maybe started

liking them because it's always
included to price, right?

Of EC2 instance.

And it's super fast.

It's like basically 1 order of
magnitude faster in terms of IOPS.

Can give you a million IOPS these
days already.

And throughput, 3 gigabyte per
second.

Michael: Well, and the resiliency,
like if you've already got

replicas provisioned for failovers,
you don't need the resiliency

that the cloud...

Nikolay: The point is they are
ephemeral.

So if restart happens, you might
lose this data.

But if restart happens, we have
replicas.

Michael: Yes, that's what I mean.

So that doesn't actually matter.

In fact, this reminds me a lot
of the PlanetScale stuff that's

been...

The PlanetScale Postgres, I think
they call it Metal.

They've got 2 products, but the
metal 1 has the local disks and

this is a lot of the

Nikolay: yeah but you can have
local ephemeral VMs only

on virtual machines of course smaller
size

Michael: metal yeah sorry all I
meant was they're doing a lot

of their publicity, a lot of their
blog posts and things are

relevant to this discussion.

You don't have to use their services
and also you could do it

on a much smaller scale.

Nikolay: Yeah.

And it's so big cost saving and it brings so much more disk I/O

capacity.

Amazing.

But there is a-

Michael: And latency reduction, right?

Like because the systems are just closer together.

Nikolay: Yeah, yeah, yeah.

So it's much like, it can handle workloads much better in terms

of OLTP workloads.

There are 2 caveats, if matter, property and also limits in terms

of we didn't touch this space topic yet.

Michael: Yeah, yeah, yeah.

We have a whole separate episode on that.

But yeah, we should still touch on that.

Nikolay: Right.

And on AWS, I like local disks much more because they are usually

bigger and so on.

Like they are bigger, each disk is bigger and the summarized

aggregated disk volume is also bigger.

On GCP, I think, first of all, somehow local disks are still,

I think, 375 gigabytes only, looks like old, but you can stack

a lot of them, and I think up to 72, or how many?

Michael: Terabytes.

Nikolay: Terabytes, yeah, quite a lot.

But in this case you need to really like maybe go with metal,

like the maximum, take a whole machine basically, right?

But it's possible, but this 72 terabytes will be your hard limit,

hard stop.

And it's not that bad.

Like, good.

Michael: Most people will be fine.

Nikolay: Yeah, yeah.

It's, it's okay.

I mean, to have this limit.

But it's a hard limit.

Michael: But you're, yeah, the hard limit is the interesting

thing.

So you're saying, let's say we start on small machines and they

only have a set amount, and we suddenly realize we're at 80 or

90% capacity.

Nikolay: Right, but at the same time, EBS volume has limits 64

terabytes and PDSSD on GCP has the same limit, 64 terabytes.

And RDS and Google Cloud, CloudSQL, they also have hard stops

at 64 terabytes.

Aurora has 128, double of that size.

And that's it.

So these are hard stops.

And I think in 2025, I think this is not a lot of data already.

50, 100TB, we had episode about it.

It's already like achievable for
bigger startups.

So RDS, I don't know.

I think we should solve it soon.

And I think CloudSQL, Google Cloud
SQL should solve it soon.

But to my knowledge, they haven't
solved it yet.

So if you approach this, it's hard
stop.

And basically, you need to go to
self-managed maybe, right?

And there you can combine multiple
EBS volumes.

Michael: Most that we've talked
to that do this shard at that

point.

Nikolay: This is a different route.

Yeah, that's why I think plane
scale, like it's easier for them

to choose local disks and deal
with those hard limits in size

as well.

Because if there is a rebalancing,
if it's 0 downtime rebalancing,

you can just make sure no shards
will reach that limit, that's

it.

It's a good way to scale further.

Michael: Yeah, they have that for
MySQL, but they don't have

that for Postgres, so, well not
yet, they're building it.

Nikolay: They announced it, right?

Michael: Yeah, well, they announced
building it, I think lots

of people are announcing building
sharding at the moment.

Nikolay: Well I see Multigres
already has some code.

I even commented in a couple of
places proposing some improvements.

Michael: Yeah well I know they
all have some code, right?

Like PgDog's got some code.

Nikolay: It's not PgDog is already,
you can test it already.

Yeah.

I think Multigres also will have
some at some point.

Michael: All I mean is that you
can shard in other ways, right,

without these solutions.

Like Notion talked about doing
it, Figma have done it.

Nikolay: So-called application
side sharding, as I claim.

Michael: Yeah, but they did it
without leaving RDS in those cases.

So it is interesting.

But I thought you were going to
go in a different direction here

like I thought it was more about
the practicalities of expanding

So let's say you're not at the
the dozens of terabytes limit.

Whatever your provider has let's
say you're at 1 terabyte and

you're you just want to expand
to 2 terabytes.

That's often really easy.

You can do it at a few clicks of
a button without any downtime

in a lot of providers.

Whereas if you've got local disks,
is it a bit more complicated?

Nikolay: Yeah, you know what?

I think these days RDS also provides
options with local and MES.

Michael: Wow, okay.

I heard

Nikolay: about this.

Yeah, the instance I'm double-checking right now, it's instance,

for example, X2idn.

Yeah, and it has, I think it has, yeah, it has local and NVMe,

several terabytes, and up to, I think, not many actually, 4 terabytes.

Interesting.

So, there might be a hybrid approach when you have EBS volumes

and you use local NVMe as a caching layer for both reads and

writes.

Michael: But then what would you do?

Would you set up some replicas with larger disks and then fail

over to those?

How are you managing a migration to larger local disks?

Nikolay: When you hit 64 terabytes?

Michael: Well, no, when you, let's say you've started with smaller,

like you started with local disks that are smaller.

Nikolay: Ah, with local disks, yeah, I think you did switch over

approach, of course, yeah, so you need a different instance with

bigger capacity in terms of disk space.

Of course, here again, elasticity and automation of network-attached

disks cloud providers have, it's great.

But let's also criticize it.

So they have like, EBS volume has auto-scaling, but only in 1

direction.

For example, if we re-shard, we need to re-provision and then

switch over.

Or if we saw we didn't have autovacuum tuning in place, or we

screwed up in terms of long-running transactions or abandoned

logical slots, so we accumulated a lot of bloat and say we have

80% of bloat.

Okay, we re-indexed, re-packed, now we sit with a lot of free

disk space.

We don't need it during next year.

Why should we pay for it, right?

And shrinking is not automated.

But of course, you can provision new replica with smaller disk

and then switch over.

And when I think about switch over, you know, I decided to

force myself to have a mind shift and my team as well to self-driving

Postgres.

We talked about it.

And I think when I think about this particular case we eliminated

a lot of bloat we want disk to be smaller we need to switch over

but switch over also it's maintenance window yeah because...

Michael: What's the shift there?

What did you used to think?

Nikolay: Say again?

Michael: What was the mindset change that you had?

Nikolay: So I think operations like adding disk, removing disk

space when not needed, getting rid of bloat and so on, automation

must be much higher.

So it should be like approval from
DBA or some senior Backend

engineer or something, a CTO if
it's a small startup, just approval.

Yeah, we need to shrink disk space.

We don't want to pay for all those
terabytes.

And automation should be very high.

Repacking, and then without downtime,
we have a smaller disk.

But to achieve this right now,
so many moving parts and for example,

to have, you can provision node
with smaller disk.

It can be local, can be EBS volume,
doesn't matter.

But then you need to switchover
without downtime, you need a

PgBouncer or PgDog Layer with
pause resume support.

And then orchestrate it properly.

RDS proxy doesn't, for example,
support pause-resume.

So you can, You must have some
small downtime.

And usually people say, oh, it's
just 30 seconds.

Well, I disagree.

Why should we lose?

This is just some routine operation.

Why should we show some errors
to customers?

Let's raise a bar and have pure
0 downtime, everything.

And auto scaling, it can be auto
scaling, but it can be maybe

like, Auto scaling is about like
it makes decision itself.

It's too much.

Like let's step back.

I can make decision myself, but
I want full automation, right?

And we don't have it.

We have it for increasing, to increase
disk space, which is good

for EB S volumes.

Which is good.

We don't need to have switchover,
so it will be 0 downtime.

You can say add 1 terabyte.

This is what people do all the
time.

And I think there is checkbox for
auto scaling, even in RDS can

decide to add more disk space itself,
right?

Which is good.

Michael: Yeah, like if you get
within 10%, for example.

But yeah, only up, yeah, as you
said.

Nikolay: Yeah, at least we will
avoid downtime.

I also saw in some places people,
like There's a trick to put

some file, like some gigabytes,
filled with zeros.

So if we are out of this request,
we delete this file.

Oh no.

Yeah, just like something sitting
there.

We can invent some funny name for
this approach.

Yeah, but just an emergency...

It's like reserved connections
for max connections, 3 connections

reserved for admin.

So reserved disk space you can
quickly delete and buy your some

time to increase disk space.

Michael: Yeah, on the disk space
thing, the only thing I think

people sometimes get caught out
by is having alerts early enough.

Like you need, sometimes you need
quite a lot of spare disk space

in order to save this space to
do a repack for example

Nikolay: yeah

Michael: you need the size of the
table you're repacking at least

3 in order to do the operation
so it

Nikolay: makes us

Michael: yes so that well either
start with your smallest ones

which is not going to make the
most difference or like try and

set that alert quite early.

Yeah.

But yeah, is there anything else
you wanted to make sure we talked

about?

Nikolay: Yeah, well no, I think
it's a good idea to understand

some numbers, right?

So our very old rule was latencies
also.

We didn't talk about latencies.

What latency is normal?

Very old rule was you look at monitoring,
if it's SSD, it can

be best volume.

If it's some, the best volume is
also NVMe, They're usually these

days with most modern instance
families.

And you just see very rough old
rule, 1 millisecond.

I already have a feeling we have
a discussion about previous

episode where I shared some old
rule and someone disagreed with

this old rule.

Yeah, rules might be already outdated.

So If it's 1 millisecond, these
days maybe we should go lower,

half of a millisecond.

If it's local disk, it should be
even lower.

This is our point when we think
it's okay.

If it's more, well, Back in those
days we thought up to 5 to

10 milliseconds is okay.

But these days already this is
not okay.

10 milliseconds is definitely slow
these days for SSDs and the

NVMe specifically, like, particularly.

This is latency which you should
start worrying.

This is latency, which is like
you should start worrying.

So basically, in monitoring, we
should control usage, situation

risks, and latency as well.

This is like regular use or 4 golden
signals, right?

So we control these things and
also errors.

And yeah, we check these things
and understand where we are right

now.

Should we start worrying already?

Yeah, simple, it's actually simple.

And my recommendation also to know
your theoretical limits, based

on the docs, as I said, it's not
trivial.

But also recommendation, if you
use some particular setups in

cloud, always test them to understand
actual limits and if they

don't match theoretical advertised
limits, You should understand

why.

And to test it's easy.

I usually prefer Fio, simple
program.

I like snippets GCP provides.

They have snippets if you just
check SSD, disk, GCP, performance,

you will see a bunch of snippets.

The only warning, I managed to
destroy, several times I destroyed

PGDATA because it was like,
you know, like, some of those

snippets are Direct I/O, and if you
try to test, like, It was always

not production, but still I made
mistakes.

And if you try to test your disk
capabilities with Direct I/O and

you use volume which is used for
PGDATA, forget about your PGDATA.

And this is a good way, for example,
to have silent corruption

as well, because Postgres even
might work for some time until

you reach a point when it will
touch the areas you had rights

to.

So yeah, those are practical pieces
of advice.

Michael: Given pgbench stress,
we've talked in the past about

pgbench stress, Is this actually
a benefit in this case?

Could we use it because it's kind
of what we want to do is stress

test at this point.

Nikolay: Right, but pgbench
tests everything, including Postgres.

In our methodology, let's split
everything to pieces and study

them quite well if possible separately.

So disk I-O should be understood
separately from Postgres.

We had it many times, by the way.

We started, oh, let's pgbench...

We talk about disk here.

Let's forget about Postgres completely
for now.

Right?

Michael: So try and isolate.

Nikolay: Not completely, actually.

It will usually keep in mind that
pages are 8 kilobytes.

Michael: Yeah.

Well, I was thinking on managed
providers, like, it's a bit,

like, how would you test on RDS
what the, what IOPs

Nikolay: should be.

That's a tricky question, right?

That's a tricky question.

Michael: I think pgbench would
be a good solution there.

Nikolay: pgbench, yes, but you
can try to guess which instance,

well, instance is easy to guess,
but which disks are there and

like IOPS and so on, right?

And then you can provision the
same instance, EC2 instance and

disk you guessed.

And, but again, as I said, 1 day
I discovered they use RAID.

So there's a stripe there.

And if you want to do the same,
probably you will have a different

setup.

That's an issue.

Also, with those, like I know CloudSQL has it for bigger customers.

I don't remember, Enterprise Plus or something, they also have

caching with local NVMes.

Michael: Yes,

Nikolay: yes.

Yeah, it's good, but to reproduce it, it's really tricky to test,

right?

So yeah, I think It's tricky how to test disks for RDS.

But yet another reason to think about who controls your database,

and why you cannot connect to your own database using SSH and

see what's happening under the hood.

Michael: Probably a good place to end it.

Nikolay: Yeah, let's do it.

Michael: All right, Nice one, Nikolay, thanks so much.

Nikolay: Thank you.

Michael: See you next week.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

Disks

Creators and Guests

Some kind things our listeners have said