Postgres FM | Transcript: WAL and checkpoint tuning

September 9, 2022 • 38 Minutes

WAL and checkpoint tuning

Nikolay talks Michael through a host of nuances about WAL and checkpoint tuning.

Michael: Hello, and welcome to Postgres FM, a
weekly show about all thingss Postgres girl.

I'm Michael founder of PG mustard.

And this is my cos Nilay founder of Postgres AI.

Hey Nilay what are we going to talk about today?

Nikolay: Hi, Michael, let's talk about checkpoint tuning,

Michael: Yes.

Write ahead.

Right?

Wow.

Configuration.

We had a really good request for this topic from one of our listeners.

Nikolay: right.

So not only checkpoint two, but checkpoint two is a part of it.

Michael: Yes, absolutely.

Thank you to Chelsea for this one.

Nikolay: Uhhuh.

Thank you.

Uh, Very interconnected topics.

We, we should think about them, both wall configuration and point tuning.

They come.

Michael: Yeah.

Should we go through them one at a time in
terms of, well, so what, what is checkpoint?

What are checkpoints?

Should we start there?

Or what is the, what is, wow.

Nikolay: right.

So what, what are checkpoints checkpoints to talk about checkpoints, we
need to talk about wall first, write a head log and right head log is

like the basic, absolutely fundamental concept of FA database system.

Not only relational uh, which says like that first data is recorded.

Different place to, to some additional some
kind of binary log or like write ahead log.

And only then it's changed in memory in data pages.

And this allows us to build recovery system.

This allows us to survive unexpected restarts failures various bug and so on.

And in POS August, this also allows us to.

Uh, Physical replication because it's built on top of, recovery subsystem.

So if you think that you change some row in a table,
first, this change is recorded into right head log.

And only when we know that these changes recorded reliably in.

it's already on disk only.

Then you can see committed you, you can, the user can
seek commit and the actual change in data it's still

can be only in memory, not flashed into disk yet.

Right?

So, so basically we record twice.

We record to red headlock the change and we change data.

In place where it's stored permanently, but this flushing
to disk in the second case it's not occurred synchronously

so we can seek meet, but the data file tabled index files,
they are not yet in actual state, on disk and the checkpoint.

It's the process of writing so-called dirty.

Blocks, dirty buffers uh, two disk so dirty in this
context means changed, but not yet saved on disk.

So, if it's already saved, it's called clean buffer.

If it's not yet saved, it's called dirty buffer.

And when we have dirty buffers, it means that we changed a lot.

We already reflected these changes involved.

It's like the rule number.

Right ahead look, but it's not yet saved on this.

When checkpoint happens, all dirty buffers are saved
and it means that next time we will need to start.

For example, after some crash, power loss, anything we we, don't care
about previous changes because they're already reflected on disc.

Michael: Yeah.

So we only have to worry about things since the last checkpoint

Nikolay: right.

And we need to redo them.

Right, right, right.

Michael: Yeah.

And just before we move on from the basics . Because
I had to read up a bit about this the main, the main

reason we need this is to prevent data loss on recovery.

So it's the idea of the dual system.

The idea of, of having this is for the, the
D in acid, I believe the durability side.

So preventing

Nikolay: Right.

A, a C I D is like the core concept of uh, database
system should be like, if it says data committ.

it never can lose it.

Otherwise it's a bad system.

So right ahead.

Look exactly allows us to, to, to, to have it.

And without checkpoint, we would need to keep a lot of changes and replay
a lot of changes and start type time after any crash would be very long.

Right.

And so checkpoints happen all the time.

They happen like kind of on.

They also can happen during, for example, when we need to shut down
the server or restart, there is so called uh, Moy shutdown checkpoint.

So PSUs doesn't report that shut completed
until this shut checkpoint finishes.

right.

So, so it's also important to understand, and that's why sometimes
we can see that shutdown takes some, a significant time because

we have a lot of dirty buffers and we need to sell them first.

Michael: Yeah on this topic, I saw a good recommendation.

I think it was from you actually around Postgres
upgrades and the idea of taking a checkpoint triggering

a checkpoint manually to, uh, reduce the time needed.

Nikolay: Right.

Because During During shutdown down, checkpoint
POS doesn't respond to new queries anymore.

It says uh, shutting down.

So come later.

Right.

But if we run manual checkpoint a SQL command checkpoint.

It can be in parallel.

So we save a lot of dirty buffers ourselves, like just
running, select in terminal in P SQL, for example.

And then when shutdown check what happens.

It's already fast because there's very low number of dirty buffers.

Right?

Michael: Yeah.

Nikolay: So it it's always recommended when we need to restart
server, for example, for minor upgrade, or we need, for example,

to switch, to perform switchover this like manual not manual can
go to rent, of course, but additional checkpoint, I would say.

uh, Oh, explicit, explicit checkpoint should be there
because after it shut down, checkpoint will happen.

So we need to help it to be shorter, faster,

Michael: Yeah.

So you, you already mentioned a couple of uses
for the right head log one's recovery on crash.

One is replication, I believe put in time recovery is another one.

So tools like PG backrest make use of it.

I believe.

Nikolay: right, right.

Well, it's a different topic, but of course, right.

If, if we store full copies of P data by the way,
they never are consistent in consistent state.

They, if we, if you copy picture data on live server, it's not consistent.

So you need a bunch of walls to be able to reach consistent point
always but you also can store all the whole stream of walls, addition.

and uh, if this allows you to take some so-called Facebook up, picture
data, like corresponding to some point in time, and then you can replace

additional walls and reach new point in time and it can be arbitrary.

So if you store continuous stream of
falls in your archive, you can choose any.

You want, the only problem usually is if your database is
quite big, the initial copy P data initially takes time.

And roughly like very, very rough rule is one terabyte per
hour, but can be slower, can be faster, but very rough rule.

So if you have 10 terabyte database should be
prepared for five or 10 or 15 hours of initial copy.

And I wonder why people.

Don't use often cloud snapshots for that to, to speed it.

right.

It would make sense completely, but I know sometimes
they're not reliable cloud snapshots in Amazon and Google.

Like there are issues with them sometimes.

Also making them takes time, but we could do it on it's.

Like, I, I, I go, like I put us to different field let's let's
postpone discussion of backups and, and other things, but you're right.

Point in time, recovery is never a area of application.

Of wall, but the primary goal is to allow us to
recover from, from unexpected shutdown or power.

I think so it's, it's like in games, right?

Michael: I lost you briefly

Nikolay: yeah.

I see, I see let, let me listen to you then.

Michael: Well, yeah, so I'd be, I think it's getting better.

So we, we have a lot of parameters for tuning these things.

We have a lot of parameters for controlling how, how fast these
things happen, how much has to happen before they kick in.

There's a few settings that it seems like we
really shouldn't touch pretty much ever as well.

But it also seems that especially on high right workloads,
there are some really big wins we can get by changing some

Nikolay: Right.

Defaults are not good enough for you usually, right?

Michael: Okay.

Is that generally true?

Or like if I have a read heavy application and not, not a ton of rights,
am I likely to, what, what, how would I know if I'm hitting problems?

What's the kind of, what are the first telltale signs?

Nikolay: Well, that's a good question.

Usually.

it's good to evaluate a distance between two daily backups, for example.

And you, you can, even in SQL, you can even take to LSN.

Usually when backup is taken, you know, LSN log, sequence number.

So it's like position in wall.

It's all sequential.

It has some specific structure.

There are a few articles.

We will attach articles, explaining how to
read LSM and how to understand this structure.

So if you take to LSN and convert them to PG underscore
LSM data type in progress, you can subtract one

from another and the difference will be in bites.

So this difference, and then you can run PG Pretty or P size
and you can see it, the difference in megabytes, gigabytes,

actually gigabytes actually, but Pogs doesn't use this notation
so you can understand how much wall you generate during a day.

Right?

If this size is it's small, like gigabytes or 10 gigabytes, it's quite
small, probably you don't need specific checkpoint and well tuning.

but if you have one terabyte generated per day, it's quite
a lot, and I'm sure you need to go away from defaults and,

and you will have better performance, better everything.

For example, wall compression is not enabled by default.

And, And,

Michael: I missed that one.

That's

Nikolay: yeah, but I, I, I'm going to check if it's enabled in recent
version because PSGs defaults are improving, but still they are Legg.

If you have heavily loaded systems, you definitely want to tune.

So if you have for Pogs 13, for example, while compression is disabled
by default, let's talk about how, like, what is written in walls.

Oh, first of all, like just like simplify the explanation people usually use.

About checkpoints.

It's like in games, you want to save your progress.

Right.

And if something bad happens you, you will repeat fewer steps.

Right.

So it's very simple analogy.

Yeah.

It's still in Pogo in POS 14.

It's still not enabled while compression.

Yeah.

Pogo 15 is still not enable.

and this is I think should be enabled in most set ups.

I, I, I'm almost sure on just it's enabled.

So and if you can, for example, do checkpoints very rarely once per week,
it's insanely low, but in this case there are high chances that if crash

happens, you will need to wait a lot while POS replace many walls, right?

A lot, a lot to do in terms of software you do.

And during this period you will be down.

Your system is down.

So no, not good.

That's why I check like logically.

I would say it's good to have checkpoints more often.

Right.

Michael: It seems like a gold deluxe problem right too often.

And you have a lot of overhead, but too infrequent
and it will take a long time to recover.

So it feels like there's a balance.

Nikolay: There is trade off here.

And there are two kinds of overhead.

We will talk about it in a second, but
to understand where overhead comes from.

in spite of dirty buffers.

Let's talk about what is written to wall by default.

Full page rights are enabled, right?

And what, what is full page, right?

It's if you change anything in some table
in some row, it may be very small change.

SVI writes whole page.

to write a headlock.

Why?

Because there is a difference.

Buffer is usually eight kilobyte size Kiy size , but
file system probably uses block size four Kiy bytes.

And you don't want to have partial right
during writing to disk reported a success.

But you, wrote only.

Half of it.

So that's why full page writers need it.

And by the way first PSUs first talks about Aurora from grant McAllister.

If I'm not mistaken they are very well.

We.

Find links to YouTube and probably slide deck.

They explain very well, this problem about
full page rights and this big overhead.

So the, when first change in the page occurs, first time
it's written after checkpoint it's full page, right?

If you change it once again, only small
Delta is written, So it's not full page.

Right.

But only until the next check.

if checkpoint happened, all changes are initially again, full page rights.

If checkpoints are very frequent, we have a lot of full page rights.

If checkpoints are not frequent, very often we
have repetitive changes in this of the same page.

So we changed.

We, for example, wrote inserted something.

We insert once again into same page.

It's new change again.

And again, we, we update something.

We have had to update hip only two top update.

So we change it the same page.

And this means we touch the same page.

We write to it multiple types in this case.

so I'm saying not only the number of rights
matters, also the nature of rights matters.

If you have heart updates, touching the same page, many, many times you
will benefit from rare checkpoints a lot because only one full page right.

Will happen after checkpoint.

And then you benefit having very.

You write a little bit to, to write a headlock EV every
subsequent change until next checkpoint of course.

Michael: yeah, that's super interesting.

And I think also.

it explains why people, so there's, there's some people I've seen.

I, I suspect this is a very bad idea, but turning off full page rights in
order to increase throughput, but it feels like a very risky thing to do.

And that's that there, I feel like we're gonna cover quite a few better

Nikolay: some systems can afford if uh, what you, you need to understand
what your, your file system is your disc and like what settings they use.

And in some cases it's possible, but it's quite dangerous.

You should understand all the risks and be a hundred percent sure that
it's it's possible in your system, but usually we don't go this way.

Usually we use X four with 4k box size and so on.

And we, we want full page rights enabled.

So back to compression compression is applied only to full page rights.

Michael: Interesting.

Nikolay: right.

So we, we don't compress as I understand POS
doesn't com compress these, small changes.

It compresses only these.

first time we change something in a page.

We record this page fully and we can compress it.

And compression is not enabled by.

and if you enable it, you can see huge benefit in terms of how much
wall you write, why we care about volume here, because if we have

right heavy system, of course writing a lot, additionally to wall
it, it like it's additional overhead on disk if you have 10 replicas,

sometimes people have it, all replicas need to receive this data.

F they work rep physical replication, logical as well.

It works through wall through writer headlock.

So if we write a lot, we need to stand over network a lot.

They want wall compression enabled to compress all full, page rights, and
we want checkpoints to happen rarely to have fewer full page rights as well.

So I would tune.

Maxwell size and checkpoint timeout to, to have very, very, not
frequent checkpoints, but in this case, if they are not frequent

again, start up time after crash, also follow over, for example, it,
the timing of these procedures will be very, very bad, long minutes.

Some I see various engineers struggl.

To understand why.

For example, shutdown takes so long why this start takes so long and
they become nervous and that at extreme cases they use kill minus.

So secure, like POG survives because we have right
headlock and we just redo, but do also takes a lot.

They, they kill it.

It's it's not acceptable.

Only in rare cases.

We should do it.

It's like last resort.

We should not do it.

But on after read, PSG starts and start up takes also many minutes.

They're still nervous.

It's not a good situation.

That's why people need to understand.

Like how much wall needed to be written and distance between checkpoints,

Michael: Yeah, let's go back to a couple of those ones you mentioned.

So my understanding is that checkpoint timeout is a maximum time
between checkpoints and that's default, quite low is five minutes.

Nikolay: Very low.

Michael: so what would be a sensible
starting point for most people in terms of

Nikolay: Yeah, so usually the main metric here is how
long you can afford being down in the, in the bad case.

In the case of an incident, this is the main number you need to understand.

You need to talk with your business people and find some number
like, okay, we can be down up to two minutes, for example, right.

From there, you start thinking.

if we have this like requirement or SLO service
level objective, if, if we are SRE, right?

So if, if we have two minutes, let's think, during two minutes,
how much we can replay, we can measure it with, with experiment.

We can, for example set checkpoint time out and
Maxwell size to value insanely big numbers.

Then we can have a lot of rights happening, PG bench, for example,
then we can wait until new checkpoint one checkpoint finishes.

Now there is about to finish and then we
kill Stan, our PSUs crash it on purpose.

And then we see recovery and just measure the speed of recovery,
how much how many bites of wall we can replay per second per minute.

And this gives us understanding of how much wall we can
afford to replay, not to exceed two minutes of downtime.

For example,

Michael: Yep.

Perfect.

Nikolay: from this, we can start thinking like,
this is very important to understand recovery.

in terms of bites per second bites per minute or gigabytes per minute.

anything here from there, we can understand how, like how many bites of
wall we produce when everything is normal, during quiet, busy hours, usually

at night, for example, we have lower activity at daytime on working days.

Probably we.

More activity, right.

But usually we say, okay, we produce, like, for
example, one wall per second, it's quite good speed.

Each wall means this file.

There is also confusion in terms because is a, I remember
the commutation says wall file is some abstract thing.

It's like two gigabytes and wall segment is 16.

Maybe bites, but if you go to picture wall
directory, you will see H file will be 16 megabytes.

Usually, as I remember, they use it and they have 64 Miyes British wall.

So I say walls.

Each wall is usually 16 megabytes.

So one wall per second, during normal, quite busy
hours, it means we produce 16 megabytes per second, or.

Right.

So, okay.

And it means that, okay, what is our replace speed?

What is our production speed?

And from there, we can understand during which time we generate that amount
of wall data, which will give us two minutes of recovery time, right.

Quite complex.

I understand.

Michael: well, it feels like luckily we've
got that second parameter in max wall

Nikolay: No, no, no, no.

Let's, let's let's pause.

Yes.

I'm talking about how to, like, in my opinion, how to understand
what's the best what's like normal checkpoint time out for you.

Right?

So in this case we understand, okay.

Recovery time is this production time of production.

Is this.

So we can conclude that to have not more than two minutes of recovery
time, we, we need to have to, to, we produce this number of wall data.

So we will set point time out, probably like half an hour.

This is like quite, maybe 15 minutes, 30 minutes.

It depends, of course observing concrete system.

We can.

make some conclusion.

Okay.

We want 30 minutes for example, but then we start
distinguishing plan checkpoints and requested checkpoints.

Requested checkpoints is like POS has two logics.

One logic is okay.

One schedule when time comes time to, to have checkpoint every 30 minutes.

For example, by default is five minutes.

I think it's too.

Right.

But then if there is another parameter called Maxwell size, and I think it's
very, very important parameter to understand it's our protection for the cases

when we have elevated activity, And we want to have more frequent checkpoints
because we want to be protected again from writing too much data to wall and.

Again longer, wait again.

The same logic.

If we understand the, how much we produce
the speed of production, we can say, okay.

Maxwell size also roughly corresponds to so, so checkpoint
time out and Maxwell size, they tuning can be correlated here.

Michael: Yeah.

So my, my understanding is you, it, it sounds like we should
rely on checkpoint timeout for the majority of the time.

That should be the thing that, that kicks off checkpoints, but.

If if more than that amount of wow is generated more
than the amount we're expected, we could set an amount in

Nikolay: We should say, we should say that so usually like, and default is
there very small one, gigabyte default is insanely small for modern workloads.

Usually like I recommend to go up sometimes up to hundred gigabytes.

Which we need to understand this recovery trade off.

Right?

So we need to measure recovery and guarantee our business that we
will not be down more than for example, two minutes or five minutes.

Right.

But right.

Maxwell size protects us from the cases when we have more
rights and PSGs can decide to perform requested checkpoint.

We see it on logs, by the way, logging of checkpoints.

We.

Enable always, as far as I remember recently,
default was changed and login is, is now enabled.

I, I remember discussion in hackers and mailing list.

So log checkpoint should be enabled for all checkpoints.

I I'm hundred percent sure.

This is what you want to understand is default is, was false.

Default is disabled in August 12, disabled in August 14.

but enabled in Pogs 15, which will be released very soon.

So this, this is new change in Pogs.

15 log check point is enabled and I recommend enabling it for
any Pogs, so I, I also saw some DBAs see that like 90% of all

checkpoints are requested, they occur according to Maxwell.

This is a problem.

No, it's not a problem.

It's not a problem because request a checkpoint
and time checkpoint like planned on schedule.

They are the same actually, like not no big difference, but of
course you want to be in order, everything should be in order,

of course you want like it's, it's just a sign that probably you
need to reconsider savings, but it's not a, an urgency situation.

Right.

Michael: Yeah,

Nikolay: Well,

Michael: good.

Nikolay: There is another uh, checkpoint completion target.

We, we didn't mention.

And by default it's 0.7 or, or,

Michael: this changed.

I, yeah, I looked this up until very recently.

It was not 0.5.

Nikolay: oh, 0.5 is terrible.

I would say

it's not what you.

Michael: Yeah.

But in 14 it was increased to 0.9.

Nikolay: great.

This is good number.

So what it, what is it?

Since when you run manual checkpoint, explicit checkpoint, it goes full speed.

So it rises dirty buffers to disc as fast as possible.

And it produces some stress on disc it's.

It's a okay.

Stress, but it's normally you want to be.

More gentle with your disc system.

Right?

So that's why we spread it over time.

And 0.9 checkpoint completion.

Target means that between two checkpoints,
90% of time, we want to spend the writing.

And 10% of time we are resting.

Maybe you want even more 99.

I don't know, like 99% of time.

So, and, and, and this is important because It's
hard to understand the distance between checkpoints.

It's quite tricky question.

Logging will report something, but you can think about when checkpoint starts.

This is like the beginning.

So 10 minutes between them is like, or 15 minutes between them.

Or 30, it's fine.

But what I wanted to deliver, this is very tricky.

It, it bothered me a few years actually, and only in the book
of Gogo already mentioned it, this Pogo scale internals of.

So I, I read it in Russian even earlier now it's published.

Both parts are published in English.

It's very good.

It explains everything in detail with links to source code.

And finally, I understood why, if we set Maxwell size, one gigabyte, the
distance in bites measured in bites, it can be like 300 something megabytes.

So it's like three times.

Michael: Why is that?

Nikolay: so explanations is interesting.

I'm looking at it right now.

So I know I knew it from practice.

I just like when I said, well, I saw, oh, you have default one gigabyte.

You know that it means that the actual distance measured
and bites will be 300 megabytes is tiny, like distance.

It means that checkpoint will disturb your system constantly.

And they, I even saw the case of very large.

Where people had um, some cleanup job on the background happening.

And then before big event marketing event, they disabled this job.

And then a couple of months later they realized that
job is disabled and some engineer very experienced one.

But not Pogo expert.

He like said, okay, this job was not painful at all.

It, it was working many years.

So he went ahead and tried to delete 10 million million
rows using one delete and put system down for 10 minutes

because they didn't have PO uh, checkpoint tuning in place.

So Maxwell size was default.

One gigabyte actual actual distance was 300 gigabytes.

I will explain.

it means that when you produce a lot, you
have checkpoints happening all the time.

Boom, boom, boom, boom.

And a lot of full page rights.

Boom, boom, boom.

It's not compressed and discs work quite good, like enterprise discs, but
not and V me unfortunately, and they just, situation happened and they went.

For 10 minutes since then, just one delete.

I even had a talk in, I did it in Russia sometime ago.

Like just about this case, how delete can put your one, one line of delete
can put your S down even before you worked very well and like critical system.

But so checkpoint tuning is important thing to have.

So if you have one gigabyte until post this 11, it was,
if you have checkpoint completion, target close to.

you should divide by three since post August 11.

You should divide, divide by two.

So if you have one gigabyte box full size,
actual distance will be half a gigabyte.

Roughly if your checkpoint completion target is 0.9 because
posts needs everything since last checkpoint and also

everything between previous one and the latest one latest.

Michael: Oh,

Nikolay: and before post 11 additional cycle was needed.

so two successful cycles and a tail, right?

Not tail it's this tail is before not behind us.

It's in front of us.

Right.

So if checkpoint competition target is 0.9, like
roughly three times, like three intervals needed.

Michael: Yeah.

Nikolay: That's why you need to raise Maxwell size anyway.

Right?

Michael: Yeah, that seems like a, almost
everybody would want to increase that one.

I I've read like.

There's some other interesting ones that are be keen on your view on,
and also actually, I guess, should, are people on cloud providers?

You mentioned RDS a couple of times.

Are they generally more protected from this?

Because they've been tuned already.

Nikolay: Tuning here means increasing Maxwell size increase,
Maxwell size, but do it not blindly understanding recovery.

Michael: Yeah.

The other ones side for, for example, I've read that, that can increase right.

Performance if you increase that number

Nikolay: my practice.

I cannot say anything here.

Like I, I, I didn't dive deeply enough to discuss this, but

Maxwell sizes, my favorite topic

Michael: Yeah.

If you haven't had to worry about meanwhile size, then I can't imagine it's.

That important.

So, yeah.

Good to know.

And yeah, the, the only other one I wanted to ask about is,

Nikolay: let's let's sorry, like it's so important.

I just want to emphasize it, you know, like, so if we have very short
distance in terms of Maxwell size, first checkpoints, and we have unexpected

or maybe expected someone decided to do it, a lot of rat activity.

We can measure it with experiments and what I found that, you know think
loans is good to iterate, but we cannot use think loans here because we need

to our discount file system behave exactly the same as Amazon production.

So I found good recipe, how to have some workload
which will not touch our physical layout of data.

It's.

Transaction of massive delete, like delete 10 or a hundred million rows, but
cancel it in the, in the beginning, roll back, delete will write to X max.

We discussed it a couple of months ago.

Probably it'll write the transaction ID, which deleted double, but

if transaction got cancel, , this is virtually zero.

Zero means it's this table is still alive.

So we produce a lot of wall, produce a big stress on system, but then we
say nothing changed and we can do another experiment on the same system.

It's perfect workload for a lab, right?

So we can have a sequence of experiments with different Maxwell.

and see using like IO top, or I know
iostat, we can see, we can use monitoring.

I recommend to using need data because it has export button.

You can export all graphs and you can see how this cloud behave.

And usually if you have one gigabyte Maxwell size, and discs are not
very powerful, you'll see plateau because it's saturated . Right.

Then.

double your Maxwell size plate again, double max.

At some point you will see your system under such stress.

It's already not plateau because, and IOP
shows when Maxwell size is small IOP shows.

If you're ordered by right throughput check
pointer will be number one, it writes.

200, 300, I don't know, 500 max per second, like, right, right, right.

Also I, promise to explain two reasons of our head one.

We already discussed full page rights.

If we just finished our checkpoint and we needed to start
another because Maxwell size commands us to, have them very.

it's like insane.

Checkpointing insane.

Checkpointing like check pointer went mad.

Right?

So for example, I saw like every 15 seconds.

Boom, boom, boom, boom.

Because these deletes, like 300 mix.

It's quite quite Fastly.

So again, again, again, default settings.

So, so full page rights is one type of overhead.

So basically, oh, also make your deletes no se.

for example, you can have some index on some random column and you can say
let's delete first 10 million rows ordered by this column, but it's random.

So first apple is first page.

Second.

Apple is page number thousand.

So we , jump between various page.

and this is the worst type because we, we could benefit.

Like if you, if they are sequential, probably all changes
in one page will happen inside one checkpointing cycle.

But if we jump between pages, we constantly produce a lot of full page rights
and we need to produce them once again, because checkpoint just finished.

Right.

This is the worst situation.

And this happens, this is exactly what put that system down that I explained.

So second I didn't realize, but it's quite obvious.

Second overhead is quite obvious.

If our page was dirty, it was checkpoint.

Now it's clean checkpoint.

Uh, Okay.

We missed once again, it became dirty.

Again.

We need to write it again.

If two rights would be inside one checkpointing cycle, we would produce only.

But if two hour visits happened in different
checkpoint cycles, we need to perform two discards.

Michael: it's more IO.

Nikolay: Right?

Right.

So sequential delete is not, is not that bad.

A random delete according to some index can be right?

Michael: that's a good point.

So as well as all.

Nikolay: Sorry.

I like I so so much fun.

I spent some months exploring it and we made a lot very good.

Like I would say enterprise scale experiments and I, I, I can take any system
and show exactly how recovery will behave, how exactly you need to tune.

I, I can like show graphs.

It like it's quite expensive in terms of
time and probably money research of system.

But I think big systems.

They need to understand their workload, their
system, and understand what will happen.

So this random delete, I, I say this, like I named it double unfortunate.

You can be unfortunate because you crashed.

And you you unfortunate twice because you
crash during some uh, random intensive rights.

In this case, you definitely want to understand your Maxwell point I'm out
and you want your this choreograph not to have Plato, but be like spiky.

And this is a sign that you have some room for other eyes.

This is like our research shows like, okay, at 16
gigabytes or 32 gigabytes, we already don't have Plato.

So this is our desired setting for Maxwell size,
maybe a hundred gigabytes seven divided by two.

Like we need to understand since S well, but, and then we say,
okay, but in this case, recovery time can be in at normal time.

It'll be this at bed times when somebody is writing random.

a lot.

It can be these like 10 minutes.

Can you afford it or it's not good here.

So, uh, you see how much like I, I had in the past with Maxwell size,

especially.

Michael: Yeah.

This is great.

And I hope people are encouraged to go.

And you can easily check this by the way.

Can't you just with all like with all Postgres
parameters, you can just use show, show max, well size.

If it's, if you get one gigabyte back maybe time to have a look at that.

Same with checkpoint timeout.

So show checkpoint, time out, check out that if it's, if
it comes back five minutes or it might say 300 seconds.

Again, another one to look at

anything else.

Nikolay: there are other things, but let's, let's
stop at this point because we are out of time.

Definitely here.

I apologize for too, too many details in this case.

Michael: I don't think, I don't think that's the kind of feedback
we, I, if, if anybody thinks we did do too many details, let me know,

Nikolay: Right.

Michael: think that's gonna be the,

Nikolay: And I, again, I, I want to like advertise need data here,
because if you do some experiments and some, you, you take like

same virtual machine, same settings, everything as production,
you do this very unfortunate, massive delete rolled back again,

delete, but can you check with various Citrix, do install net data
and export all dashboard with all this Cayo and everything to file.

And then you can compare, you can open and browser several.

Several files and, right, right.

And see exactly the difference in behavior for different settings.

It's so convenient

and you can

store the,

those artifacts term.

Michael: Yeah, I really, I enjoy doing showed me that.

I, I also wanted to advertise a few things.

There's a couple of great websites for checking out parameters.

If you want to see like what they mean.

Obviously the Postgres documentation's
great, but there's also Postgres code.nf.

Oh, Postgres comp by I'll link up.

And PGP here as well.

I find great for this kind of thing.

They have a section on this I found useful.

So I'll, I'll share those as well.

Nikolay: right.

But like you, I, I suppose if you have, have a lot of LTP
system, you probably will set checkpoint time out of 15 or 30

minutes and Maxwell size to something like 32 gigabytes, at
least maybe more but better to conduct full-fledged research.

and make decisions based on your requirements from business.

Michael: Yeah,

Nikolay: Good.

Michael: so much Cola.

Thanks everyone.

Nikolay: Thank you everyone.

Yes.

Share like share, share, share, share is important.

Most important, probably.

And if you, by the way I have special request to our listeners today.

If you have a IO device, please go to apple podcast.

and like us please.

And write some review.

would

Michael: Yeah, I dunno if you saw, but we got a nice one already.

So thank you to that person.

Nikolay: Good.

Thank you, Michael.

Michael: Cheers everyone.

Nikolay: Bye

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

WAL and checkpoint tuning

Creators and Guests

Some kind things our listeners have said