Michael: Hello and welcome to Postgres.FM,
a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard
and I'm joined as always

by Nik, founder of PostgresAI.

Hey

Nik.

Nikolay: Hi Michael.

Michael: And today we have with
us David Steele, who is a significant

contributor to PostgreSQL and the
creator and maintainer of both

pgBackRest, which we'll be talking
about today, and pgaudit.

Welcome, David.

David: Thank you very much.

Good to be here, Nik and Michael.

Michael: It's a pleasure to have
you.

Right, to get started, I wondered
if you could give us a little

bit of history, a little bit of
the origin story, perhaps of

pgBackRest.

David: Sure.

That's easy enough.

Actually, pgBackRest was born
at the Dublin conference in 2013

of conversations that Stephen Frost
and Cynthia Shang and I and

Magnus Hagander and some others
were having At that time, we

were working on a fairly large
database for the time, around

50T.

That doesn't sound so big these
days, but in 2013, that was a

pretty significant database.

You know, basically we needed to
make backups, of course, as

you do.

And just the available tools were
not up to the task.

It was just simply too large.

And we really need to be able to
do incrementals.

And there was nothing that would
do incrementals plus compression

at the same time, et cetera, et
cetera.

So As you do in open source, we
decided that we would build our

own thing.

I remember originally it was going
to be a pretty simple project.

Write it in Perl, keep it simple,
just be 1 file, so it would

be easy to copy around and distribute
and etc.

Yeah, that didn't last very long.

Obviously it grew pretty quickly.

So I built it.

I built the initial software that
was usable in about 40 hours

and to basically solve our initial
problem and then just kept

building on it as we had problems
and bugs and other things I

would build on it and build on
it.

Convinced the company I was working
for, Resonate, which is an

ad tech company, to open source
it.

And then when I left there, I kept
noodling around at it.

I took some time off and just kept
working on it.

Got the restore functionality working
well.

And then that's when I got hired
into Crunchy.

At 1st they weren't super interested
in it but Stephen was.

So I got a little bit of time to
work on it and then eventually

got more time to work on it and
then eventually we hired Cynthia

to come and work on it with me
and that went on for a while and

then finally we decided to migrate
it to C and we did the whole

C migration which was 2 years so
that was painful because we

didn't we basically didn't write
any new features for 2 years

we fixed bugs only And we could
only write a new feature if it

lived entirely in the C code.

And over time, it was more and
more possible for that to happen,

but it was a tough migration.

2 test suites, 2 of them, it was
just, I didn't even want to

think about it.

It was not a lot of fun.

And that's basically the project
as it is today.

Now it's, of course, we're rolling
along.

It's a C project.

We have features based on user
demand, based on chasing performance.

Performance is always the big thing.

Like how do we move a bunch of
data quickly?

How do we do it efficiently?

How do we keep the repo as small
as possible, block incremental

backups, the list goes on and on.

Nikolay: I have a question here about
history.

Maybe you remember those discussions
should Postgres have Remember

times of Slony and then Londiste
and there was discussion

should Postgres have replication
inside.

And there are many such discussions,
same for auto failover and

so on.

And For the case of replication,
the idea to have it in core

won.

For the case of auto failover, it
lost.

Still auto failover is outside.

I'm very curious what you think
about the idea to have full-fledged

backup solution inside core and
why it's not happening.

David: 1 interesting historical
fact is, and 1 of the reasons

why some of the other committers
were involved in really early

planning, and part of the reason
for the migration to C was the

idea was that we could actually
maybe make pgBackRest the core

solution for backup.

Now, this is not something that
was endorsed by core broadly,

or it was just discussions that
I had with some committers, because

they were interested in having
a more comprehensive solution

in core.

Obviously, they wanted it to be
written in C.

And so that was 1 of the reasons
that drove the adoption of C.

As for actually having something
like pgBackRest in core, it's

a little tricky because the pgBackRest
project moves a lot faster

than core does So to have it on
the yearly cycle would be But

let's say we had put pgBackRest
into core like from the beginning

Maybe as soon as it was migrated
to C.

It would not be nearly as far along
as it is now At the time

we were doing 12 releases a year,
we went down to 6, and now

we're currently at 4, which I think
is just about the right tempo

for a project of this type.

But being able to release features
4 times a year, get new stuff

out there, get people trying it,
get people testing it, that

kind of stuff.

I think it's pretty important.

And as pgBackRest gets more stable,
maybe it would be more appropriate.

But at the same time, the project
has diverged significantly

from, you know, we use a lot of
the same concepts as Postgres,

MemContext, and error handling
looks a lot similar and et cetera,

et cetera.

But it's all pretty different.

2, pgBackRest grew its own way.

So it'd be very difficult to get
in there.

I think the idea is pg_basebackup
was going to be that tool.

Getting incremental backups into
pg_basebackup was a big step,

but it's still not a complete tool.

You have to have tooling on top
of pg_basebackup for it to

be at all usable, especially for
incremental, because reconstructing

all those incrementals, going and
fetching them and uncompressing

them and getting them all ready
for the pg_combinebackup tool

to run.

That's a significant amount of
work and obviously it doesn't

do anything with WAL archiving,
expiration, there's just the

list of all the things that you
need to do and it's a pretty

big list and it's intimidating,
honestly.

You know how core goes, right?

It's intimidating to even contemplate
getting something like

that into core.

So I think it would be a good idea,
but also Think about storage

drivers.

Support for pg_basebackup supports
POSIX, right?

But most backup tools support S3,
GCS, Azure, SFTP, etc.

Because those are the tools that
people are actually using.

So all that would have to go into
core as well.

All the storage drivers, people
would have to...

Nikolay: Maybe not everything could
go.

It's possible, Postgres is very
extensible.

If just the core thing would go
to the core, but expose some

interfaces, particular drivers
could stay outside.

David: And in theory, that's
what we've done, a bit with pg_basebackup,

but the amount of stuff that needs
to be done by the outside

tool is huge.

Now there are quite a few tools
that are based on pg_basebackup

to do their page level slash block
level incrementals.

Barman, obviously, is the most
well-known 1.

I think pgmoneta is 1 of them.

There's a newer 1 that I saw recently
that's also using it.

So you can certainly build a tool
around pg_basebackup.

And I think the idea would be that
you'd extend and extend pg_basebackup,

and then people would take that
feature out of their tool.

And eventually maybe pg_basebackup
would be a thing that does

everything.

But at the pace that it's actually
evolving, I would expect that

to be able to reach feature parity
with something like pgBackRest

or WAL-G or Barman in approximately
30 years.

I'm exaggerating a little bit,
but if I look at the actual progress

that pg_basebackup has made over
the years, that's where we are.

Since that incremental, nothing's
been done with that, even though

there are pretty huge performance
implications for restores.

Any large, very large database
to restore with that page incremental

format is you basically need, let's
say you've got a database

that's a terabyte, you need at
least 2 terabytes to do the restore

minimum, and it depends on how
many incrementals you have, so

you could need 2, 3 terabytes to
do the restore.

All the files need to be copied
down regardless of whether those

blocks are used or not.

Everything needs to be uncompressed
and then fed into pg_combinebackup

which then rewrites everything
to a different location.

So from a scalability standpoint,
it has a pretty serious problem.

And we're 2 versions on from it
being introduced and no 1 has

actually even thought about actually
addressing any of those

issues.

I don't want to harp on this too
much, but the main point is

getting stuff into core is hard.

It takes me years.

I've been working on a very small
change for backup just to mark

pg_control when a backup label
is required.

So you do a backup, a backup label
is required to do the recovery.

If the user deletes the backup
label, they can end up with corruption,

silent corruption.

It's quite annoying actually.

And I've seen people who know Postgres
really well, hackers do

this and not understand what happened.

Nikolay: I saw it many times already
in various teams.

It's annoying.

David: So 2 years ago I introduced
a small patch for Postgres

to just put a flag in pg_control
to mark it when we actually

need a backup label.

So if you start Postgres and backup
label's not there, it will

just stop and say, no, you must
have backup label.

Please provide.

That was 2 years ago.

It goes through review occasionally,
but really nothing there.

In pgBackRest, we actually implemented
that about 3 years ago.

What we do, this is a bit of a
hack but it works pretty well,

what we do is we overwrite the
last checkpoint in pg_control,

we write in the hex value DEAD.

So if we get a report from a user,
because it will come up a

Postgres that will say unable to
find checkpoint DEAD.

And that's actually not a valid
checkpoint at all because it's

under the 1st WAL segment limit.

So, Postgres will mark it as invalid,
it will throw an error,

and then when we get a report from
the user we immediately know,

hey you tried to start this you
delete a backup label and they're

like yeah I deleted backup label
and boom there you go.

But I've been trying to get that
that I wrote that patch 2 years

ago and I'm still hoping to get
that into Postgres.

So that's like the speed at which
Postgres can operate sometimes,

especially for people like me who
aren't really committers.

So can you imagine trying to get
something the size of pgBackRest

or complexity of pgBackRest into
Postgres?

Nikolay: Yeah, I wish we could highlight
this, put a link to this

patch and maybe drive some attention
to it because some people

who participate in Postgres hacking
listen to us and maybe...

I agree it's super annoying and
right now All the other tools

they need to...

Basically, if you create full copy,
full backup on a replica,

you are responsible for placing
this backup label yourself.

And it's also how restore works
in Postgres.

I fixed recently, and This fix
was very quick because it was

obvious.

It was just a problem in code and
Postgres in the store path.

How Postgres is working with the
backup label and also signal

files.

If you look how people use it,
they see some error in logs and

then, okay, I will just delete
some file, right?

To get it going and this is what's
happening all the time right

but also I noticed pgBackRest also
requires maybe because of what

this mechanism you explained well
when making backup on replica

it requires connection to primary
And this surprised me a lot

recently when I started to code,
right?

David: Yeah, that actually doesn't
have anything to do with

that specifically.

The reason why we did that is you
get a better backup that way.

Nikolay: Better backup.

Not all backups are equal.

David: You get a better backup.

1st of all, there are some things
about backup from standby.

I don't necessarily want to get
into that because it's pretty

esoteric, but there's some things
about backups from standby

that I don't 100% trust.

But I do believe it works.

The reason why a primary backup
is better, though, is you get

a couple things.

You get stats from the primary,
which is actually pretty nice.

If you do a backup from standby
and you restore, you've got stats

from, say, a standby.

And if you're actually doing analysis
of statistics, it's pretty

shocking to suddenly have like
all of your index patterns, scan

pattern, read patterns, full change,
everything.

So I'm not talking about, sorry,
table statistics like basically

like scan statistics.

Maybe I'm using the wrong word.

Michael: Shoot.

No, I think you, I think I understand
number of seq scans on a

specific.

David: Yeah, I think they're
both, I think they're both stats,

but, but I'm not talking about
the planner statistics.

I'm talking about like the actual
usage stats.

Another thing is if you're doing
a backup on the standby, you

can't actually finish the backup
and verify that all the WAL

has reached the archive.

You can, but you might have to
wait a day for that to happen

or whatever.

We feel it's really important to
make sure that this backup,

When we mark the backup as done,
we want it to be done.

Not hoping that someday this WAL
archive is going to arrive.

We want it to be done at that moment.

So that's pretty important.

Nikolay: Do you mean the backup is
self-sufficient?

It positively affects RTO, right?

Like recovery time objective.

It reduces the time needed to recover.

David: Well, at the very least,
we want to know that the backup

can be recovered to consistency.

Right.

Now, if you want to do point-in-time
recovery, that's going to

be after the backup finishes, you'll
continue WAL archiving,

you need to monitor that as well.

But at the point where the backup
finishes, we want that to definitely

be true.

Other things like getting actual
logs from the primary are more

interesting than getting logs from
a standby.

Although I don't really recommend
that people put logs in their

PGDATA directory at all.

It's actually quite common.

Nikolay: Hold on 1 2nd.

Too many things here.

1st of all, when you say logs,
it's WAL files, right?

David: No, sorry, in this case
I do mean logs.

So like basically textual logs
that are generated by Postgres,

and a lot of people will put those
in PGDATA.

Nikolay: Okay, I didn't get it.

Why do we care about logs when
making backup?

Ah, because we don't want them
to be backed up, right?

David: We do if you have them
in...

So If people are putting their
logs in PGDATA and they expect

them to be from the primary, it
can be surprising when their

logs from a standby or something
like that.

I see.

They're not seeing the information
that they expect to get.

Maybe they have auditing turned
on in the primary and not on

the standby.

You recover the backup and now
all your audit records are gone,

etc.

So now I'm not a proponent of putting
logs in PGDATA on the

primary, but there are people who
do this and they want to preserve

those logs.

That's another reason.

So there's a whole raft of reasons,
but basically, so what we

do is we copy everything that's
replicated from the standby.

Oh, the other reason why we really
wanted to do this, although

honestly we've never implemented
it, is once you've done this,

you can actually parallelize a
backup across all available standbys.

Because you're coordinating everything
on the primary, you wait

for the standbys to reach the checkpoint
where the backup started,

which we already do, of course.

We do a bunch of consistency checks.

And then you would be able to parallelize
your backup across

all the available standbys and
really supercharge it.

And it's not even that hard to
do.

We just haven't really, there hasn't
been demand for it.

And there's always 1000000 things
to write.

You know how it is.

So we just haven't gotten to it
yet.

But that was another big reason
to do it that way, because it

allows us to parallelize backup
in a way that otherwise wouldn't

be possible if you're just backing
up from a single standby.

Nikolay: There is a dilemma here
where, 1st of all, making full

backup on primary, especially if
it's full, not incremental,

full, It's huge stress for disks.

David: Yeah, you don't want that.

Nikolay: Yeah.

So usually people have loaded to
some standby, and then the problem

is also which standby, because
it's also stress.

If it's a read replica, it's stress
for those reads as well.

So distributing totally makes sense,
but there is also, like,

since you talk and focus a lot
on corruption and consistency,

I was always curious if we, for
example, have allocated standby

only for backups, which I think
Crunchy Bridge had an issue, we

had a customer there, Crunchy Bridge
had an issue when they made

full backups on primary.

I was super shocked because like
how come it was fixed then?

And I think full backups went to
HA replica, which is for HA.

It's allocated, it doesn't receive
reads.

But then I'm curious, if some corruption
happens, it might happen

only, there are many kinds of corruption,
right?

I can think about, at least theoretically,
some corruption might

be only on 1 node, but not on others.

And if you always make backups
from only 1 node and you don't

notice this corruption propagates
to all backups.

So some rotation at least is good,
but combining and parallelization,

it's also interesting.

I'm just thinking when SQL Server,
for example, there is a mechanism

to self-heal if, for example, primary
notices some pages are

corrupted, it can grab them from
replicas, which was implemented

long ago.

This is cool technology.

I'm curious about this dilemma.

We want also distribute load not
to hurt our user traffic, but

also what's happening with corruption.

This is Pandora box of topics.

David: That'd be an interesting
idea actually, too.

If you were doing backups from
multiple standbys, you could compare

and contrast.

If there was corruption on 1 standby,
you could say, hey, why

don't I try grabbing this file
from the other standby instead,

fail it, send it back to the main
process, which would reissue

that job on a different standby.

That's a pretty cool idea.

Nikolay: You talk about corruption,
like data checksums level

of corruption but there may be
a higher level of corruption like

index level, like logical level,
foreign keys, like a lot of

stuff.

David: Oh sure and your corruption
scenarios on primaries and

standbys are quite different too
because the way things are being

written out on a primary is actually
fairly different to the

way it's being written out on a
standby.

Also the standby is essentially
a single threaded operation.

Let's not

Nikolay: go there,

David: it's terrible.

Yeah, obviously it's terrible for
performance from the standpoint

of any issues there might be with
locking and parallelism and

other things on a primary, the
standby is not going to see those

bugs, probably.

And so there are whole classes
of bugs that can happen on a primary

that could cause corruption that
aren't going to happen on a

standby.

So from that perspective, it might
actually be safer to do backups

on a standby as well, just because
it's a simpler path.

The data goes through a much simpler
path to get to disk on the

standby than it does on the primary,
as a rule, unless you really

have 1 writer on the primary, which
is not the way things tend

to run these days, not for interesting
databases at least.

So yeah.

And as you say, the corruption
we're detecting is only checksum

corruption.

There are some projects out there
where people have combined

pgBackRest with amcheck.

So basically when it does recovery,
like basically you call this

command, it automatically integrates
pgBackRest with some sanity

checks afterwards and does amcheck
and some other things to check

a higher level of consistency than
just the page checksums.

Nikolay: Yeah, when you do recovery
it's a good point usually.

This is what we usually do.

Not all recovery attempts, but
some of them, percentage of them,

should also check index health.

David: This is really useful
in test recovery.

So hopefully you're testing your
recovery, right?

And that's a great time to do amcheck
because then you can have

confidence in your emergency hair-on-
fire recovery that it actually

does work because you practiced
it, you ran amcheck on all the

practice runs.

So at that point you should feel
pretty confident to run the

production restore without having
to take the penalty of running

amcheck at that time, if you've
been properly testing.

Nikolay: I agree with everything,
but You cannot do it on all runs

because full-fledged amcheck takes
a lot of time.

David: Yeah, that's why I say
it's really for test restore,

right?

So you're testing things and it's
tricky, right?

Okay, I guess you can say amcheck
takes 2 days, so we're going

to do this once a week.

And then we'll do more recovery
testing, but only run amcheck

once a week.

I think that's a perfectly valid
way to go about it, to think

about it.

Absolutely.

Nikolay: I'm also very curious, did
you think about some more observability

capabilities in pgBackRest to understand
that.

My final goal is, for example,
I have 100 clusters.

I just want to know actual RPO,
RTO all the time, measured somehow,

during those restore tests, during
the backup process itself.

And just to have some high-level
view, this is what's happening.

That's it.

Now you need to build a lot to
have that.

David: RPO you can measure using
the information that pgBackRest

provides in the Intel command.

Because it'll tell you what the
most recently, how up to date

you are on your WAL segments,
right?

So that's going to give you your
RPO.

So I'm 1 segment behind, 10 segments
behind, 5 segments behind.

So at least you know what your
recovery point is based on the

information that you get.

RTO is a much trickier scenario
though.

Nikolay: RTO is also tricky because
you think like technical guy

and I'm also technical guy.

We think about bytes.

Business guys think about seconds,
hours of data loss.

How many hours of data were lost?

And this you need to translate
somehow.

David: It's absolutely fair and
you're not going to be able

to get that timing exactly from
just you could use for instance

our repo-ls command to go and grab
information about the WAL

and find out how much WAL you're
generating per 2nd.

It's not going to give you exactly
the time, but it's going to

be awfully close.

So you could estimate, you're like,
OK, we're generating 5 WAL

per minute.

We're 2 WAL behind.

And therefore, our recovery point
is this many bytes behind and

this many seconds behind.

So you would actually be able to
estimate that reasonably well

using tools just that are available
on pgBackRest.

Although actually estimating the
time of the WAL, you would

have to roll your own for that
because that would have to be

something you do for repo.

So the nice thing about repo-ls
is it's a command we give you

and it works on any repository
whether you're on S3, GCS, SFTP,

POSIX.

So you can use that tool to get
information about the repository

that's not included in our JSON
info and it will work in any

environment, on any repository,
on any storage.

So you don't have to muck about
with AWS CLI over here and the

Azure CLI over here and whatever
the GCS equivalent is and et

cetera, et cetera.

And then, okay, so you could sort
that out, I think, with the

tools that are available with pgBackRest,
but RTO is still a

thing that needs to be tested.

We can't tell you RTO at backup
time.

And since Postgres is famously
slow for recovery, Maybe not slow,

but certainly things don't get
written as fast as they get written

on the primary.

That much we can agree on.

Recovery actually, it goes pretty
well.

And Postgres has gotten pretty
good at things.

pgBackRest prefetches, WAL segments,
all this combined actually

gives you a pretty good, pretty
good performance overall.

But again, you're going to have
to test it to know how long it

takes you to do recovery.

Fun little example, there have
been, and this hasn't happened

1 time, but the 1st time it happened
was in early days of

pgBackRest, because our thing has
always been high volume performance,

et cetera.

So there are a number of people
over the years who have used

pgBackRest because they can't use
replication.

Because replication simply will
not keep up with their load ever.

So what they do is they basically
continuously take and restore

backups.

And that way, they can measure
the RTO pretty well, actually,

because they're constantly doing
these recoveries.

And they get to a point where they...

And then we'll see how long it
takes to get to the point where

they were when they did the restore,
right?

Because they can never actually
catch up, but they know how long

it's going to take.

If the primary fails, how long
will it be until the standby is

up to date using WAL from the archive?

A downtime will be this.

A downtime will be this.

And they know it exactly because
they do it every single day.

And this is a use case that I've
seen multiple times now.

People have brought it up with
me.

It's a really interesting use case.

Nikolay: So they are bumping into
this 100% single CPU situation

problem of startup process on replicas.

So that means also they live without
replicas.

With these replicas, they cannot
use them because they are lagging

basically, right?

David: Yeah, they're pretty far
behind.

So they really are just there to...

I mean, obviously you can make
up scenarios where a replica like

this could be used.

Let's say you're doing reporting
and you know your replica lag

is 6 hours, then you know by 8
AM you can start running reports,

that kind of thing.

So you can game it.

But you're right, it severely reduces
the usability of the replica

in that scenario.

But they just need something because
they can't start from...

They're just trying to minimize
the time it takes for them to

get to running again.

Because if they start from a restore
of their 50T backup, that's

going to be however long that takes.

pgBackRest is pretty fast, but
50 terabytes is 50 terabytes.

And then they start recovery.

Nikolay: 1 hour.

I showed an article 37 terabytes
per hour.

We can do it 1 hour, but it's local
NVMe to local NVMe.

It's very different.

David: Exactly.

As usual, the backup is, you know,
I've gotten a lot of flack

over time about using, we do a
lot of checksumming and we use

pretty heavy duty checksums and
people are like oh those checksums

are so expensive how can you stand
how expensive they are?

They're not expensive once you
actually start pushing stuff.

If you're writing to your local
SSD, then yeah, the checksum

overhead is 5% in pgBackRest.

I've measured it.

We have tests for this.

So yeah, it looks like a lot.

But then you actually start pushing
stuff out to S3 and poof!

It just disappears.

You don't even see it anymore.

The trick is getting this kind
of volume in a realistic environment.

People don't store their backups
locally.

At least they shouldn't be.

That's the message we try to get.

Nikolay: I should repeat the tests
with S3.

I'm very curious about the level
of parallelization we can get.

Of course, we need the machine
itself should be where we recover.

It should have local NVMes.

I'm very curious.

David: We have a...

Yeah, sorry.

Let me just address that really
quickly.

We do have a good, interesting
optimization coming, hopefully

in the next release, if I can get
it reviewed, by July 10th,

that's our feature freeze.

It basically does, basically it's
adding prefetch for object

stores and something I call I'm
sure other people do this but

I haven't really figured out what
it should be called so I've

been calling it readover hoping
someone will come up with a better

name.

But a lot of times we're reading
through say a bundle and we

need n number of files or n-m number
of files and so we're basically

starting a bunch of new reads so
we'll grab these 3 files and

then there's a file we're skipping
so we'll start a new read

grab 4 files and then we have to
skip 2 files so we'll start

a new read and then do So what
we're doing is actually we have

this thing now called read over
which is also under review and

it will skip over a configurable
block of bytes.

So if we're reading through the
file and suddenly And it's configured

by default to 64K.

So if we have this 1 small file
that's breaking up the read that

we're doing, we'll just read it
and throw it away and then continue.

And I picked 64 bytes because based
on my testing, that was,

It's cheaper to read 64 bytes pretty
much in all scenarios, no

matter what your storage type is,
whether it's reduced redundancy

or call it the longer term storages
and stuff, I can't think

right now.

And so 64k is basically always
a win.

Even if you're paying for egress,
which you will be in some cases,

you're still going to pay less
if you just read the 64 bytes.

And from a latency standpoint,
it's always a win.

Because starting a new read is
just expensive in terms of time.

And there's actually a cost to
starting a read as well.

Once you've actually got a file
open you're reading and normal

S3 storage You're not paying for
that egress But if you actually

start a new request you do pay
for that new request Anyway, some

interesting features that would
make the tests that you were

talking just talking about Very
interesting to run Would be a

very different scenario than pgBackRest
today.

Nikolay: Longer storm to storage,
glacier, right?

David: Yeah, no, I'm not thinking
of glacier.

I'm thinking of the 1 in between.

I actually set all my buckets to
transition to it automatically

after 30 days.

So just transition to that.

You can still read it instantly,
but it costs.

If you go and read it, it costs.

And there was a, people have been
wanting to add this as a feature

where you can go into the storage
class immediately for all backups

and WAL.

Nikolay: Yeah.

Standard IA.

David: IA, thank you.

Infrequent access, that's the 1.

And it's actually a bad idea to
write IA initially, because in

a healthy repo, you're actually
reading the repo.

And as soon as you read 1 backup
out of that repo, you've destroyed

all the advantages of IA.

It's all gone.

The idea of IA is I've got an older
backup from 6 months ago

that I'm holding for compliance
and now suddenly I need to recover

it.

Great.

You do that once every 2 years.

Now you're using IA correctly,
but if you're writing all your

backups into IA initially, you'll
just end up spending more money.

The only scenario where it works
out is where you write and never

ever do any recovery or any WAL
recovery ever, which is just

not a use case to me.

Back to these

Nikolay: cases you are familiar with,
I'm very curious how much

of all data was written per 2nd
on those cases where it was like

the problem like they cannot afford
replicas so if near 0 lag,

right?

David: Yeah.

Nikolay: Some hundreds of megabytes
per 2nd, I suspect.

David: It can be.

It really depends on what you're
doing.

Let's say you are, the simplest
possible case is you're replaying

sequential writes on an unindexed
table.

Right?

That's your simplest possible case.

That'll run like gangbusters and
that'll definitely go into hundreds

of megabytes or gigabytes.

Maybe not per 2nd, but hundreds
of megabytes per 2nd I think

becomes pretty reasonable at that
point.

There's still some, still quite
a bit of exchange to get the

WAL segments, move them, rename
them, do this kind of stuff,

but it's quite fast.

But as soon as you start doing
interesting things like say updating

indexes, then your write volume
is going to go down significantly.

Then it's all going to be I/O latency.

So the CPU will drop below 100,
you'll go into I/O wait, and you're

going to be looking at latency.

So if your CPU actually is 100,
you're probably doing really

simple writes, generally speaking.

And you can maintain a really high
throughput if you're doing

that.

If your CPU drops, if you've got
complicated index writes and

stuff like that, your CPU will
generally not be 100 anymore.

And you'll be into I/O wait instead.

So I played around with this in
a bunch of different scenarios,

trying to figure out ways to help
Postgres and maybe gain Postgres.

Certain scenarios are just slow.

Nikolay: It makes total sense because
if it's like simple, it's

I/O bound and if on primary, we
have hundreds of cores doing this

work, but on the back we have only
1 trying to catch up to do

the same work, basically, logically.

It's terrible.

Yeah, that makes sense.

Simple work.

Yeah.

David: And unfortunately, all
the cool work that Andres et

al have been doing on async I/O,
Tomas, and et cetera, it's

not going to help that scenario
very much in on the standby.

It's great on the primary, but
it just makes a situation where

the primary can write even more
data than the standby can ever

possibly keep up with.

Again, because single threaded,
so those async operations are

only going to buy you so much.

Direct I/O and stuff obviously
will help.

So there are improvements being
made in the single threaded recovery

So don't make no mistake I'm not
saying there aren't because

there definitely are but they're
not keeping pace with the improvements

that are being made on the primary
side Which are can be an order

of magnitude greater than what's
happening on the standby.

Nikolay: Yeah, exactly.

I used, I think, Intel machines
in this experiment where I achieved

30-something terabytes per hour
with pgBackRest, and I compared

it to pg_basebackup, and I expected
200-300 megabytes per 2nd

as usual.

But since it was Postgres 18, I
saw 1 gigabyte per 2nd.

I was super surprised.

David: It's actually gotten better.

I remember the article.

Yeah.

And that's jibes with what I've
been seeing in terms of overall

I/O improvement, I/O improvements
in Postgres.

So they're impacting everything,
make no mistake, but it's obviously

that recovery bottleneck is still
there.

And there's not much that pgBackRest
can do about that, except

just make sure that Postgres always
has the WAL.

As soon as Postgres requests a
WAL segment, we've got it right

there waiting for it.

And we got recommendations on how
to set up your storage so that

we can do a move into the pg_wal
directory so it's as fast as

it can possibly be.

We're not doing a copy.

We don't fsync things, we don't
et cetera, et cetera.

So we're trying to run everything
as fast as possible, but there

are limits.

Nikolay: This makes total sense and
it's great insight.

Like improving things around, you
might highlight some difficult

problems and they might become
even more acute, right?

So Everything improved so you have
better performance.

You much faster you can achieve
the problem with lagging replication

Yeah,

David: it was definitely a time
before pgBackRest and

other parallel WAL implementations
that were getting the WAL

was the bottleneck and Postgres
kind of merrily sailed along

and did its thing and it was just
WAL that was the issue.

Now it's usually Postgres.

Another place this expresses is,
if you're doing, like people

have noticed and started pointing
out to me, although I already

knew it, is that if you're doing
log shipping, you can do recovery

much faster than if you are doing
replication.

Because log shipping compressed
segments, which are prefetched

and decompressed asynchronously
and then moved into the pg_wal

directory is quite a bit faster
than streaming uncompressed WAL

segments over the network from
the primary basically every day

of the week if you have high volume.

So that really brings into question
the idea of I brought up

a standby and I want to do recovery.

How do I do the best things where
I'm going to go and log ship

as long as I can because that's
faster and also puts less load

on the primary and then only switch
over to getting data from

the primary when I have to.

At 1 point you switch back and
stuff like that.

Postgres doesn't currently manage
that very well and the rules

are a little bit arcane.

So some archivers, we haven't really
done this in pgBackRest

yet, but there's certain things
you can do to basically, you

can artificially give an error
to Postgres to make it switch

back over to replication.

Do you get what I'm saying?

So let's say we get to pretty much
the end.

Nikolay: You're moving very fast.

So you're talking about replica
where both restore_command and

primary_conninfo are configured,
right?

Yes.

And then you say Postgres has a
very strict precedence rule.

I think it searches local pg_wal
directory 1st, then it uses

streaming replication and only
then it uses restore_command,

if I'm not mistaken.

David: I believe that's the way
it works, yeah.

Nikolay: And then you say you have
a way to fool Postgres to achieve

what you want, and what you want
is to get WALs from archive

because they are compressed and
you can parallelize it at restore

command level and only switch to
primary_conninfo when to streaming

replication in some cases.

Right?

And then you have to, with some
error, this I don't get already.

I can imagine throwing an error
from a restore_command, I cannot

imagine anything at primary_conninfo,
that's a problem.

David: The situation is, let's
say you are, you've been doing

log shipping because that's the
fastest thing to do, so you're

getting WAL segments.

Now, Postgres, if it's basically
decided on log shipping, isn't

going to switch away from that
until something happens.

The idea is, like the actual, the
WAL get routine, when it knows

it's at the end of what it has.

So let's say Postgres, so you get
to the end and now the next

WAL segment arrives 2 minutes
later.

Now you've essentially built in
2 minutes of lag because Postgres

will patiently wait 2 minutes for
that next WAL segment to arrive

as long as it doesn't get an error.

So what you, in this situation,
what you can do is actually inject

an error into the archive-get to
say, I know I'm at the end of

the archive stream in the repo
because I can see it.

I'll throw an error to force Postgres
back to replication.

Nikolay: Yeah, streaming replication.

I think I got it wrong.

I think restore_command has precedence
over streaming replication.

David: I think it has.

Nikolay: I always forget that.

But if the recipe you described
implies that.

But it's an interesting recipe.

David: And a lot of times, even
if the streaming replication

were prioritized, let's say you
recover, restored a backup and

you need to get WAL from 3 days
ago to start doing recovery.

There's a very good chance that's
not going to be on the primary.

So you're going to, you're going
to go over to log shipping at

that point anyway.

But the other thing you would want
to do is, let's say you start

lagging too badly on the primary,
ideally Postgres would go back

to log shipping because it's quite
a lot faster.

Historically that wasn't true,
but with modern archivers, like

something you'd find in pgBackRest
or WAL-G, we can do that

a lot faster than replication.

So in theory, if you're lagging
behind enough on replication,

you switch back to log shipping,
catch up, then go back to replication

in order to have as little lag
as possible.

So the question is, how do you
coordinate?

On the pgBackRest side, we can
game it to force pgBackRest

to switch to streaming rep, but
we can't really get it to go

back to log shipping.

Now, these are discussions that
have been had.

I don't know if we've...

I can't remember if we've had them
on hackers, but it's something

we've certainly talked about at
conferences.

Nikolay: Yeah, that's an interesting
idea.

It would be great to have more
control here and more flexible

configuration.

David: Some kind of...

Nikolay: Sorry.

Go on.

We have huge delay because we are
on very different parts of

the world, right?

3 very different parts of the world.

David: We are very geographically
distributed, that's for sure.

So yeah, so on the 1 side, the
archiver would have to tell Postgres

that it's caught up.

So rather than throwing a nasty
error, we could say, we could

have a code to return value 2 or
something to say, hey, I'm at

the end of my, of the WAL in the
repository.

Nikolay: Or just signal the lag.

Signal the lag.

This is my lag, That's it.

And then the mechanism will decide
which precedence, adjust precedence

based on knowledge about lags.

David: Or that.

And honestly, if you're going to
do that, the way to do that

would be inside the archive_library
interface.

Nikolay: And

David: you're familiar with the
archive_library, right?

Yes.

We don't have 1 for pgBackRest
yet because it's not needed for

performance in pgBackRest and
it doesn't really provide any

other benefits.

So we haven't really done it yet,
although now that several versions

of Postgres support it, I'm looking
at probably adding that this

year or next year at the latest.

Because then maybe we could push
some of these ideas into Postgres

to saying, hey, let's flip back
and forth between archive-get

and streaming replication based
on what's...

And passing stats back and forth
inside the archive_library would

be easy.

You don't need any fancy return
values or JSON or something.

It would just be an API call that
you would use within Postgres.

Nikolay: Postgres 15, when it was
introduced.

I wanted to clarify very clearly
that we talk about fetching

WALs and streaming, but after
that Postgres needs to replay

and then we have this startup process,
100% CPU.

This cannot be solved anyhow, right?

So if you manage to...

Usually problem is there.

This is what I'm trying to say.

It depends on the system, I think,
but even with a simple, regular

streaming replication, which is
no compression, nothing, I usually

don't see any problems with receiving
WAL and just putting it

to file with WAL receiver, right?

The problem is usually on the startup
process.

This is my observations on multiple
systems.

David: And that's true.

A lot of it depends on your network.

If you're running things in multiple
availability zones and stuff

that, or across cloud providers
or whatever, that situation can

definitely change.

Nikolay: Yeah.

But certainly if you've got,

David: if you've got primary
and standby locally, but the network

speeds these days, you're right.

Streaming replication should not
be a problem in that environment.

In this day and age.

Nikolay: My observations are with
good network and single region.

I agree with you if you introduce
network complexity, yes.

Michael: That sounds all very clever
and like obviously needed

at super high scale.

We've talked to a lot of people
and it seems like there's quite

a few projects in the space of
sharding these days and it feels

like people that go the sharded
route sidestep this problem because

each of the shards just has lower
volume, like by definition.

So I'm wondering if it's just like
an orthogonal thing to

pgBackRest, like sharding can happen
and each 1 can be backed up.

Does it affect you in any way or
are you thinking along those

lines at all?

David: From a pgBackRest standpoint,
no, because it's really

from our...

So the things in this arena that
I know support pgBackRest are

like Greenplum and Multigres.

And in both those situations, it's
a very straightforward

pgBackRest scenario.

We're just doing recovery, they
give us a point in time to recover

to, and that's up for them to figure
out how to get the shards

back in sync, and then we just
give them what they ask for.

Multigres looks enough like Postgres
that pgBackRest just is

cool with it.

On the Greenplum side, there's
a company that has created a fork

that specifically supports Greenplum
that'll read its control

files and do that and we've never
brought that into core mostly

because I don't want to get into
Greenplum and testing Greenplum

and also annoyingly the oldest
version of Greenplum is still

based on Postgres 9.4, which we
expired a couple of years ago.

So it's a little bit painful to
have.

So that company is actually basically
supporting now, not only

do they have their fork, but they're
also supporting versions

of Postgres that we don't support
anymore so that they can still

support Greenplum 6, which is based
on 9.4.

And maybe that's something we could
bring into Core Postgres,

but it's a resource issue.

We just simply don't have the resources
for it and it's just

easier to focus on open source
Postgres.

When you drive people towards the
idea that if you make your

stuff look like Postgres it'll
just work, if you muck about with

pg_control, like the parts of...

So we even have a feature where
you can add stuff to pg_control

as long as you add it to the end.

And we'll automatically figure
out, so you tell us your Postgres

16, we'll automatically figure
out where the checksum lives,

based on the size of your pg_control
and all that kind of stuff,

and work all that out.

But The contract is that the part
of pg_control that was part

of that version of Postgres must
be the same.

Because we read all over pg_control.

It's extremely important for our
operations, so we can't just,

you can't move things around, you're
just allowed to add to the

end.

So we do make concessions to forks.

Things like it works with EDB,
their enterprise server, and other

things that have new control versions
and stuff like that.

As long as you kind of know what
you're doing.

We call it a maintainer feature.

So the idea is if you shouldn't
be mucking around with this feature,

if you are not a maintainer of
a fork.

And deploying pgBackRest for
that particular fork or documenting

how pgBackRest should be used
with that fork or something.

If you have to use these options
on a regular cluster, then you've

probably done something wrong.

You shouldn't be doing that.

But obviously I can't control what
people do.

That's part of the fun of it, right?

Michael: Yeah.

Talking of maintenance, is it a
good time to transition into

how you have been maintaining it
over the years, kind of what

the situation was, and it's changed
recently.

So I'm wondering if you wanted
to share a bit of that.

David: Yeah, maintenance has
always been a big part of pgBackRest.

1 thing is we have a very comprehensive
test suite for a project

of this size.

We test on 5 different architectures.

We test on a variety of distros.

We have 100% unit test coverage.

We have integration tests.

We have, we test our doc code,
our test code, or we test everything.

And just keeping that test suite
up to date is a bit of a challenge.

It's not the core code.

The core code doesn't really, they
don't, the C code doesn't

break because rel 10 comes out.

What breaks is our tests and the
documentation and all these

sorts of things.

So it's a non-trivial task just
to keep all that going.

And then you have things like say,
Cirrus CI just going away

on a month's notice.

So we need to migrate off of Cirrus
CI.

Luckily, we're already on GitHub
Actions, so we're able to do

that fairly easily.

When I say maintenance, I also
include bug fixes that come in,

So we get a pretty regular stream
of bug reports.

Some are bugs or some are not.

Most of them these days tend to
be pretty weird edge cases.

Enough people are using pgBackRest
that the Really obvious

bugs get caught pretty quickly.

If we release something and there's
bugs in it, I get reports

the next day, almost.

Bug fixes or maintenance, the documentation
is maintenance, interacting

with the community and answering
questions and feature requests

and other things like that's all
maintenance.

So that's actually a pretty big
part of the job.

And after I left, after I did not
transition to Snowflake, so

Snowflake bought Crunchy Data, I did
not transition.

I wasn't happy with their terms,
so I decided just to take some

time and travel.

I kept maintaining pgBackRest,
but at that point I was really

just maintaining it.

Which is fine, but it really got
to the point of thinking about

really restarting development on
pgBackRest again.

I was like, I think I might need
a job.

And so I started trying to get,
I actually put sponsorship links

and I've been including sponsorship
with every release since

last summer.

It was around for basically a whole
year, not getting a lot of

traction of course.

Then I started spending more time
on it.

Then I got to the point where I
realized maybe this whole sponsorship

thing isn't gonna work out.

And that got us to the recent crisis
where I announced that I

wasn't gonna work on pgBackRest
anymore.

And suddenly I got sponsorship.

And when I say suddenly, it really
was suddenly.

Within 3 weeks, pretty much everything
was worked out, and I

was able to make an announcement
that the project was going to

continue to be maintained, and
etc.

But not just maintained.

The key is adding new features.

Right now, I've been doing, on
pgaudit and pgBackRest, I've been

doing a lot of cleanup work because
they've been sitting for

a while, especially pgaudit wasn't
really getting a whole lot

of love.

So I've been working on that, but
I'll be done with that at the

end of this week, basically.

And then it's time just to dive
into big new features again,

which I'm really excited about.

I've got a lot of ideas.

The sponsors have a lot of ideas,
as you can imagine.

Nikolay: The

David: great thing is the ideas
that the sponsors have are

directly aligned with big ticket
items that have been on our

list for a while.

They're not asking for outlandish
stuff.

They're asking for repo-to-repo
backup, for instance, is a big

ask.

And that's been on the list for
a long time, because that's a

pretty important thing to be able
to do.

Streaming, doing streaming while
replication.

So you can have RPO 0 if you want,
without having a standby,

so the idea is you have RPO 0 without
a standby.

Nikolay: pg_receivewal, right?

David: Yeah, although we would
work directly at the protocol

level.

And in fact, we might...

I'm actually thinking, given that
we have this archive_library

thing, and given that pgBackRest
can just live in Postgres,

I'm not even sure if we need to
deal with the protocol level.

We could just sit there as a companion
to Postgres and stream

WAL out.

Nikolay: Isn't it expensive to sit
as Postgres?

Like, I don't get it.

What exactly is the idea?

David: So basically, instead
of a replication slot, we would

just be keeping up with the current
write WAL location.

Where are we currently writing
in WAL?

That's the point that we should
currently be tarping.

And although we would still need
a replica, if we really wanted

to say a synchronous replication,
RPO 0 stuff, we'd probably

still need a replication slot for
that.

The other thought we had is that
we could actually game, we could

create a replication slot but not
actually use it and just update

the row.

So basically do our own background
archiving and then update

the row in Postgres to our current
position, what we've actually

pushed out to storage.

And if I do this, I want to do
it right.

In a lot of cases, the things that
are doing the WAL receiver,

they're not actually, let's say
your eventual destination is

S3, they're not actually writing
packets off to S3.

They're storing the stuff locally
on whatever host the WAL receiver

is running on and then when they
get a full WAL segment, they're

pushing that to S3.

But what I would want to do here
is actually have a couple of

parameters that decide what your
acceptable lag is.

Let's say you're willing to have
3 seconds of lag, or so many

bytes.

And we can measure that.

In this situation, we'd be able
to measure either 1 of those

very easily.

So we would say, OK, if we get
to that point where we're going

to hit that lag, we'll actually
write a chunk of WAL out to

S3 and store it.

And the WAL archive-get routine
would actually be written so

that it would be able to...

Later these chunks would be assembled,
of course, reassembled

into a WAL segment to keep things
kind of tidy.

But the WAL get routine would
actually be able to understand

if you get to the end of the WAL
stream and all you've got is

these chunks, we'll be able to
read them out and reconstruct

them and send them Postgres.

So you actually have, because I
feel like the implementation

is out there right now, yeah, you're
getting the WAL off of

the primary, which is good.

Don't get me wrong, that's a pretty
important thing, but I want

to get the WAL all the way to
the repository, all the way to

the final storage.

So if everything goes away, and
all you've got is the S3 bucket,

then you still have the whatever
you set that to, sorry, then

you would have that, unless we
fall behind or other things happen.

Obviously, there's scenarios.

But I think it could be a lot more
efficient because then we

could figure out, hey, they're
generating a WAL at a ridiculous

rate.

We actually need to switch over
to just compressing and pushing

whole WAL segments and then come
back to chunking up portions

of this WAL segment.

And with the replication protocol
you don't really have that

option because you've just got
this fire hose that's sending

you data and what you really want
to do is say, no, this is too

much.

Also, it's a lot slower than just
compressing and shipping.

So, no, we're going to go back
to doing whole segments, and when

we get caught up on the whole segments,
we'll start doing the

most recent segment in chunks,
back and forth.

I think the best way to do that,
that I'm thinking, I'm always

an outside the box kind of guy,
is actually in an archive_library

sitting next to Postgres and just
asking Postgres, where are

you, where are you, you know, kind
of thing.

And we can actually monitor, you
know, check stuff on disk, so

we'd be able to see, oh, we've
just gotten a new WAL segment,

so everything in the old WAL segment
is fair game, etc., etc.

So there's plenty of heuristics
we can use to improve this.

So that's the way I'm leaning right
now, and that gives us the

archive_library and this WAL receiver
RPO feature at the same

time.

Nikolay: RPO control.

This is better RPO control, basically.

David: Yeah, better.

Sorry, I meant to say RPO 0.

So if we could actually gain this
so that we can update the node doing streaming

replication, but update the row
in Postgres for replication,

then we could actually do synchronous
replication this way.

And we could basically do synchronous
replication to S3.

If someone wanted such a thing,
yikes, It would be slow.

Nikolay: I'm very curious.

What is your opinion about, Barman
was the 1st tool which used

only streaming replication, right?

I remember this.

So what's, what is your opinion
on using streaming replication

for backups?

And this new CYBERTEC tool released
last week, pg_hardstorage,

I quickly checked they don't use
pg_receivewal as well, they

just implement protocol, they work
with protocol and they pretend

to be replicable.

Yeah, That's

David: what most people have
done.

I think Barman might still be using
pg_receivewal, but most other

implementations have gone straight
to the protocol.

The protocol's not that complicated,
to be honest.

Nikolay: Like all uppercase.

David: If you've got libpq, Once
you've got that set up, especially

if you're using libpq directly,
it's actually fairly trivial.

The replication protocol is very
simple, which it should be.

There's nothing wrong with simplicity.

I'm all about it.

I think ultimately it's a big bottleneck
making backups through

the streaming replication.

So for very large databases, it's
just going to be a pretty big

bottleneck.

Obviously someday we can add parallelism,
do all these things

and etc.

There's a lot of...

Well, they got compression.

So that's good.

Although I think...

Is the compression server side
or is it client side?

I think it might still be client
side.

No, it's server

Nikolay: side Okay,

David: I can't remember.

Anyway, that's actually a pretty
solvable problem But You've

got basically a scale of now you
need to back up something really

large and you're pushing everything
through this little pipe.

Nikolay: pg_basebackup has compression
since Postgres 15 and you

can...

Yeah.

Right?

Or

David: no.

It does.

I just can't remember which side
it happens on, whether it happens

on the server side or the client

Nikolay: side.

David: I think it is.

Nikolay: Compress server-gzip
option.

David: I believe it is, yeah.

Nikolay: But it's still single threaded.

David: That's 1 bottleneck taken
care of because you're not

pushing uncompressed data over
the network, which you definitely

don't want to do.

But it's still single threaded.

And that's a huge limitation on
the backup side, obviously.

And then on the recovery side,
we still have this problem of

all the data needs to be recovered
before anything can be done

if you're doing block incremental.

If you're not doing page incremental,
then you can make that

a bit more efficient.

You can stream the tar file from
S3 and decompress it as you

go and do various things.

But if you're doing page incremental,
which I think pretty much

everyone wants to do, that becomes
a huge bottleneck on the restore

side.

Nikolay: I'm sorry.

We never had so compressed.

I know we already are way over
time, but it's so compressed.

You compress so much knowledge
and I like it so much, but can

you explain, like elaborate page
level versus block level and

why you think it's important?

David: That's my confusion because
what was introduced in Postgres 17,

the page level incremental with
the WAL summarized and everything,

in pgBackRest we call that block
level incremental.

And the reason why we call it block
retal is we don't always

operate at the page, at page size.

We operate at block size.

And then we actually combine blocks
into what we call a super

block for compression.

So let's say I want to get 1 block
out, I might have to retrieve

3 compressed blocks or 4 compressed
blocks, depending on the

super block size, to actually get
that.

But the important thing is we can,
let's say we've got a 1 gigabyte

file and we need to recover 1 block,
we can go recover 3 blocks

to do that instead of 1000 blocks,
however many blocks are in

a 1 gig file.

I think it's usually 1, 200 at
our largest block size.

So we can go recover 3 blocks instead
of 1, 200 blocks, even

though we have to recover 3 blocks
just to write 1 block.

So the main thing is we're able
to go, let's say you've got a

standby or a primary that's failed
for some reason, you don't

know why.

So you're gonna do a delta restore.

So in a delta restore, we go and
we look at every file And for

the block level stuff, we chop
it up into pieces, and then we

do a hash on each piece, and then
we do a hash for the entire

file.

If the whole file hash matches
what's in the backup manifest,

then we're done.

Move on.

Next file.

If the whole file manifest doesn't
match, then we can go recover

what we call the block map, which
is a map that tells us where

all the blocks live in all the
backups.

You might have 10 incremental backups
since your last full, and

now we need to go figure out where
are the blocks, where are

the blocks that we need located
and go grab those blocks.

And in the other implementations
that have been done, you can't

do that.

You basically have to go recover
everything and some of them

will have good things like WAL-G
will, as it's reading it,

it'll know that it doesn't need
the block and it'll just throw

it away.

That kind of read over thing I
was talking about before.

So it'll just throw away the blocks
it doesn't need, but it's

still reading them out sequentially.

We're actually able to go random
access and pull out just the

blocks you need.

And It's extremely powerful because
we have that delta restore

concept where we can actually just
recover part of a cluster.

And that's where the power of those
checksums that we spent in

theory, 5% of our time, but actually
it's really much less than

1%, that's where that all comes
in on the pg_basebackup side

right now.

You basically have to recover all
the full and all the incrementals,

decompress everything all at the
same time, present that to pg_basebackup

that will then rewrite that into
another directory, sorry, pg_combinebackup

which will then rewrite
that into another directory.

So at a minimum to do recovery,
you're looking at double your

database size and actually more.

And you don't really know what
that more is, unless you've stored

it.

It's up to you to actually store
that information to find out

what is the total number of bytes
that I'm going to need to even

pull down the data.

On the pgBackRest side, when
we're doing a recovery, nothing

ever hits disk except in PGDATA.

So we're just reconstructing those
files inside PGDATA.

There's no spooling, there's no
spooling during backup, there's

no spooling during recovery, there's
no spooling during archive

push, there is spooling during
archive-get because we actually

prefetch WAL files and we store
them so that we can hand them

over to Postgres.

So we, we spool on that side, but
for the most part, we're 0

disk operation.

And to be really efficient, that's
the way you need to be looking.

And right now the design in pg_basebackup
is exactly not that.

A lot of disk I/O.

They've done some tricks with,
if you're on ZFS or other file

systems that can do copy-on-write
tricks and do other fun things

to try to minimize the number of
writes that you're going to

do.

I don't know if it actually minimizes
the space that you need.

On some file systems that might
be true.

Nikolay: Thank you for explanation,
it was great.

David: The whole idea was the
feasibility of pg_basebackup

using the streaming replication
as a backup tool.

And I think, yeah, you can do it,
but when you're working at

scale, it starts to introduce a
lot of limitations.

And from day 1, Gabriele Bartolini
wrote a pretty good thing

after I announced that I wasn't
working on pgBackRest about

his and my fundamental disagreement
about how this tool should

work.

And his idea is that the copy thing
should be owned by Postgres.

So Postgres should do all the file
copying.

And my idea was that no, the copy
should be built into the tool.

Because then we can do, the sky's
the limit.

We can do anything we want.

And I think the performance that
you can get from pgBackRest

speaks to the value of having that
complexity.

And it is, it's a lot of complexity,
don't get me wrong, to have

all that built into pgBackRest,
but the payoff is performance.

And also reliability.

We have checksums that we can check
on everything.

We know when the data that's coming
back is good.

That's 1 thing that kind of peeves
me is that if you, like let's

say you were doing pg_combinebackup
it doesn't actually verify

that the data is correct.

So if you want to verify, you actually
have to run pg_verifybackup

as a separate step before
you run pg_combinebackup.

And in my mind, those should always
be combined into a single

operation.

You're looking at the data, you're
verifying it, you're writing

it out.

And the cost for that, again, oh,
checksums are expensive.

Yeah, but if you're streaming the
data directly from S3 and checksumming

it and writing it out as it comes
in, the checksumming cost disappears.

You don't see it anymore.

If you copy all the data from S3
locally, decompress it, and

then start running checksums on
it, it looks really painful because

you weren't able to, the idea for
me is let's gain all the latency

that we can, the storage latency
that we can.

So while we're waiting for the
storage to give us something,

we'll send it the next request
to S3 and while it's thinking,

sending this data, we check some.

We're done with that block, the
next block is ready for us.

We pull that in, we start checksumming,
we've asynchronously

set the next request to S3.

So we've already got more data
coming.

In the next version of Postgres,
if everything goes well, we'll

also have prefetch.

So we won't be asynchronously fetching
1 block, we'll be asynchronously

fetching by default 4.

And just pulling that data in as
fast as we can, but even as

fast as you can get stuff over
the network, that cost of checksumming

disappears, But it gives you power.

And pgBackRest is the only thing
that has something equivalent

to delta restore.

And all of that is powered by the
checksums.

That's where all that comes from.

And then we have the block level
delta restore as well, which

is powered by the block level hashes,
which are actually XXHash,

not SHA-1.

Because the blocks are a maximum
of 88k.

Although you configure that, you
can also configure how much

of the XXHash you're going to use
for a particular block size.

But we did a lot of analysis on
this and looking at file systems

and how many bits they were using
to protect blah blah blah etc.

And basically came up with checksum
sizes that were appropriate

for each block size that we support.

The whole the way we do block incremental
is actually a pretty

complicated topic.

I did a talk on it once and it
was a complete disaster because

it was just too much.

Nikolay: You go too deep.

Truncating 32-bits you said, right?

Or truncating hashes you said,
right?

David: Oh yeah, so the 32-bits.

So what we do is we generate a
128-bit hash and then we use however

many bits out of that, well, bytes
actually, that we want.

The minimum number of bytes we'd
use for an 8k block is 6 bytes,

which is actually a lot.

That's 50% more than would be used
for 4k pages on ZFS, for instance,

or Btrfs.

So we go a little bit crazy on
the checksums because we just

we want everything to be right
and then when you're doing block

increment and we actually have
2 levels of checksum so we use

the block checksums to reconstruct
the file and Then we also

checksum the entire file and the
SHA-1 hash of the entire file

plus all the block checksums have
to match.

And the chances of that getting
a collision, you know, like a

piece of data that actually satisfies
the conditions, all the

block checksums matched and the
SHA-1 checksum of the whole file

matches.

We're talking heat death of the
universe probability here.

Right?

It's just simply not going to happen.

You're going to have disk corruption
1000000000 times before

anything like this ever fails,
as far as my math works out.

Michael: Well, it sounds reasonable.

Nikolay: It's a lot of interesting
topics, I feel.

And I like the direction you are
thinking, especially RPO control.

This is super cool thing because
I see the demand because Postgres

has been used in like critical
systems and people just don't

know what's happening there.

Like how, what's happening in the
case of disaster, right?

Like it's hard to answer simple
questions to business leaders.

David: Yeah.

It's not 1 size fits all.

Right.

I did.

You guys are familiar with ARIN,
right?

The internet number registry.

I know the people who run the database
over there, and they use

pgBackRest.

I worked with them for years in
another company, and they're

all RPO 0 stuff, right?

But their write volume is stupidly
low.

So they've got 1 primary, which
receives 100 writes an hour on

a really busy day, and then they
replicate that, and then you've

got the whole internet reading
to figure out what blocks are

allocated to who at any given point,
and where they should be

routed, that kind of thing.

So the read volume is pretty big,
the write volume is extremely

tiny, so they do synchronous replication
because they can't lose

anything and they have low enough
write volume they can get away

with it, so they would be a great
candidate for a pgBackRest

RPO 0 solution that allows them to
synchronously write to S3.

I bet they'd love it.

But for most people, that kind
of solution just isn't going to

work.

People ask me how I can work on
this software year after year.

And the reason is because the problems
are actually really interesting

and challenging and extraordinarily
complex in terms of the solutions

that we have to come up with to
solve these problems.

It's, it's, pgBackRest is a relatively
small project, but it's

dense.

The stuff that we do is...

Nikolay: Let me uncompress a little
bit.

You mentioned ARIN, it's American
Registry for Internet Numbers,

right?

David: Right.

That's the 1.

Nikolay: So it's basically the important
piece of internet, right?

David: Very important piece.

Yeah.

Very important piece.

Nikolay: It's interesting that you
Consider like volumes are not

huge and the streaming replication
and so on.

It's interesting and challenging.

But I have cases where it's like
just not answered.

RPOs are not answered and extremely
important cases.

Important companies also, they
simply don't know.

So if you build this, I think it
will be good.

It will be useful, helpful, and
so on.

David: It's a tricky thing though,
to some extent, when you

put those tools in people's hands,
because of course there's

some level in the hierarchy that's
going to say, always RPO 0.

Of course, we have to have synchronous
replication.

Nikolay: Why would we ever do that?

We cannot afford data loss.

We cannot afford, we need to use
the synchronous replication

and so on.

And also we cannot, we don't have
split brains.

You mentioned Gabriele Bartolini,
right?

I also have my opinion about CloudNativePG
and solutions he

chose in this tool.

Yeah, but like at high level we
are great, but if you look inside

you see data loss is possible,
split brains are possible, almost

everywhere.

And It's really extremely hard
to achieve and I like so much

you focus on these topics like
RPO, control, and especially corruption

and checksums everywhere.

It's great.

David: My whole job in life is
to protect the data, protect

everyone's data.

And that's what I think about all
the time.

Like how can pgBackRest be the
most reliable thing?

Even if it's not the fastest, or
I think it's generally the fastest,

but if it's not the most feature
rich, there are a couple of

features that other programs have
that we don't have, although

we're working on that.

But everything we introduce has
to be performant and ultimately

absolutely reliable people trust
the software to protect their

data.

Now they have an obligation to,
they need to, I get people at

conferences that will come to me
and say, you saved my job, pgBackRest

saved my job.

And I'm like, I really appreciate
you saying that, but actually

you saved your job because you
set up backup, you tested restores,

and when the big day happened,
you were ready.

Nikolay: Yes, I set up restores.

David: Because they'll tell me,
oh yeah, we set up weekly restore

tests like you recommended.

And I'm like, okay, great.

I can recommend these things, but
a lot of people don't do it

so if you actually go and do the
stuff that the backup people

recommend kudos to you you deserve
all the credit for saving

your job not me but don't get me
wrong I still like hearing the

stories It's great to know that
pgBackRest is useful and valued,

of course.

Nikolay: Untested backups are Schrödinger
backups.

They should not be considered proper
backups.

Yeah, backups must be tested.

David: I can't remember where
I found that, But I put that

on 1 of my very early talks, Schrödinger's
Backup.

It wasn't me, it wasn't original
to me, but I can't remember

where I saw it.

But as soon as I saw it, I was
like, yes.

Nikolay: It's an obvious idea, honestly.

Multiple people might invent it.

Anyway, way over time, I enjoyed
so much talking to you.

Final question, we discussed a
lot of technical detail, very

advanced, I must say.

I think it's okay if some people
don't get everything because

we definitely dived into multiple
areas very deep.

We also touched the situation about
sponsorship.

We haven't touched the situation
about maintainers.

What would help you to have a 2nd
big maintainer?

David: A 2nd big maintainer would
be good for a couple of reasons.

1, to help me write the big features.

These days, I'm the only 1 writing
big features and I'm just

1 person.

And the other thing is review.

So review is a big bottleneck for
me.

I can actually produce quite a
lot of code, but it's got to be

gone over carefully by somebody.

I also, I'm using LLMs now for
review as well.

So I'll ask cloud code review anything
I write before I send

it to any person just to catch
the really obvious stuff.

And these days the not so obvious
stuff is getting pretty good,

But I also need someone to bounce
ideas off of.

Another maintainer would be good
for that.

Someone who I could reliably chat
with and be like, so I was

thinking about X, Y, and Z and
I just, you know how it is when

you vocalize something, it becomes,
sometimes you don't even

need to get input from them.

Just by explaining the problem
to them, you, oh yeah, I know

what I need to do here.

It's so obvious to me now.

So that's why I really need another
maintainer.

We're working on that right now.

That should be, I don't wanna make
any announcements or say anything

regard to this, but this is something
that's actively being worked

on to get another maybe not full-time
maintainer, but very active,

very involved maintainer.

Nikolay: That's, that would be great.

Yeah.

David: And to some extent, it's
hard to know how big we could

scale this project.

It's an interesting, how many people
are interested.

Also how big are the problems we're
solving?

Can 2 people handle everything?

Probably.

Especially with tools these days,
helping.

We're getting more contributions
from the community.

Those are coming in a pretty regular
stream now so I don't have

to write everything.

I don't have to find all the bugs,
I don't have to fix all the

bugs, I don't have to do little
authentication tweaks for S3

or blah blah blah.

Most of that is coming from outside
contributors, and I just

review it, and if it's reasonable,
commit it.

So we're building community that
way as well, but if we can get

1 more person who's really looking
at it regularly, that would

be great.

And that would, of course, be in
addition to Stefan, with Data

Egret, who actually already spends
quite a bit of time on pgBackRest

as well, review and
testing, et cetera.

Nikolay: Cool.

Great.

Thank you.

I don't have any more questions.

I have, but I will post them, because
they are super technical

again.

Michael: David, thank you so much.

Thanks for joining us, but also
thanks for all the maintenance

you've done over the years.

David: Oh, absolutely.

It's definitely been my pleasure.

Michael: Congrats on getting all
the sponsors and Thanks to them

as well for keeping it going.

David: Yeah, like kudos to the
sponsors.

They have made this possible.

I'm just, I'm back to work.

I'm back to what I love doing and
everyone gets to benefit.

So I think this has worked out
really well.

So a really cool open source story.

Michael: Yeah, it is absolutely
good.

Great open source story.

Nikolay: Thank you.

Have a great week.

David: Yeah, thank you very much
for having me.

Creators and Guests

David Steele
Guest
David Steele
Creator and maintainer of pgBackRest and pgAudit, Significant PostgreSQL Contributor

Some kind things our listeners have said