pgBackRest
Michael: Hello and welcome to Postgres.FM,
a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard
and I'm joined as always
by Nik, founder of PostgresAI.
Hey
Nik.
Nikolay: Hi Michael.
Michael: And today we have with
us David Steele, who is a significant
contributor to PostgreSQL and the
creator and maintainer of both
pgBackRest, which we'll be talking
about today, and pgaudit.
Welcome, David.
David: Thank you very much.
Good to be here, Nik and Michael.
Michael: It's a pleasure to have
you.
Right, to get started, I wondered
if you could give us a little
bit of history, a little bit of
the origin story, perhaps of
pgBackRest.
David: Sure.
That's easy enough.
Actually, pgBackRest was born
at the Dublin conference in 2013
of conversations that Stephen Frost
and Cynthia Shang and I and
Magnus Hagander and some others
were having At that time, we
were working on a fairly large
database for the time, around
50T.
That doesn't sound so big these
days, but in 2013, that was a
pretty significant database.
You know, basically we needed to
make backups, of course, as
you do.
And just the available tools were
not up to the task.
It was just simply too large.
And we really need to be able to
do incrementals.
And there was nothing that would
do incrementals plus compression
at the same time, et cetera, et
cetera.
So As you do in open source, we
decided that we would build our
own thing.
I remember originally it was going
to be a pretty simple project.
Write it in Perl, keep it simple,
just be 1 file, so it would
be easy to copy around and distribute
and etc.
Yeah, that didn't last very long.
Obviously it grew pretty quickly.
So I built it.
I built the initial software that
was usable in about 40 hours
and to basically solve our initial
problem and then just kept
building on it as we had problems
and bugs and other things I
would build on it and build on
it.
Convinced the company I was working
for, Resonate, which is an
ad tech company, to open source
it.
And then when I left there, I kept
noodling around at it.
I took some time off and just kept
working on it.
Got the restore functionality working
well.
And then that's when I got hired
into Crunchy.
At 1st they weren't super interested
in it but Stephen was.
So I got a little bit of time to
work on it and then eventually
got more time to work on it and
then eventually we hired Cynthia
to come and work on it with me
and that went on for a while and
then finally we decided to migrate
it to C and we did the whole
C migration which was 2 years so
that was painful because we
didn't we basically didn't write
any new features for 2 years
we fixed bugs only And we could
only write a new feature if it
lived entirely in the C code.
And over time, it was more and
more possible for that to happen,
but it was a tough migration.
2 test suites, 2 of them, it was
just, I didn't even want to
think about it.
It was not a lot of fun.
And that's basically the project
as it is today.
Now it's, of course, we're rolling
along.
It's a C project.
We have features based on user
demand, based on chasing performance.
Performance is always the big thing.
Like how do we move a bunch of
data quickly?
How do we do it efficiently?
How do we keep the repo as small
as possible, block incremental
backups, the list goes on and on.
Nikolay: I have a question here about
history.
Maybe you remember those discussions
should Postgres have Remember
times of Slony and then Londiste
and there was discussion
should Postgres have replication
inside.
And there are many such discussions,
same for auto failover and
so on.
And For the case of replication,
the idea to have it in core
won.
For the case of auto failover, it
lost.
Still auto failover is outside.
I'm very curious what you think
about the idea to have full-fledged
backup solution inside core and
why it's not happening.
David: 1 interesting historical
fact is, and 1 of the reasons
why some of the other committers
were involved in really early
planning, and part of the reason
for the migration to C was the
idea was that we could actually
maybe make pgBackRest the core
solution for backup.
Now, this is not something that
was endorsed by core broadly,
or it was just discussions that
I had with some committers, because
they were interested in having
a more comprehensive solution
in core.
Obviously, they wanted it to be
written in C.
And so that was 1 of the reasons
that drove the adoption of C.
As for actually having something
like pgBackRest in core, it's
a little tricky because the pgBackRest
project moves a lot faster
than core does So to have it on
the yearly cycle would be But
let's say we had put pgBackRest
into core like from the beginning
Maybe as soon as it was migrated
to C.
It would not be nearly as far along
as it is now At the time
we were doing 12 releases a year,
we went down to 6, and now
we're currently at 4, which I think
is just about the right tempo
for a project of this type.
But being able to release features
4 times a year, get new stuff
out there, get people trying it,
get people testing it, that
kind of stuff.
I think it's pretty important.
And as pgBackRest gets more stable,
maybe it would be more appropriate.
But at the same time, the project
has diverged significantly
from, you know, we use a lot of
the same concepts as Postgres,
MemContext, and error handling
looks a lot similar and et cetera,
et cetera.
But it's all pretty different.
2, pgBackRest grew its own way.
So it'd be very difficult to get
in there.
I think the idea is pg_basebackup
was going to be that tool.
Getting incremental backups into
pg_basebackup was a big step,
but it's still not a complete tool.
You have to have tooling on top
of pg_basebackup for it to
be at all usable, especially for
incremental, because reconstructing
all those incrementals, going and
fetching them and uncompressing
them and getting them all ready
for the pg_combinebackup tool
to run.
That's a significant amount of
work and obviously it doesn't
do anything with WAL archiving,
expiration, there's just the
list of all the things that you
need to do and it's a pretty
big list and it's intimidating,
honestly.
You know how core goes, right?
It's intimidating to even contemplate
getting something like
that into core.
So I think it would be a good idea,
but also Think about storage
drivers.
Support for pg_basebackup supports
POSIX, right?
But most backup tools support S3,
GCS, Azure, SFTP, etc.
Because those are the tools that
people are actually using.
So all that would have to go into
core as well.
All the storage drivers, people
would have to...
Nikolay: Maybe not everything could
go.
It's possible, Postgres is very
extensible.
If just the core thing would go
to the core, but expose some
interfaces, particular drivers
could stay outside.
David: And in theory, that's
what we've done, a bit with pg_basebackup,
but the amount of stuff that needs
to be done by the outside
tool is huge.
Now there are quite a few tools
that are based on pg_basebackup
to do their page level slash block
level incrementals.
Barman, obviously, is the most
well-known 1.
I think pgmoneta is 1 of them.
There's a newer 1 that I saw recently
that's also using it.
So you can certainly build a tool
around pg_basebackup.
And I think the idea would be that
you'd extend and extend pg_basebackup,
and then people would take that
feature out of their tool.
And eventually maybe pg_basebackup
would be a thing that does
everything.
But at the pace that it's actually
evolving, I would expect that
to be able to reach feature parity
with something like pgBackRest
or WAL-G or Barman in approximately
30 years.
I'm exaggerating a little bit,
but if I look at the actual progress
that pg_basebackup has made over
the years, that's where we are.
Since that incremental, nothing's
been done with that, even though
there are pretty huge performance
implications for restores.
Any large, very large database
to restore with that page incremental
format is you basically need, let's
say you've got a database
that's a terabyte, you need at
least 2 terabytes to do the restore
minimum, and it depends on how
many incrementals you have, so
you could need 2, 3 terabytes to
do the restore.
All the files need to be copied
down regardless of whether those
blocks are used or not.
Everything needs to be uncompressed
and then fed into pg_combinebackup
which then rewrites everything
to a different location.
So from a scalability standpoint,
it has a pretty serious problem.
And we're 2 versions on from it
being introduced and no 1 has
actually even thought about actually
addressing any of those
issues.
I don't want to harp on this too
much, but the main point is
getting stuff into core is hard.
It takes me years.
I've been working on a very small
change for backup just to mark
pg_control when a backup label
is required.
So you do a backup, a backup label
is required to do the recovery.
If the user deletes the backup
label, they can end up with corruption,
silent corruption.
It's quite annoying actually.
And I've seen people who know Postgres
really well, hackers do
this and not understand what happened.
Nikolay: I saw it many times already
in various teams.
It's annoying.
David: So 2 years ago I introduced
a small patch for Postgres
to just put a flag in pg_control
to mark it when we actually
need a backup label.
So if you start Postgres and backup
label's not there, it will
just stop and say, no, you must
have backup label.
Please provide.
That was 2 years ago.
It goes through review occasionally,
but really nothing there.
In pgBackRest, we actually implemented
that about 3 years ago.
What we do, this is a bit of a
hack but it works pretty well,
what we do is we overwrite the
last checkpoint in pg_control,
we write in the hex value DEAD.
So if we get a report from a user,
because it will come up a
Postgres that will say unable to
find checkpoint DEAD.
And that's actually not a valid
checkpoint at all because it's
under the 1st WAL segment limit.
So, Postgres will mark it as invalid,
it will throw an error,
and then when we get a report from
the user we immediately know,
hey you tried to start this you
delete a backup label and they're
like yeah I deleted backup label
and boom there you go.
But I've been trying to get that
that I wrote that patch 2 years
ago and I'm still hoping to get
that into Postgres.
So that's like the speed at which
Postgres can operate sometimes,
especially for people like me who
aren't really committers.
So can you imagine trying to get
something the size of pgBackRest
or complexity of pgBackRest into
Postgres?
Nikolay: Yeah, I wish we could highlight
this, put a link to this
patch and maybe drive some attention
to it because some people
who participate in Postgres hacking
listen to us and maybe...
I agree it's super annoying and
right now All the other tools
they need to...
Basically, if you create full copy,
full backup on a replica,
you are responsible for placing
this backup label yourself.
And it's also how restore works
in Postgres.
I fixed recently, and This fix
was very quick because it was
obvious.
It was just a problem in code and
Postgres in the store path.
How Postgres is working with the
backup label and also signal
files.
If you look how people use it,
they see some error in logs and
then, okay, I will just delete
some file, right?
To get it going and this is what's
happening all the time right
but also I noticed pgBackRest also
requires maybe because of what
this mechanism you explained well
when making backup on replica
it requires connection to primary
And this surprised me a lot
recently when I started to code,
right?
David: Yeah, that actually doesn't
have anything to do with
that specifically.
The reason why we did that is you
get a better backup that way.
Nikolay: Better backup.
Not all backups are equal.
David: You get a better backup.
1st of all, there are some things
about backup from standby.
I don't necessarily want to get
into that because it's pretty
esoteric, but there's some things
about backups from standby
that I don't 100% trust.
But I do believe it works.
The reason why a primary backup
is better, though, is you get
a couple things.
You get stats from the primary,
which is actually pretty nice.
If you do a backup from standby
and you restore, you've got stats
from, say, a standby.
And if you're actually doing analysis
of statistics, it's pretty
shocking to suddenly have like
all of your index patterns, scan
pattern, read patterns, full change,
everything.
So I'm not talking about, sorry,
table statistics like basically
like scan statistics.
Maybe I'm using the wrong word.
Michael: Shoot.
No, I think you, I think I understand
number of seq scans on a
specific.
David: Yeah, I think they're
both, I think they're both stats,
but, but I'm not talking about
the planner statistics.
I'm talking about like the actual
usage stats.
Another thing is if you're doing
a backup on the standby, you
can't actually finish the backup
and verify that all the WAL
has reached the archive.
You can, but you might have to
wait a day for that to happen
or whatever.
We feel it's really important to
make sure that this backup,
When we mark the backup as done,
we want it to be done.
Not hoping that someday this WAL
archive is going to arrive.
We want it to be done at that moment.
So that's pretty important.
Nikolay: Do you mean the backup is
self-sufficient?
It positively affects RTO, right?
Like recovery time objective.
It reduces the time needed to recover.
David: Well, at the very least,
we want to know that the backup
can be recovered to consistency.
Right.
Now, if you want to do point-in-time
recovery, that's going to
be after the backup finishes, you'll
continue WAL archiving,
you need to monitor that as well.
But at the point where the backup
finishes, we want that to definitely
be true.
Other things like getting actual
logs from the primary are more
interesting than getting logs from
a standby.
Although I don't really recommend
that people put logs in their
PGDATA directory at all.
It's actually quite common.
Nikolay: Hold on 1 2nd.
Too many things here.
1st of all, when you say logs,
it's WAL files, right?
David: No, sorry, in this case
I do mean logs.
So like basically textual logs
that are generated by Postgres,
and a lot of people will put those
in PGDATA.
Nikolay: Okay, I didn't get it.
Why do we care about logs when
making backup?
Ah, because we don't want them
to be backed up, right?
David: We do if you have them
in...
So If people are putting their
logs in PGDATA and they expect
them to be from the primary, it
can be surprising when their
logs from a standby or something
like that.
I see.
They're not seeing the information
that they expect to get.
Maybe they have auditing turned
on in the primary and not on
the standby.
You recover the backup and now
all your audit records are gone,
etc.
So now I'm not a proponent of putting
logs in PGDATA on the
primary, but there are people who
do this and they want to preserve
those logs.
That's another reason.
So there's a whole raft of reasons,
but basically, so what we
do is we copy everything that's
replicated from the standby.
Oh, the other reason why we really
wanted to do this, although
honestly we've never implemented
it, is once you've done this,
you can actually parallelize a
backup across all available standbys.
Because you're coordinating everything
on the primary, you wait
for the standbys to reach the checkpoint
where the backup started,
which we already do, of course.
We do a bunch of consistency checks.
And then you would be able to parallelize
your backup across
all the available standbys and
really supercharge it.
And it's not even that hard to
do.
We just haven't really, there hasn't
been demand for it.
And there's always 1000000 things
to write.
You know how it is.
So we just haven't gotten to it
yet.
But that was another big reason
to do it that way, because it
allows us to parallelize backup
in a way that otherwise wouldn't
be possible if you're just backing
up from a single standby.
Nikolay: There is a dilemma here
where, 1st of all, making full
backup on primary, especially if
it's full, not incremental,
full, It's huge stress for disks.
David: Yeah, you don't want that.
Nikolay: Yeah.
So usually people have loaded to
some standby, and then the problem
is also which standby, because
it's also stress.
If it's a read replica, it's stress
for those reads as well.
So distributing totally makes sense,
but there is also, like,
since you talk and focus a lot
on corruption and consistency,
I was always curious if we, for
example, have allocated standby
only for backups, which I think
Crunchy Bridge had an issue, we
had a customer there, Crunchy Bridge
had an issue when they made
full backups on primary.
I was super shocked because like
how come it was fixed then?
And I think full backups went to
HA replica, which is for HA.
It's allocated, it doesn't receive
reads.
But then I'm curious, if some corruption
happens, it might happen
only, there are many kinds of corruption,
right?
I can think about, at least theoretically,
some corruption might
be only on 1 node, but not on others.
And if you always make backups
from only 1 node and you don't
notice this corruption propagates
to all backups.
So some rotation at least is good,
but combining and parallelization,
it's also interesting.
I'm just thinking when SQL Server,
for example, there is a mechanism
to self-heal if, for example, primary
notices some pages are
corrupted, it can grab them from
replicas, which was implemented
long ago.
This is cool technology.
I'm curious about this dilemma.
We want also distribute load not
to hurt our user traffic, but
also what's happening with corruption.
This is Pandora box of topics.
David: That'd be an interesting
idea actually, too.
If you were doing backups from
multiple standbys, you could compare
and contrast.
If there was corruption on 1 standby,
you could say, hey, why
don't I try grabbing this file
from the other standby instead,
fail it, send it back to the main
process, which would reissue
that job on a different standby.
That's a pretty cool idea.
Nikolay: You talk about corruption,
like data checksums level
of corruption but there may be
a higher level of corruption like
index level, like logical level,
foreign keys, like a lot of
stuff.
David: Oh sure and your corruption
scenarios on primaries and
standbys are quite different too
because the way things are being
written out on a primary is actually
fairly different to the
way it's being written out on a
standby.
Also the standby is essentially
a single threaded operation.
Let's not
Nikolay: go there,
David: it's terrible.
Yeah, obviously it's terrible for
performance from the standpoint
of any issues there might be with
locking and parallelism and
other things on a primary, the
standby is not going to see those
bugs, probably.
And so there are whole classes
of bugs that can happen on a primary
that could cause corruption that
aren't going to happen on a
standby.
So from that perspective, it might
actually be safer to do backups
on a standby as well, just because
it's a simpler path.
The data goes through a much simpler
path to get to disk on the
standby than it does on the primary,
as a rule, unless you really
have 1 writer on the primary, which
is not the way things tend
to run these days, not for interesting
databases at least.
So yeah.
And as you say, the corruption
we're detecting is only checksum
corruption.
There are some projects out there
where people have combined
pgBackRest with amcheck.
So basically when it does recovery,
like basically you call this
command, it automatically integrates
pgBackRest with some sanity
checks afterwards and does amcheck
and some other things to check
a higher level of consistency than
just the page checksums.
Nikolay: Yeah, when you do recovery
it's a good point usually.
This is what we usually do.
Not all recovery attempts, but
some of them, percentage of them,
should also check index health.
David: This is really useful
in test recovery.
So hopefully you're testing your
recovery, right?
And that's a great time to do amcheck
because then you can have
confidence in your emergency hair-on-
fire recovery that it actually
does work because you practiced
it, you ran amcheck on all the
practice runs.
So at that point you should feel
pretty confident to run the
production restore without having
to take the penalty of running
amcheck at that time, if you've
been properly testing.
Nikolay: I agree with everything,
but You cannot do it on all runs
because full-fledged amcheck takes
a lot of time.
David: Yeah, that's why I say
it's really for test restore,
right?
So you're testing things and it's
tricky, right?
Okay, I guess you can say amcheck
takes 2 days, so we're going
to do this once a week.
And then we'll do more recovery
testing, but only run amcheck
once a week.
I think that's a perfectly valid
way to go about it, to think
about it.
Absolutely.
Nikolay: I'm also very curious, did
you think about some more observability
capabilities in pgBackRest to understand
that.
My final goal is, for example,
I have 100 clusters.
I just want to know actual RPO,
RTO all the time, measured somehow,
during those restore tests, during
the backup process itself.
And just to have some high-level
view, this is what's happening.
That's it.
Now you need to build a lot to
have that.
David: RPO you can measure using
the information that pgBackRest
provides in the Intel command.
Because it'll tell you what the
most recently, how up to date
you are on your WAL segments,
right?
So that's going to give you your
RPO.
So I'm 1 segment behind, 10 segments
behind, 5 segments behind.
So at least you know what your
recovery point is based on the
information that you get.
RTO is a much trickier scenario
though.
Nikolay: RTO is also tricky because
you think like technical guy
and I'm also technical guy.
We think about bytes.
Business guys think about seconds,
hours of data loss.
How many hours of data were lost?
And this you need to translate
somehow.
David: It's absolutely fair and
you're not going to be able
to get that timing exactly from
just you could use for instance
our repo-ls command to go and grab
information about the WAL
and find out how much WAL you're
generating per 2nd.
It's not going to give you exactly
the time, but it's going to
be awfully close.
So you could estimate, you're like,
OK, we're generating 5 WAL
per minute.
We're 2 WAL behind.
And therefore, our recovery point
is this many bytes behind and
this many seconds behind.
So you would actually be able to
estimate that reasonably well
using tools just that are available
on pgBackRest.
Although actually estimating the
time of the WAL, you would
have to roll your own for that
because that would have to be
something you do for repo.
So the nice thing about repo-ls
is it's a command we give you
and it works on any repository
whether you're on S3, GCS, SFTP,
POSIX.
So you can use that tool to get
information about the repository
that's not included in our JSON
info and it will work in any
environment, on any repository,
on any storage.
So you don't have to muck about
with AWS CLI over here and the
Azure CLI over here and whatever
the GCS equivalent is and et
cetera, et cetera.
And then, okay, so you could sort
that out, I think, with the
tools that are available with pgBackRest,
but RTO is still a
thing that needs to be tested.
We can't tell you RTO at backup
time.
And since Postgres is famously
slow for recovery, Maybe not slow,
but certainly things don't get
written as fast as they get written
on the primary.
That much we can agree on.
Recovery actually, it goes pretty
well.
And Postgres has gotten pretty
good at things.
pgBackRest prefetches, WAL segments,
all this combined actually
gives you a pretty good, pretty
good performance overall.
But again, you're going to have
to test it to know how long it
takes you to do recovery.
Fun little example, there have
been, and this hasn't happened
1 time, but the 1st time it happened
was in early days of
pgBackRest, because our thing has
always been high volume performance,
et cetera.
So there are a number of people
over the years who have used
pgBackRest because they can't use
replication.
Because replication simply will
not keep up with their load ever.
So what they do is they basically
continuously take and restore
backups.
And that way, they can measure
the RTO pretty well, actually,
because they're constantly doing
these recoveries.
And they get to a point where they...
And then we'll see how long it
takes to get to the point where
they were when they did the restore,
right?
Because they can never actually
catch up, but they know how long
it's going to take.
If the primary fails, how long
will it be until the standby is
up to date using WAL from the archive?
A downtime will be this.
A downtime will be this.
And they know it exactly because
they do it every single day.
And this is a use case that I've
seen multiple times now.
People have brought it up with
me.
It's a really interesting use case.
Nikolay: So they are bumping into
this 100% single CPU situation
problem of startup process on replicas.
So that means also they live without
replicas.
With these replicas, they cannot
use them because they are lagging
basically, right?
David: Yeah, they're pretty far
behind.
So they really are just there to...
I mean, obviously you can make
up scenarios where a replica like
this could be used.
Let's say you're doing reporting
and you know your replica lag
is 6 hours, then you know by 8
AM you can start running reports,
that kind of thing.
So you can game it.
But you're right, it severely reduces
the usability of the replica
in that scenario.
But they just need something because
they can't start from...
They're just trying to minimize
the time it takes for them to
get to running again.
Because if they start from a restore
of their 50T backup, that's
going to be however long that takes.
pgBackRest is pretty fast, but
50 terabytes is 50 terabytes.
And then they start recovery.
Nikolay: 1 hour.
I showed an article 37 terabytes
per hour.
We can do it 1 hour, but it's local
NVMe to local NVMe.
It's very different.
David: Exactly.
As usual, the backup is, you know,
I've gotten a lot of flack
over time about using, we do a
lot of checksumming and we use
pretty heavy duty checksums and
people are like oh those checksums
are so expensive how can you stand
how expensive they are?
They're not expensive once you
actually start pushing stuff.
If you're writing to your local
SSD, then yeah, the checksum
overhead is 5% in pgBackRest.
I've measured it.
We have tests for this.
So yeah, it looks like a lot.
But then you actually start pushing
stuff out to S3 and poof!
It just disappears.
You don't even see it anymore.
The trick is getting this kind
of volume in a realistic environment.
People don't store their backups
locally.
At least they shouldn't be.
That's the message we try to get.
Nikolay: I should repeat the tests
with S3.
I'm very curious about the level
of parallelization we can get.
Of course, we need the machine
itself should be where we recover.
It should have local NVMes.
I'm very curious.
David: We have a...
Yeah, sorry.
Let me just address that really
quickly.
We do have a good, interesting
optimization coming, hopefully
in the next release, if I can get
it reviewed, by July 10th,
that's our feature freeze.
It basically does, basically it's
adding prefetch for object
stores and something I call I'm
sure other people do this but
I haven't really figured out what
it should be called so I've
been calling it readover hoping
someone will come up with a better
name.
But a lot of times we're reading
through say a bundle and we
need n number of files or n-m number
of files and so we're basically
starting a bunch of new reads so
we'll grab these 3 files and
then there's a file we're skipping
so we'll start a new read
grab 4 files and then we have to
skip 2 files so we'll start
a new read and then do So what
we're doing is actually we have
this thing now called read over
which is also under review and
it will skip over a configurable
block of bytes.
So if we're reading through the
file and suddenly And it's configured
by default to 64K.
So if we have this 1 small file
that's breaking up the read that
we're doing, we'll just read it
and throw it away and then continue.
And I picked 64 bytes because based
on my testing, that was,
It's cheaper to read 64 bytes pretty
much in all scenarios, no
matter what your storage type is,
whether it's reduced redundancy
or call it the longer term storages
and stuff, I can't think
right now.
And so 64k is basically always
a win.
Even if you're paying for egress,
which you will be in some cases,
you're still going to pay less
if you just read the 64 bytes.
And from a latency standpoint,
it's always a win.
Because starting a new read is
just expensive in terms of time.
And there's actually a cost to
starting a read as well.
Once you've actually got a file
open you're reading and normal
S3 storage You're not paying for
that egress But if you actually
start a new request you do pay
for that new request Anyway, some
interesting features that would
make the tests that you were
talking just talking about Very
interesting to run Would be a
very different scenario than pgBackRest
today.
Nikolay: Longer storm to storage,
glacier, right?
David: Yeah, no, I'm not thinking
of glacier.
I'm thinking of the 1 in between.
I actually set all my buckets to
transition to it automatically
after 30 days.
So just transition to that.
You can still read it instantly,
but it costs.
If you go and read it, it costs.
And there was a, people have been
wanting to add this as a feature
where you can go into the storage
class immediately for all backups
and WAL.
Nikolay: Yeah.
Standard IA.
David: IA, thank you.
Infrequent access, that's the 1.
And it's actually a bad idea to
write IA initially, because in
a healthy repo, you're actually
reading the repo.
And as soon as you read 1 backup
out of that repo, you've destroyed
all the advantages of IA.
It's all gone.
The idea of IA is I've got an older
backup from 6 months ago
that I'm holding for compliance
and now suddenly I need to recover
it.
Great.
You do that once every 2 years.
Now you're using IA correctly,
but if you're writing all your
backups into IA initially, you'll
just end up spending more money.
The only scenario where it works
out is where you write and never
ever do any recovery or any WAL
recovery ever, which is just
not a use case to me.
Back to these
Nikolay: cases you are familiar with,
I'm very curious how much
of all data was written per 2nd
on those cases where it was like
the problem like they cannot afford
replicas so if near 0 lag,
right?
David: Yeah.
Nikolay: Some hundreds of megabytes
per 2nd, I suspect.
David: It can be.
It really depends on what you're
doing.
Let's say you are, the simplest
possible case is you're replaying
sequential writes on an unindexed
table.
Right?
That's your simplest possible case.
That'll run like gangbusters and
that'll definitely go into hundreds
of megabytes or gigabytes.
Maybe not per 2nd, but hundreds
of megabytes per 2nd I think
becomes pretty reasonable at that
point.
There's still some, still quite
a bit of exchange to get the
WAL segments, move them, rename
them, do this kind of stuff,
but it's quite fast.
But as soon as you start doing
interesting things like say updating
indexes, then your write volume
is going to go down significantly.
Then it's all going to be I/O latency.
So the CPU will drop below 100,
you'll go into I/O wait, and you're
going to be looking at latency.
So if your CPU actually is 100,
you're probably doing really
simple writes, generally speaking.
And you can maintain a really high
throughput if you're doing
that.
If your CPU drops, if you've got
complicated index writes and
stuff like that, your CPU will
generally not be 100 anymore.
And you'll be into I/O wait instead.
So I played around with this in
a bunch of different scenarios,
trying to figure out ways to help
Postgres and maybe gain Postgres.
Certain scenarios are just slow.
Nikolay: It makes total sense because
if it's like simple, it's
I/O bound and if on primary, we
have hundreds of cores doing this
work, but on the back we have only
1 trying to catch up to do
the same work, basically, logically.
It's terrible.
Yeah, that makes sense.
Simple work.
Yeah.
David: And unfortunately, all
the cool work that Andres et
al have been doing on async I/O,
Tomas, and et cetera, it's
not going to help that scenario
very much in on the standby.
It's great on the primary, but
it just makes a situation where
the primary can write even more
data than the standby can ever
possibly keep up with.
Again, because single threaded,
so those async operations are
only going to buy you so much.
Direct I/O and stuff obviously
will help.
So there are improvements being
made in the single threaded recovery
So don't make no mistake I'm not
saying there aren't because
there definitely are but they're
not keeping pace with the improvements
that are being made on the primary
side Which are can be an order
of magnitude greater than what's
happening on the standby.
Nikolay: Yeah, exactly.
I used, I think, Intel machines
in this experiment where I achieved
30-something terabytes per hour
with pgBackRest, and I compared
it to pg_basebackup, and I expected
200-300 megabytes per 2nd
as usual.
But since it was Postgres 18, I
saw 1 gigabyte per 2nd.
I was super surprised.
David: It's actually gotten better.
I remember the article.
Yeah.
And that's jibes with what I've
been seeing in terms of overall
I/O improvement, I/O improvements
in Postgres.
So they're impacting everything,
make no mistake, but it's obviously
that recovery bottleneck is still
there.
And there's not much that pgBackRest
can do about that, except
just make sure that Postgres always
has the WAL.
As soon as Postgres requests a
WAL segment, we've got it right
there waiting for it.
And we got recommendations on how
to set up your storage so that
we can do a move into the pg_wal
directory so it's as fast as
it can possibly be.
We're not doing a copy.
We don't fsync things, we don't
et cetera, et cetera.
So we're trying to run everything
as fast as possible, but there
are limits.
Nikolay: This makes total sense and
it's great insight.
Like improving things around, you
might highlight some difficult
problems and they might become
even more acute, right?
So Everything improved so you have
better performance.
You much faster you can achieve
the problem with lagging replication
Yeah,
David: it was definitely a time
before pgBackRest and
other parallel WAL implementations
that were getting the WAL
was the bottleneck and Postgres
kind of merrily sailed along
and did its thing and it was just
WAL that was the issue.
Now it's usually Postgres.
Another place this expresses is,
if you're doing, like people
have noticed and started pointing
out to me, although I already
knew it, is that if you're doing
log shipping, you can do recovery
much faster than if you are doing
replication.
Because log shipping compressed
segments, which are prefetched
and decompressed asynchronously
and then moved into the pg_wal
directory is quite a bit faster
than streaming uncompressed WAL
segments over the network from
the primary basically every day
of the week if you have high volume.
So that really brings into question
the idea of I brought up
a standby and I want to do recovery.
How do I do the best things where
I'm going to go and log ship
as long as I can because that's
faster and also puts less load
on the primary and then only switch
over to getting data from
the primary when I have to.
At 1 point you switch back and
stuff like that.
Postgres doesn't currently manage
that very well and the rules
are a little bit arcane.
So some archivers, we haven't really
done this in pgBackRest
yet, but there's certain things
you can do to basically, you
can artificially give an error
to Postgres to make it switch
back over to replication.
Do you get what I'm saying?
So let's say we get to pretty much
the end.
Nikolay: You're moving very fast.
So you're talking about replica
where both restore_command and
primary_conninfo are configured,
right?
Yes.
And then you say Postgres has a
very strict precedence rule.
I think it searches local pg_wal
directory 1st, then it uses
streaming replication and only
then it uses restore_command,
if I'm not mistaken.
David: I believe that's the way
it works, yeah.
Nikolay: And then you say you have
a way to fool Postgres to achieve
what you want, and what you want
is to get WALs from archive
because they are compressed and
you can parallelize it at restore
command level and only switch to
primary_conninfo when to streaming
replication in some cases.
Right?
And then you have to, with some
error, this I don't get already.
I can imagine throwing an error
from a restore_command, I cannot
imagine anything at primary_conninfo,
that's a problem.
David: The situation is, let's
say you are, you've been doing
log shipping because that's the
fastest thing to do, so you're
getting WAL segments.
Now, Postgres, if it's basically
decided on log shipping, isn't
going to switch away from that
until something happens.
The idea is, like the actual, the
WAL get routine, when it knows
it's at the end of what it has.
So let's say Postgres, so you get
to the end and now the next
WAL segment arrives 2 minutes
later.
Now you've essentially built in
2 minutes of lag because Postgres
will patiently wait 2 minutes for
that next WAL segment to arrive
as long as it doesn't get an error.
So what you, in this situation,
what you can do is actually inject
an error into the archive-get to
say, I know I'm at the end of
the archive stream in the repo
because I can see it.
I'll throw an error to force Postgres
back to replication.
Nikolay: Yeah, streaming replication.
I think I got it wrong.
I think restore_command has precedence
over streaming replication.
David: I think it has.
Nikolay: I always forget that.
But if the recipe you described
implies that.
But it's an interesting recipe.
David: And a lot of times, even
if the streaming replication
were prioritized, let's say you
recover, restored a backup and
you need to get WAL from 3 days
ago to start doing recovery.
There's a very good chance that's
not going to be on the primary.
So you're going to, you're going
to go over to log shipping at
that point anyway.
But the other thing you would want
to do is, let's say you start
lagging too badly on the primary,
ideally Postgres would go back
to log shipping because it's quite
a lot faster.
Historically that wasn't true,
but with modern archivers, like
something you'd find in pgBackRest
or WAL-G, we can do that
a lot faster than replication.
So in theory, if you're lagging
behind enough on replication,
you switch back to log shipping,
catch up, then go back to replication
in order to have as little lag
as possible.
So the question is, how do you
coordinate?
On the pgBackRest side, we can
game it to force pgBackRest
to switch to streaming rep, but
we can't really get it to go
back to log shipping.
Now, these are discussions that
have been had.
I don't know if we've...
I can't remember if we've had them
on hackers, but it's something
we've certainly talked about at
conferences.
Nikolay: Yeah, that's an interesting
idea.
It would be great to have more
control here and more flexible
configuration.
David: Some kind of...
Nikolay: Sorry.
Go on.
We have huge delay because we are
on very different parts of
the world, right?
3 very different parts of the world.
David: We are very geographically
distributed, that's for sure.
So yeah, so on the 1 side, the
archiver would have to tell Postgres
that it's caught up.
So rather than throwing a nasty
error, we could say, we could
have a code to return value 2 or
something to say, hey, I'm at
the end of my, of the WAL in the
repository.
Nikolay: Or just signal the lag.
Signal the lag.
This is my lag, That's it.
And then the mechanism will decide
which precedence, adjust precedence
based on knowledge about lags.
David: Or that.
And honestly, if you're going to
do that, the way to do that
would be inside the archive_library
interface.
Nikolay: And
David: you're familiar with the
archive_library, right?
Yes.
We don't have 1 for pgBackRest
yet because it's not needed for
performance in pgBackRest and
it doesn't really provide any
other benefits.
So we haven't really done it yet,
although now that several versions
of Postgres support it, I'm looking
at probably adding that this
year or next year at the latest.
Because then maybe we could push
some of these ideas into Postgres
to saying, hey, let's flip back
and forth between archive-get
and streaming replication based
on what's...
And passing stats back and forth
inside the archive_library would
be easy.
You don't need any fancy return
values or JSON or something.
It would just be an API call that
you would use within Postgres.
Nikolay: Postgres 15, when it was
introduced.
I wanted to clarify very clearly
that we talk about fetching
WALs and streaming, but after
that Postgres needs to replay
and then we have this startup process,
100% CPU.
This cannot be solved anyhow, right?
So if you manage to...
Usually problem is there.
This is what I'm trying to say.
It depends on the system, I think,
but even with a simple, regular
streaming replication, which is
no compression, nothing, I usually
don't see any problems with receiving
WAL and just putting it
to file with WAL receiver, right?
The problem is usually on the startup
process.
This is my observations on multiple
systems.
David: And that's true.
A lot of it depends on your network.
If you're running things in multiple
availability zones and stuff
that, or across cloud providers
or whatever, that situation can
definitely change.
Nikolay: Yeah.
But certainly if you've got,
David: if you've got primary
and standby locally, but the network
speeds these days, you're right.
Streaming replication should not
be a problem in that environment.
In this day and age.
Nikolay: My observations are with
good network and single region.
I agree with you if you introduce
network complexity, yes.
Michael: That sounds all very clever
and like obviously needed
at super high scale.
We've talked to a lot of people
and it seems like there's quite
a few projects in the space of
sharding these days and it feels
like people that go the sharded
route sidestep this problem because
each of the shards just has lower
volume, like by definition.
So I'm wondering if it's just like
an orthogonal thing to
pgBackRest, like sharding can happen
and each 1 can be backed up.
Does it affect you in any way or
are you thinking along those
lines at all?
David: From a pgBackRest standpoint,
no, because it's really
from our...
So the things in this arena that
I know support pgBackRest are
like Greenplum and Multigres.
And in both those situations, it's
a very straightforward
pgBackRest scenario.
We're just doing recovery, they
give us a point in time to recover
to, and that's up for them to figure
out how to get the shards
back in sync, and then we just
give them what they ask for.
Multigres looks enough like Postgres
that pgBackRest just is
cool with it.
On the Greenplum side, there's
a company that has created a fork
that specifically supports Greenplum
that'll read its control
files and do that and we've never
brought that into core mostly
because I don't want to get into
Greenplum and testing Greenplum
and also annoyingly the oldest
version of Greenplum is still
based on Postgres 9.4, which we
expired a couple of years ago.
So it's a little bit painful to
have.
So that company is actually basically
supporting now, not only
do they have their fork, but they're
also supporting versions
of Postgres that we don't support
anymore so that they can still
support Greenplum 6, which is based
on 9.4.
And maybe that's something we could
bring into Core Postgres,
but it's a resource issue.
We just simply don't have the resources
for it and it's just
easier to focus on open source
Postgres.
When you drive people towards the
idea that if you make your
stuff look like Postgres it'll
just work, if you muck about with
pg_control, like the parts of...
So we even have a feature where
you can add stuff to pg_control
as long as you add it to the end.
And we'll automatically figure
out, so you tell us your Postgres
16, we'll automatically figure
out where the checksum lives,
based on the size of your pg_control
and all that kind of stuff,
and work all that out.
But The contract is that the part
of pg_control that was part
of that version of Postgres must
be the same.
Because we read all over pg_control.
It's extremely important for our
operations, so we can't just,
you can't move things around, you're
just allowed to add to the
end.
So we do make concessions to forks.
Things like it works with EDB,
their enterprise server, and other
things that have new control versions
and stuff like that.
As long as you kind of know what
you're doing.
We call it a maintainer feature.
So the idea is if you shouldn't
be mucking around with this feature,
if you are not a maintainer of
a fork.
And deploying pgBackRest for
that particular fork or documenting
how pgBackRest should be used
with that fork or something.
If you have to use these options
on a regular cluster, then you've
probably done something wrong.
You shouldn't be doing that.
But obviously I can't control what
people do.
That's part of the fun of it, right?
Michael: Yeah.
Talking of maintenance, is it a
good time to transition into
how you have been maintaining it
over the years, kind of what
the situation was, and it's changed
recently.
So I'm wondering if you wanted
to share a bit of that.
David: Yeah, maintenance has
always been a big part of pgBackRest.
1 thing is we have a very comprehensive
test suite for a project
of this size.
We test on 5 different architectures.
We test on a variety of distros.
We have 100% unit test coverage.
We have integration tests.
We have, we test our doc code,
our test code, or we test everything.
And just keeping that test suite
up to date is a bit of a challenge.
It's not the core code.
The core code doesn't really, they
don't, the C code doesn't
break because rel 10 comes out.
What breaks is our tests and the
documentation and all these
sorts of things.
So it's a non-trivial task just
to keep all that going.
And then you have things like say,
Cirrus CI just going away
on a month's notice.
So we need to migrate off of Cirrus
CI.
Luckily, we're already on GitHub
Actions, so we're able to do
that fairly easily.
When I say maintenance, I also
include bug fixes that come in,
So we get a pretty regular stream
of bug reports.
Some are bugs or some are not.
Most of them these days tend to
be pretty weird edge cases.
Enough people are using pgBackRest
that the Really obvious
bugs get caught pretty quickly.
If we release something and there's
bugs in it, I get reports
the next day, almost.
Bug fixes or maintenance, the documentation
is maintenance, interacting
with the community and answering
questions and feature requests
and other things like that's all
maintenance.
So that's actually a pretty big
part of the job.
And after I left, after I did not
transition to Snowflake, so
Snowflake bought Crunchy Data, I did
not transition.
I wasn't happy with their terms,
so I decided just to take some
time and travel.
I kept maintaining pgBackRest,
but at that point I was really
just maintaining it.
Which is fine, but it really got
to the point of thinking about
really restarting development on
pgBackRest again.
I was like, I think I might need
a job.
And so I started trying to get,
I actually put sponsorship links
and I've been including sponsorship
with every release since
last summer.
It was around for basically a whole
year, not getting a lot of
traction of course.
Then I started spending more time
on it.
Then I got to the point where I
realized maybe this whole sponsorship
thing isn't gonna work out.
And that got us to the recent crisis
where I announced that I
wasn't gonna work on pgBackRest
anymore.
And suddenly I got sponsorship.
And when I say suddenly, it really
was suddenly.
Within 3 weeks, pretty much everything
was worked out, and I
was able to make an announcement
that the project was going to
continue to be maintained, and
etc.
But not just maintained.
The key is adding new features.
Right now, I've been doing, on
pgaudit and pgBackRest, I've been
doing a lot of cleanup work because
they've been sitting for
a while, especially pgaudit wasn't
really getting a whole lot
of love.
So I've been working on that, but
I'll be done with that at the
end of this week, basically.
And then it's time just to dive
into big new features again,
which I'm really excited about.
I've got a lot of ideas.
The sponsors have a lot of ideas,
as you can imagine.
Nikolay: The
David: great thing is the ideas
that the sponsors have are
directly aligned with big ticket
items that have been on our
list for a while.
They're not asking for outlandish
stuff.
They're asking for repo-to-repo
backup, for instance, is a big
ask.
And that's been on the list for
a long time, because that's a
pretty important thing to be able
to do.
Streaming, doing streaming while
replication.
So you can have RPO 0 if you want,
without having a standby,
so the idea is you have RPO 0 without
a standby.
Nikolay: pg_receivewal, right?
David: Yeah, although we would
work directly at the protocol
level.
And in fact, we might...
I'm actually thinking, given that
we have this archive_library
thing, and given that pgBackRest
can just live in Postgres,
I'm not even sure if we need to
deal with the protocol level.
We could just sit there as a companion
to Postgres and stream
WAL out.
Nikolay: Isn't it expensive to sit
as Postgres?
Like, I don't get it.
What exactly is the idea?
David: So basically, instead
of a replication slot, we would
just be keeping up with the current
write WAL location.
Where are we currently writing
in WAL?
That's the point that we should
currently be tarping.
And although we would still need
a replica, if we really wanted
to say a synchronous replication,
RPO 0 stuff, we'd probably
still need a replication slot for
that.
The other thought we had is that
we could actually game, we could
create a replication slot but not
actually use it and just update
the row.
So basically do our own background
archiving and then update
the row in Postgres to our current
position, what we've actually
pushed out to storage.
And if I do this, I want to do
it right.
In a lot of cases, the things that
are doing the WAL receiver,
they're not actually, let's say
your eventual destination is
S3, they're not actually writing
packets off to S3.
They're storing the stuff locally
on whatever host the WAL receiver
is running on and then when they
get a full WAL segment, they're
pushing that to S3.
But what I would want to do here
is actually have a couple of
parameters that decide what your
acceptable lag is.
Let's say you're willing to have
3 seconds of lag, or so many
bytes.
And we can measure that.
In this situation, we'd be able
to measure either 1 of those
very easily.
So we would say, OK, if we get
to that point where we're going
to hit that lag, we'll actually
write a chunk of WAL out to
S3 and store it.
And the WAL archive-get routine
would actually be written so
that it would be able to...
Later these chunks would be assembled,
of course, reassembled
into a WAL segment to keep things
kind of tidy.
But the WAL get routine would
actually be able to understand
if you get to the end of the WAL
stream and all you've got is
these chunks, we'll be able to
read them out and reconstruct
them and send them Postgres.
So you actually have, because I
feel like the implementation
is out there right now, yeah, you're
getting the WAL off of
the primary, which is good.
Don't get me wrong, that's a pretty
important thing, but I want
to get the WAL all the way to
the repository, all the way to
the final storage.
So if everything goes away, and
all you've got is the S3 bucket,
then you still have the whatever
you set that to, sorry, then
you would have that, unless we
fall behind or other things happen.
Obviously, there's scenarios.
But I think it could be a lot more
efficient because then we
could figure out, hey, they're
generating a WAL at a ridiculous
rate.
We actually need to switch over
to just compressing and pushing
whole WAL segments and then come
back to chunking up portions
of this WAL segment.
And with the replication protocol
you don't really have that
option because you've just got
this fire hose that's sending
you data and what you really want
to do is say, no, this is too
much.
Also, it's a lot slower than just
compressing and shipping.
So, no, we're going to go back
to doing whole segments, and when
we get caught up on the whole segments,
we'll start doing the
most recent segment in chunks,
back and forth.
I think the best way to do that,
that I'm thinking, I'm always
an outside the box kind of guy,
is actually in an archive_library
sitting next to Postgres and just
asking Postgres, where are
you, where are you, you know, kind
of thing.
And we can actually monitor, you
know, check stuff on disk, so
we'd be able to see, oh, we've
just gotten a new WAL segment,
so everything in the old WAL segment
is fair game, etc., etc.
So there's plenty of heuristics
we can use to improve this.
So that's the way I'm leaning right
now, and that gives us the
archive_library and this WAL receiver
RPO feature at the same
time.
Nikolay: RPO control.
This is better RPO control, basically.
David: Yeah, better.
Sorry, I meant to say RPO 0.
So if we could actually gain this
so that we can update the node doing streaming
replication, but update the row
in Postgres for replication,
then we could actually do synchronous
replication this way.
And we could basically do synchronous
replication to S3.
If someone wanted such a thing,
yikes, It would be slow.
Nikolay: I'm very curious.
What is your opinion about, Barman
was the 1st tool which used
only streaming replication, right?
I remember this.
So what's, what is your opinion
on using streaming replication
for backups?
And this new CYBERTEC tool released
last week, pg_hardstorage,
I quickly checked they don't use
pg_receivewal as well, they
just implement protocol, they work
with protocol and they pretend
to be replicable.
Yeah, That's
David: what most people have
done.
I think Barman might still be using
pg_receivewal, but most other
implementations have gone straight
to the protocol.
The protocol's not that complicated,
to be honest.
Nikolay: Like all uppercase.
David: If you've got libpq, Once
you've got that set up, especially
if you're using libpq directly,
it's actually fairly trivial.
The replication protocol is very
simple, which it should be.
There's nothing wrong with simplicity.
I'm all about it.
I think ultimately it's a big bottleneck
making backups through
the streaming replication.
So for very large databases, it's
just going to be a pretty big
bottleneck.
Obviously someday we can add parallelism,
do all these things
and etc.
There's a lot of...
Well, they got compression.
So that's good.
Although I think...
Is the compression server side
or is it client side?
I think it might still be client
side.
No, it's server
Nikolay: side Okay,
David: I can't remember.
Anyway, that's actually a pretty
solvable problem But You've
got basically a scale of now you
need to back up something really
large and you're pushing everything
through this little pipe.
Nikolay: pg_basebackup has compression
since Postgres 15 and you
can...
Yeah.
Right?
Or
David: no.
It does.
I just can't remember which side
it happens on, whether it happens
on the server side or the client
Nikolay: side.
David: I think it is.
Nikolay: Compress server-gzip
option.
David: I believe it is, yeah.
Nikolay: But it's still single threaded.
David: That's 1 bottleneck taken
care of because you're not
pushing uncompressed data over
the network, which you definitely
don't want to do.
But it's still single threaded.
And that's a huge limitation on
the backup side, obviously.
And then on the recovery side,
we still have this problem of
all the data needs to be recovered
before anything can be done
if you're doing block incremental.
If you're not doing page incremental,
then you can make that
a bit more efficient.
You can stream the tar file from
S3 and decompress it as you
go and do various things.
But if you're doing page incremental,
which I think pretty much
everyone wants to do, that becomes
a huge bottleneck on the restore
side.
Nikolay: I'm sorry.
We never had so compressed.
I know we already are way over
time, but it's so compressed.
You compress so much knowledge
and I like it so much, but can
you explain, like elaborate page
level versus block level and
why you think it's important?
David: That's my confusion because
what was introduced in Postgres 17,
the page level incremental with
the WAL summarized and everything,
in pgBackRest we call that block
level incremental.
And the reason why we call it block
retal is we don't always
operate at the page, at page size.
We operate at block size.
And then we actually combine blocks
into what we call a super
block for compression.
So let's say I want to get 1 block
out, I might have to retrieve
3 compressed blocks or 4 compressed
blocks, depending on the
super block size, to actually get
that.
But the important thing is we can,
let's say we've got a 1 gigabyte
file and we need to recover 1 block,
we can go recover 3 blocks
to do that instead of 1000 blocks,
however many blocks are in
a 1 gig file.
I think it's usually 1, 200 at
our largest block size.
So we can go recover 3 blocks instead
of 1, 200 blocks, even
though we have to recover 3 blocks
just to write 1 block.
So the main thing is we're able
to go, let's say you've got a
standby or a primary that's failed
for some reason, you don't
know why.
So you're gonna do a delta restore.
So in a delta restore, we go and
we look at every file And for
the block level stuff, we chop
it up into pieces, and then we
do a hash on each piece, and then
we do a hash for the entire
file.
If the whole file hash matches
what's in the backup manifest,
then we're done.
Move on.
Next file.
If the whole file manifest doesn't
match, then we can go recover
what we call the block map, which
is a map that tells us where
all the blocks live in all the
backups.
You might have 10 incremental backups
since your last full, and
now we need to go figure out where
are the blocks, where are
the blocks that we need located
and go grab those blocks.
And in the other implementations
that have been done, you can't
do that.
You basically have to go recover
everything and some of them
will have good things like WAL-G
will, as it's reading it,
it'll know that it doesn't need
the block and it'll just throw
it away.
That kind of read over thing I
was talking about before.
So it'll just throw away the blocks
it doesn't need, but it's
still reading them out sequentially.
We're actually able to go random
access and pull out just the
blocks you need.
And It's extremely powerful because
we have that delta restore
concept where we can actually just
recover part of a cluster.
And that's where the power of those
checksums that we spent in
theory, 5% of our time, but actually
it's really much less than
1%, that's where that all comes
in on the pg_basebackup side
right now.
You basically have to recover all
the full and all the incrementals,
decompress everything all at the
same time, present that to pg_basebackup
that will then rewrite that into
another directory, sorry, pg_combinebackup
which will then rewrite
that into another directory.
So at a minimum to do recovery,
you're looking at double your
database size and actually more.
And you don't really know what
that more is, unless you've stored
it.
It's up to you to actually store
that information to find out
what is the total number of bytes
that I'm going to need to even
pull down the data.
On the pgBackRest side, when
we're doing a recovery, nothing
ever hits disk except in PGDATA.
So we're just reconstructing those
files inside PGDATA.
There's no spooling, there's no
spooling during backup, there's
no spooling during recovery, there's
no spooling during archive
push, there is spooling during
archive-get because we actually
prefetch WAL files and we store
them so that we can hand them
over to Postgres.
So we, we spool on that side, but
for the most part, we're 0
disk operation.
And to be really efficient, that's
the way you need to be looking.
And right now the design in pg_basebackup
is exactly not that.
A lot of disk I/O.
They've done some tricks with,
if you're on ZFS or other file
systems that can do copy-on-write
tricks and do other fun things
to try to minimize the number of
writes that you're going to
do.
I don't know if it actually minimizes
the space that you need.
On some file systems that might
be true.
Nikolay: Thank you for explanation,
it was great.
David: The whole idea was the
feasibility of pg_basebackup
using the streaming replication
as a backup tool.
And I think, yeah, you can do it,
but when you're working at
scale, it starts to introduce a
lot of limitations.
And from day 1, Gabriele Bartolini
wrote a pretty good thing
after I announced that I wasn't
working on pgBackRest about
his and my fundamental disagreement
about how this tool should
work.
And his idea is that the copy thing
should be owned by Postgres.
So Postgres should do all the file
copying.
And my idea was that no, the copy
should be built into the tool.
Because then we can do, the sky's
the limit.
We can do anything we want.
And I think the performance that
you can get from pgBackRest
speaks to the value of having that
complexity.
And it is, it's a lot of complexity,
don't get me wrong, to have
all that built into pgBackRest,
but the payoff is performance.
And also reliability.
We have checksums that we can check
on everything.
We know when the data that's coming
back is good.
That's 1 thing that kind of peeves
me is that if you, like let's
say you were doing pg_combinebackup
it doesn't actually verify
that the data is correct.
So if you want to verify, you actually
have to run pg_verifybackup
as a separate step before
you run pg_combinebackup.
And in my mind, those should always
be combined into a single
operation.
You're looking at the data, you're
verifying it, you're writing
it out.
And the cost for that, again, oh,
checksums are expensive.
Yeah, but if you're streaming the
data directly from S3 and checksumming
it and writing it out as it comes
in, the checksumming cost disappears.
You don't see it anymore.
If you copy all the data from S3
locally, decompress it, and
then start running checksums on
it, it looks really painful because
you weren't able to, the idea for
me is let's gain all the latency
that we can, the storage latency
that we can.
So while we're waiting for the
storage to give us something,
we'll send it the next request
to S3 and while it's thinking,
sending this data, we check some.
We're done with that block, the
next block is ready for us.
We pull that in, we start checksumming,
we've asynchronously
set the next request to S3.
So we've already got more data
coming.
In the next version of Postgres,
if everything goes well, we'll
also have prefetch.
So we won't be asynchronously fetching
1 block, we'll be asynchronously
fetching by default 4.
And just pulling that data in as
fast as we can, but even as
fast as you can get stuff over
the network, that cost of checksumming
disappears, But it gives you power.
And pgBackRest is the only thing
that has something equivalent
to delta restore.
And all of that is powered by the
checksums.
That's where all that comes from.
And then we have the block level
delta restore as well, which
is powered by the block level hashes,
which are actually XXHash,
not SHA-1.
Because the blocks are a maximum
of 88k.
Although you configure that, you
can also configure how much
of the XXHash you're going to use
for a particular block size.
But we did a lot of analysis on
this and looking at file systems
and how many bits they were using
to protect blah blah blah etc.
And basically came up with checksum
sizes that were appropriate
for each block size that we support.
The whole the way we do block incremental
is actually a pretty
complicated topic.
I did a talk on it once and it
was a complete disaster because
it was just too much.
Nikolay: You go too deep.
Truncating 32-bits you said, right?
Or truncating hashes you said,
right?
David: Oh yeah, so the 32-bits.
So what we do is we generate a
128-bit hash and then we use however
many bits out of that, well, bytes
actually, that we want.
The minimum number of bytes we'd
use for an 8k block is 6 bytes,
which is actually a lot.
That's 50% more than would be used
for 4k pages on ZFS, for instance,
or Btrfs.
So we go a little bit crazy on
the checksums because we just
we want everything to be right
and then when you're doing block
increment and we actually have
2 levels of checksum so we use
the block checksums to reconstruct
the file and Then we also
checksum the entire file and the
SHA-1 hash of the entire file
plus all the block checksums have
to match.
And the chances of that getting
a collision, you know, like a
piece of data that actually satisfies
the conditions, all the
block checksums matched and the
SHA-1 checksum of the whole file
matches.
We're talking heat death of the
universe probability here.
Right?
It's just simply not going to happen.
You're going to have disk corruption
1000000000 times before
anything like this ever fails,
as far as my math works out.
Michael: Well, it sounds reasonable.
Nikolay: It's a lot of interesting
topics, I feel.
And I like the direction you are
thinking, especially RPO control.
This is super cool thing because
I see the demand because Postgres
has been used in like critical
systems and people just don't
know what's happening there.
Like how, what's happening in the
case of disaster, right?
Like it's hard to answer simple
questions to business leaders.
David: Yeah.
It's not 1 size fits all.
Right.
I did.
You guys are familiar with ARIN,
right?
The internet number registry.
I know the people who run the database
over there, and they use
pgBackRest.
I worked with them for years in
another company, and they're
all RPO 0 stuff, right?
But their write volume is stupidly
low.
So they've got 1 primary, which
receives 100 writes an hour on
a really busy day, and then they
replicate that, and then you've
got the whole internet reading
to figure out what blocks are
allocated to who at any given point,
and where they should be
routed, that kind of thing.
So the read volume is pretty big,
the write volume is extremely
tiny, so they do synchronous replication
because they can't lose
anything and they have low enough
write volume they can get away
with it, so they would be a great
candidate for a pgBackRest
RPO 0 solution that allows them to
synchronously write to S3.
I bet they'd love it.
But for most people, that kind
of solution just isn't going to
work.
People ask me how I can work on
this software year after year.
And the reason is because the problems
are actually really interesting
and challenging and extraordinarily
complex in terms of the solutions
that we have to come up with to
solve these problems.
It's, it's, pgBackRest is a relatively
small project, but it's
dense.
The stuff that we do is...
Nikolay: Let me uncompress a little
bit.
You mentioned ARIN, it's American
Registry for Internet Numbers,
right?
David: Right.
That's the 1.
Nikolay: So it's basically the important
piece of internet, right?
David: Very important piece.
Yeah.
Very important piece.
Nikolay: It's interesting that you
Consider like volumes are not
huge and the streaming replication
and so on.
It's interesting and challenging.
But I have cases where it's like
just not answered.
RPOs are not answered and extremely
important cases.
Important companies also, they
simply don't know.
So if you build this, I think it
will be good.
It will be useful, helpful, and
so on.
David: It's a tricky thing though,
to some extent, when you
put those tools in people's hands,
because of course there's
some level in the hierarchy that's
going to say, always RPO 0.
Of course, we have to have synchronous
replication.
Nikolay: Why would we ever do that?
We cannot afford data loss.
We cannot afford, we need to use
the synchronous replication
and so on.
And also we cannot, we don't have
split brains.
You mentioned Gabriele Bartolini,
right?
I also have my opinion about CloudNativePG
and solutions he
chose in this tool.
Yeah, but like at high level we
are great, but if you look inside
you see data loss is possible,
split brains are possible, almost
everywhere.
And It's really extremely hard
to achieve and I like so much
you focus on these topics like
RPO, control, and especially corruption
and checksums everywhere.
It's great.
David: My whole job in life is
to protect the data, protect
everyone's data.
And that's what I think about all
the time.
Like how can pgBackRest be the
most reliable thing?
Even if it's not the fastest, or
I think it's generally the fastest,
but if it's not the most feature
rich, there are a couple of
features that other programs have
that we don't have, although
we're working on that.
But everything we introduce has
to be performant and ultimately
absolutely reliable people trust
the software to protect their
data.
Now they have an obligation to,
they need to, I get people at
conferences that will come to me
and say, you saved my job, pgBackRest
saved my job.
And I'm like, I really appreciate
you saying that, but actually
you saved your job because you
set up backup, you tested restores,
and when the big day happened,
you were ready.
Nikolay: Yes, I set up restores.
David: Because they'll tell me,
oh yeah, we set up weekly restore
tests like you recommended.
And I'm like, okay, great.
I can recommend these things, but
a lot of people don't do it
so if you actually go and do the
stuff that the backup people
recommend kudos to you you deserve
all the credit for saving
your job not me but don't get me
wrong I still like hearing the
stories It's great to know that
pgBackRest is useful and valued,
of course.
Nikolay: Untested backups are Schrödinger
backups.
They should not be considered proper
backups.
Yeah, backups must be tested.
David: I can't remember where
I found that, But I put that
on 1 of my very early talks, Schrödinger's
Backup.
It wasn't me, it wasn't original
to me, but I can't remember
where I saw it.
But as soon as I saw it, I was
like, yes.
Nikolay: It's an obvious idea, honestly.
Multiple people might invent it.
Anyway, way over time, I enjoyed
so much talking to you.
Final question, we discussed a
lot of technical detail, very
advanced, I must say.
I think it's okay if some people
don't get everything because
we definitely dived into multiple
areas very deep.
We also touched the situation about
sponsorship.
We haven't touched the situation
about maintainers.
What would help you to have a 2nd
big maintainer?
David: A 2nd big maintainer would
be good for a couple of reasons.
1, to help me write the big features.
These days, I'm the only 1 writing
big features and I'm just
1 person.
And the other thing is review.
So review is a big bottleneck for
me.
I can actually produce quite a
lot of code, but it's got to be
gone over carefully by somebody.
I also, I'm using LLMs now for
review as well.
So I'll ask cloud code review anything
I write before I send
it to any person just to catch
the really obvious stuff.
And these days the not so obvious
stuff is getting pretty good,
But I also need someone to bounce
ideas off of.
Another maintainer would be good
for that.
Someone who I could reliably chat
with and be like, so I was
thinking about X, Y, and Z and
I just, you know how it is when
you vocalize something, it becomes,
sometimes you don't even
need to get input from them.
Just by explaining the problem
to them, you, oh yeah, I know
what I need to do here.
It's so obvious to me now.
So that's why I really need another
maintainer.
We're working on that right now.
That should be, I don't wanna make
any announcements or say anything
regard to this, but this is something
that's actively being worked
on to get another maybe not full-time
maintainer, but very active,
very involved maintainer.
Nikolay: That's, that would be great.
Yeah.
David: And to some extent, it's
hard to know how big we could
scale this project.
It's an interesting, how many people
are interested.
Also how big are the problems we're
solving?
Can 2 people handle everything?
Probably.
Especially with tools these days,
helping.
We're getting more contributions
from the community.
Those are coming in a pretty regular
stream now so I don't have
to write everything.
I don't have to find all the bugs,
I don't have to fix all the
bugs, I don't have to do little
authentication tweaks for S3
or blah blah blah.
Most of that is coming from outside
contributors, and I just
review it, and if it's reasonable,
commit it.
So we're building community that
way as well, but if we can get
1 more person who's really looking
at it regularly, that would
be great.
And that would, of course, be in
addition to Stefan, with Data
Egret, who actually already spends
quite a bit of time on pgBackRest
as well, review and
testing, et cetera.
Nikolay: Cool.
Great.
Thank you.
I don't have any more questions.
I have, but I will post them, because
they are super technical
again.
Michael: David, thank you so much.
Thanks for joining us, but also
thanks for all the maintenance
you've done over the years.
David: Oh, absolutely.
It's definitely been my pleasure.
Michael: Congrats on getting all
the sponsors and Thanks to them
as well for keeping it going.
David: Yeah, like kudos to the
sponsors.
They have made this possible.
I'm just, I'm back to work.
I'm back to what I love doing and
everyone gets to benefit.
So I think this has worked out
really well.
So a really cool open source story.
Michael: Yeah, it is absolutely
good.
Great open source story.
Nikolay: Thank you.
Have a great week.
David: Yeah, thank you very much
for having me.