Michael: Hello and welcome to PostgresFM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard, and I'm joined as usual by

Nik, founder of Postgres.AI.

Hey, Nik.

Nikolay: Hi, Michael.

And let's welcome our guest.

Michael: Yeah, we are joined by a very special guest, Sugu, who

is a co-creator of Vitess, co-founded PlanetScale, and is now

at Supabase working on an exciting project called Multigres.

So welcome, Sugu.

Sugu: Thank you.

Glad to be here.

Michael: All right.

It's our pleasure.

So it's my job to ask you a couple of the easy questions to start

off.

So what is Multigres and why are you working on it?

Sugu: Multigres is a Vitess adaptation for Postgres.

It's been on my mind for a long time, many years, and we've even

had a few false starts with this project.

And I guess there is a timing for everything and finally the

timing has come.

So I'm very excited to get started on this finally.

Michael: Yeah timing is an interesting one it feels like for many

years I was looking at PlanetScale and Vitess specifically very

jealously thinking you can promise the world you can promise

this you know horizontal scaling with a relational database for

OLTP and it's it you know all of the things that people want

and we didn't really have a good answer for it in Postgres.

But all of a sudden, in the last few months, it seems almost,

there are now 3, 4 competing projects all doing it.

So why now?

Why is it all happening now?

Sugu: I would say, I think there is a timing for every new idea

to gather momentum.

Like the idea may be good, but the timing may be bad for one reason

or another.

But once the timing is good, then it is obvious that it is the

right time.

And it feels like now is that time.

Like I would say, for example, PayPal, there have been previous

attempts at moving money online, but it never worked.

But PayPal just came at the right time and therefore took off

during that time.

Many such examples where, you know, some ideas came too early

and didn't take off.

But for Postgres, this feels like it's the right time.

Nikolay: That's interesting.

Let me reveal something.

We met in 2018, I think, at Percona
conference.

And in 2019, we met again in Nutanix
at the meetup.

Go meetup, right.

And discussions about Vitess for
Postgres started.

And a couple of times we had a
call, I tried to involve a couple

of guys to and from from my understanding
it never worked because

You could not do it yourself being
busy with other thing I guess

my sequel related and the guys
are looking at the complexity

other guys looking at the complexity
of related and, and didn't

proceed.

And actually, in 1 case, it was,
they decided to build from scratch.

It was, it was SPQR project.

It's live and
there's sharding for Postgres.

Sugu: Yeah, Andrey Borodin.

Nikolay: Yeah, Andrey Borodin.

Borodin.

Yeah.
Yeah.

So and other folks also involved.

And so for me, it was disappointing
that it doesn't work.

And at some point I saw the message
in Vitess, I think, that

we are not going to do it.

So like, don't expect.

Sugu: I feel so bad because I was
so excited about doing it.

And then I realized, oh my God,
you know.

Nikolay: But now you started and
last week PlanetScale decided

to support Postgres.

So what's happening?

Like, I don't understand.

Just, just right time.

Right.

Yeah.

Enough, enough number of companies
using Postgres, which really

needed.

Sugu: At least 1 horse will win.

Yeah.

Nikolay: So yeah, it's great.

But, but yeah, long, long story
to this point.

Michael: Yeah.

Sometimes when there are multiple
projects there's kind of slight

differences in philosophy or approach
or trade-offs, willing

to trade 1 thing off in relation
to another.

And I saw your plan, I really liked
that you used, you mentioned

building incrementally.

So Vitess is a huge project, lots
and lots of features, But

I've heard you talk in the past
about building it quite incrementally

while at YouTube.

You know, it didn't start off as
complex as it is now, obviously,

and you did it kind of 1 feature
at a time.

And it sounds like that's the plan
again with Multigres.

Is that different to some of the other projects?

Or what do you see as your philosophy and how it might differ

slightly to some of the others?

Sugu: I think my philosophy is that I would say I don't want

to compromise on what the final project is going to look like.

So any path that deviates me from hitting the target on the final

project, I do not want to take.

But any shortcuts that can get us early results, that we can

do, but only those that ensure that it doesn't risk what we become

eventually, which is a project that should feel native as if

it was for Postgres, by Postgres kind of thing.

I want it to be a pure Postgres project.

Nikolay: In this case, I must ask, in this case, are you sure

you're still good with choice of Go language?

Because sometimes we see projects which have really like sub

millisecond latency on average, like 100 microseconds, for example,

on average.

And Go definitely will bring a few hundred microseconds of latency

overhead.

So usually it's not a big deal, but maybe in some cases it's

some deal, right?

Are you happy with Go?

Yeah, I mean- Because you were 1 of the first big Go language

users building Vitess, as we know from various interviews and

so on.

Yeah.

So it's still a good choice because now there is Rust, right?

Sugu: Yes.

I would say, by the way, when we started Go, compared to where

Go is today, it was a nightmare.

Nikolay: Yeah, like 10 milliseconds.

Sugu: Like, yeah, like 10 milliseconds or something round trip

is what we were paying for.

Those days we had hard disks, by the way.

So that's another 3 to 5 milliseconds just within the database.

But things are a lot better now.

And at this point, the way I would put it is, Like the trade-offs

are in favor of Go, let's put it that way.

Mainly because there is a huge amount of existing code that you

can just lift and port.

And rewriting all of that in Rust is going to just delay us.

And at least in Vitess, it has proven itself to scale for, you

know, like multi, hundreds of terabytes.

And the latencies that people see are, they are not affected

by a couple of hundred microseconds.

So I think the, and plus there's, there's this inherent acceptance

of this network latency for storage and stuff.

And if you bring the storage local, then this actually wins out

over anything that's there.

Nikolay: This is exactly what I wanted to mention.

Yeah, I see PlanetScale right now, they came out with Postgres

support, but no Vitess.

And I'm very curious how much it will take for them to bring

it and compete with you.

It's another interesting question.

But from past week, I see my main impression is like, main stake

is on local storage.

And this is great because local storage for Postgres, we use

it in some places where we struggle with EBS volumes and so on,

but it's considered like not standard, not safe, blah blah blah.

There are companies who use it, I know.

I know myself, right?

And it's great.

Like today, for example, Patroni and since Postgres 12 we don't

need to restart nodes when we have failover.

So if we lose node, forget about node, we just failover and so

on.

And with local storage, not a big deal.

But now I expect with your plans to bring local storage, it will

become more, like I expect it will be more and more popular,

and that's great.

So you shave off latency there and keep Go, which brings 200.

Sugu: Good compromise.

Effectively, it's a win because 1 network hop completely eliminates

language level overheads.

Nikolay: Sounds good.

Maybe your goal will improve additionally.

Michael: I wanted to go back, Sugu, you mentioned not wanting

to compromise on feeling Postgres native.

That feels to me like a really big statement, coming from Vitess

being very MySQL specific, saying you want to be Postgres native

feels like it adds a lot of work to the project.

It feels like a lot to me.

What does it mean?

Is that about compatibility with the protocol?

What does it mean to be Postgres native?

Sugu: There's 2 answers.

1 is, why do we still think we can bring Vitess if it was built

for MySQL?

And how do you make it Postgres native?

That's because of Vitess's history.

For the longest time, Vitess was built not to be tied to MySQL.

It was built to be a generic SQL 92 compliant database.

That was actually our restriction for a very long time until

the MySQL community said, you need to support all these MySQL

features, otherwise we won't

Nikolay: move forward.
CTEs, right?

Common Table Expressions, with, right?

It's I guess SQL 99 feature, not 92.

Sugu: Yeah, I think the first parser
I built was SQL 92, which

is the most popular 1 that I know
of.

So that's answer 1.

Answer 2 is more with the behavior
of Postgres.

What we want is completely mimic
the Postgres behavior right

from the beginning.

Basically, in other words, we plan
to actually copy or translate

what we can from the Postgres engine
itself, where that behavior

is very specific to Postgres.

And the goal is not compatibility
just at the communication layer,

but even internally, possibly even,
you know, recreating bugs

at the risk of recreating bugs.

Nikolay: In this case, it's like,
so there is Citus in the hands

of Microsoft got everything open
sourced.

So before, resharding was only
in paid version, now it's in free

version, open source, so it's fully
open source.

And they put Postgres in between,
so they don't need to mimic

it, they can use it, right?

And latency overhead is surprisingly
low, we checked it.

Well, it's C, but it's whole database
in between.

But it's sub millisecond, so it's
acceptable as well.

I think it's half a millisecond
or so in our experiments with

simple select 1 or something, SELECT,
just like.

So don't you think it's like quite,
like in comparison, it's

quite a challenging point when
you say I'm going to mimic a lot

of stuff, but they just use Postgres
in between?

Sugu: Yeah, yeah, I think there's
a difference in architecture

between our approach between Multigres
versus Citus.

I think the main difference is
it's a single coordinator for

Citus.

And there is some bottleneck issue
with that.

If you scale to extremely large
workloads, like that goes into

millions of QPS, hundreds of terabytes.

So having that single bottleneck,
I think would be a problem

in the future.

Whereas,

Nikolay: yeah.

I understand you can put multiple,
like you can have multiple

of nodes there and also you can
put a PgBouncer to mitigate

the connection issues.

So it can scale as well.

Sugu: That's good.

That's not, that's something I
haven't known before.

So, yeah, it's possible then that
they may also have something

that can viably scale for OLTP.

Nikolay: Yeah.

So we're still exploring this and
more benchmarks are needed.

Actually, I'm surprised how few
and not comprehensive benchmarks

there are published for this.

Sugu: Yeah, what I know of Citus
is probably what you told me

when we met.

So I was about 5 years old.

Michael: Yeah, yeah.

Another big difference and this
is typically Nikolay's question

is on the license front I think
you've picked about as open a

license as you could possibly pick
which is not the case I think

for many of the other projects.

So that feels to me like a very
Supabase thing to do and also

in line with what Vitess did
and that seems to me like a major

advantage in terms of collaborating
with others, other providers

also adopting this or working with
you to make it better.

Is like, what's your philosophy
on that side of things?

Sugu: My philosophy is, my metric
for success is adoption.

Yeah.

And the only way to have a project
be adopted is to have a good

license, a license that people
are confident to use.

That has been the case from day
1 of Vitess.

We actually first launched with
BSD license, which is even more

permissive than Apache.

And then when we moved the project
to CNCF, they said, oh, no,

no, we need Apache license, which
is why we converted it to that

1.

So adoption has always been, yeah.

Nikolay: Why do they say it?

Do you know?

Sugu: Why CNCF wants Apache?

I think Apache is a pretty good
license.

They just made it a policy.

I mean, had we asked to keep the
BSD license, they would have

allowed us, but we didn't feel
like it was a problem to move

to Apache.

Nikolay: Yeah, and I remember you
described like, when you did

it, like in YouTube, you thought
about external users.

You need external users for this
project to grow.

I guess at Google, AGPL is not
popular at all, we know.

Sugu: Oh, banned.

Nikolay: Yeah, yeah, banned.

And Citus is AGPL,

Sugu: which is interesting.

Nikolay: Also, compared to Citus,
I think you have chances to

be compatible with RDS and other
managed Postgres, to work on

top of them, unlike Citus, which
requires extension and so on,

right?

Sugu: Correct.

Correct.

Yes.

This was actually something that
we learned very early on, wanting

to work.

We made a 5 line change on MySQL
just to make Vitess work initially.

And it was such a nightmare to
keep that binary up, to keep that

build running.

Fork.

Yeah, to keep that fork alive.

So we decided, no, it's like, we
are going to make this work

without a single line of code of
change in MySQL.

And that actually is what helped
Vitess move forward.

Because people would come in with
all kinds of configurations

and say, make it work for this.

So in this case, actually, we'll
probably talk about the consensus

part.

That is 1 part that we think it
is worth making a patch for Postgres,

and we're going to work hard at
getting that patch accepted.

But I think what we will do is
we will also make Multigres work

for unpatched Postgres for those
who want it that way.

Except they will lose all the cool
things about what consensus

can give you.

Nikolay: I'm smiling because we
have so many variations of Postgres

these days.

I would expect people coming not
only with different configurations

of Postgres, but also with various
flavors like Aurora.

We have a client who just migrated
from regular Postgres to AlloyDB.

Hello from Google Cloud.

And they already sharded on application
side, but imagine they

could come to you and say, let's
support AlloyDB now.

It's almost Postgres, right?

So yeah, so these things might
happen as well.

Sugu: Don't they claim full compatibility
with Postgres?

I thought you Not full,

Nikolay: but but most of it.

It's, they did interesting stuff
in memory like, like column

storage and memory for tables.

It's row storage on disk, but column
storage in memory.

But it looks like kind of Postgres,
and we actually even had

some questions answered from my
team unexpectedly because we

don't normally work with AlloyDB,
but it looks like Postgres.

So I could imagine the request,
let's support AlloyDB as well.

Anyway, but my question, I remember
featuring the test that we

work with RDS and managed MySQL.

Did those features, like has this
feature survived?

Sugu: No, actually later we decided
that at least, we call it

actually managed versus unmanaged.

Managed meaning that means that
Vitess manages its databases.

And unmanaged means that the database
is managed by somebody

else, Vitess just acts as a proxy
to serve queries.

At some point of time, we realized
that supporting both is diluting

our efforts.

And that's when we decided, okay,
you know, it's not worth it

to try and make this work with
every available version that exists

out there in the world.

And we said, okay, we will do only
managed, which means that

we will manage it ourselves.

And if you want, we'll build the
tools to migrate out of wherever

you are, and we'll make it safe,
we'll make it completely transparent.

In other words, you deploy Vitess
on both and then we'll migrate

your data out without you having
to change your application.

But then Vitess can be more intentional
about its features,

more opinionated about how clusters
are to be managed.

And we were able to commit to that
because at that point, Vitess

had become mature enough.

People were completely trusting
it.

They actually preferred it over
previous other managed solutions.

So it wasn't a problem at that
time.

Nikolay: Yeah, it's not a surprise.

That's why I asked.

Because you talk about local disks,
backups, HA, a lot of stuff,

right?

And basically-

Sugu: Yeah, 5 nines is like what
Vitess shoots for.

And like most big companies that
run Vitess do operate at that

level of availability.

Nikolay: So what's the plan for
Multigres going to support?

So only managed version, right?

Sugu: Only, So it would be, yes,
it would be only managed versions

because I believe that the cluster
management section of Vitess

will port directly over to Postgres,
which means that you will,

once it goes live, it will be coming
with batteries included

on cluster management, which should
hopefully be equal to or

better than what is already out
there.

So I don't see a reason why we
should try to make it work with

everything that exists today.

Nikolay: So it means there is no,
like, this is the same like

with Citus, it doesn't work with
RDS on 1 hand, but on another

hand, we, like, I don't see it's
only a sharding solution.

It's everything, which is great.

I mean, it's interesting, super
interesting.

So a lot of problems will be solved.

And I expect even more managed
services will be created.

I don't know how it will continue,
like in terms of Supabase,

because of the very open license
and so on.

But also I expect that many people
will think, reconsider their

opinion about managed.

We had episode about this.

This is my usual opinion about
managed services because they

hide superuser from you, they
don't provide you access, it's

hard to troubleshoot problems.

In this case, if problems are solved
in this and this gives you

a new way to run Postgres, so if
many problems solved, it's great.

If you want

Sugu: to do so,

Nikolay: for example.

Sugu: Yeah, if you may not know, the initial focus of Vitess

is actually solving these problems first.

Sharding was actually came much later, like protecting the database,

making sure that they survive abusive queries.

Basically, that's what we built Vitess for initially.

And the counterpart of taking away power from the user, like

you said, is 1 is, well, we now know exactly how to make sure

that the cluster doesn't go down.

And 2, we countered that by building really, really good metrics.

So when there is an outage, you can very quickly 0 in on a query.

If a query was responsible, Vitess will have it on top of it,

like on the top of the line, saying that this is a query that's

killing your database.

So we built some really, really good metrics, and which should

become available in Postgres, probably from day 1.

Nikolay: That's interesting.

I didn't see, maybe I missed, I didn't see in readme you were

writing right now in the project.

Sugu: There's a last section called observability.

I missed that.

I need to revisit.

Nikolay: We're actually building something there as well for

regular Postgres.

I'm very curious, I will definitely revisit this, interesting,

okay.
So yeah, great.

Yeah.

Michael: Also, I feel like this is quite a big difference on

the, at least with Citus in terms of the philosophy or at least

the origin story.

I feel like that started much more with OLAP-focused features

in terms of distributed queries and parallelised across multiple

shards and aggregations and columnar, and loads of things that

really benefit OLAP workloads, whereas this has come from a philosophy

of let's not worry about optimizing for those cross shard queries,

this is much more let's optimize for the single shard very very

short quick OLTP queries and let's make sure we protect it against

abusive queries.

So it feels like it's coming, architecturally, it's coming from

a very different place of what to optimize for first.

Sugu: And historically, that was YouTube's problem.

Surviving the onslaught of a huge number of QPS and making sure

that 1 single QPS doesn't take the rest of the site down.

Michael: Yeah.

Yeah.

Perfect.
Makes loads of sense.

So actually, before we move on too much from that, what, where

do you see sharding as becoming necessary?

Like, is it just a case of a total number of QPS or like writes

per second type thing.

We've talked about sharding in
the past and talked about kind

of a max that you can scale up
to perhaps in terms of writes,

in terms of WAL.

WAL per second I think was the
metric we ended up discussing.

Are there other reasons or kind
of bottlenecks that you see people

getting to that sharding then kind
of makes sense as it's now

time or you should be considering
at this point?

Sugu: Well there is a physical
limiting factor which is the single,

if you max out your single machine,
that is your Postgres server,

then that's the end of your scale.

There is nothing more to do beyond
that.

And there are a lot of people already
hitting those limits from

what I hear.

And the sad part of it is they
probably don't realize it.

As soon as that limit is hit, in
order to protect the database,

they actually push back on engineering
features indirectly, saying

that, you know, this data, can
you make it smaller?

Can you somehow lower the QPS?

Or could you put it elsewhere?

Nikolay: Let's stop showing this
number on front page.

Sugu: Yeah, yeah.

And it affects the entire organization.

It's a very small, it's a very
subtle change.

But the entire organization slows
down.

Like we experienced that at YouTube
when we were at our limits.

We like the default answer from
a DBA was always no.

We used to even kid, no?

The answer is no.

What's your question?

And when we started sharding, it
took us a while to change our

answer to say that, you know, bring
your data I like it we can

scale as much as you want. Believe
it or not, we went from 16

shards to 256 in no time.

And the number of features in YouTube
exploded during that time

because there was just no restriction
on how much data you wanted

to put.

And coming back here, the upper,
like reaching the limit of a

machine is actually something you
should never do.

It's very unhealthy for a large
number of reasons, like even

if there is a crash, like how long
is it going to take to recover?

Like the thing that we found out
is once you can shard, it actually

makes sense to keep your instances
way, way small.

So we used to run like 20 to 50
instances of MySQLs per machine.

And that was a lot healthier than
running big ones.

For a couple of reasons.

One is, if you try to run so many
threads within a process, that

itself is a huge overhead for the
machine.

And it doesn't do that very efficiently,
whereas it does it better

if you run it as smaller instances,
I think.

It's more of a feeling, but I don't
know if there is proof or

whatever.

But like Go, for example, wouldn't
do well.

Go, I think, beyond a certain memory
size or beyond a certain

number of goroutines would start
to slow down, would not be

as efficient as it was before.

Mainly because the data structures
to keep track of those threads

and stuff, they are growing bigger.

But more importantly, on an outage,
a smaller number of users

are affected.

If you have 256 shards and 1 shard
goes down, it is 1 256th of

the outage, right?

And so the site looks a lot healthier,
behaves a lot healthier.

There's less panic if a shard goes
down.

So people are, you know, a lot
less stressed managing such instances.

Nikolay: Right, I wanted to mention
that this discussion was

with Lev Kokotov, PgDog, which
is a competitor as well, a new

sharding tool written in Rust.

And we discussed that there is
a big limitation when Postgres...

So replication, physical replication
has limitation because it's

single threaded process on standby.

If we reach like somewhat like
150, 200, 250 megabytes per second,

depending on core and also number
of, not number, structure of

tuples and so on.

We hit 1 single CPU, 100%, 1 process,
and it becomes bottleneck

and replica standbys, they start
lagging.

It's a big nightmare because you
usually by that time, but that's

like at high scale you have multiple
replicas and you will float

a lot of read only queries there.

And then you don't want, don't
know what to do except as you

described let's remove this feature
and slow down development

and this is not not fun at all.

So what I'm trying to do here is
trying to move us to discussion

of replication not physical but
logical.

I noticed your plans involve heavily
logical replication in Postgres.

But we know it has, it's improving
every year.

So like when we started the discussion
5, 6 years ago, it was

much worse.

Right now it's much better.

Many things are solved, improved,
but many things still are not

solved.

For example, schema changes are
not replicated, right?

And sequences, there is work in
progress, but if it's committed,

it will be only in Postgres 19,
not in 18.

So it means like long wait for
many people.

So what are your plans here?

Are you ready to deal with problems
like this?

Pure Postgres problems, you know?

Sugu: Yeah, yeah.

If you ask me, I think the Postgres
problems are less than what

we faced with MySQL.

Nikolay: I wanted to involve physical
as well, because this great

talk by Kukushkin, which describes
very bad anomalies when data

loss happens and so on.

Let's talk about this.

Sugu: Yeah, we should talk about
both.

I think overall the Postgres design
is cleaner, is what I would

say.

Like you can feel that from things.

Like the design somewhat supersedes
performance, which I think

in my case is a good trade-off,
especially for sharded solutions,

because some of these design decisions
affect you only if you

are running at, you know,

Nikolay: if

Sugu: you're pushing it really,
really hard, then these design

decisions affect you.

But if your instances are small
to medium size you won't even

know and then you benefit from
the fact that these designs are

good.

So I actually like the approaches
that Postgres has taken with

respect to the WAL as well as
logical replication.

And by the way, I think logical
replication theoretically can

do better things than what it does
now, and we should push those

limits.

But yes, I think the issue about
schema not being as part of

logical replication, it feels like
that is also a theoretically

solvable problem except that people
haven't gotten to it.

I think there are issues about
the transactionality of DDLs which

doesn't even exist in MySQL.

So at least in Postgres, it exists
in most cases.

There are only a few cases where
it is not.

And for such things like a proxy
layer, like Multigres or Vitess,

it's a no problem for them because
you should say, oh, OK, this

particular construct is not transactionally
safe, well, then

we'll even prevent you from doing
it transactionally because

we don't want you to get the wrong
impression.

We'll let you do it non-transactionally,
and we know that it's

non-transactional, and therefore,
we can do something about it,

right?

Those abilities don't exist previously,
But eventually, if it

becomes transactional, then we
can actually include it in a transaction.

Nikolay: Yeah, just for those who
are curious, because there

is a concept, all DDL in Postgres
is transactional.

Here we talk about things like
creating this concurrently because

we had discussion offline about
this before recording.

So yeah, creating this concurrently
can be an issue, but you

obviously have a solution for it.

That's great.

Sugu: The way I would say it is
we have dealt with much worse

at with MySQL.

So this is much better than what was there then.

Sounds good.

Yeah.

Nikolay: Okay, good.

And let's talk about physical replication because I saw you are

going to use it.

So each shard is going to have a standby.

And yeah, so with quorum commit, right?

So like we don't lose data because data is on local disks by

default, as I understand, like ephemeral in cloud.

So we want to be sure that data is written to at least 2 places,

for example, or 3, right?

Configurable, of course.

Here, this interesting talk by Kukushkin, he presented it recently

at an online conference by Microsoft, describing that synchronous

replication in Postgres is not what you think.

Sugu: Correct.

So, right.

Correct.

Nikolay: What are you going to do about this?

Sugu: Well, I was just chatting with someone and essentially

synchronous replication is theoretically impure when it comes

to consensus.

I think it's provable that if you use synchronous replication

then you will hit corner cases that you can't handle.

And the most egregious situation is that it can lead to some

level of definitely split brain, but in some cases it can even

lead to downstream issues.

Because it's a leaky abstraction, it's a leaky implementation,

there are situations where you can see a transaction and think

that it is committed.

And later, the system may fail.

And in the recovery, may choose not to propagate that transaction,

or may not be able to, and it's going to discard that transaction

and move forward.

Nikolay: But this is the same as with asynchronous replication,

it's the same, we're just losing some data, right?

Sugu: Exactly, yeah, it is the same as asynchronous replication,

Yes.

Nikolay: It's not split-brain, it's just data loss.

Sugu: It's data loss, correct.

It's data loss, but for example, if you are running like a logical

replication off of 1 of those, then that logical replication

may actually propagate it into an external system and now you

have corrupted downstream systems that don't match the source.

So those risks exist and at Vitascale people see this all the

time, for example, and they have to build defenses against this

and it's very, very painful.

It's not impossible, but it's very hard to reason about failures

when a system is behaving like this.

So that is the problem with synchronous
replication.

And this is the reason why I feel
like it may be worth patching

Postgres.

Because there is no existing primitive
in Postgres on which you

can build a clean consensus system.

I feel like that primitive should
be in Postgres.

Nikolay: I now remember from Kukushkin's
talk, there is another

case when a primary transaction
looks like not committed because

we wait a replica, but the replica
somehow is like lost connection

or something.

And then we suddenly, and client
thinks it's not committed because

commit was not returned.

But then it suddenly looks committed.

It's like not data loss, it's data
and loss

Sugu: somehow.

Boom.

Nikolay: Like suddenly, and this
is not all right as well.

And when you think about consensus,
I think you are very good

describing these things like concept
and distributed systems.

It feels like if you have 2 places
to write, definitely there

will be corner cases where something
will go off if you don't

use two-phase commit, right?

Sugu: Correct.

Nikolay: And here we have this.

But when you say you're going to
bring something with consensus,

it immediately triggers my memory
how difficult it is and how

many attempts it was made to bring
pure HA to Postgres, just

to have auto-failover.

All of them failed.

All of them.

And they say, let's be outside
of Postgres.

So here maybe it will be similar
complexity to bring these 2

inside Postgres.

Is it possible to build it outside
this thing?

Sugu: It is not possible to build
it outside.

Because if it was, that is what
I would have proposed.

The reason is because building
it outside is like putting band-aid

over the problem.

It will not solve the core problem.

The core problem is you've committed
data in 1 place, and if

that data can be lost, and there
is a gap when the data can be

read by someone, causes is the
root cause of the problem.

That is unsolvable.

Even if you later, later raft may
choose to honor that transaction

or not.

And that becomes ambiguous, but
we don't want ambiguity, right?

Nikolay: What if we created something
extension to commit, like

make extendable to talk to some
external stuff to understand

that can be finalized or something?

I don't know, consensus.

Correct.
Sugu: Correct.

So if you, essentially, if you
reason through about this, your

answer will become a two-phase
system.

Nikolay: Yeah.

Sugu: Without a two-phase system.

Nikolay: Which scares me.

But as I told you, a two-phase
commit in the Postgres OTP world

is considered really, really slow
and the rule is let's just

avoid it.

I see your enthusiasm and I think,
I couldn't find good benchmarks,

0, published.

Sugu: This is not two-phase commit,
by the way.

This is two-phase synchronization.

Nikolay: I understand.

It's not, in two-phase commit,
it's like more communication happens.

I understand this.

Sugu: So two-phase synchronization,
the network overhead is exactly

the same as full sync, because
the transaction completes on the

first sync.

Later it sends an acknowledgement
saying that yes, I'm happy,

you can commit it, but the transaction
completes on the first

sync, so it will be no worse than
full sync.

Nikolay: Yeah, compared to current
situation when primary commit

happens, but there is a lock which
is being held until- Correct,

Sugu: it is the same cost.

Nikolay: Yeah, it is the same cost.

We wait until standby, And for
user it looks like when lock is

released, it thinks, okay, commit
happens.

But the problem with this design,
if, for example, standby restarts,

lock is automatically released
and commit is here and it's unexpected.

This is data on loss, right?

So you are saying we can redesign
this, network cost will be

the same, but it will be pure.

Yeah, that's great, I like this.

I'm just thinking, will it be acceptable?

Because bringing autofailover
is not acceptable.

There was another attempt last
year from someone and with great

enthusiasm, let's bring autofailover
inside Postgres.

Actually, maybe you know this guy,
it was Konstantin Osipov who

built a Tarantool database system.

It's like memory.

He was X MySQL in performance after
Zaitsev.

Zaitsev was X MySQL then Osipov
was MySQL.

And so Konstantin came to Postgres
saying let's build this.

Great enthusiasm, but it's extremely
hard to convince such big

thing to be in core.

So if you say it's not big thing,
this already...

Sugu: So I can, it's, I'll probably
have to explain it in a bigger

blog, but essentially, now that
I've studied the problem well

enough, the reason why it's hard to implement consensus in Postgres

with the write-ahead log is because they are trying to make Raft work with

the write-ahead log.

And there are limitations about how the Raft, how commits work

within Postgres that mismatch with how Raft wants commits to

be processed.

And that mismatch, so far, I have not found a way to work around

that.

But, the variation of Raft can be made to work.

Nikolay: Interesting.

Sugu: The way the, I don't know if you know about my blog series

that I wrote when I was at PlanetScale.

It's an 8 part blog series about generalized consensus.

People think that Raft is the only way to do consensus, But it

is 1 of a thousand ways to do consensus.

So that blog series explains the rules you must follow if you

have to build a consensus system.

And if you follow those rules, you will get all the properties

that are required by a consensus system.

So This 1 that I have, the design that I have in mind, follows

those rules, and I am able to prove to myself that it will work,

but it's not Raft.

It's going to be similar to Raft.

I think we can make Raft also work, but that may require changes

to the write-ahead log, which I don't want to do.

So this system I want to implement without changes to the write-ahead log

as possibly a plugin.

Nikolay: Yeah.

Well, now I understand why you could, like another reason you

cannot take Patroni not only because it's Python versus Postgres

But also because you need another version of consensus algorithm

Sugu: Correct, correct.

Nikolay: And among those hundred Thousand millions of ways.

Sugu: By the way Patroni can take this and use it because it's

very close to how FullSync works.

Nikolay: Good.

Okay.

Michael: I was just thinking, watching Alexander Kukushkin's

talk, he said a couple of things that were interesting.

1 is that he was surprised that this hasn't happened upstream.

So you definitely have an ally in Kukushkin in terms of trying

to get this upstreamed, but also that he thinks every cloud provider

has had to patch Postgres to, in order to offer their own high

availability products with Postgres, each 1 has had to patch

it.

And they are having to, you mentioned earlier today how painful

it is to maintain even a small patch on something.

Nikolay: I don't think it's every, I think it's Microsoft for

sure, knowing where Kukushkin works at.

Yeah.

But maybe more, not every.

Yeah,

Michael: all I mean is that there
are a growing number of committers

working for hyperscale and hosting
providers.

So I suspect you might have more
more optimism for consensus

or at least a few allies in terms
of getting something committed

upstream so I personally think
there might be growing chance

of this happening even if it hasn't
in the past for some reason.

Sugu: Yeah, I feel like also being
new to the Postgres community,

I am feeling a little you know,
shy about proposing this upfront.

So what I am thinking of doing
is at least show it working, show

it working at scale, have people
gain confidence that this is

actually efficient and performant
and safe.

So I also plan to, I don't know
if you've heard of FlexPaxos,

which is actually, in my opinion,
a better way to handle durability,

because today's cloud environments
are a lot more complex, and

a simple majority-based quorum
is actually very hard to configure

if your needs are different, which
actually FlexPaxos does handle.

It's actually something I'm a co-inventor
of, of some sort.

And this blog post...

Nikolay: I only heard the name.

That's it.

So I like...

Yeah.
Can you explain a little bit?

Not super difficult.

Sugu: Oh sure.

Yeah.

So the...

Actually, let me explain what is
the reason why.

So FlexPaxos was published a few
years ago, about 7 years ago

or so.

And you'll see my name mentioned
there, which I feel very proud

of.

And the block series that I wrote
is actually a refinement of

FlexPaxos.

And that actually explains better
why these things are important.

The reason why it's important is
because people think of consensus

as either a bunch of nodes agreeing
on a value, right?

That's what you commonly hear.

Or you think of like reaching majority,
reaching quorum is important.

But the true reason for consensus
is just durability.

When you ask for a commit and the
system says, yes, I have it,

you don't want the system to lose
it.

So instead of defining quorum and
all those things, define the

problem as it is and solve it the
way it was asked for is, how

do you solve the problem of durability
in a transactional system?

And the simple answer to that is,
make sure your data is elsewhere.

Michael: Yeah, I love how simple
you make it.

Sugu: Yeah, if you make sure your
data is elsewhere, if there

is a failure, your challenge is to find out where the data is

and continue from where it went.

And that is all that consensus is about.

And then all you have to do is have rules to make sure that these

properties are preserved.

And Raft is only just 1 way to do this.

So if you look at this problem, if you approach this problem

this way, you could ask for something like, I just want my data

to go across availability zones.

As long as it's in a different availability zone, I'm happy.

Or you can say, I want the data to be across regions, or I want

at least 2 other nodes to have it, right?

So that's your Durability requirement.

But you could say, I want 2 other nodes to have it, but I want

to run 7 nodes in the system, or 20 nodes.

It sounds outrageous but it is actually very practical.

In YouTube we had 70 replicas but only 1 1 node the data have

to be in 1 other node for it to be durable and we were able to

run this at scale.

The trade-off is that when you do a failover you have a wild

goose chase looking for the Transaction that went elsewhere but

you find it and then you continue.

And so that is basically the principle of this consensus system.

And that's what I want to bring in Multigres.

While making sure that the people that want simpler majority

based quorums to also work using the same primitives.

Michael: Just quickly to clarify, when you say the wild goose

chase, is that because it was 1 of 70, but different transactions

could have gone to different of the 70 or it's always the same

1, but you have to know which 1 that is

Sugu: No, it could be anyone the way we we ran it the way we

ran it It is 1 it could not be at any given point of time There's

only 1 Primary which means that there is only 1 Transaction that

you have to chase down.

Michael: The latest 1.

Sugu: The latest 1, yes.

Michael: Yeah, makes sense.

Sugu: Yeah.

There was a time when we found that Transaction in a different

country.

So we had to bring it back home and then continue.

It was once it happened in whatever the 10 years that we ran.

Nikolay: It's interesting that talking about sharding, we need

to discuss these things, which are not sharding per se, right?

So it's about HA inside each shard, right?

Sugu: It's actually like what I would call healthy database principles,

which is, I think, somewhat more important than sharding.

Nikolay: Yeah.

Yeah.

Yeah.

Michael: It is true that it is to do with it being a distributed

system, right?

And that is because it's sharded, no?

Sugu: I think they are orthogonal.

Okay.

Yeah, I think sharding, like you can do sharding on anything,

Right?

Like you can do sharding on RDS.

Somebody asked me, like, what about Neon?

I said, you can do sharding on Neon too.

It's like you put a proxy in front and then it does the sharding.

But the problem with sharding is it is not just a proxy.

That's what people think of it when they first think of the problem

because they haven't looked ahead.

Once you have sharded, you have to evolve.

You start with 4 shards, then you have to go to 8 shards.

And the evolution is not linear this way.

Actually, it's an exponential growth because 4, 8, 16.

But at some point of time, it changes because your sharding scheme

itself will not scale.

Like if you, for example, are in a multi-tenant workload and

you say shard by tenant, at some point of time, a single tenant

is going to be so big that they won't fit in an instance.

And that we have seen.

And at that time, we had to change the sharding scheme.

So how do you change the sharding scheme?

Slack had to go through this, where they were a tenant-based

sharding scheme, and a single tenant just became too big.

They couldn't even fit 1 tenant in 1 shard.

So they had to change their sharding scheme to be user-based.

They actually talk about it in 1 of their presentations.

And Vitess has the tools to do these changes without actually

you incurring any kind of downtime, which again, Multigres

will have.

I keep talking about Vitess, but these are all things that Multigres

will have, which means that you are future-proofed when it comes

to.

And these are extremely difficult problems to solve.

Because when you're talking about changing the sharding scheme,

you are basically looking at a full crisscross replication of

data.

And across data centers.

Nikolay: Yeah, and also, like I know, it has version 3, right?

It was when you changed, basically created a new planner, right,

to deal with arbitrary query and understand how to route it properly

and where to execute it, which is it a single shard or it's global

or it's like different shards and so on.

Like what's, are you going to do the same with Postgres?

I think yes.

Right.

Sugu: So that's the part that I'm
still on the fence.

By the way, the v3 now has become
Gen 4.

It's actually much better than
what it was when I built it.

The problem with v3 is that it
is still not a full query.

It doesn't support the full query
set yet.

It controls supports like 90% of
it, I would say, but not everything.

On the temptation side, there's
the Postgres engine that supports

everything.

So I'm still debating how do we
bring the 2 together?

If it was possible to do a simple
git merge, I would do it.

But obviously this 1 is in C, this
was in Go.

And the part that I'm trying to
figure out is how much of the

sharding bias exists in the current
engine in Vitess.

If we brought the Postgres engine
as is, without that sharding

bias, would this engine work well
for a sharded system?

Nikolay: So this looks like CytoSort,
if you bring up the whole

Postgres.

There's a library, libpg_query,
by Lukas Fittl, which basically

takes the parser part of Postgres
and brings it to...

And there is a Go version of it
as well.

Sugu: Oh, libpg_query you said?

Nikolay: Yeah, yeah, yeah, I will
send it to you.

So many, many, many, many systems
use it when we need to parse.

Yeah, 1 day I told it to SPQR guys
about this, and eventually

I think they ended up using it.

I think Lev Kupotov uses it, Pijadok
also uses it, the Rust version.

Sugu: Is it like 100% Postgres
compatible?

Nikolay: Well, it's based on Postgres
source code.

So parser is truly broad, but it's
not whole Postgres.

So maybe you should consider this.

Right?

If you're thinking about parsing,
I mean, queries and so on,

but I'm very curious, I also noticed
you mentioned routing, like

read-only queries routed to replicas
automatically.

And this concerns me a lot because
many Postgres developers,

I mean, who use it, users, they
use PL/pgSQL functions, all PL..

Python functions, anything, which
are writing data.

And the standard way to call function
is select, select function

name.

So understanding that this function
is actually writing data

is not trivial.

Right, and we know in pgpool, which
I, all my life I just avoid.

I touched it a few times, decided
not to use at all because it

tries to do a lot of stuff at once
and always considered like,

no, I'm not going to use this tool.

So pgpool solves it like saying
okay like let's build a list

of functions which are actually
writing or something like this.

So it's like patch approach you
know workaround approach.

So this is going to be a huge challenge,
I think, if you...

Sugu: Yeah, yeah.

Nikolay: For automatic routing, it's a huge challenge.

Sugu: Yeah, I think this is the reason why I think it is important

to have the full Postgres functional engine in Multigres, because

then these things will work as intended is my hope.

What we will have to do is add our own sharded understanding

to these functions and figure out what does it mean to call this

function, right?

If this function is going to call out to a different shard, then

that interpretation has to happen at the higher level.

But if that function is going to be accessing something within

shard, then push the whole thing down and just let the push the

whole SELECT along with the function down and let the individual

Postgres instance do it.

Nikolay: Yeah, but how to understand?

Function can contain another function and so on.

It can be so complex in some cases.

It's also funny that there is still there is actually Google

Cloud, CloudSQL supports it, kind of language, it's not language,

called PL/Proxy, which is sharding.

For those who have workload only in functions, this can route

to proper shard.

It was created at Skype.

It still exists, but not super popular these days.

But there is a big requirement to write everything in functions.

In your case, if you continue, like, I would expect in some case

you would say, okay, don't use functions.

But I'm afraid it's not possible.

Like, I love functions.

Actually, Supabase loves functions because they use Postgres,

right?
Postgres, like, it provokes you to use functions.

Sugu: Oh, really?

Oh, yeah, yeah.

Actually, I saw that.

Yeah.

So in Vitess, I feel like this was a mistake that we made, which

is if we felt that anything, any functionality that you used

didn't make sense.

Like if I were you, I wouldn't do this, right?

Because it's not, it won't scale.

It's a bad idea.

You know, it's like those we didn't support.

We didn't want to support.

You said, no, we will never do this for you because we'll not

give you a rope long enough to hang yourself.

Basically, that was our philosophy.

But in Multigres, we want to move away from that, which means

that if you want to call a function that writes, have at it.

Nikolay: Just put a comment, it's going to write something.

Sugu: Yeah.

If you want a function that calls
a function that writes have

at it.

If we cannot, like the worst case
scenario for us is we don't

know how to optimize this.

And what we will do is we'll execute
the whole thing on the VT

gate side.

Nikolay: There's another, I remember
there is interesting solution

in, I think in AWS RDS proxy, which
when they, as I know it,

maybe I'm wrong, when they needed
to create a global, it's called

I think Aurora Global Database
maybe or something like this.

So there is a secondary cluster
living in a different region

and it's purely read-only, but
it accepts writes.

And when write comes, this proxy
routes it to original primary,

waits until this write is propagated
back to replica and responds.

Sugu: Oh, wow.

Yeah.

I don't think that feature, I don't
know how even that feature

can be supported.

Nikolay: No, no, it's just some
exotic, interesting solution

I just wanted to share.

Maybe, you know, if we, for example,
if you originally route

a write to a replica, then somehow
in Postgres you understand,

oh, it's actually a write.

Sugu: Okay, Yeah, so maybe 100%
is theoretically impossible to

support.

Nikolay: Yes, it's okay.

It's super exotic.

Okay.

Sugu: But I think if people are
doing things like that, it means

that they are trying to solve a
problem that doesn't have a good

existing solution.

Nikolay: Exactly.

Sugu: So if we can find a good
existing solution, I think they'll

be very happy to adopt that instead
of whatever they were trying

to do.

Nikolay: Well, this is just multi-region
setup.

I saw not 1 CTO which wanted it
for like dealing with Postgres

like say we are still single region
we need to be to be present

in multiple regions in case if
1 AWS region is down right it's

also okay yeah so availability
and business characteristics so

yeah anyway okay yeah it's it's
exotic but but interesting still

yeah So

Michael: you've got a lot of work
ahead of you, Sugu.

Sugu: I feel like we barely covered
like 1 of so many topics.

Nikolay: Let's touch something
else.

Maybe it's a very long episode,
but it's worth it, I think.

It's super interesting.

What else?

What else?

Sugu: I think the other interesting
1 would be 2PC and isolation.

Nikolay: Isolation from what?

Sugu: Like the 1 issue with the sharded solution is that, again,

this is a philosophy for the longest time in Vitess we didn't

allow 2PC.

You said you shard it in such a way that you do not have distributed

transactions.

And many people lived with that.

And some people actually did

Nikolay: not adopt Vitess.

Let me interrupt you here, because this is like the most, the

best feature I liked about Vitess, it's this materialized feature

when data is brought.

Sugu: Oh yeah, materialized is another topic.

That's actually a better topic than 2PC.

Nikolay: Well, yeah, because this is your strength, right?

So this is like, I love this idea, basically distributed materialized

view, which is incrementally updated.

Sugu: Yes, yes, yes.

Nikolay: That's great.

We need it in Postgres ecosystem.

Just maybe as a separate project, even, you know, like we lack

it everywhere.

So yeah, this is how you avoid distributed transactions basically,

right?

Sugu: No, this is 1 way to avoid it.

1 way, yeah.

Like there are 2 use cases where materialized views are super

awesome.

You know a table that has multiple foreign keys, that has foreign

keys to 2 different tables is the classic use case, where the

example I gave was a user that's producing music and listeners

that are listening to it, which means that the row where I listen

to this music has 2 foreign keys, 1 to the creator and 1 to the

listener.

And where should this row live?

Should this row live with the creator or should this row live

with the listener is a classic problem.

And there is no perfect solution for it.

It depends on your traffic pattern.

But what if the traffic pattern is 1 way in 1 case and another

way in another case?

There is no perfect solution.

So this is where in Multigres what you could do is you say okay

in most cases this row should live with the creator let's assume

that right so then you say this row lives with the creator and

we shard it this way, which means that if you join the creator

table with this event row, it'll be all local joins.

But if you join the listeners table with this event row, it's

a huge cross shard while this chases.

So in this case, you can say materialize this table using a different

foreign key, which is the listeners
foreign key into the same

sharded database as a different
table name.

And now you can do a local join
with the listener and this event

table.

And this materialized view is near
real-time, basically the time

it takes to read the WAL and apply
it.

And this can go on forever.

And this is actually also the secret
behind resharding, changing

the sharding key.

This is essentially a table that
has real-time present with 2

sharding keys.

If you say, oh, at some point of
time, this is more authoritative,

All you have to do is swap this
out.

Make 1 the source, the other is
a target.

You've changed your sharding key.

Actually, the change sharding key
works exactly like this for

a table.

Nikolay: Distributed denormalization
technique.

This is what

Michael: it is.

Sugu: Yeah, yeah, yeah, exactly.

And the other use case is when
you reshard, you leave behind

smaller tables, reference tables,
we call them.

And they have to live in a different
database because they are

too small and even if you shard
them, they won't shard well.

Like if you have a billion rows
in 1 table and a thousand rows

in a smaller table, you don't want
to shard your thousand row

table.

And there's no benefit to sharding
that either.

So it's better that that table
lives in a separate database.

But if you want to join between
these 2, how do you do it, right?

The only way you join is join at
the application level, read

1 and then read the other.

And so at high QPS, it's not efficient.

So what we can do is actually materialize
this table on all the

shards as reference, and then all
joins become local.

Nikolay: Yeah.

And you definitely need logical
replication for all of this.

So this is where we started, like
challenges with logical replication.

Sugu: Yeah, yeah.

You do have the, so the reason
why 2PC is still important, because

there are trade-offs to this solution,
which is, there's a lag.

So it takes time for the things
to go through the WAL and come

to the other side.

Whereas 2PC is essentially, basically
the transaction system

itself trying to complete a transaction,
which means that it

will handle cases where there are
race conditions, right?

If somebody else tries to change
that row elsewhere while this

row is being changed, 2PC will
block that from happening, whereas

in the other case, you cannot do
that.

Nikolay: Yeah, if it's just user views of some video like on

YouTube, we can say, okay, there will be some lag, probably some

small mistake, it's fine.

But if it's financial data, it should be 2PC, but latency of

write will be high, throughput will be low, right?

This is...

Sugu: I actually want to...

I read the design of...

Which is again, by the way, very elegant API, and I assume...

I can see the implementation on the API and I Don't think we

will see performance problems with 2PC.

Nikolay: We need to benchmark it

Sugu: We will be we will benchmark it, but I will be very surprised.

I think there are some isolation issues that we may not have

time to go through today because it's a long topic.

Like the way 2PC is currently supported in Postgres, I think

it'll perform really well.

Nikolay: Isolation issues when we sit in read committed and use

2PC.

You mean this, right?

Not in repeatable read.

In default.

Sugu: Yeah, read committed I think will be, there will be some

tradeoffs on read committed, but not the kind that will affect

most applications.

MVCC will be the bigger challenge.

But from what I hear is most people don't use, like the most

common use case is read committed.

Nikolay: Of course, as default.

Yeah, it's faster.

Sugu: It's a default.

Yeah.

So people won't even, yeah, I don't, I think this.

Nikolay: They're already on some, they're already in bad state.

It won't be worse.

Sugu: It won't be worse.

Yes.

Nikolay: Yes.

Yeah.
So to PC, of course depends on the distance between nodes, right?

A lot, like if they are far, we need to talk like client is somewhere,

2 nodes are somewhere, and if it's different availability zones,

it depends, right?

So this distance is a big contributor to latency, right?

Network.

Because there are 4 communication messages that are needed.

So.

Sugu: Correct, correct.

Actually, you can, I have actually the mathematics for it?

But you're probably right.

It's about double the number of
round trips.

Nikolay: Yeah, if we put everything
in 1 AZ, client and both

primaries, we are fine.

But in reality, they will be in
different places.

And if it's different regions,
it's nightmare, of course.

But at

Sugu: least it's- Yeah, the 2PC
is not done by the client, by

the way.

The 2PC would be done by the VTgate,
which would be- It should

have the nodes nearby.

Nikolay: Ah, should have, In 1
availability

Sugu: zone?

Unless you did some crazy configuration,
they should be pretty

close to each other.

Nikolay: Pretty close means still
different availability zones

in general case, right?

Sugu: No, no.

The availability zone is only for
durability.

Nikolay: Okay.

Sugu: For replica level.

But a 2PC, you're coordinating
between 2 primaries, which may

actually be on the same machine
for all you care.

It's not...

Well,

Nikolay: Imagine the real practical
case.

We have sharded schema, we have...

Every shard has primary and a couple
of standbys, right?

Sugu: Correct.

Correct.

Nikolay: So are you saying that
we need to keep primaries all

in the same availability zone?

Sugu: That's usually how things
are.

Nikolay: Ah, interesting.

I didn't know about this.

By the way, I wanted to rattle
a little bit about PlanetScale benchmarks

they published last week.

They compared to everyone.

They compared to everyone.

I wanted just, it's not like, I'm
sorry, I will take a little

bit of time.

They compared to everyone, and
they just published like PlanetScale

versus something.

And this very topic, they on charts,
we have PlanetScale in single

AZ, everything client and server
in the same AZ.

And line, which is like normal
case, client is different AZ.

And line with same AZ is active,
line is normal, not active.

And others like Neon, Supabase,
everyone, It's different.

And of course, PlanetScale looks
really well, because by default,

they presented numbers for the
same availability zone.

And below the chart, everything
is explained, but who reads it,

right?
So people just see the graphs.

And you can unselect, select proper
PlanetScale numbers and

see that they are similar.

But by default, the same as that
number is chosen.

And this is like benchmarking,
you know, like.

Sugu: Well, I think if you look
at the architecture, like, even

fair comparison, PlanetScale should
come out ahead, like, like

the performance of a local disk,
of course, should.

Nikolay: But this was SELECT 1
disks.

Sugu: Disks, SELECT 1 is not a
benchmark.

Nikolay: Well, it was part of benchmark.

It's just checking query path,
but it fully depends on where

client and server are located.

So what's the point showing better
numbers just putting client

closer?

I don't like that part of that
benchmark.

Sugu: Okay, yeah.

I saw the publications, But I didn't
go into the details because

I thought, well, it has to be faster
because it's on local disks.

So why?

Nikolay: Yeah, for data which is
not fully in cache, of course.

Yeah, local disks are amazing versus
ABS volumes.

Sugu: You're right.

Yeah, if the data is in cache,
then there is, yeah, then all

performance, the performance of
everything would be the same.

Nikolay: Yeah, well, I wanted to
share this.

I was annoyed about this, but I
fully support the idea of local

disks.
It's great.

I think we need to use them more
in more systems.

Sugu: I think, I wouldn't be surprised
if you reached out to

PlanetScale.

They may be willing to, like if
you want to run your benchmarks,

they may be willing to give you
the...

Nikolay: Yeah, there is source
code published and in general

benchmarks look great, the idea
is great.

And actually, with local disks,
the only concern is usually the

limit, hard limit.

We cannot have more space.

But if you have sharded solution,
there is no such thing.

Sugu: Correct.

But speaking about the hard limit,
today's SSDs, you can buy

100 plus terabytes size SSD, single
SSD, and you can probably

stack them up on

Nikolay: 1 next to the other.

Yeah, yeah.

But in cloud there are limitations,
in cloud there are limitations,

for instance.

Sugu: Okay, yeah.

I saw AWS SSD over 100 terabytes.

Nikolay: In Google Cloud, 72 terabytes is hard limit for Z3 metal.

And I didn't see more.

So 72 terabytes, it's a lot, but sometimes it's already notable.

Sugu: Yeah, at that limit, your storage is not the limit.

You will not be able to run a database of that size on a single

machine.

Why not?
You will max out your CPU.

Nikolay: We have cases, CPU.

Well, again, the problem will be replication.

If we talk about single node, we can

Sugu: have- Or replication.

Nikolay: 360 cores in AWS, almost 1,000 cores already for Xeon

scalable generation 5 or something.

So hundreds of cores.

The problem is Postgres design.

If replication, physical replication was multi-threaded, we could

scale more.

Sugu: By the way, replication is not the only problem.

Backup recovery.

If your machine goes down, you're down for hours.

Recovery is something of that size.

Nikolay: Yeah.

Not many hours.

Someone in my team recently saw 17 or 19 terabytes per hour for

recovery with pgBackRest or WAL-G.

In AWS.

I was like, my jaw dropped.

On our podcast I usually say 1 terabyte.

Sugu: Can you repeat that?

17 or 19

Nikolay: terabytes per hour?

17 terabytes per hour.

With local disks.

So this is important.

With EBS, it's impossible.

Sugu: It's good to know.

Nikolay: Yeah.
With Michael, I was always saying 1 terabyte is like what you

should achieve.

If it's below, it's bad.

Now I'm thinking 1 terabyte is already...

Yeah, yeah.

So with EBS volumes, we managed to achieve, I think, 7 terabytes

per hour to restore with WAL-G.

And that's great.

But there's

Sugu: a danger there.

You could become a noisy neighbor.

So we actually built throttling in our restore just to prevent

noisy neighbors.

Nikolay: With local disks, you lose ability to use BS snapshots,

cloud disk snapshots.

Sugu: Correct, correct, yeah.

Nikolay: That's what you lose, unfortunately.

And they're great, and people enjoy them more and more.

Yeah, so I agree, and as I remember, for 17 terabytes it was

128 threads of 4g or pgBackRest I don't remember local disks I

need to update

Sugu: my technology is changing too fast.

Nikolay: Exactly, yeah.

And hundreds of cores, terabytes of RAM already, right?

Like,

Michael: yeah.

Yeah, yeah.
But it does go straight to your point of the smaller they are,

the faster you can recover still.

Sugu: You don't hit some of these limits.

These systems were not designed with these types of limits in

mind.

Some weird data structure, suddenly the limit of this is only

100 items.

And you hit those limits and then you're stuck.

Recently, Metronome had an issue.

They had that, the routed.

MultiXact.

The MultiXact thing, which nobody has ever run before, but

they hit that problem.

Nikolay: Yeah, we saw many problems also when you're on the edge.

And it pushes forward Postgres actually sometimes, but if you

want to be on the safe side, but I really like the, like, it's

kind of resilience characteristics when, even if it's down, it's

only a small part of your system is down.

That's great.

Sugu: Correct,
Nikolay: yeah.

That's mature architecture already.

Sugu: That actually makes it easier to achieve 5 nines uptime.

Because that's the way you calculate.

Like if only 1 node is down, you divide it by the number of users

Nikolay: being affected.

Budget.

Sugu: Downtime budget.

Yeah, budget.

Yeah.

Nikolay: That's good.

Cool.

I think it's maybe 1 of the longest episodes we had.

Enjoyed it.

Oh my God.

I enjoyed it.

I hope we will continue a discussion
of issues with logical,

for example, and so on, and maybe
if things will be improved

and so on.

Looking forward to test POC once
you have it.

Thank you so much

for coming.
Sugu: I am so excited.

Nikolay: Yeah, thank you.

Michael: Yeah, is there any last
things you wanted to add or

anything you wanted help from people
on?

Sugu: I would say it feels like
nothing is happening on the repository

except me pushing, you know, a
few pushes, a few things, changes,

but a huge amount of work is happening
in the background.

Like some of these design work
about consensus are all like almost

ready to go and there's also hiring
going on there are people

coming on board very soon so you
will see this snowball It's

a very tiny snowball right now,
but it's going to get very big

as momentum builds up.

So pretty excited about that.

We may still have 1 or 2 spots
open to add to the team, but it's

filling up fast.

So If any of you are very familiar,
this is a very high bar to

contribute to a multi-base.

You have to understand consensus,
you have to understand query

processing.

But if there are people who want
to contribute, we are still

looking for maybe 1 or 2 people
and also on the orchestration

side and the Kubernetes side of
things.

Yeah, so that's...

Nikolay: Do you mind a small joke
in the end?

Just not to finish on...

Sugu: I do not mind at all.

Let's hear it.

Nikolay: Yeah, so I know what you're
doing.

You're writing a lot of markdown
right now and then you will

feed it to AI.

Sugu: I wish!

Oh my god.

I almost hope that day never comes
but It is so fun working on

this project, creating it.

Why do I want to give it to an
AI to do it, you know?

Nikolay: Okay.

Good.

Thank you.

I enjoyed it a lot.

Michael: Yeah.
Yeah.

Thank you so much for joining us.

It's great to have you as part
of the Postgres community now

and I'm excited to see what you
get up to.

Sugu: And me too.

Michael: Wonderful, thanks so much.

Some kind things our listeners have said