
Multigres
Michael: Hello and welcome to PostgresFM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard, and I'm joined as usual by
Nik, founder of Postgres.AI.
Hey, Nik.
Nikolay: Hi, Michael.
And let's welcome our guest.
Michael: Yeah, we are joined by a very special guest, Sugu, who
is a co-creator of Vitess, co-founded PlanetScale, and is now
at Supabase working on an exciting project called Multigres.
So welcome, Sugu.
Sugu: Thank you.
Glad to be here.
Michael: All right.
It's our pleasure.
So it's my job to ask you a couple of the easy questions to start
off.
So what is Multigres and why are you working on it?
Sugu: Multigres is a Vitess adaptation for Postgres.
It's been on my mind for a long time, many years, and we've even
had a few false starts with this project.
And I guess there is a timing for everything and finally the
timing has come.
So I'm very excited to get started on this finally.
Michael: Yeah timing is an interesting one it feels like for many
years I was looking at PlanetScale and Vitess specifically very
jealously thinking you can promise the world you can promise
this you know horizontal scaling with a relational database for
OLTP and it's it you know all of the things that people want
and we didn't really have a good answer for it in Postgres.
But all of a sudden, in the last few months, it seems almost,
there are now 3, 4 competing projects all doing it.
So why now?
Why is it all happening now?
Sugu: I would say, I think there is a timing for every new idea
to gather momentum.
Like the idea may be good, but the timing may be bad for one reason
or another.
But once the timing is good, then it is obvious that it is the
right time.
And it feels like now is that time.
Like I would say, for example, PayPal, there have been previous
attempts at moving money online, but it never worked.
But PayPal just came at the right time and therefore took off
during that time.
Many such examples where, you know, some ideas came too early
and didn't take off.
But for Postgres, this feels like it's the right time.
Nikolay: That's interesting.
Let me reveal something.
We met in 2018, I think, at Percona
conference.
And in 2019, we met again in Nutanix
at the meetup.
Go meetup, right.
And discussions about Vitess for
Postgres started.
And a couple of times we had a
call, I tried to involve a couple
of guys to and from from my understanding
it never worked because
You could not do it yourself being
busy with other thing I guess
my sequel related and the guys
are looking at the complexity
other guys looking at the complexity
of related and, and didn't
proceed.
And actually, in 1 case, it was,
they decided to build from scratch.
It was, it was SPQR project.
It's live and
there's sharding for Postgres.
Sugu: Yeah, Andrey Borodin.
Nikolay: Yeah, Andrey Borodin.
Borodin.
Yeah.
Yeah.
So and other folks also involved.
And so for me, it was disappointing
that it doesn't work.
And at some point I saw the message
in Vitess, I think, that
we are not going to do it.
So like, don't expect.
Sugu: I feel so bad because I was
so excited about doing it.
And then I realized, oh my God,
you know.
Nikolay: But now you started and
last week PlanetScale decided
to support Postgres.
So what's happening?
Like, I don't understand.
Just, just right time.
Right.
Yeah.
Enough, enough number of companies
using Postgres, which really
needed.
Sugu: At least 1 horse will win.
Yeah.
Nikolay: So yeah, it's great.
But, but yeah, long, long story
to this point.
Michael: Yeah.
Sometimes when there are multiple
projects there's kind of slight
differences in philosophy or approach
or trade-offs, willing
to trade 1 thing off in relation
to another.
And I saw your plan, I really liked
that you used, you mentioned
building incrementally.
So Vitess is a huge project, lots
and lots of features, But
I've heard you talk in the past
about building it quite incrementally
while at YouTube.
You know, it didn't start off as
complex as it is now, obviously,
and you did it kind of 1 feature
at a time.
And it sounds like that's the plan
again with Multigres.
Is that different to some of the other projects?
Or what do you see as your philosophy and how it might differ
slightly to some of the others?
Sugu: I think my philosophy is that I would say I don't want
to compromise on what the final project is going to look like.
So any path that deviates me from hitting the target on the final
project, I do not want to take.
But any shortcuts that can get us early results, that we can
do, but only those that ensure that it doesn't risk what we become
eventually, which is a project that should feel native as if
it was for Postgres, by Postgres kind of thing.
I want it to be a pure Postgres project.
Nikolay: In this case, I must ask, in this case, are you sure
you're still good with choice of Go language?
Because sometimes we see projects which have really like sub
millisecond latency on average, like 100 microseconds, for example,
on average.
And Go definitely will bring a few hundred microseconds of latency
overhead.
So usually it's not a big deal, but maybe in some cases it's
some deal, right?
Are you happy with Go?
Yeah, I mean- Because you were 1 of the first big Go language
users building Vitess, as we know from various interviews and
so on.
Yeah.
So it's still a good choice because now there is Rust, right?
Sugu: Yes.
I would say, by the way, when we started Go, compared to where
Go is today, it was a nightmare.
Nikolay: Yeah, like 10 milliseconds.
Sugu: Like, yeah, like 10 milliseconds or something round trip
is what we were paying for.
Those days we had hard disks, by the way.
So that's another 3 to 5 milliseconds just within the database.
But things are a lot better now.
And at this point, the way I would put it is, Like the trade-offs
are in favor of Go, let's put it that way.
Mainly because there is a huge amount of existing code that you
can just lift and port.
And rewriting all of that in Rust is going to just delay us.
And at least in Vitess, it has proven itself to scale for, you
know, like multi, hundreds of terabytes.
And the latencies that people see are, they are not affected
by a couple of hundred microseconds.
So I think the, and plus there's, there's this inherent acceptance
of this network latency for storage and stuff.
And if you bring the storage local, then this actually wins out
over anything that's there.
Nikolay: This is exactly what I wanted to mention.
Yeah, I see PlanetScale right now, they came out with Postgres
support, but no Vitess.
And I'm very curious how much it will take for them to bring
it and compete with you.
It's another interesting question.
But from past week, I see my main impression is like, main stake
is on local storage.
And this is great because local storage for Postgres, we use
it in some places where we struggle with EBS volumes and so on,
but it's considered like not standard, not safe, blah blah blah.
There are companies who use it, I know.
I know myself, right?
And it's great.
Like today, for example, Patroni and since Postgres 12 we don't
need to restart nodes when we have failover.
So if we lose node, forget about node, we just failover and so
on.
And with local storage, not a big deal.
But now I expect with your plans to bring local storage, it will
become more, like I expect it will be more and more popular,
and that's great.
So you shave off latency there and keep Go, which brings 200.
Sugu: Good compromise.
Effectively, it's a win because 1 network hop completely eliminates
language level overheads.
Nikolay: Sounds good.
Maybe your goal will improve additionally.
Michael: I wanted to go back, Sugu, you mentioned not wanting
to compromise on feeling Postgres native.
That feels to me like a really big statement, coming from Vitess
being very MySQL specific, saying you want to be Postgres native
feels like it adds a lot of work to the project.
It feels like a lot to me.
What does it mean?
Is that about compatibility with the protocol?
What does it mean to be Postgres native?
Sugu: There's 2 answers.
1 is, why do we still think we can bring Vitess if it was built
for MySQL?
And how do you make it Postgres native?
That's because of Vitess's history.
For the longest time, Vitess was built not to be tied to MySQL.
It was built to be a generic SQL 92 compliant database.
That was actually our restriction for a very long time until
the MySQL community said, you need to support all these MySQL
features, otherwise we won't
Nikolay: move forward.
CTEs, right?
Common Table Expressions, with, right?
It's I guess SQL 99 feature, not 92.
Sugu: Yeah, I think the first parser
I built was SQL 92, which
is the most popular 1 that I know
of.
So that's answer 1.
Answer 2 is more with the behavior
of Postgres.
What we want is completely mimic
the Postgres behavior right
from the beginning.
Basically, in other words, we plan
to actually copy or translate
what we can from the Postgres engine
itself, where that behavior
is very specific to Postgres.
And the goal is not compatibility
just at the communication layer,
but even internally, possibly even,
you know, recreating bugs
at the risk of recreating bugs.
Nikolay: In this case, it's like,
so there is Citus in the hands
of Microsoft got everything open
sourced.
So before, resharding was only
in paid version, now it's in free
version, open source, so it's fully
open source.
And they put Postgres in between,
so they don't need to mimic
it, they can use it, right?
And latency overhead is surprisingly
low, we checked it.
Well, it's C, but it's whole database
in between.
But it's sub millisecond, so it's
acceptable as well.
I think it's half a millisecond
or so in our experiments with
simple select 1 or something, SELECT,
just like.
So don't you think it's like quite,
like in comparison, it's
quite a challenging point when
you say I'm going to mimic a lot
of stuff, but they just use Postgres
in between?
Sugu: Yeah, yeah, I think there's
a difference in architecture
between our approach between Multigres
versus Citus.
I think the main difference is
it's a single coordinator for
Citus.
And there is some bottleneck issue
with that.
If you scale to extremely large
workloads, like that goes into
millions of QPS, hundreds of terabytes.
So having that single bottleneck,
I think would be a problem
in the future.
Whereas,
Nikolay: yeah.
I understand you can put multiple,
like you can have multiple
of nodes there and also you can
put a PgBouncer to mitigate
the connection issues.
So it can scale as well.
Sugu: That's good.
That's not, that's something I
haven't known before.
So, yeah, it's possible then that
they may also have something
that can viably scale for OLTP.
Nikolay: Yeah.
So we're still exploring this and
more benchmarks are needed.
Actually, I'm surprised how few
and not comprehensive benchmarks
there are published for this.
Sugu: Yeah, what I know of Citus
is probably what you told me
when we met.
So I was about 5 years old.
Michael: Yeah, yeah.
Another big difference and this
is typically Nikolay's question
is on the license front I think
you've picked about as open a
license as you could possibly pick
which is not the case I think
for many of the other projects.
So that feels to me like a very
Supabase thing to do and also
in line with what Vitess did
and that seems to me like a major
advantage in terms of collaborating
with others, other providers
also adopting this or working with
you to make it better.
Is like, what's your philosophy
on that side of things?
Sugu: My philosophy is, my metric
for success is adoption.
Yeah.
And the only way to have a project
be adopted is to have a good
license, a license that people
are confident to use.
That has been the case from day
1 of Vitess.
We actually first launched with
BSD license, which is even more
permissive than Apache.
And then when we moved the project
to CNCF, they said, oh, no,
no, we need Apache license, which
is why we converted it to that
1.
So adoption has always been, yeah.
Nikolay: Why do they say it?
Do you know?
Sugu: Why CNCF wants Apache?
I think Apache is a pretty good
license.
They just made it a policy.
I mean, had we asked to keep the
BSD license, they would have
allowed us, but we didn't feel
like it was a problem to move
to Apache.
Nikolay: Yeah, and I remember you
described like, when you did
it, like in YouTube, you thought
about external users.
You need external users for this
project to grow.
I guess at Google, AGPL is not
popular at all, we know.
Sugu: Oh, banned.
Nikolay: Yeah, yeah, banned.
And Citus is AGPL,
Sugu: which is interesting.
Nikolay: Also, compared to Citus,
I think you have chances to
be compatible with RDS and other
managed Postgres, to work on
top of them, unlike Citus, which
requires extension and so on,
right?
Sugu: Correct.
Correct.
Yes.
This was actually something that
we learned very early on, wanting
to work.
We made a 5 line change on MySQL
just to make Vitess work initially.
And it was such a nightmare to
keep that binary up, to keep that
build running.
Fork.
Yeah, to keep that fork alive.
So we decided, no, it's like, we
are going to make this work
without a single line of code of
change in MySQL.
And that actually is what helped
Vitess move forward.
Because people would come in with
all kinds of configurations
and say, make it work for this.
So in this case, actually, we'll
probably talk about the consensus
part.
That is 1 part that we think it
is worth making a patch for Postgres,
and we're going to work hard at
getting that patch accepted.
But I think what we will do is
we will also make Multigres work
for unpatched Postgres for those
who want it that way.
Except they will lose all the cool
things about what consensus
can give you.
Nikolay: I'm smiling because we
have so many variations of Postgres
these days.
I would expect people coming not
only with different configurations
of Postgres, but also with various
flavors like Aurora.
We have a client who just migrated
from regular Postgres to AlloyDB.
Hello from Google Cloud.
And they already sharded on application
side, but imagine they
could come to you and say, let's
support AlloyDB now.
It's almost Postgres, right?
So yeah, so these things might
happen as well.
Sugu: Don't they claim full compatibility
with Postgres?
I thought you Not full,
Nikolay: but but most of it.
It's, they did interesting stuff
in memory like, like column
storage and memory for tables.
It's row storage on disk, but column
storage in memory.
But it looks like kind of Postgres,
and we actually even had
some questions answered from my
team unexpectedly because we
don't normally work with AlloyDB,
but it looks like Postgres.
So I could imagine the request,
let's support AlloyDB as well.
Anyway, but my question, I remember
featuring the test that we
work with RDS and managed MySQL.
Did those features, like has this
feature survived?
Sugu: No, actually later we decided
that at least, we call it
actually managed versus unmanaged.
Managed meaning that means that
Vitess manages its databases.
And unmanaged means that the database
is managed by somebody
else, Vitess just acts as a proxy
to serve queries.
At some point of time, we realized
that supporting both is diluting
our efforts.
And that's when we decided, okay,
you know, it's not worth it
to try and make this work with
every available version that exists
out there in the world.
And we said, okay, we will do only
managed, which means that
we will manage it ourselves.
And if you want, we'll build the
tools to migrate out of wherever
you are, and we'll make it safe,
we'll make it completely transparent.
In other words, you deploy Vitess
on both and then we'll migrate
your data out without you having
to change your application.
But then Vitess can be more intentional
about its features,
more opinionated about how clusters
are to be managed.
And we were able to commit to that
because at that point, Vitess
had become mature enough.
People were completely trusting
it.
They actually preferred it over
previous other managed solutions.
So it wasn't a problem at that
time.
Nikolay: Yeah, it's not a surprise.
That's why I asked.
Because you talk about local disks,
backups, HA, a lot of stuff,
right?
And basically-
Sugu: Yeah, 5 nines is like what
Vitess shoots for.
And like most big companies that
run Vitess do operate at that
level of availability.
Nikolay: So what's the plan for
Multigres going to support?
So only managed version, right?
Sugu: Only, So it would be, yes,
it would be only managed versions
because I believe that the cluster
management section of Vitess
will port directly over to Postgres,
which means that you will,
once it goes live, it will be coming
with batteries included
on cluster management, which should
hopefully be equal to or
better than what is already out
there.
So I don't see a reason why we
should try to make it work with
everything that exists today.
Nikolay: So it means there is no,
like, this is the same like
with Citus, it doesn't work with
RDS on 1 hand, but on another
hand, we, like, I don't see it's
only a sharding solution.
It's everything, which is great.
I mean, it's interesting, super
interesting.
So a lot of problems will be solved.
And I expect even more managed
services will be created.
I don't know how it will continue,
like in terms of Supabase,
because of the very open license
and so on.
But also I expect that many people
will think, reconsider their
opinion about managed.
We had episode about this.
This is my usual opinion about
managed services because they
hide superuser from you, they
don't provide you access, it's
hard to troubleshoot problems.
In this case, if problems are solved
in this and this gives you
a new way to run Postgres, so if
many problems solved, it's great.
If you want
Sugu: to do so,
Nikolay: for example.
Sugu: Yeah, if you may not know, the initial focus of Vitess
is actually solving these problems first.
Sharding was actually came much later, like protecting the database,
making sure that they survive abusive queries.
Basically, that's what we built Vitess for initially.
And the counterpart of taking away power from the user, like
you said, is 1 is, well, we now know exactly how to make sure
that the cluster doesn't go down.
And 2, we countered that by building really, really good metrics.
So when there is an outage, you can very quickly 0 in on a query.
If a query was responsible, Vitess will have it on top of it,
like on the top of the line, saying that this is a query that's
killing your database.
So we built some really, really good metrics, and which should
become available in Postgres, probably from day 1.
Nikolay: That's interesting.
I didn't see, maybe I missed, I didn't see in readme you were
writing right now in the project.
Sugu: There's a last section called observability.
I missed that.
I need to revisit.
Nikolay: We're actually building something there as well for
regular Postgres.
I'm very curious, I will definitely revisit this, interesting,
okay.
So yeah, great.
Yeah.
Michael: Also, I feel like this is quite a big difference on
the, at least with Citus in terms of the philosophy or at least
the origin story.
I feel like that started much more with OLAP-focused features
in terms of distributed queries and parallelised across multiple
shards and aggregations and columnar, and loads of things that
really benefit OLAP workloads, whereas this has come from a philosophy
of let's not worry about optimizing for those cross shard queries,
this is much more let's optimize for the single shard very very
short quick OLTP queries and let's make sure we protect it against
abusive queries.
So it feels like it's coming, architecturally, it's coming from
a very different place of what to optimize for first.
Sugu: And historically, that was YouTube's problem.
Surviving the onslaught of a huge number of QPS and making sure
that 1 single QPS doesn't take the rest of the site down.
Michael: Yeah.
Yeah.
Perfect.
Makes loads of sense.
So actually, before we move on too much from that, what, where
do you see sharding as becoming necessary?
Like, is it just a case of a total number of QPS or like writes
per second type thing.
We've talked about sharding in
the past and talked about kind
of a max that you can scale up
to perhaps in terms of writes,
in terms of WAL.
WAL per second I think was the
metric we ended up discussing.
Are there other reasons or kind
of bottlenecks that you see people
getting to that sharding then kind
of makes sense as it's now
time or you should be considering
at this point?
Sugu: Well there is a physical
limiting factor which is the single,
if you max out your single machine,
that is your Postgres server,
then that's the end of your scale.
There is nothing more to do beyond
that.
And there are a lot of people already
hitting those limits from
what I hear.
And the sad part of it is they
probably don't realize it.
As soon as that limit is hit, in
order to protect the database,
they actually push back on engineering
features indirectly, saying
that, you know, this data, can
you make it smaller?
Can you somehow lower the QPS?
Or could you put it elsewhere?
Nikolay: Let's stop showing this
number on front page.
Sugu: Yeah, yeah.
And it affects the entire organization.
It's a very small, it's a very
subtle change.
But the entire organization slows
down.
Like we experienced that at YouTube
when we were at our limits.
We like the default answer from
a DBA was always no.
We used to even kid, no?
The answer is no.
What's your question?
And when we started sharding, it
took us a while to change our
answer to say that, you know, bring
your data I like it we can
scale as much as you want. Believe
it or not, we went from 16
shards to 256 in no time.
And the number of features in YouTube
exploded during that time
because there was just no restriction
on how much data you wanted
to put.
And coming back here, the upper,
like reaching the limit of a
machine is actually something you
should never do.
It's very unhealthy for a large
number of reasons, like even
if there is a crash, like how long
is it going to take to recover?
Like the thing that we found out
is once you can shard, it actually
makes sense to keep your instances
way, way small.
So we used to run like 20 to 50
instances of MySQLs per machine.
And that was a lot healthier than
running big ones.
For a couple of reasons.
One is, if you try to run so many
threads within a process, that
itself is a huge overhead for the
machine.
And it doesn't do that very efficiently,
whereas it does it better
if you run it as smaller instances,
I think.
It's more of a feeling, but I don't
know if there is proof or
whatever.
But like Go, for example, wouldn't
do well.
Go, I think, beyond a certain memory
size or beyond a certain
number of goroutines would start
to slow down, would not be
as efficient as it was before.
Mainly because the data structures
to keep track of those threads
and stuff, they are growing bigger.
But more importantly, on an outage,
a smaller number of users
are affected.
If you have 256 shards and 1 shard
goes down, it is 1 256th of
the outage, right?
And so the site looks a lot healthier,
behaves a lot healthier.
There's less panic if a shard goes
down.
So people are, you know, a lot
less stressed managing such instances.
Nikolay: Right, I wanted to mention
that this discussion was
with Lev Kokotov, PgDog, which
is a competitor as well, a new
sharding tool written in Rust.
And we discussed that there is
a big limitation when Postgres...
So replication, physical replication
has limitation because it's
single threaded process on standby.
If we reach like somewhat like
150, 200, 250 megabytes per second,
depending on core and also number
of, not number, structure of
tuples and so on.
We hit 1 single CPU, 100%, 1 process,
and it becomes bottleneck
and replica standbys, they start
lagging.
It's a big nightmare because you
usually by that time, but that's
like at high scale you have multiple
replicas and you will float
a lot of read only queries there.
And then you don't want, don't
know what to do except as you
described let's remove this feature
and slow down development
and this is not not fun at all.
So what I'm trying to do here is
trying to move us to discussion
of replication not physical but
logical.
I noticed your plans involve heavily
logical replication in Postgres.
But we know it has, it's improving
every year.
So like when we started the discussion
5, 6 years ago, it was
much worse.
Right now it's much better.
Many things are solved, improved,
but many things still are not
solved.
For example, schema changes are
not replicated, right?
And sequences, there is work in
progress, but if it's committed,
it will be only in Postgres 19,
not in 18.
So it means like long wait for
many people.
So what are your plans here?
Are you ready to deal with problems
like this?
Pure Postgres problems, you know?
Sugu: Yeah, yeah.
If you ask me, I think the Postgres
problems are less than what
we faced with MySQL.
Nikolay: I wanted to involve physical
as well, because this great
talk by Kukushkin, which describes
very bad anomalies when data
loss happens and so on.
Let's talk about this.
Sugu: Yeah, we should talk about
both.
I think overall the Postgres design
is cleaner, is what I would
say.
Like you can feel that from things.
Like the design somewhat supersedes
performance, which I think
in my case is a good trade-off,
especially for sharded solutions,
because some of these design decisions
affect you only if you
are running at, you know,
Nikolay: if
Sugu: you're pushing it really,
really hard, then these design
decisions affect you.
But if your instances are small
to medium size you won't even
know and then you benefit from
the fact that these designs are
good.
So I actually like the approaches
that Postgres has taken with
respect to the WAL as well as
logical replication.
And by the way, I think logical
replication theoretically can
do better things than what it does
now, and we should push those
limits.
But yes, I think the issue about
schema not being as part of
logical replication, it feels like
that is also a theoretically
solvable problem except that people
haven't gotten to it.
I think there are issues about
the transactionality of DDLs which
doesn't even exist in MySQL.
So at least in Postgres, it exists
in most cases.
There are only a few cases where
it is not.
And for such things like a proxy
layer, like Multigres or Vitess,
it's a no problem for them because
you should say, oh, OK, this
particular construct is not transactionally
safe, well, then
we'll even prevent you from doing
it transactionally because
we don't want you to get the wrong
impression.
We'll let you do it non-transactionally,
and we know that it's
non-transactional, and therefore,
we can do something about it,
right?
Those abilities don't exist previously,
But eventually, if it
becomes transactional, then we
can actually include it in a transaction.
Nikolay: Yeah, just for those who
are curious, because there
is a concept, all DDL in Postgres
is transactional.
Here we talk about things like
creating this concurrently because
we had discussion offline about
this before recording.
So yeah, creating this concurrently
can be an issue, but you
obviously have a solution for it.
That's great.
Sugu: The way I would say it is
we have dealt with much worse
at with MySQL.
So this is much better than what was there then.
Sounds good.
Yeah.
Nikolay: Okay, good.
And let's talk about physical replication because I saw you are
going to use it.
So each shard is going to have a standby.
And yeah, so with quorum commit, right?
So like we don't lose data because data is on local disks by
default, as I understand, like ephemeral in cloud.
So we want to be sure that data is written to at least 2 places,
for example, or 3, right?
Configurable, of course.
Here, this interesting talk by Kukushkin, he presented it recently
at an online conference by Microsoft, describing that synchronous
replication in Postgres is not what you think.
Sugu: Correct.
So, right.
Correct.
Nikolay: What are you going to do about this?
Sugu: Well, I was just chatting with someone and essentially
synchronous replication is theoretically impure when it comes
to consensus.
I think it's provable that if you use synchronous replication
then you will hit corner cases that you can't handle.
And the most egregious situation is that it can lead to some
level of definitely split brain, but in some cases it can even
lead to downstream issues.
Because it's a leaky abstraction, it's a leaky implementation,
there are situations where you can see a transaction and think
that it is committed.
And later, the system may fail.
And in the recovery, may choose not to propagate that transaction,
or may not be able to, and it's going to discard that transaction
and move forward.
Nikolay: But this is the same as with asynchronous replication,
it's the same, we're just losing some data, right?
Sugu: Exactly, yeah, it is the same as asynchronous replication,
Yes.
Nikolay: It's not split-brain, it's just data loss.
Sugu: It's data loss, correct.
It's data loss, but for example, if you are running like a logical
replication off of 1 of those, then that logical replication
may actually propagate it into an external system and now you
have corrupted downstream systems that don't match the source.
So those risks exist and at Vitascale people see this all the
time, for example, and they have to build defenses against this
and it's very, very painful.
It's not impossible, but it's very hard to reason about failures
when a system is behaving like this.
So that is the problem with synchronous
replication.
And this is the reason why I feel
like it may be worth patching
Postgres.
Because there is no existing primitive
in Postgres on which you
can build a clean consensus system.
I feel like that primitive should
be in Postgres.
Nikolay: I now remember from Kukushkin's
talk, there is another
case when a primary transaction
looks like not committed because
we wait a replica, but the replica
somehow is like lost connection
or something.
And then we suddenly, and client
thinks it's not committed because
commit was not returned.
But then it suddenly looks committed.
It's like not data loss, it's data
and loss
Sugu: somehow.
Boom.
Nikolay: Like suddenly, and this
is not all right as well.
And when you think about consensus,
I think you are very good
describing these things like concept
and distributed systems.
It feels like if you have 2 places
to write, definitely there
will be corner cases where something
will go off if you don't
use two-phase commit, right?
Sugu: Correct.
Nikolay: And here we have this.
But when you say you're going to
bring something with consensus,
it immediately triggers my memory
how difficult it is and how
many attempts it was made to bring
pure HA to Postgres, just
to have auto-failover.
All of them failed.
All of them.
And they say, let's be outside
of Postgres.
So here maybe it will be similar
complexity to bring these 2
inside Postgres.
Is it possible to build it outside
this thing?
Sugu: It is not possible to build
it outside.
Because if it was, that is what
I would have proposed.
The reason is because building
it outside is like putting band-aid
over the problem.
It will not solve the core problem.
The core problem is you've committed
data in 1 place, and if
that data can be lost, and there
is a gap when the data can be
read by someone, causes is the
root cause of the problem.
That is unsolvable.
Even if you later, later raft may
choose to honor that transaction
or not.
And that becomes ambiguous, but
we don't want ambiguity, right?
Nikolay: What if we created something
extension to commit, like
make extendable to talk to some
external stuff to understand
that can be finalized or something?
I don't know, consensus.
Correct.
Sugu: Correct.
So if you, essentially, if you
reason through about this, your
answer will become a two-phase
system.
Nikolay: Yeah.
Sugu: Without a two-phase system.
Nikolay: Which scares me.
But as I told you, a two-phase
commit in the Postgres OTP world
is considered really, really slow
and the rule is let's just
avoid it.
I see your enthusiasm and I think,
I couldn't find good benchmarks,
0, published.
Sugu: This is not two-phase commit,
by the way.
This is two-phase synchronization.
Nikolay: I understand.
It's not, in two-phase commit,
it's like more communication happens.
I understand this.
Sugu: So two-phase synchronization,
the network overhead is exactly
the same as full sync, because
the transaction completes on the
first sync.
Later it sends an acknowledgement
saying that yes, I'm happy,
you can commit it, but the transaction
completes on the first
sync, so it will be no worse than
full sync.
Nikolay: Yeah, compared to current
situation when primary commit
happens, but there is a lock which
is being held until- Correct,
Sugu: it is the same cost.
Nikolay: Yeah, it is the same cost.
We wait until standby, And for
user it looks like when lock is
released, it thinks, okay, commit
happens.
But the problem with this design,
if, for example, standby restarts,
lock is automatically released
and commit is here and it's unexpected.
This is data on loss, right?
So you are saying we can redesign
this, network cost will be
the same, but it will be pure.
Yeah, that's great, I like this.
I'm just thinking, will it be acceptable?
Because bringing autofailover
is not acceptable.
There was another attempt last
year from someone and with great
enthusiasm, let's bring autofailover
inside Postgres.
Actually, maybe you know this guy,
it was Konstantin Osipov who
built a Tarantool database system.
It's like memory.
He was X MySQL in performance after
Zaitsev.
Zaitsev was X MySQL then Osipov
was MySQL.
And so Konstantin came to Postgres
saying let's build this.
Great enthusiasm, but it's extremely
hard to convince such big
thing to be in core.
So if you say it's not big thing,
this already...
Sugu: So I can, it's, I'll probably
have to explain it in a bigger
blog, but essentially, now that
I've studied the problem well
enough, the reason why it's hard to implement consensus in Postgres
with the write-ahead log is because they are trying to make Raft work with
the write-ahead log.
And there are limitations about how the Raft, how commits work
within Postgres that mismatch with how Raft wants commits to
be processed.
And that mismatch, so far, I have not found a way to work around
that.
But, the variation of Raft can be made to work.
Nikolay: Interesting.
Sugu: The way the, I don't know if you know about my blog series
that I wrote when I was at PlanetScale.
It's an 8 part blog series about generalized consensus.
People think that Raft is the only way to do consensus, But it
is 1 of a thousand ways to do consensus.
So that blog series explains the rules you must follow if you
have to build a consensus system.
And if you follow those rules, you will get all the properties
that are required by a consensus system.
So This 1 that I have, the design that I have in mind, follows
those rules, and I am able to prove to myself that it will work,
but it's not Raft.
It's going to be similar to Raft.
I think we can make Raft also work, but that may require changes
to the write-ahead log, which I don't want to do.
So this system I want to implement without changes to the write-ahead log
as possibly a plugin.
Nikolay: Yeah.
Well, now I understand why you could, like another reason you
cannot take Patroni not only because it's Python versus Postgres
But also because you need another version of consensus algorithm
Sugu: Correct, correct.
Nikolay: And among those hundred Thousand millions of ways.
Sugu: By the way Patroni can take this and use it because it's
very close to how FullSync works.
Nikolay: Good.
Okay.
Michael: I was just thinking, watching Alexander Kukushkin's
talk, he said a couple of things that were interesting.
1 is that he was surprised that this hasn't happened upstream.
So you definitely have an ally in Kukushkin in terms of trying
to get this upstreamed, but also that he thinks every cloud provider
has had to patch Postgres to, in order to offer their own high
availability products with Postgres, each 1 has had to patch
it.
And they are having to, you mentioned earlier today how painful
it is to maintain even a small patch on something.
Nikolay: I don't think it's every, I think it's Microsoft for
sure, knowing where Kukushkin works at.
Yeah.
But maybe more, not every.
Yeah,
Michael: all I mean is that there
are a growing number of committers
working for hyperscale and hosting
providers.
So I suspect you might have more
more optimism for consensus
or at least a few allies in terms
of getting something committed
upstream so I personally think
there might be growing chance
of this happening even if it hasn't
in the past for some reason.
Sugu: Yeah, I feel like also being
new to the Postgres community,
I am feeling a little you know,
shy about proposing this upfront.
So what I am thinking of doing
is at least show it working, show
it working at scale, have people
gain confidence that this is
actually efficient and performant
and safe.
So I also plan to, I don't know
if you've heard of FlexPaxos,
which is actually, in my opinion,
a better way to handle durability,
because today's cloud environments
are a lot more complex, and
a simple majority-based quorum
is actually very hard to configure
if your needs are different, which
actually FlexPaxos does handle.
It's actually something I'm a co-inventor
of, of some sort.
And this blog post...
Nikolay: I only heard the name.
That's it.
So I like...
Yeah.
Can you explain a little bit?
Not super difficult.
Sugu: Oh sure.
Yeah.
So the...
Actually, let me explain what is
the reason why.
So FlexPaxos was published a few
years ago, about 7 years ago
or so.
And you'll see my name mentioned
there, which I feel very proud
of.
And the block series that I wrote
is actually a refinement of
FlexPaxos.
And that actually explains better
why these things are important.
The reason why it's important is
because people think of consensus
as either a bunch of nodes agreeing
on a value, right?
That's what you commonly hear.
Or you think of like reaching majority,
reaching quorum is important.
But the true reason for consensus
is just durability.
When you ask for a commit and the
system says, yes, I have it,
you don't want the system to lose
it.
So instead of defining quorum and
all those things, define the
problem as it is and solve it the
way it was asked for is, how
do you solve the problem of durability
in a transactional system?
And the simple answer to that is,
make sure your data is elsewhere.
Michael: Yeah, I love how simple
you make it.
Sugu: Yeah, if you make sure your
data is elsewhere, if there
is a failure, your challenge is to find out where the data is
and continue from where it went.
And that is all that consensus is about.
And then all you have to do is have rules to make sure that these
properties are preserved.
And Raft is only just 1 way to do this.
So if you look at this problem, if you approach this problem
this way, you could ask for something like, I just want my data
to go across availability zones.
As long as it's in a different availability zone, I'm happy.
Or you can say, I want the data to be across regions, or I want
at least 2 other nodes to have it, right?
So that's your Durability requirement.
But you could say, I want 2 other nodes to have it, but I want
to run 7 nodes in the system, or 20 nodes.
It sounds outrageous but it is actually very practical.
In YouTube we had 70 replicas but only 1 1 node the data have
to be in 1 other node for it to be durable and we were able to
run this at scale.
The trade-off is that when you do a failover you have a wild
goose chase looking for the Transaction that went elsewhere but
you find it and then you continue.
And so that is basically the principle of this consensus system.
And that's what I want to bring in Multigres.
While making sure that the people that want simpler majority
based quorums to also work using the same primitives.
Michael: Just quickly to clarify, when you say the wild goose
chase, is that because it was 1 of 70, but different transactions
could have gone to different of the 70 or it's always the same
1, but you have to know which 1 that is
Sugu: No, it could be anyone the way we we ran it the way we
ran it It is 1 it could not be at any given point of time There's
only 1 Primary which means that there is only 1 Transaction that
you have to chase down.
Michael: The latest 1.
Sugu: The latest 1, yes.
Michael: Yeah, makes sense.
Sugu: Yeah.
There was a time when we found that Transaction in a different
country.
So we had to bring it back home and then continue.
It was once it happened in whatever the 10 years that we ran.
Nikolay: It's interesting that talking about sharding, we need
to discuss these things, which are not sharding per se, right?
So it's about HA inside each shard, right?
Sugu: It's actually like what I would call healthy database principles,
which is, I think, somewhat more important than sharding.
Nikolay: Yeah.
Yeah.
Yeah.
Michael: It is true that it is to do with it being a distributed
system, right?
And that is because it's sharded, no?
Sugu: I think they are orthogonal.
Okay.
Yeah, I think sharding, like you can do sharding on anything,
Right?
Like you can do sharding on RDS.
Somebody asked me, like, what about Neon?
I said, you can do sharding on Neon too.
It's like you put a proxy in front and then it does the sharding.
But the problem with sharding is it is not just a proxy.
That's what people think of it when they first think of the problem
because they haven't looked ahead.
Once you have sharded, you have to evolve.
You start with 4 shards, then you have to go to 8 shards.
And the evolution is not linear this way.
Actually, it's an exponential growth because 4, 8, 16.
But at some point of time, it changes because your sharding scheme
itself will not scale.
Like if you, for example, are in a multi-tenant workload and
you say shard by tenant, at some point of time, a single tenant
is going to be so big that they won't fit in an instance.
And that we have seen.
And at that time, we had to change the sharding scheme.
So how do you change the sharding scheme?
Slack had to go through this, where they were a tenant-based
sharding scheme, and a single tenant just became too big.
They couldn't even fit 1 tenant in 1 shard.
So they had to change their sharding scheme to be user-based.
They actually talk about it in 1 of their presentations.
And Vitess has the tools to do these changes without actually
you incurring any kind of downtime, which again, Multigres
will have.
I keep talking about Vitess, but these are all things that Multigres
will have, which means that you are future-proofed when it comes
to.
And these are extremely difficult problems to solve.
Because when you're talking about changing the sharding scheme,
you are basically looking at a full crisscross replication of
data.
And across data centers.
Nikolay: Yeah, and also, like I know, it has version 3, right?
It was when you changed, basically created a new planner, right,
to deal with arbitrary query and understand how to route it properly
and where to execute it, which is it a single shard or it's global
or it's like different shards and so on.
Like what's, are you going to do the same with Postgres?
I think yes.
Right.
Sugu: So that's the part that I'm
still on the fence.
By the way, the v3 now has become
Gen 4.
It's actually much better than
what it was when I built it.
The problem with v3 is that it
is still not a full query.
It doesn't support the full query
set yet.
It controls supports like 90% of
it, I would say, but not everything.
On the temptation side, there's
the Postgres engine that supports
everything.
So I'm still debating how do we
bring the 2 together?
If it was possible to do a simple
git merge, I would do it.
But obviously this 1 is in C, this
was in Go.
And the part that I'm trying to
figure out is how much of the
sharding bias exists in the current
engine in Vitess.
If we brought the Postgres engine
as is, without that sharding
bias, would this engine work well
for a sharded system?
Nikolay: So this looks like CytoSort,
if you bring up the whole
Postgres.
There's a library, libpg_query,
by Lukas Fittl, which basically
takes the parser part of Postgres
and brings it to...
And there is a Go version of it
as well.
Sugu: Oh, libpg_query you said?
Nikolay: Yeah, yeah, yeah, I will
send it to you.
So many, many, many, many systems
use it when we need to parse.
Yeah, 1 day I told it to SPQR guys
about this, and eventually
I think they ended up using it.
I think Lev Kupotov uses it, Pijadok
also uses it, the Rust version.
Sugu: Is it like 100% Postgres
compatible?
Nikolay: Well, it's based on Postgres
source code.
So parser is truly broad, but it's
not whole Postgres.
So maybe you should consider this.
Right?
If you're thinking about parsing,
I mean, queries and so on,
but I'm very curious, I also noticed
you mentioned routing, like
read-only queries routed to replicas
automatically.
And this concerns me a lot because
many Postgres developers,
I mean, who use it, users, they
use PL/pgSQL functions, all PL..
Python functions, anything, which
are writing data.
And the standard way to call function
is select, select function
name.
So understanding that this function
is actually writing data
is not trivial.
Right, and we know in pgpool, which
I, all my life I just avoid.
I touched it a few times, decided
not to use at all because it
tries to do a lot of stuff at once
and always considered like,
no, I'm not going to use this tool.
So pgpool solves it like saying
okay like let's build a list
of functions which are actually
writing or something like this.
So it's like patch approach you
know workaround approach.
So this is going to be a huge challenge,
I think, if you...
Sugu: Yeah, yeah.
Nikolay: For automatic routing, it's a huge challenge.
Sugu: Yeah, I think this is the reason why I think it is important
to have the full Postgres functional engine in Multigres, because
then these things will work as intended is my hope.
What we will have to do is add our own sharded understanding
to these functions and figure out what does it mean to call this
function, right?
If this function is going to call out to a different shard, then
that interpretation has to happen at the higher level.
But if that function is going to be accessing something within
shard, then push the whole thing down and just let the push the
whole SELECT along with the function down and let the individual
Postgres instance do it.
Nikolay: Yeah, but how to understand?
Function can contain another function and so on.
It can be so complex in some cases.
It's also funny that there is still there is actually Google
Cloud, CloudSQL supports it, kind of language, it's not language,
called PL/Proxy, which is sharding.
For those who have workload only in functions, this can route
to proper shard.
It was created at Skype.
It still exists, but not super popular these days.
But there is a big requirement to write everything in functions.
In your case, if you continue, like, I would expect in some case
you would say, okay, don't use functions.
But I'm afraid it's not possible.
Like, I love functions.
Actually, Supabase loves functions because they use Postgres,
right?
Postgres, like, it provokes you to use functions.
Sugu: Oh, really?
Oh, yeah, yeah.
Actually, I saw that.
Yeah.
So in Vitess, I feel like this was a mistake that we made, which
is if we felt that anything, any functionality that you used
didn't make sense.
Like if I were you, I wouldn't do this, right?
Because it's not, it won't scale.
It's a bad idea.
You know, it's like those we didn't support.
We didn't want to support.
You said, no, we will never do this for you because we'll not
give you a rope long enough to hang yourself.
Basically, that was our philosophy.
But in Multigres, we want to move away from that, which means
that if you want to call a function that writes, have at it.
Nikolay: Just put a comment, it's going to write something.
Sugu: Yeah.
If you want a function that calls
a function that writes have
at it.
If we cannot, like the worst case
scenario for us is we don't
know how to optimize this.
And what we will do is we'll execute
the whole thing on the VT
gate side.
Nikolay: There's another, I remember
there is interesting solution
in, I think in AWS RDS proxy, which
when they, as I know it,
maybe I'm wrong, when they needed
to create a global, it's called
I think Aurora Global Database
maybe or something like this.
So there is a secondary cluster
living in a different region
and it's purely read-only, but
it accepts writes.
And when write comes, this proxy
routes it to original primary,
waits until this write is propagated
back to replica and responds.
Sugu: Oh, wow.
Yeah.
I don't think that feature, I don't
know how even that feature
can be supported.
Nikolay: No, no, it's just some
exotic, interesting solution
I just wanted to share.
Maybe, you know, if we, for example,
if you originally route
a write to a replica, then somehow
in Postgres you understand,
oh, it's actually a write.
Sugu: Okay, Yeah, so maybe 100%
is theoretically impossible to
support.
Nikolay: Yes, it's okay.
It's super exotic.
Okay.
Sugu: But I think if people are
doing things like that, it means
that they are trying to solve a
problem that doesn't have a good
existing solution.
Nikolay: Exactly.
Sugu: So if we can find a good
existing solution, I think they'll
be very happy to adopt that instead
of whatever they were trying
to do.
Nikolay: Well, this is just multi-region
setup.
I saw not 1 CTO which wanted it
for like dealing with Postgres
like say we are still single region
we need to be to be present
in multiple regions in case if
1 AWS region is down right it's
also okay yeah so availability
and business characteristics so
yeah anyway okay yeah it's it's
exotic but but interesting still
yeah So
Michael: you've got a lot of work
ahead of you, Sugu.
Sugu: I feel like we barely covered
like 1 of so many topics.
Nikolay: Let's touch something
else.
Maybe it's a very long episode,
but it's worth it, I think.
It's super interesting.
What else?
What else?
Sugu: I think the other interesting
1 would be 2PC and isolation.
Nikolay: Isolation from what?
Sugu: Like the 1 issue with the sharded solution is that, again,
this is a philosophy for the longest time in Vitess we didn't
allow 2PC.
You said you shard it in such a way that you do not have distributed
transactions.
And many people lived with that.
And some people actually did
Nikolay: not adopt Vitess.
Let me interrupt you here, because this is like the most, the
best feature I liked about Vitess, it's this materialized feature
when data is brought.
Sugu: Oh yeah, materialized is another topic.
That's actually a better topic than 2PC.
Nikolay: Well, yeah, because this is your strength, right?
So this is like, I love this idea, basically distributed materialized
view, which is incrementally updated.
Sugu: Yes, yes, yes.
Nikolay: That's great.
We need it in Postgres ecosystem.
Just maybe as a separate project, even, you know, like we lack
it everywhere.
So yeah, this is how you avoid distributed transactions basically,
right?
Sugu: No, this is 1 way to avoid it.
1 way, yeah.
Like there are 2 use cases where materialized views are super
awesome.
You know a table that has multiple foreign keys, that has foreign
keys to 2 different tables is the classic use case, where the
example I gave was a user that's producing music and listeners
that are listening to it, which means that the row where I listen
to this music has 2 foreign keys, 1 to the creator and 1 to the
listener.
And where should this row live?
Should this row live with the creator or should this row live
with the listener is a classic problem.
And there is no perfect solution for it.
It depends on your traffic pattern.
But what if the traffic pattern is 1 way in 1 case and another
way in another case?
There is no perfect solution.
So this is where in Multigres what you could do is you say okay
in most cases this row should live with the creator let's assume
that right so then you say this row lives with the creator and
we shard it this way, which means that if you join the creator
table with this event row, it'll be all local joins.
But if you join the listeners table with this event row, it's
a huge cross shard while this chases.
So in this case, you can say materialize this table using a different
foreign key, which is the listeners
foreign key into the same
sharded database as a different
table name.
And now you can do a local join
with the listener and this event
table.
And this materialized view is near
real-time, basically the time
it takes to read the WAL and apply
it.
And this can go on forever.
And this is actually also the secret
behind resharding, changing
the sharding key.
This is essentially a table that
has real-time present with 2
sharding keys.
If you say, oh, at some point of
time, this is more authoritative,
All you have to do is swap this
out.
Make 1 the source, the other is
a target.
You've changed your sharding key.
Actually, the change sharding key
works exactly like this for
a table.
Nikolay: Distributed denormalization
technique.
This is what
Michael: it is.
Sugu: Yeah, yeah, yeah, exactly.
And the other use case is when
you reshard, you leave behind
smaller tables, reference tables,
we call them.
And they have to live in a different
database because they are
too small and even if you shard
them, they won't shard well.
Like if you have a billion rows
in 1 table and a thousand rows
in a smaller table, you don't want
to shard your thousand row
table.
And there's no benefit to sharding
that either.
So it's better that that table
lives in a separate database.
But if you want to join between
these 2, how do you do it, right?
The only way you join is join at
the application level, read
1 and then read the other.
And so at high QPS, it's not efficient.
So what we can do is actually materialize
this table on all the
shards as reference, and then all
joins become local.
Nikolay: Yeah.
And you definitely need logical
replication for all of this.
So this is where we started, like
challenges with logical replication.
Sugu: Yeah, yeah.
You do have the, so the reason
why 2PC is still important, because
there are trade-offs to this solution,
which is, there's a lag.
So it takes time for the things
to go through the WAL and come
to the other side.
Whereas 2PC is essentially, basically
the transaction system
itself trying to complete a transaction,
which means that it
will handle cases where there are
race conditions, right?
If somebody else tries to change
that row elsewhere while this
row is being changed, 2PC will
block that from happening, whereas
in the other case, you cannot do
that.
Nikolay: Yeah, if it's just user views of some video like on
YouTube, we can say, okay, there will be some lag, probably some
small mistake, it's fine.
But if it's financial data, it should be 2PC, but latency of
write will be high, throughput will be low, right?
This is...
Sugu: I actually want to...
I read the design of...
Which is again, by the way, very elegant API, and I assume...
I can see the implementation on the API and I Don't think we
will see performance problems with 2PC.
Nikolay: We need to benchmark it
Sugu: We will be we will benchmark it, but I will be very surprised.
I think there are some isolation issues that we may not have
time to go through today because it's a long topic.
Like the way 2PC is currently supported in Postgres, I think
it'll perform really well.
Nikolay: Isolation issues when we sit in read committed and use
2PC.
You mean this, right?
Not in repeatable read.
In default.
Sugu: Yeah, read committed I think will be, there will be some
tradeoffs on read committed, but not the kind that will affect
most applications.
MVCC will be the bigger challenge.
But from what I hear is most people don't use, like the most
common use case is read committed.
Nikolay: Of course, as default.
Yeah, it's faster.
Sugu: It's a default.
Yeah.
So people won't even, yeah, I don't, I think this.
Nikolay: They're already on some, they're already in bad state.
It won't be worse.
Sugu: It won't be worse.
Yes.
Nikolay: Yes.
Yeah.
So to PC, of course depends on the distance between nodes, right?
A lot, like if they are far, we need to talk like client is somewhere,
2 nodes are somewhere, and if it's different availability zones,
it depends, right?
So this distance is a big contributor to latency, right?
Network.
Because there are 4 communication messages that are needed.
So.
Sugu: Correct, correct.
Actually, you can, I have actually the mathematics for it?
But you're probably right.
It's about double the number of
round trips.
Nikolay: Yeah, if we put everything
in 1 AZ, client and both
primaries, we are fine.
But in reality, they will be in
different places.
And if it's different regions,
it's nightmare, of course.
But at
Sugu: least it's- Yeah, the 2PC
is not done by the client, by
the way.
The 2PC would be done by the VTgate,
which would be- It should
have the nodes nearby.
Nikolay: Ah, should have, In 1
availability
Sugu: zone?
Unless you did some crazy configuration,
they should be pretty
close to each other.
Nikolay: Pretty close means still
different availability zones
in general case, right?
Sugu: No, no.
The availability zone is only for
durability.
Nikolay: Okay.
Sugu: For replica level.
But a 2PC, you're coordinating
between 2 primaries, which may
actually be on the same machine
for all you care.
It's not...
Well,
Nikolay: Imagine the real practical
case.
We have sharded schema, we have...
Every shard has primary and a couple
of standbys, right?
Sugu: Correct.
Correct.
Nikolay: So are you saying that
we need to keep primaries all
in the same availability zone?
Sugu: That's usually how things
are.
Nikolay: Ah, interesting.
I didn't know about this.
By the way, I wanted to rattle
a little bit about PlanetScale benchmarks
they published last week.
They compared to everyone.
They compared to everyone.
I wanted just, it's not like, I'm
sorry, I will take a little
bit of time.
They compared to everyone, and
they just published like PlanetScale
versus something.
And this very topic, they on charts,
we have PlanetScale in single
AZ, everything client and server
in the same AZ.
And line, which is like normal
case, client is different AZ.
And line with same AZ is active,
line is normal, not active.
And others like Neon, Supabase,
everyone, It's different.
And of course, PlanetScale looks
really well, because by default,
they presented numbers for the
same availability zone.
And below the chart, everything
is explained, but who reads it,
right?
So people just see the graphs.
And you can unselect, select proper
PlanetScale numbers and
see that they are similar.
But by default, the same as that
number is chosen.
And this is like benchmarking,
you know, like.
Sugu: Well, I think if you look
at the architecture, like, even
fair comparison, PlanetScale should
come out ahead, like, like
the performance of a local disk,
of course, should.
Nikolay: But this was SELECT 1
disks.
Sugu: Disks, SELECT 1 is not a
benchmark.
Nikolay: Well, it was part of benchmark.
It's just checking query path,
but it fully depends on where
client and server are located.
So what's the point showing better
numbers just putting client
closer?
I don't like that part of that
benchmark.
Sugu: Okay, yeah.
I saw the publications, But I didn't
go into the details because
I thought, well, it has to be faster
because it's on local disks.
So why?
Nikolay: Yeah, for data which is
not fully in cache, of course.
Yeah, local disks are amazing versus
ABS volumes.
Sugu: You're right.
Yeah, if the data is in cache,
then there is, yeah, then all
performance, the performance of
everything would be the same.
Nikolay: Yeah, well, I wanted to
share this.
I was annoyed about this, but I
fully support the idea of local
disks.
It's great.
I think we need to use them more
in more systems.
Sugu: I think, I wouldn't be surprised
if you reached out to
PlanetScale.
They may be willing to, like if
you want to run your benchmarks,
they may be willing to give you
the...
Nikolay: Yeah, there is source
code published and in general
benchmarks look great, the idea
is great.
And actually, with local disks,
the only concern is usually the
limit, hard limit.
We cannot have more space.
But if you have sharded solution,
there is no such thing.
Sugu: Correct.
But speaking about the hard limit,
today's SSDs, you can buy
100 plus terabytes size SSD, single
SSD, and you can probably
stack them up on
Nikolay: 1 next to the other.
Yeah, yeah.
But in cloud there are limitations,
in cloud there are limitations,
for instance.
Sugu: Okay, yeah.
I saw AWS SSD over 100 terabytes.
Nikolay: In Google Cloud, 72 terabytes is hard limit for Z3 metal.
And I didn't see more.
So 72 terabytes, it's a lot, but sometimes it's already notable.
Sugu: Yeah, at that limit, your storage is not the limit.
You will not be able to run a database of that size on a single
machine.
Why not?
You will max out your CPU.
Nikolay: We have cases, CPU.
Well, again, the problem will be replication.
If we talk about single node, we can
Sugu: have- Or replication.
Nikolay: 360 cores in AWS, almost 1,000 cores already for Xeon
scalable generation 5 or something.
So hundreds of cores.
The problem is Postgres design.
If replication, physical replication was multi-threaded, we could
scale more.
Sugu: By the way, replication is not the only problem.
Backup recovery.
If your machine goes down, you're down for hours.
Recovery is something of that size.
Nikolay: Yeah.
Not many hours.
Someone in my team recently saw 17 or 19 terabytes per hour for
recovery with pgBackRest or WAL-G.
In AWS.
I was like, my jaw dropped.
On our podcast I usually say 1 terabyte.
Sugu: Can you repeat that?
17 or 19
Nikolay: terabytes per hour?
17 terabytes per hour.
With local disks.
So this is important.
With EBS, it's impossible.
Sugu: It's good to know.
Nikolay: Yeah.
With Michael, I was always saying 1 terabyte is like what you
should achieve.
If it's below, it's bad.
Now I'm thinking 1 terabyte is already...
Yeah, yeah.
So with EBS volumes, we managed to achieve, I think, 7 terabytes
per hour to restore with WAL-G.
And that's great.
But there's
Sugu: a danger there.
You could become a noisy neighbor.
So we actually built throttling in our restore just to prevent
noisy neighbors.
Nikolay: With local disks, you lose ability to use BS snapshots,
cloud disk snapshots.
Sugu: Correct, correct, yeah.
Nikolay: That's what you lose, unfortunately.
And they're great, and people enjoy them more and more.
Yeah, so I agree, and as I remember, for 17 terabytes it was
128 threads of 4g or pgBackRest I don't remember local disks I
need to update
Sugu: my technology is changing too fast.
Nikolay: Exactly, yeah.
And hundreds of cores, terabytes of RAM already, right?
Like,
Michael: yeah.
Yeah, yeah.
But it does go straight to your point of the smaller they are,
the faster you can recover still.
Sugu: You don't hit some of these limits.
These systems were not designed with these types of limits in
mind.
Some weird data structure, suddenly the limit of this is only
100 items.
And you hit those limits and then you're stuck.
Recently, Metronome had an issue.
They had that, the routed.
MultiXact.
The MultiXact thing, which nobody has ever run before, but
they hit that problem.
Nikolay: Yeah, we saw many problems also when you're on the edge.
And it pushes forward Postgres actually sometimes, but if you
want to be on the safe side, but I really like the, like, it's
kind of resilience characteristics when, even if it's down, it's
only a small part of your system is down.
That's great.
Sugu: Correct,
Nikolay: yeah.
That's mature architecture already.
Sugu: That actually makes it easier to achieve 5 nines uptime.
Because that's the way you calculate.
Like if only 1 node is down, you divide it by the number of users
Nikolay: being affected.
Budget.
Sugu: Downtime budget.
Yeah, budget.
Yeah.
Nikolay: That's good.
Cool.
I think it's maybe 1 of the longest episodes we had.
Enjoyed it.
Oh my God.
I enjoyed it.
I hope we will continue a discussion
of issues with logical,
for example, and so on, and maybe
if things will be improved
and so on.
Looking forward to test POC once
you have it.
Thank you so much
for coming.
Sugu: I am so excited.
Nikolay: Yeah, thank you.
Michael: Yeah, is there any last
things you wanted to add or
anything you wanted help from people
on?
Sugu: I would say it feels like
nothing is happening on the repository
except me pushing, you know, a
few pushes, a few things, changes,
but a huge amount of work is happening
in the background.
Like some of these design work
about consensus are all like almost
ready to go and there's also hiring
going on there are people
coming on board very soon so you
will see this snowball It's
a very tiny snowball right now,
but it's going to get very big
as momentum builds up.
So pretty excited about that.
We may still have 1 or 2 spots
open to add to the team, but it's
filling up fast.
So If any of you are very familiar,
this is a very high bar to
contribute to a multi-base.
You have to understand consensus,
you have to understand query
processing.
But if there are people who want
to contribute, we are still
looking for maybe 1 or 2 people
and also on the orchestration
side and the Kubernetes side of
things.
Yeah, so that's...
Nikolay: Do you mind a small joke
in the end?
Just not to finish on...
Sugu: I do not mind at all.
Let's hear it.
Nikolay: Yeah, so I know what you're
doing.
You're writing a lot of markdown
right now and then you will
feed it to AI.
Sugu: I wish!
Oh my god.
I almost hope that day never comes
but It is so fun working on
this project, creating it.
Why do I want to give it to an
AI to do it, you know?
Nikolay: Okay.
Good.
Thank you.
I enjoyed it a lot.
Michael: Yeah.
Yeah.
Thank you so much for joining us.
It's great to have you as part
of the Postgres community now
and I'm excited to see what you
get up to.
Sugu: And me too.
Michael: Wonderful, thanks so much.