
PgDog
Nikolay: Hello, hello.
This is Postgres.FM.
My name is Nikolay, Postgres.AI,
and as usual, co-host
is Michael, pgMustard.
Hi, Michael.
Michael: Hello, Nikolay.
Nikolay: And we have a super interesting
guest today, Lev Kokotov.
Hi, thank you for coming.
Lev: Thanks.
Hi, Nikolay.
Glad to be here.
Nikolay: Yeah, and the reason we
invited you is when I saw PgCat,
it was interesting.
At that time, I had a spike of
interest to sharding solutions,
and I saw PgCat was started as
a connection pooler alternative
to PgBouncer.
Actually, at the same time, a few
other teams started some new
projects.
But Then Sharding was, oh, okay,
Sharding was there.
It's quite straightforward way,
just comments.
It was interesting, but I'm not
a cat guy.
I'm a dog guy.
And once I saw PgDog started,
we focused on sharding.
It obviously attracted much more
attention from me.
So that's why we invited you, actually.
Awesome.
Yeah.
Let's talk about, I don't know,
like maybe sharding itself and
probably let's start with the idea
that not everyone needs it
Because on a single Postgres cluster,
you can grow and reach
multibillion evaluation, go to
IPO, I have examples.
But obviously, sometimes it's really
tough, right?
Tough job to maintain a huge monolith.
So what's your take in general
about the idea that you don't
need sharding?
Lev: Oh yeah, well that's a good
1.
I agree in principle.
So you know, Postgres can be pushed
quite far and we pushed it
very, very far at Instacart, pre-IPO
I should add.
But we IPO it as a massively sharded
database.
And we absolutely had to because
a lot of things started to run
quite slow.
We had a lot of writes.
Instacart was, you think most apps
are 90% read, 10% write.
I don't know if that was the case
for us.
I think we're a little bit more
like maybe like 80-20, maybe
70-30, but we were doing a lot
of writes.
Nikolay: Hold on.
80-20 means 80 writes or reads,
as usual reads, right?
Lev: Yeah, yeah, reads.
Yeah, yeah.
I'm thinking like 90% typical workloads, 90% reads, 10% writes.
Nikolay: Or even more, or even more sometimes.
Social media is definitely more reads, right?
But here's the thing.
Lev: Yeah, here's what it was a little bit different, not by
much, but even that that small percentage, like 10%, 15% writes
was enough to push our r5.24xlarge, I don't know, like 192 cores,
almost a terabyte of RAM over the edge.
Nikolay: Sounds like RDS.
Actually, I'm super curious.
It's off topic, but I'm super curious about so-called Instacart
zero-downtime upgrade approach because we believe, at Postgres.AI,
we believe it leads to corruption.
But I hope we will discuss at a different time.
Let's hold on to this.
Lev: Yeah, happy to talk about it on another podcast.
That was 1 of the guys who worked on it.
Nikolay: Consider you invited.
zero-downtime upgrades made on managed Postgres, where, which doesn't
allow to change recovery target LSN.
Lev: Yep.
I did it myself.
We had zero-downtime.
Nikolay: Yeah.
It's super interesting topic.
Maybe in a few weeks, I'm really looking forward.
So yeah, I understand that writes, we can scale reads easily
because it's just a replica until some point, right?
But writes, we have only the primary.
And you need to go either to services, microservices, or sharding
or combination of them.
This is several routes, actually 2 major routes here.
Lev: Yeah, Instacart actually sharded what we call functionally
sharded.
I don't know if that's a real term in the industry, but we just
took tables out of our main database and put it into a different
database, you know, functional sharding air quotes.
Nikolay: Vertical split.
Vertical position.
Exactly.
Lev: And that happened even before I joined the company.
So that was way before, you know, IPO talks and all that stuff.
So you can run out of capacity on a single machine quite quickly
if you use it a lot.
Sorry, that's a little bit of a tautology, but that's just the
case.
Like, it depends on your workload.
We did a lot of machine learning.
So we wrote a lot of bulk data, a lot, like daily.
We would have like hundreds of
gigabytes of data that were completely
new data every day, basically.
So a lot of writes were happening
and sharding was the only way,
basically.
We were running out of...
Our vacuum was always behind, and
when vacuum is behind, everyone
gets a little bit scared.
Performance is worse as well.
We were getting a lot of lock contention
on the WAL.
That happens a lot.
The WAL is single-threaded to
this day, and that's totally fine,
but you end up, you know, when
you write like hundreds of thousands
transactions per second.
Nikolay: Well, it's single-threaded
if you use WAL writer, but
backends can write to WAL themselves
if you turn it off.
Like this setting, I forgot.
Lev: Yeah, No, you're right.
You're absolutely right.
But there's still a log.
There's still a log.
Yeah.
You have to grab a log.
Nikolay: We see it in performance
insights.
If it's on RDS, like queries are
spending time on LWLog.
Lev: That's the guy.
And that's the guy who takes your
website offline on Sunday,
every morning, like a clockwork.
Nikolay: On the one-night-defensive
foreclose.
And I would say traditional optimization
would be, let's write
less.
But I assume only in Postgres 13,
WAL-related metrics were added
to register statements and explain.
It was very difficult to understand
which queries, well, with
performance insights you can have
based on weight event analysis
to identify queries which are like
WAL write intensive, right?
And maybe optimize them, but it's
still like limited approach,
right?
Lev: Well it's limited in its success
in a company because when
you tell somebody like, hey, can
you write less to your database?
People are like, I'm sorry, but
is that your job?
You can write less,
Nikolay: but we have, well, WAL
writes are problem for a query,
consider a query which writes to
a table and 2 situations, This
table has a couple of indexes versus
20 indexes and when we insert
something, every index gets some
insert and it amplifies the
amount of WAL written for the
same insert.
Lev: It's really important to know
that every single index has
an important business use case.
And that's why it was put there
in the first place.
So like all of these are invariants.
Like when you say like I have 20
indexes on a table and it's
right up, like that's by design.
And when you want to write more
data to that table, because your
business is growing, like us as
a database engineering team,
we're like, all right, we're going
to make it happen, because
we can't go to like 25 individual
teams and tell them like, hey,
can you fix your query, they're
going to be like, I don't have
the time, I'm trying to sell groceries
here.
You know, like I have other concerns.
Nikolay: But anyway, I'm trying
to say I agree with you that
without sharding, we are limited
in, in like, we need to squeeze,
squeeze, squeeze, but it's very
limited.
And at some point, you cannot say
we can grow to X anymore, right?
Lev: That's right.
And that 2X is actually on the
low bound.
What most engineering leaders expect
is like 3, 5, 6X.
10.
Yeah, 10 to have that runway.
If you don't have that runway in
your system, it's a red flag
for most.
Nikolay: And it goes to CTO level,
basically.
The decision is like, should we
migrate out of Postgres to, I
don't know where, right?
Lev: Exactly.
Nikolay: I know this so many times,
including sitting on CTO
position myself, like 15 years
ago.
Like I know this pain of sharding.
Yeah.
Lev: It's really funny.
You're just kind of sitting there
and be like, Hey guys, I need
10X capacity.
And your database engineer is like,
well, it's the WAL.
There's nothing I can do.
It's the WAL.
It's like, it's just like a disconnect
between these, like, you
know, these 2 people, you need
to like, you just need to make
this thing scale.
Like that's the end.
And a lot of these are out of control.
So,
Nikolay: yeah.
Let's like, we can also enable
compression for full page writes.
We can also partition so writes
become more local and some, I
don't know, like you're skeptical,
right?
Lev: I am, because all of that,
partitioning is great.
I would never speak bad against
partitioning, but...
Nikolay: Not every partitioning,
sorry for interrupting, not
every partitioning.
If, for example, people say, okay,
we are going to spread all
customers evenly distributed among
partitions.
This is not going to help us in
terms of WAL writes.
But if you say, we will have hot
partition, like receiving most
writes, and kind of archive partitions
where the vacuum and everything
was going like stabilized.
In this case, it's super beneficial
for WAL writes and write
intensive workload optimization,
right?
Lev: Absolutely, but then you're
going to end up having to read
those cold partitions anyway.
It's a temporary band-aid, which
we did.
Again, absolutely, we partitioned
a lot of tables and that helped
us with a lot of things.
But at the end of the day, you
just need more compute.
That's the bitter lesson of database.
I don't know if you're familiar
with the bitter lesson in AI,
where you just need more machines
to solve the problem.
It's the same thing in databases.
You just need more compute to just
be able to write, read more
stuff and write more stuff.
Nikolay: Yeah, I'm still pulling
us back intentionally because
we are having conversations about
like scaling problems.
And interesting insight I've got
yesterday talking to a very
experienced Postgres expert, that
partitioning not only helps
with data locality, write locality,
and full-page writes, and
volume amount as well, but also
with backups, surprisingly, if
we do incremental backups with
WAL-G, pgBackRest, with new incremental
API, I still need to learn about
this more.
Or snapshots, cloud snapshots of
EBS volumes and RDS also relies
on it, as I understand.
Imagine writes are spread out everywhere,
and you constantly
change a lot of blocks, versus
you change only specific blocks,
and some blocks are not changed.
And it's less pressure on backups
as well.
It's interesting and it affects
DR and RPO, RTO.
But I just wanted to ask, like
my understanding, If we go to
sharding blindly and miss some
optimizations, I remember I was
dealing with Mongo and each node
of Mongo, it was maybe 10 plus
years ago, I saw each node of Mongo
can handle much less workload
than Postgres, 1 node of Postgres.
Maybe like jumping to sharding
too early, it's premature optimization
and you will be not in good shape
in terms of how much money
you spend on compute nodes because
you missed a lot of optimization
steps.
What do you think?
Lev: Alexey Nosovsky Absolutely.
If your first step is like, my
database is broken, I need to
shard, you missed a lot.
Of course, you should look at all
the possible optimizations.
And that's really important to
keep you on a single machine.
I think the benchmark for me is
when you need to start thinking
about charting is when you try
like 3 or 4 or 5 different things
and then you have like a person
or 2 working on it full-time
and at the end of the day they're
like I ran out of ideas.
Nikolay: Do you have some numbers
like for modern hardware like
how much WAL to be written per
day, for example, like terabyte,
5 terabytes, where it's already
an edge, right?
Lev: Yeah, for sure.
I can give you the exact number.
We benchmarked Postgres on in the
best possible scenario, actually,
It was on EC2, it had NVMe drives,
it had an ext4, a RAID 0.
So we did not, we did not want
durability.
We wanted performance.
That thing could write 4 or 5 gigabytes
per second using Bonnie++.
Now Postgres was able to write
about 300 megabytes per second
of just like whatever.
I think we're using copy, like
the fastest possible way to dump
data into PG.
And it's just, it's the nature
of the beast, right?
You have to like process it, you
have to put it into a table.
I'm sure there's a lot of like
checks and balances go in between
and again, lock contention and
all of that stuff.
So you can't necessarily squeeze
out everything out of your machine
if you have the capacity.
So ultimately you need to split
the processes themselves between
multiple machines once you reach
that number.
And then, you know, for everything
else, if you have, again,
if you have write amplification,
things are getting a little
bit more tricky.
And then, so I think that's, that's
usually the number.
Once you're writing like a, you
know, 200 megabytes per second
of just the WAL.
And you can see that in RDS there's
a graph for that just just
WAL writes.
I would even start early.
200 megabytes
Nikolay: of WAL per second each
WAL is 16 megabytes but on
RDS it's 64.
They changed it.
So how many WAL files like it's
a lot right?
Lev: It's a lot and that's And
that's the scale you're looking
at.
Nikolay: 10 plus is already too
much.
Yeah, I agree.
Lev: Yeah.
So that's the red line, right?
If you're at the red line and you're
thinking about sharding,
we're gonna have a bad time.
And that's okay.
It's okay to have a bad time.
These lessons are learned when
you're having a bad time.
I had a really bad time.
Nikolay: Yeah, and I pulled BC
calculator just to check 200 megs
per second, it gives you 17 terabytes
per day.
Yeah,
Lev: that's mild, honestly, for
what we did at Instacart.
We had like petabytes of WAL in
S3, like after like a week.
And That's just, again, it's the
nature of the beast.
Nikolay: Multiple cluster.
Lev: Single cluster.
Yeah, it's just, it is.
It's called selling groceries.
What can I tell you?
We had like, you know, multi, hundreds
of gigabytes of machine
learning data that would come in
every day because the model
was retrained on a nightly basis
and all the embeddings have
changed, right?
Again, the use cases are almost
like, I mean, they're interesting,
but they're kind of off topic,
I guess, because they're just
like, you just write a lot of data.
That's what you do for a living,
right?
That's what Postgres is for.
That's what databases are for.
So the numbers are like, well,
this is the number, and that's
when you should split up.
Nikolay: Yeah, so let's probably
rest for more sane situations.
Let's say 10 terabytes per day
or something of WAL data.
It's already too much.
If you approach that like not soon,
it's already a WAL you will
be hitting, several WALs inside,
obviously you will be hitting.
Including light.
Yeah, yeah, several WALs.
Yeah, I was trying to say lightweight
lock, WAL-related lightweight
locks, WAL-write, right?
There are several of them.
Yeah, but it's, Yeah, this is the
scale to be scary.
Lev: It's hard to put a number
on it because you can push that
further, and who knows what's going
to happen to your app.
Maybe that's as high as you'll
go.
Nikolay: And not everyone has local
NVMEs because in cloud environments,
they are ephemeral.
It's kind of exotic and risky.
Lev: It is.
It's definitely risky.
I love those because when you click
reboot on the machine, it
wipes the encryption key and all
your data is gone.
So you better be careful which
button you press.
There's like a hot restart, which
is safe, and there's the actual
restart, which will delete all
your data.
So stay on RDS if you can.
Nikolay: But no, I don't agree
here because you just you should
just use Patroni and multiple replicas
at least at least 3 nodes
and you'll be fine to lose 1 node.
Lev: You lose 1, something ran
a bad, anyway, that's totally
off topic.
Nikolay: Okay, thank you.
For me, it's enough to understand
that there are cases definitely
when you do need to split vertically,
and splitting vertically
usually is limited as well.
It's hard sometimes, and at some
point we need sharding.
Yeah, so it's enough to explain
that definitely there are cases
where you need to scale Postgres
beyond 1 primary, it can be
vertical or horizontal.
The result is sharding, right?
Lev: That's right.
Nikolay: So, good.
Yeah.
This is great explanation.
Sharding is definitely needed,
but not for everyone maybe, but
for many.
Lev: I mean, yeah, I hope not for
everyone.
That would be, I mean, you know,
we solve the problem once and
for all.
Nikolay: Actually.
Anyone has access to
Lev: it, yeah.
Nikolay: Last week, Michael and
I had discussion about snapshots
and I said that CloudSQL has hard
limit, 64 terabytes, because
this is a limit for GCP persistent
disks, including persistent
PD, SSD.
And I said, RDS obviously can allow
you to grow further, but
beyond 64, but it's not.
After we recorded, I checked, and
the same limitation, 64 terabytes.
It's another reason to have multiple
primaries and scale beyond
1 cluster.
Lev: Well, you'll hit that sooner
on Postgres because of the
32 terabyte table limit.
And actually I did hit that limit
once and that was a scary error.
The only thing that it says is
like, there's no more bytes to
be allocated or something like
that in some kind of database
file.
I'm like, what?
You know, and then you're done.
Like, there's nothing you can do.
You have to move data out into
that table.
You have to think database offline,
obviously.
Nikolay: It's a single table.
Okay.
Okay.
Lev: Yeah.
A single table.
Yeah.
Which is actually not uncommon.
Nikolay: Well, it's still not super
common as well.
Yeah.
I see only cases like kind of 10
terabytes for table.
And it's already when any DBA should
scream, where is partitioning?
Lev: Well, DBAs like to scream,
but application engineers be
like, I can't, aren't you the DBA?
Like, what am I doing?
Do something, right?
Nikolay: But partitioning unfortunately
requires application
code changes.
Lev: Yeah, precisely.
Nikolay: Yeah, so sharding is needed
for big projects, obviously.
Yeah.
Agreed.
What's next?
Michael: Can we go?
I reckon it's time to go back to
the origin story of PgDog and
obviously then PgCat.
It would be great to hear like
a little bit of that story, Lev.
Lev: Yeah, absolutely.
I'm happy to.
Do you guys care about the name
of the project?
I don't know if, or do you want
to know what the inspiration
for that came from?
Nikolay: Sure, it's fun, yeah.
Lev: Okay, well, when I started
PgCat, we just got a new cat
and I really loved that cat.
And I was working on Postgres,
so the 2 things came together
naturally.
You could probably guess the origin
story of PgDog now, I got
a dog, and then I'm like, look,
I love this dog.
What can I call my next project?
Obviously PgDog.
Yeah.
So the naming issue solved.
PgCat came from the idea that it
was really simple.
It was sharding was not even in
scope back then.
It was just like, we ran PgBouncer.
PgBouncer could only talk to 1
Database at a time.
It makes sense, you know, you were
pooling like just the Postgres
instance.
We had a bunch of replicas.
We needed to load balance.
And we needed a load balancing
algorithm that was smart.
When a replica went offline because
of, again, hardware issues,
scaling issues, whatever, we needed
to remove it from the rotation
without affecting the app.
So we would regularly lose the
replica.
And then most of the site would
go offline because we had a Ruby
gem that would randomize access
to those replicas, and when 1
of them broke, it's just, you know,
it worked okay.
But doing this in the application
code is really hard.
Especially in Ruby, there's a way
to inject exceptions into Ruby,
like, sideline, and basically that
breaks your state.
So we had like multiple gems working
against each other and we
just needed to do that in a place
where it made more sense.
Like a load balancer is typically
outside the application because
you have multiple applications
so I can do it in any way.
So I just built a load balancer
basically.
It was after actually I left Instacart.
I was just doing it as a side project
just to keep my mind going.
So I built it.
It was really simple.
Used banding logic, which was kind
of novel at the time for Postgres.
If you receive 1 single error from
the database, it's removed
from the rotation.
It's very aggressive, but Postgres
never throws errors, like
network-related errors, unless
there's a serious problem.
So that actually worked pretty
well.
I talked to a friend of mine on
Instacart.
I was like, hey, look, I built
this on the side.
That looks fun, right?
Like, we thought about this.
And he's like, yeah, all right.
Didn't you quit?
I'm like, yeah, but no, I have
some free time to work on this.
Right.
And he's like, okay.
And then he took the code, added
a bunch of features that I didn't
add.
Cause obviously I didn't have a
use case anymore for it.
And he's like, oh, great.
We're going to deploy it.
Right.
And we tried it and they use it
and they're, I mean, they put
so much work into it.
They wrote a blog post about it.
You probably know about this.
So that went pretty well, you know,
so it's working in production.
It's great.
And then I'm like, all right, well,
you know, sharding is the
next 1.
They have a bunch of sharded databases
that we sharded and adding
sharding routing to that would
be great because again, it was
done in the application layer and
application layer routing,
I think we'll all agree is a little
bit iffy, especially if you
have more than 1 app written in
more than 1 language, like you
have to repeat the same logic across
all apps.
So I added just a common system.
I knew like there's 2 sharding
schemes at Instacart.
1 uses, you know, the actually
the hashing function from partitions
in Postgres.
I love that because you can actually
split data both at the client
and in the server.
So you have multiple ways to move
your data around and the other
1 is just a custom 1 like we use
like SHA1 and take the last
few bytes and then mod that that's
just you know, it's random,
but it was available in multiple
systems as well The data for
that came from Snowflake so we
could actually Shard the data
in Snowflake and then ingest it
into the instances directly.
And then on the routing layer in
Ruby, same hashing function,
you know, the sharding key is always
available.
So that was good.
So I added both of them and they're
like, great, that's great.
And they tried, I think, the SHA1
function and I think it's working
pretty well for them.
So that was fun.
Then I started another company
that had nothing to do with any
of anything.
PostgresML, you might've heard
about it.
That came from, you know, the idea
that we shouldn't ingest,
you know, hundreds of gigabytes
of machine learning data into
Postgres, why we should just ingest
a 3 megabyte model and run
inference online.
You know, it was okay.
Stayed a couple of, 2 and a half
years there.
Didn't work out.
There's a lot of startups too,
left.
And then I had some free time and
I'm like, well, what do I like
to do in my free time, guys?
Writing Postgres poolers.
This is what I do.
This is what I do to, you know,
rest on vacation.
I write Postgres poolers.
I'm like, well, let's do sharding
for real this time.
Let's actually take what we built
at Instacart, make it into
software because That's what we
do, right?
We come up with an idea, well,
we find a problem, we find a solution,
we write it in code, and we don't
have to solve it again every
single time manually for like,
you know, hopefully hundreds,
hopefully thousands of use cases
for this.
You know, we'll see.
I'm still doing my research.
But yeah, So that's PgDog.
Sharding is, you know, sales and
number 1.
Everything else is, you know, there's
obviously, it's obviously
a pooler, it's obviously a load
balancer.
It has all the features that PgCat
has.
Almost all of them, I'm adding
them as I go.
It's a rewrite, it's brand new.
I like new code.
That's what everyone loves to hear.
Hey, you rewrote it from scratch.
Great.
You know, that code that we battle
tested in production and serving
like half a million transactions
per second.
Well, that's obsolete now, I guess.
Nikolay: You're just going to
Lev: take this brand new code base
and check it out.
Nikolay: Yeah.
And it's written in Rust.
Lev: Yeah, absolutely.
Nikolay: Yeah, it's in my kind
of dreams to find some spare time
to learn and try it because, yeah,
it looks like many folks move
to Rust.
So what do you think?
Lev: Well, it took me about a decade
to get good at it.
So the sooner you start, the sooner
you'll get it.
You know, it takes about 10 years
to be good at it.
So at least that's what it took
me.
That's okay.
Again, if you start today, eventually
you'll get good at it and
that's okay.
It's a journey, you don't have
to learn it immediately.
It's just, it's such a paradigm
shift in how things work.
The compiler is very aggressive
about checking for things, especially
concurrency errors, which is for
multithreaded asynchronous poolers,
very important.
I don't have concurrency bugs in
PgDog or PgCat because I'm using
Rust.
I don't have data races.
And that's really important.
And the number of bugs that I ship
is considerably lower because
the compiler is like, hey, listen,
this variable, you're not
using it.
And I'm like, oh yeah, crap, I'm
actually importing the wrong
variable here.
I'm using the wrong variable.
Good catch, because that was going
to be a bug.
A lot of things that just in other
languages are not available
Rust makes really nice.
So it's really worth, again, you
ask anyone who writes Rust,
but like, that's the best thing
that ever happened to me since
sliced bread.
I'm like, that's true.
I haven't thought about sliced
bread in a while, but Rust is
great.
Cool.
Nikolay: Another side question
is license.
I saw feedback.
I joined that feedback.
So, PgCat was, I think, on Apache
or MIT.
I don't remember exactly, but kind
of permissive.
And for PgDog, you chose AGPL,
right?
Can you elaborate a little bit
why?
Lev: Yeah, yeah, of course.
Yeah, happy to.
The answer is really simple.
AGPL is actually pretty misunderstood.
You can use that code and the application
anywhere you want and
never have to share anything back.
As long as you use it internally
and you don't provide, like,
you don't use PgDog like publicly
as like a service for running
PgDog.
And even if you do that in that
case, all you have to do is just
tell us like, what did you change
and send us patches, like what
did you change?
So like, it's pretty minimal.
But it's a red flag for everyone
and that's okay.
Yeah, I'm building a company around
it.
Building a company around MIT code
base is, I think, probably
possible.
I've never done it successfully.
Building a company around AGPL,
I think, has been done before.
And I think it's probably fine.
But if it becomes a hurdle, I'm
not like married to it.
I just thought AGPL looks cool.
I like the ideas behind it.
I like free and open source code.
I don't think MIT is necessarily
the original idea behind, you
know, free and open source code.
I like MIT because I don't have
to think about it.
Nikolay: I'm checking PostgresML.
PostgresML is MIT, right?
Lev: Yeah.
Nikolay: Codebase.
Rasta and MIT.
So it's interesting how you decided
to change it.
I agree it's misunderstood.
But it's already so.
The majority of people misunderstood
it and we cannot change
it with single project.
So it's reality.
Yeah, but you don't care if
Lev: it's open.
Nikolay: Okay, yeah, like you do
it.
Lev: Yeah, yeah, because like if
somebody tells me like look
I would love to use your code,
but AGPL is a deal-breaker I'll
be like well, we'll work something
out.
You know that's that's not a big deal.
You know But I think that's a good.
It's a good thing to have a good license.
It's important.
Michael: You mentioned starting a company around it.
It strikes me it's going to be tough.
Like, obviously, New codebase, the main use cases are at scale.
But normally startups, like, the easiest way of getting started
is serving smaller companies, right?
Like it's harder, like going straight to the enterprise with
something that's not yet, like what's the, you've got a plan
though, it'd be great to hear like what's the plan?
Lev: What's the plan?
It's okay, don't freak out, it's gonna be okay.
Yes, it's not actually that uncommon to have enterprise startup
products.
If the problem is interesting enough, there's always going to
be somebody who's going to be like, oh, great.
Somebody's working on it full time.
That'll be amazing.
At this early stage, how this works usually is I'm looking for
design partners.
So it's companies like Instacart who are like, hey, this is a
great idea.
We're gonna try it out.
We're gonna develop it together.
And at the end of the day, it's gonna be in production because
we built it together.
And that's actually good because you wanna build it with the
users.
Like you don't wanna build it by yourself like for several years
and then show up and be like, hey, does anyone need Postgres
sharding?
And Nikolay's like, well, I don't know, maybe, maybe not, depends.
So what I'm actively looking for right now, like codebase is
okay.
I'm sure there's bugs in it, performance issues, and that's totally
fine.
I'm just looking for people who'd be like, this is an interesting
idea.
I like the idea of Postgres sharding.
I like the way it's done at the pooler.
It's not done as an extension.
It's not done as some kind of other thing that I can't even think
of.
I like the way it's done.
So I'd like to try it out and help you finish the job.
You know, by deploying it in production, by benchmarking it,
by finding bugs, by reporting bugs, by even fixing bugs would
be great, but not required.
My job.
Nikolay: I can confirm you're very
quick reacting to requests.
I remember looking at PgCat, I
had the idea that mirroring, let's
have mirroring to have like a kind
of A-B testing, A-B performance
testing, right in production.
And you implemented it, it was
great.
I think you were, like you were, or no?
Lev: Actually, to be perfectly
correct, it was actually Mostafa
and Instacart who implemented it,
because he had a use case for
it.
He was my design partner.
Nikolay: It's a very common request,
and the only problem, like,
you need to have this pool already
in production to use it.
This is the trickiest part.
Lev: It's 0 to 1.
It's always tricky.
New stuff, especially in the hot
path, always going to be hard.
But if the problem is there, and
If the problem is big enough,
I'll find my champion.
Yeah, exactly.
Nikolay: So yeah, and you mentioned
PgDog.
Maybe let's move back to technical
discussions.
A little bit out of business and
license and so on.
So first of all, it's not like
PgCat, it's not explicit sharding
when you command with SQL comments
how to route.
This PgDog has automated routing,
right?
And second thing, there's no Postgres
middleware for this.
So it's just a pooler with routing.
Can you explain architectural decisions
here and what do you
use and what kind of components?
You got a parser from Postgres
to understand queries.
I'm very curious how you are going
to automatically route selects
of functions which are writing,
for example, right?
Or select for update, which you
cannot route to a physical standby,
a replica, right?
Or I don't know, something else,
like how you are going to, or
You obviously will have some limitations,
already have some limitations,
right?
So can you talk about this a little
bit?
Lev: Yeah, of course.
Select for update is actually really
simple.
That's a clear intent to write
something.
So that's an easy 1, straight to
the primary.
No problem.
That's an easy 1, which I actually
should implement.
Now that I'm thinking about it,
I'm routing it to the replica
right now.
Bug issue incoming.
Thank you very much.
Nikolay: You're welcome.
Lev: The other 1 is the functions.
Obviously It's impossible to know
if a function is writing or
not by just looking at it.
Even if you look at the code, you're
not going to know.
So like static analysis, I don't
think is necessarily possible.
So for that 1, I think it should
be pretty easy.
You put it in the config, you have
a list of functions that actually
write.
Nikolay: This is what pgpool does,
right?
Lev: I think so, yeah.
I'm not sure.
Nikolay: pgpool does everything,
so I'm sure they do it as well.
Lev: Exactly.
You have to be careful.
You can't do everything.
People don't believe.
I know that approach is not perfect
because if you add a new
function, you have to update the
config and you're always going
to forget that you're always going
to have issues.
So for that 1, I don't have a solution.
My theory so far is that that is.
Not as common as I'd like, but
I will probably prove it wrong
and then we'll figure something
out probably.
Some kind of migration process
that says, if you want to add
a function that writes, send it
through the, like you should
be writing migrations and sending
them through the, through the
pooler, not some kind of side channel.
And you can probably mark that
function as like, hey, this function
writes, like put in a comment or
something.
And then PgDog is gonna be like,
great, good to know.
You're gonna need persistent storage
for that kind of stuff,
which you could probably implement.
Nikolay: Yeah, I agree.
Some dogs behave like cats sometimes.
Lev: Exactly.
Yeah, exactly.
People will forget to put that
comment in and there's going to
be issues.
But that's just software, you know.
When there's people involved, there's
always going to be...
The more manual stuff you have
to do, the more problems there
are going to be.
But if you're writing a function
by hand in the first place,
you know, That's just got to be
part of your review process.
Michael: I was reading Hacker News
comments and somebody asked
about aggregates as well.
I think there's a limitation around
those at the moment.
What's the story there?
Lev: Well When I posted the thing,
I didn't have support for
aggregates at all, but then I'm
like, hey, you know what, let's
add some.
I've been thinking about it for months.
I might as well just do a couple of simple ones.
So I added, yeah, I added like count, simple 1, you just sum
the counts across all shards.
Maxmin, again, super simple.
I added sum as well.
Sum is really just sum everything.
Nikolay: Yeah.
Yeah.
Some of the- Let's, let's, small comment here.
It's MapReduce basically, like an analogy for it.
It's, it's cool, right?
So.
Lev: It is cool, right?
This is cool.
Like, you know, it's scale
Nikolay: to billions, trillions and so on.
Lev: Precisely.
Yeah, this is like a MapReduce for Postgres is phenomenal.
I'm actually, okay, so I'm going to release a blog post in a
couple of days.
I'm actually sharding and doing MapReduce for pgvector.
I don't know if you've heard about this 1.
Yeah, yeah.
That one's really fun.
pgvector is like a-
Nikolay: No, no, pgvector we know, but what do you do with it?
Lev: Well, you're going to have to wait till the blog post comes
out, but it's really fun.
I'm doing both MapReduce and like a machine learning algorithm
to route queries in the cluster.
Because like scaling, like searching vectors is a completely
different problem than searching B-trees.
So I don't know how many people will need that solution.
Nikolay: Well, HNSW is terrible if you go beyond 1 million records.
It's a big problem still, So.
Lev: That's the number I had in my blog post as well.
I don't know why 1 million just feels like the right number,
but that's exactly what I said.
I'll answer you, over a million, you probably need to shard your
pgvector index.
Nikolay: Or use a different approach, yeah.
Exactly.
That's great.
And MapReduce, someone told in the past that Pale Proxy by Skype,
very old tech, also looked like MapReduce, but it required you
to use only functions, which is a huge limitation, especially
if you have ORM or GraphQL, it's a big showstopper.
And also it was Postgres in the middle, right?
For routing and for this MapReduce.
But in your case, this is more
lightweight software in the middle,
PgDog, right?
And it does some simple, like,
arithmetic operations.
Yeah.
And do you plan to define some
interface for more advanced operations
that the user could define, like,
beyond simple sum or count
or other aggregates?
Lev: I haven't worked much with
custom data types, UDFs and all
that stuff, So that's going to
be a learning curve for me, I'm
sure.
I'm sure it's not that hard, but
like once you add custom functions,
you need to add custom logic.
I think that should be pretty straightforward
to implement if
there's a synchronization between,
you know, the this is working.
Nikolay: This would give full-fledged
MapReduce capabilities
to this, right?
Lev: Yeah, absolutely.
More open
Nikolay: and interesting perspectives,
I suppose.
Lev: Absolutely, yeah.
If I find someone who thinks this
is cool as well, we could definitely
build it together.
Nikolay: Not only thinks it's cool,
but has some production to
try, right?
Because it's just- Absolutely.
Exactly.
Lev: Yes, absolutely, yeah.
I think that would be pretty terrific.
It will work pretty well, but yeah.
So again, there's a lot of interesting
things about aggregates
that, you know, for example, like
percentiles, like notoriously
difficult, basically impossible
to solve, I think, at the sharding
level, because you need to look
at the whole dataset to compute
it.
You can approximate it.
Approximation functions should
be like a feature that we add
to like PgDog that says like, you
know what, I don't care about
the exact number, like average,
the simplest 1, you could estimate
it.
I think
Nikolay: for percentile, like
we could define some custom data
type, right, to remember how many
members were analyzed and so
on, like, I don't know, to bring
not just 1 number from each
shard, but a couple of numbers
and then it would be possible
to understand percentiles, maybe,
should be.
Maybe.
Not super difficult.
Michael: Feels like a similar,
like HyperLogLog had some similar,
like, I don't know how you do that
cross-shard, but it feels
like there might be some secret
sauce in what they've done already
that could be applied cross-shard.
Lev: Yeah, HyperLogLog is like
a counter, basically, approximates
how many members are in a set.
Actually, there's an extension
for it in Postgres, which you
can use.
Yeah, it's pretty fun.
But yeah, it's, I need a statistician,
like a data scientist
Nikolay: to come
Lev: and be like, all right, this
is how you approximate percentiles.
And I'd be like, great.
Do you know Rust?
Good stuff.
Yeah.
Nikolay: What won't do with Rust.
Lev: Exactly, yeah.
Michael: Well, so what is, like,
what's next?
What are you looking, I guess it
depends on what people want,
but what does tomorrow look like
or the next week?
Lev: Well You're gonna be shocked
to hear this, but I'm an engineer
who does sales now I literally
just like yeah as I said, I'm
building a company So I'm literally
just sending as many like
LinkedIn and emails to whoever
I can think of to find design
partners for people to be like,
hey, I want to help.
This problem exists.
First of all, that's the first
feedback I need to get.
Be like, I would like sharding,
I would like it solved, and I
would like to be solved at your
way.
That would be great.
Or, you know, just tell me how
you'd like it to be solved.
And if like there's an interlap
and I think there should be overlap,
solve it together.
So that's what I'm doing mostly,
but you know I'm an engineer,
so I need, I need like a safe space
from, from all the social
activities.
So I still code.
So that's why the pgvector sharding
is coming out because I
needed to code a little bit and
I thought the idea would be cool.
And I'm gonna keep doing that.
I'm gonna keep adding these kind
of features, keep adding tests,
benchmarks, fixing bugs, finding
more use cases, like SELECT
for UPDATE that I forgot.
But yeah, that's that's the plan.
Nikolay: I have specific question
about rebalancing without downtime.
Right.
If 1 shard is too huge, others
are smaller.
We need to rebalance.
What do you think?
Will this feature first of all
be inside open source core offering?
Because we remember in Citus it
was not until Microsoft decision,
as I understand, Microsoft decision to, or like a Citus team
decision to make everything open source.
And second question actually, triggering by my this question,
like, how do you compare PgDog to Citus?
Lev: Yeah, the open source, I would like to stay, I don't foresee
myself writing closed source code.
Maybe things around deployments and orchestrating the stuff in
companies.
My idea, we'll see if it's realistic, is to sell managed deployments
of PgDog to on-prem deployments to companies.
So probably the the actual code that orchestrates that stuff
will probably be proprietary, you know, mostly because I'm embarrassed
how much bash I use.
Nobody really wants to know how that sausage is made.
Nikolay: Do you follow Google Bash code style?
Lev: Yeah, somebody taught me to use curly braces for my bash
variables, and ever since then I've been doing that religiously.
So I learn new things every day.
But I think the core will stay open source forever.
I don't see...
Even that data migration part, like that's a, that's a known,
there's a known solution for it.
There's no point of building a closed source solution that does
the same thing.
Like it's already been solved.
So might as well just...
Nikolay: It doesn't exist yet, right?
This, this rebalancing feature.
Lev: Well the rebalancing feature is basically, again, it depends
on your sharding key.
It depends how you store data on the shard.
Like Instagram wrote that blog post a long time ago where they
pre-partition everything into smaller shards.
And that's how Citus does it underneath.
Like Instead of like, you see, if you say you want 3 shards,
it's not going to build you 3 tables.
It's going to build 128 tables and move them between the shards.
And then when 1 shard gets- How?
How?
Well, logical replication became a thing in 10.
So, Asidus uses that to move things around.
I think logical replication makes sense up to a point.
You really have to catch the tables at the right time.
Once they get a little bit too big, logical propagation can't
catch up anymore.
So that's going to be an orchestration problem.
I've seen logical kind of...
You can
Nikolay: partition it virtually, like what peerDB did.
They implemented virtual partitioning, splitting.
If your primary key or partition
key, partition key in this case,
allows to define some ranges, you
can use multiple streams to
copy initially and then even to
have CDC.
So it's kind of interesting.
That's right.
So yeah.
Well,
Lev: it's funny that you mentioned
that because what is it, Postgres
16 allows us to now create logical
application for
Nikolay: other tests.
Binary equals true.
Postgres 17.
17?
Or 16.
Maybe 16.
1 of those.
Maybe you are right.
I just looked at the computation
a few days ago and I already
forget.
Yeah, but still binary is good,
but it's whole table.
Maybe we don't need it.
If we rebalance, we don't.
Well, in case if you split already
partitions, it's fine.
But if not, yeah, I'm very interested
to understand design decisions
here.
It's going to be interesting.
Because I think for sharding at
large scale, this is 1 of the
key features to understand this.
Lev: Yeah.
Well, so when you started from
like a 1 big database to shard
it into like 12, what we did at
Instacart was we created a replication
slot, we snapshot at 12 databases,
we restored them 12 different
times, deleted the data that's
not part of the shard, synchronized
it with logical replication and
launched.
So deleting data is faster than
writing it from-
Nikolay: Copy it as physical replica
first, right?
Or copy it logically, like provisioning
logical replica, basically
dump restore before binary.
Lev: No dump restore.
So RDS, we were on RDS, so EBS.
Nikolay: This leads us to this
discussion about upgrades, because
recovery targetless doesn't exist.
How, like, let's follow up on this.
Like, this is good bridge to 0
downtime upgrades on RDS.
Lev: Yeah, those are fine.
Oh yeah, so the Second question,
yeah, Citus.
So the difference is mostly philosophical
in the architecture.
Citus runs inside the database,
which limits it to 2 things.
First of all, the database host
has to allow you to run Citus,
Maybe because of AGPL, maybe because
they just don't want you
competing with their internal products.
Again, you know, it's business,
it's all fair game.
And then the second 1 is performance.
When you run something inside Postgres
that needs to be massively
parallel, you know, you're limited
by the number of processes
you can spawn and by number of
connections you can serve.
So PgDog is asynchronous, Tokio,
Rust, lightweight.
It's not even threaded.
I mean, it's multi-threaded, but
it's mostly like asynchronous,
like task-based runtime.
So you can connect, I mean, I'm
gonna pull up a big number here
just for, you know, for effect
that you could have like a million
connections going to PgDog from
a single machine and that technically
should work because it's using
Ebola underneath.
But for Postgres, you could probably
do like, you know.
Nikolay: You need PgBouncer or
something in front of
Lev: it.
You need PgBouncer and you need
most of those connections to
be idle because concurrency-wise
Postgres can only do maybe like,
you know, 2 per core.
That's the myth at least.
So again, like Citus, single machine,
they have some kind of
support for multiple coordinators,
but I think the readme just
says like, please contact us.
Nikolay: Any case you're in the
hands of Microsoft, you need
to go to Azure or you need to do
self host everything.
You cannot use it on RDS because
extensions required.
In your case, no extensions required.
This is the key.
You can run it on RDS because the
extensions are not needed.
Exactly.
So you can run it on any Postgres.
I see a lot of guys are trying
to develop Extension ecosystem.
I'm over time became big Like opponent
of extensions idea because
we have a lot of managed services
and if you develop extension
it takes a lot of time to bring
that extension to managed provider.
It takes years.
So if you can do something without
extensions, it might be better
in some cases.
And sharding maybe is such a case.
Lev: No, absolutely.
Yeah.
Because if you develop an extension
that gets installed by RDS,
like, I don't know if they're gonna
pay
Nikolay: you for that.
Approve it and support.
Lev: Approve it or support it or
all that stuff.
RDS notoriously upgrades like once
a year.
Nikolay: We need to wrap up soon,
but like a few words about
encryption.
Does PgDog support encryption now
already?
Because it's super important for...
And it can be a bottleneck.
I know...
Odyssey connection pool was created
because PgBouncer needed
2 layers of PgBouncers to handle
a lot of...
Yeah, this is what several companies
did in the past, two layers
of the bouncers because of a handshake.
So, okay.
Tell us handshake.
What, what, what's in this area
encryption?
Lev: Sure.
Yeah.
It supports TLS.
It's using a library, Tokyo, it's
like one of the rest libraries
that implements TLS.
It's completely fine, you can use
TLS.
And my personal favorite, I finally
find a library that implements
the SCRAM SHA-256 authentication.
So now that's finally supported.
Nikolay: Yeah, I saw MD5 is going
to be duplicated in the next
few years in Postgres, so...
Lev: It still comes up.
I mean, it's been duplicated for
10 years, and people still use
it just because it's really simple
to implement.
And SCRAM is really hard.
Nikolay: And one more small question
about...
Yes, about prepared statements.
I know PgCat supported them, right?
In transaction pool mode, right?
Lev: Yeah, the implementation wasn't
great, but it does support
it.
Yeah, PgDog supports them too.
Much better implementation this
time.
Nikolay: Okay, cool.
Well, I'm not out of questions.
I'm out of time.
Yes!
But it was absolutely interesting.
Thank you so much.
I'm definitely rooting and going
to follow the project.
Best of luck to you and your new company.
Maybe Michael has some questions
additionally.
I took the microphone for too long
this time.
Apologies.
Michael: Well, Lev, is there anything
we should have asked that
we didn't?
Lev: Oh, no you guys are, I think
you covered it.
Michael: Well, really nice to meet
you, thanks so much for joining
us and yeah catch next week Nikolay.
Nikolay: Thank you so much, Bye
bye.
Thank you.