Nikolay: Hello, hello.

This is Postgres.FM.

My name is Nikolay, Postgres.AI,
and as usual, co-host

is Michael, pgMustard.

Hi, Michael.

Michael: Hello, Nikolay.

Nikolay: And we have a super interesting
guest today, Lev Kokotov.

Hi, thank you for coming.

Lev: Thanks.

Hi, Nikolay.

Glad to be here.

Nikolay: Yeah, and the reason we
invited you is when I saw PgCat,

it was interesting.

At that time, I had a spike of
interest to sharding solutions,

and I saw PgCat was started as
a connection pooler alternative

to PgBouncer.

Actually, at the same time, a few
other teams started some new

projects.

But Then Sharding was, oh, okay,
Sharding was there.

It's quite straightforward way,
just comments.

It was interesting, but I'm not
a cat guy.

I'm a dog guy.

And once I saw PgDog started,
we focused on sharding.

It obviously attracted much more
attention from me.

So that's why we invited you, actually.

Awesome.

Yeah.

Let's talk about, I don't know,
like maybe sharding itself and

probably let's start with the idea
that not everyone needs it

Because on a single Postgres cluster,
you can grow and reach

multibillion evaluation, go to
IPO, I have examples.

But obviously, sometimes it's really
tough, right?

Tough job to maintain a huge monolith.

So what's your take in general
about the idea that you don't

need sharding?

Lev: Oh yeah, well that's a good
1.

I agree in principle.

So you know, Postgres can be pushed
quite far and we pushed it

very, very far at Instacart, pre-IPO
I should add.

But we IPO it as a massively sharded
database.

And we absolutely had to because
a lot of things started to run

quite slow.

We had a lot of writes.

Instacart was, you think most apps
are 90% read, 10% write.

I don't know if that was the case
for us.

I think we're a little bit more
like maybe like 80-20, maybe

70-30, but we were doing a lot
of writes.

Nikolay: Hold on.

80-20 means 80 writes or reads,
as usual reads, right?

Lev: Yeah, yeah, reads.

Yeah, yeah.

I'm thinking like 90% typical workloads, 90% reads, 10% writes.

Nikolay: Or even more, or even more sometimes.

Social media is definitely more reads, right?

But here's the thing.

Lev: Yeah, here's what it was a little bit different, not by

much, but even that that small percentage, like 10%, 15% writes

was enough to push our r5.24xlarge, I don't know, like 192 cores,

almost a terabyte of RAM over the edge.

Nikolay: Sounds like RDS.

Actually, I'm super curious.

It's off topic, but I'm super curious about so-called Instacart

zero-downtime upgrade approach because we believe, at Postgres.AI,

we believe it leads to corruption.

But I hope we will discuss at a different time.

Let's hold on to this.

Lev: Yeah, happy to talk about it on another podcast.

That was 1 of the guys who worked on it.

Nikolay: Consider you invited.

zero-downtime upgrades made on managed Postgres, where, which doesn't

allow to change recovery target LSN.

Lev: Yep.

I did it myself.

We had zero-downtime.

Nikolay: Yeah.

It's super interesting topic.

Maybe in a few weeks, I'm really looking forward.

So yeah, I understand that writes, we can scale reads easily

because it's just a replica until some point, right?

But writes, we have only the primary.

And you need to go either to services, microservices, or sharding

or combination of them.

This is several routes, actually 2 major routes here.

Lev: Yeah, Instacart actually sharded what we call functionally

sharded.

I don't know if that's a real term in the industry, but we just

took tables out of our main database and put it into a different

database, you know, functional sharding air quotes.

Nikolay: Vertical split.

Vertical position.

Exactly.

Lev: And that happened even before I joined the company.

So that was way before, you know, IPO talks and all that stuff.

So you can run out of capacity on a single machine quite quickly

if you use it a lot.

Sorry, that's a little bit of a tautology, but that's just the

case.

Like, it depends on your workload.

We did a lot of machine learning.

So we wrote a lot of bulk data, a lot, like daily.

We would have like hundreds of
gigabytes of data that were completely

new data every day, basically.

So a lot of writes were happening
and sharding was the only way,

basically.

We were running out of...

Our vacuum was always behind, and
when vacuum is behind, everyone

gets a little bit scared.

Performance is worse as well.

We were getting a lot of lock contention
on the WAL.

That happens a lot.

The WAL is single-threaded to
this day, and that's totally fine,

but you end up, you know, when
you write like hundreds of thousands

transactions per second.

Nikolay: Well, it's single-threaded
if you use WAL writer, but

backends can write to WAL themselves
if you turn it off.

Like this setting, I forgot.

Lev: Yeah, No, you're right.

You're absolutely right.

But there's still a log.

There's still a log.

Yeah.

You have to grab a log.

Nikolay: We see it in performance
insights.

If it's on RDS, like queries are
spending time on LWLog.

Lev: That's the guy.

And that's the guy who takes your
website offline on Sunday,

every morning, like a clockwork.

Nikolay: On the one-night-defensive
foreclose.

And I would say traditional optimization
would be, let's write

less.

But I assume only in Postgres 13,
WAL-related metrics were added

to register statements and explain.

It was very difficult to understand
which queries, well, with

performance insights you can have
based on weight event analysis

to identify queries which are like
WAL write intensive, right?

And maybe optimize them, but it's
still like limited approach,

right?

Lev: Well it's limited in its success
in a company because when

you tell somebody like, hey, can
you write less to your database?

People are like, I'm sorry, but
is that your job?

You can write less,

Nikolay: but we have, well, WAL
writes are problem for a query,

consider a query which writes to
a table and 2 situations, This

table has a couple of indexes versus
20 indexes and when we insert

something, every index gets some
insert and it amplifies the

amount of WAL written for the
same insert.

Lev: It's really important to know
that every single index has

an important business use case.

And that's why it was put there
in the first place.

So like all of these are invariants.

Like when you say like I have 20
indexes on a table and it's

right up, like that's by design.

And when you want to write more
data to that table, because your

business is growing, like us as
a database engineering team,

we're like, all right, we're going
to make it happen, because

we can't go to like 25 individual
teams and tell them like, hey,

can you fix your query, they're
going to be like, I don't have

the time, I'm trying to sell groceries
here.

You know, like I have other concerns.

Nikolay: But anyway, I'm trying
to say I agree with you that

without sharding, we are limited
in, in like, we need to squeeze,

squeeze, squeeze, but it's very
limited.

And at some point, you cannot say
we can grow to X anymore, right?

Lev: That's right.

And that 2X is actually on the
low bound.

What most engineering leaders expect
is like 3, 5, 6X.

10.

Yeah, 10 to have that runway.

If you don't have that runway in
your system, it's a red flag

for most.

Nikolay: And it goes to CTO level,
basically.

The decision is like, should we
migrate out of Postgres to, I

don't know where, right?

Lev: Exactly.

Nikolay: I know this so many times,
including sitting on CTO

position myself, like 15 years
ago.

Like I know this pain of sharding.

Yeah.

Lev: It's really funny.

You're just kind of sitting there
and be like, Hey guys, I need

10X capacity.

And your database engineer is like,
well, it's the WAL.

There's nothing I can do.

It's the WAL.

It's like, it's just like a disconnect
between these, like, you

know, these 2 people, you need
to like, you just need to make

this thing scale.

Like that's the end.

And a lot of these are out of control.

So,

Nikolay: yeah.

Let's like, we can also enable
compression for full page writes.

We can also partition so writes
become more local and some, I

don't know, like you're skeptical,
right?

Lev: I am, because all of that,
partitioning is great.

I would never speak bad against
partitioning, but...

Nikolay: Not every partitioning,
sorry for interrupting, not

every partitioning.

If, for example, people say, okay,
we are going to spread all

customers evenly distributed among
partitions.

This is not going to help us in
terms of WAL writes.

But if you say, we will have hot
partition, like receiving most

writes, and kind of archive partitions
where the vacuum and everything

was going like stabilized.

In this case, it's super beneficial
for WAL writes and write

intensive workload optimization,
right?

Lev: Absolutely, but then you're
going to end up having to read

those cold partitions anyway.

It's a temporary band-aid, which
we did.

Again, absolutely, we partitioned
a lot of tables and that helped

us with a lot of things.

But at the end of the day, you
just need more compute.

That's the bitter lesson of database.

I don't know if you're familiar
with the bitter lesson in AI,

where you just need more machines
to solve the problem.

It's the same thing in databases.

You just need more compute to just
be able to write, read more

stuff and write more stuff.

Nikolay: Yeah, I'm still pulling
us back intentionally because

we are having conversations about
like scaling problems.

And interesting insight I've got
yesterday talking to a very

experienced Postgres expert, that
partitioning not only helps

with data locality, write locality,
and full-page writes, and

volume amount as well, but also
with backups, surprisingly, if

we do incremental backups with
WAL-G, pgBackRest, with new incremental

API, I still need to learn about
this more.

Or snapshots, cloud snapshots of
EBS volumes and RDS also relies

on it, as I understand.

Imagine writes are spread out everywhere,
and you constantly

change a lot of blocks, versus
you change only specific blocks,

and some blocks are not changed.

And it's less pressure on backups
as well.

It's interesting and it affects
DR and RPO, RTO.

But I just wanted to ask, like
my understanding, If we go to

sharding blindly and miss some
optimizations, I remember I was

dealing with Mongo and each node
of Mongo, it was maybe 10 plus

years ago, I saw each node of Mongo
can handle much less workload

than Postgres, 1 node of Postgres.

Maybe like jumping to sharding
too early, it's premature optimization

and you will be not in good shape
in terms of how much money

you spend on compute nodes because
you missed a lot of optimization

steps.

What do you think?

Lev: Alexey Nosovsky Absolutely.

If your first step is like, my
database is broken, I need to

shard, you missed a lot.

Of course, you should look at all
the possible optimizations.

And that's really important to
keep you on a single machine.

I think the benchmark for me is
when you need to start thinking

about charting is when you try
like 3 or 4 or 5 different things

and then you have like a person
or 2 working on it full-time

and at the end of the day they're
like I ran out of ideas.

Nikolay: Do you have some numbers
like for modern hardware like

how much WAL to be written per
day, for example, like terabyte,

5 terabytes, where it's already
an edge, right?

Lev: Yeah, for sure.

I can give you the exact number.

We benchmarked Postgres on in the
best possible scenario, actually,

It was on EC2, it had NVMe drives,
it had an ext4, a RAID 0.

So we did not, we did not want
durability.

We wanted performance.

That thing could write 4 or 5 gigabytes
per second using Bonnie++.

Now Postgres was able to write
about 300 megabytes per second

of just like whatever.

I think we're using copy, like
the fastest possible way to dump

data into PG.

And it's just, it's the nature
of the beast, right?

You have to like process it, you
have to put it into a table.

I'm sure there's a lot of like
checks and balances go in between

and again, lock contention and
all of that stuff.

So you can't necessarily squeeze
out everything out of your machine

if you have the capacity.

So ultimately you need to split
the processes themselves between

multiple machines once you reach
that number.

And then, you know, for everything
else, if you have, again,

if you have write amplification,
things are getting a little

bit more tricky.

And then, so I think that's, that's
usually the number.

Once you're writing like a, you
know, 200 megabytes per second

of just the WAL.

And you can see that in RDS there's
a graph for that just just

WAL writes.

I would even start early.

200 megabytes

Nikolay: of WAL per second each
WAL is 16 megabytes but on

RDS it's 64.

They changed it.

So how many WAL files like it's
a lot right?

Lev: It's a lot and that's And
that's the scale you're looking

at.

Nikolay: 10 plus is already too
much.

Yeah, I agree.

Lev: Yeah.

So that's the red line, right?

If you're at the red line and you're
thinking about sharding,

we're gonna have a bad time.

And that's okay.

It's okay to have a bad time.

These lessons are learned when
you're having a bad time.

I had a really bad time.

Nikolay: Yeah, and I pulled BC
calculator just to check 200 megs

per second, it gives you 17 terabytes
per day.

Yeah,

Lev: that's mild, honestly, for
what we did at Instacart.

We had like petabytes of WAL in
S3, like after like a week.

And That's just, again, it's the
nature of the beast.

Nikolay: Multiple cluster.

Lev: Single cluster.

Yeah, it's just, it is.

It's called selling groceries.

What can I tell you?

We had like, you know, multi, hundreds
of gigabytes of machine

learning data that would come in
every day because the model

was retrained on a nightly basis
and all the embeddings have

changed, right?

Again, the use cases are almost
like, I mean, they're interesting,

but they're kind of off topic,
I guess, because they're just

like, you just write a lot of data.

That's what you do for a living,
right?

That's what Postgres is for.

That's what databases are for.

So the numbers are like, well,
this is the number, and that's

when you should split up.

Nikolay: Yeah, so let's probably
rest for more sane situations.

Let's say 10 terabytes per day
or something of WAL data.

It's already too much.

If you approach that like not soon,
it's already a WAL you will

be hitting, several WALs inside,
obviously you will be hitting.

Including light.

Yeah, yeah, several WALs.

Yeah, I was trying to say lightweight
lock, WAL-related lightweight

locks, WAL-write, right?

There are several of them.

Yeah, but it's, Yeah, this is the
scale to be scary.

Lev: It's hard to put a number
on it because you can push that

further, and who knows what's going
to happen to your app.

Maybe that's as high as you'll
go.

Nikolay: And not everyone has local
NVMEs because in cloud environments,

they are ephemeral.

It's kind of exotic and risky.

Lev: It is.

It's definitely risky.

I love those because when you click
reboot on the machine, it

wipes the encryption key and all
your data is gone.

So you better be careful which
button you press.

There's like a hot restart, which
is safe, and there's the actual

restart, which will delete all
your data.

So stay on RDS if you can.

Nikolay: But no, I don't agree
here because you just you should

just use Patroni and multiple replicas
at least at least 3 nodes

and you'll be fine to lose 1 node.

Lev: You lose 1, something ran
a bad, anyway, that's totally

off topic.

Nikolay: Okay, thank you.

For me, it's enough to understand
that there are cases definitely

when you do need to split vertically,
and splitting vertically

usually is limited as well.

It's hard sometimes, and at some
point we need sharding.

Yeah, so it's enough to explain
that definitely there are cases

where you need to scale Postgres
beyond 1 primary, it can be

vertical or horizontal.

The result is sharding, right?

Lev: That's right.

Nikolay: So, good.

Yeah.

This is great explanation.

Sharding is definitely needed,
but not for everyone maybe, but

for many.

Lev: I mean, yeah, I hope not for
everyone.

That would be, I mean, you know,
we solve the problem once and

for all.

Nikolay: Actually.

Anyone has access to

Lev: it, yeah.

Nikolay: Last week, Michael and
I had discussion about snapshots

and I said that CloudSQL has hard
limit, 64 terabytes, because

this is a limit for GCP persistent
disks, including persistent

PD, SSD.

And I said, RDS obviously can allow
you to grow further, but

beyond 64, but it's not.

After we recorded, I checked, and
the same limitation, 64 terabytes.

It's another reason to have multiple
primaries and scale beyond

1 cluster.

Lev: Well, you'll hit that sooner
on Postgres because of the

32 terabyte table limit.

And actually I did hit that limit
once and that was a scary error.

The only thing that it says is
like, there's no more bytes to

be allocated or something like
that in some kind of database

file.

I'm like, what?

You know, and then you're done.

Like, there's nothing you can do.

You have to move data out into
that table.

You have to think database offline,
obviously.

Nikolay: It's a single table.

Okay.

Okay.

Lev: Yeah.

A single table.

Yeah.

Which is actually not uncommon.

Nikolay: Well, it's still not super
common as well.

Yeah.

I see only cases like kind of 10
terabytes for table.

And it's already when any DBA should
scream, where is partitioning?

Lev: Well, DBAs like to scream,
but application engineers be

like, I can't, aren't you the DBA?

Like, what am I doing?

Do something, right?

Nikolay: But partitioning unfortunately
requires application

code changes.

Lev: Yeah, precisely.

Nikolay: Yeah, so sharding is needed
for big projects, obviously.

Yeah.

Agreed.

What's next?

Michael: Can we go?

I reckon it's time to go back to
the origin story of PgDog and

obviously then PgCat.

It would be great to hear like
a little bit of that story, Lev.

Lev: Yeah, absolutely.

I'm happy to.

Do you guys care about the name
of the project?

I don't know if, or do you want
to know what the inspiration

for that came from?

Nikolay: Sure, it's fun, yeah.

Lev: Okay, well, when I started
PgCat, we just got a new cat

and I really loved that cat.

And I was working on Postgres,
so the 2 things came together

naturally.

You could probably guess the origin
story of PgDog now, I got

a dog, and then I'm like, look,
I love this dog.

What can I call my next project?

Obviously PgDog.

Yeah.

So the naming issue solved.

PgCat came from the idea that it
was really simple.

It was sharding was not even in
scope back then.

It was just like, we ran PgBouncer.

PgBouncer could only talk to 1
Database at a time.

It makes sense, you know, you were
pooling like just the Postgres

instance.

We had a bunch of replicas.

We needed to load balance.

And we needed a load balancing
algorithm that was smart.

When a replica went offline because
of, again, hardware issues,

scaling issues, whatever, we needed
to remove it from the rotation

without affecting the app.

So we would regularly lose the
replica.

And then most of the site would
go offline because we had a Ruby

gem that would randomize access
to those replicas, and when 1

of them broke, it's just, you know,
it worked okay.

But doing this in the application
code is really hard.

Especially in Ruby, there's a way
to inject exceptions into Ruby,

like, sideline, and basically that
breaks your state.

So we had like multiple gems working
against each other and we

just needed to do that in a place
where it made more sense.

Like a load balancer is typically
outside the application because

you have multiple applications
so I can do it in any way.

So I just built a load balancer
basically.

It was after actually I left Instacart.

I was just doing it as a side project
just to keep my mind going.

So I built it.

It was really simple.

Used banding logic, which was kind
of novel at the time for Postgres.

If you receive 1 single error from
the database, it's removed

from the rotation.

It's very aggressive, but Postgres
never throws errors, like

network-related errors, unless
there's a serious problem.

So that actually worked pretty
well.

I talked to a friend of mine on
Instacart.

I was like, hey, look, I built
this on the side.

That looks fun, right?

Like, we thought about this.

And he's like, yeah, all right.

Didn't you quit?

I'm like, yeah, but no, I have
some free time to work on this.

Right.

And he's like, okay.

And then he took the code, added
a bunch of features that I didn't

add.

Cause obviously I didn't have a
use case anymore for it.

And he's like, oh, great.

We're going to deploy it.

Right.

And we tried it and they use it
and they're, I mean, they put

so much work into it.

They wrote a blog post about it.

You probably know about this.

So that went pretty well, you know,
so it's working in production.

It's great.

And then I'm like, all right, well,
you know, sharding is the

next 1.

They have a bunch of sharded databases
that we sharded and adding

sharding routing to that would
be great because again, it was

done in the application layer and
application layer routing,

I think we'll all agree is a little
bit iffy, especially if you

have more than 1 app written in
more than 1 language, like you

have to repeat the same logic across
all apps.

So I added just a common system.

I knew like there's 2 sharding
schemes at Instacart.

1 uses, you know, the actually
the hashing function from partitions

in Postgres.

I love that because you can actually
split data both at the client

and in the server.

So you have multiple ways to move
your data around and the other

1 is just a custom 1 like we use
like SHA1 and take the last

few bytes and then mod that that's
just you know, it's random,

but it was available in multiple
systems as well The data for

that came from Snowflake so we
could actually Shard the data

in Snowflake and then ingest it
into the instances directly.

And then on the routing layer in
Ruby, same hashing function,

you know, the sharding key is always
available.

So that was good.

So I added both of them and they're
like, great, that's great.

And they tried, I think, the SHA1
function and I think it's working

pretty well for them.

So that was fun.

Then I started another company
that had nothing to do with any

of anything.

PostgresML, you might've heard
about it.

That came from, you know, the idea
that we shouldn't ingest,

you know, hundreds of gigabytes
of machine learning data into

Postgres, why we should just ingest
a 3 megabyte model and run

inference online.

You know, it was okay.

Stayed a couple of, 2 and a half
years there.

Didn't work out.

There's a lot of startups too,
left.

And then I had some free time and
I'm like, well, what do I like

to do in my free time, guys?

Writing Postgres poolers.

This is what I do.

This is what I do to, you know,
rest on vacation.

I write Postgres poolers.

I'm like, well, let's do sharding
for real this time.

Let's actually take what we built
at Instacart, make it into

software because That's what we
do, right?

We come up with an idea, well,
we find a problem, we find a solution,

we write it in code, and we don't
have to solve it again every

single time manually for like,
you know, hopefully hundreds,

hopefully thousands of use cases
for this.

You know, we'll see.

I'm still doing my research.

But yeah, So that's PgDog.

Sharding is, you know, sales and
number 1.

Everything else is, you know, there's
obviously, it's obviously

a pooler, it's obviously a load
balancer.

It has all the features that PgCat
has.

Almost all of them, I'm adding
them as I go.

It's a rewrite, it's brand new.

I like new code.

That's what everyone loves to hear.

Hey, you rewrote it from scratch.

Great.

You know, that code that we battle
tested in production and serving

like half a million transactions
per second.

Well, that's obsolete now, I guess.

Nikolay: You're just going to

Lev: take this brand new code base
and check it out.

Nikolay: Yeah.

And it's written in Rust.

Lev: Yeah, absolutely.

Nikolay: Yeah, it's in my kind
of dreams to find some spare time

to learn and try it because, yeah,
it looks like many folks move

to Rust.

So what do you think?

Lev: Well, it took me about a decade
to get good at it.

So the sooner you start, the sooner
you'll get it.

You know, it takes about 10 years
to be good at it.

So at least that's what it took
me.

That's okay.

Again, if you start today, eventually
you'll get good at it and

that's okay.

It's a journey, you don't have
to learn it immediately.

It's just, it's such a paradigm
shift in how things work.

The compiler is very aggressive
about checking for things, especially

concurrency errors, which is for
multithreaded asynchronous poolers,

very important.

I don't have concurrency bugs in
PgDog or PgCat because I'm using

Rust.

I don't have data races.

And that's really important.

And the number of bugs that I ship
is considerably lower because

the compiler is like, hey, listen,
this variable, you're not

using it.

And I'm like, oh yeah, crap, I'm
actually importing the wrong

variable here.

I'm using the wrong variable.

Good catch, because that was going
to be a bug.

A lot of things that just in other
languages are not available

Rust makes really nice.

So it's really worth, again, you
ask anyone who writes Rust,

but like, that's the best thing
that ever happened to me since

sliced bread.

I'm like, that's true.

I haven't thought about sliced
bread in a while, but Rust is

great.

Cool.

Nikolay: Another side question
is license.

I saw feedback.

I joined that feedback.

So, PgCat was, I think, on Apache
or MIT.

I don't remember exactly, but kind
of permissive.

And for PgDog, you chose AGPL,
right?

Can you elaborate a little bit
why?

Lev: Yeah, yeah, of course.

Yeah, happy to.

The answer is really simple.

AGPL is actually pretty misunderstood.

You can use that code and the application
anywhere you want and

never have to share anything back.

As long as you use it internally
and you don't provide, like,

you don't use PgDog like publicly
as like a service for running

PgDog.

And even if you do that in that
case, all you have to do is just

tell us like, what did you change
and send us patches, like what

did you change?

So like, it's pretty minimal.

But it's a red flag for everyone
and that's okay.

Yeah, I'm building a company around
it.

Building a company around MIT code
base is, I think, probably

possible.

I've never done it successfully.

Building a company around AGPL,
I think, has been done before.

And I think it's probably fine.

But if it becomes a hurdle, I'm
not like married to it.

I just thought AGPL looks cool.

I like the ideas behind it.

I like free and open source code.

I don't think MIT is necessarily
the original idea behind, you

know, free and open source code.

I like MIT because I don't have
to think about it.

Nikolay: I'm checking PostgresML.

PostgresML is MIT, right?

Lev: Yeah.

Nikolay: Codebase.

Rasta and MIT.

So it's interesting how you decided
to change it.

I agree it's misunderstood.

But it's already so.

The majority of people misunderstood
it and we cannot change

it with single project.

So it's reality.

Yeah, but you don't care if

Lev: it's open.

Nikolay: Okay, yeah, like you do
it.

Lev: Yeah, yeah, because like if
somebody tells me like look

I would love to use your code,
but AGPL is a deal-breaker I'll

be like well, we'll work something
out.

You know that's that's not a big deal.

You know But I think that's a good.

It's a good thing to have a good license.

It's important.

Michael: You mentioned starting a company around it.

It strikes me it's going to be tough.

Like, obviously, New codebase, the main use cases are at scale.

But normally startups, like, the easiest way of getting started

is serving smaller companies, right?

Like it's harder, like going straight to the enterprise with

something that's not yet, like what's the, you've got a plan

though, it'd be great to hear like what's the plan?

Lev: What's the plan?

It's okay, don't freak out, it's gonna be okay.

Yes, it's not actually that uncommon to have enterprise startup

products.

If the problem is interesting enough, there's always going to

be somebody who's going to be like, oh, great.

Somebody's working on it full time.

That'll be amazing.

At this early stage, how this works usually is I'm looking for

design partners.

So it's companies like Instacart who are like, hey, this is a

great idea.

We're gonna try it out.

We're gonna develop it together.

And at the end of the day, it's gonna be in production because

we built it together.

And that's actually good because you wanna build it with the

users.

Like you don't wanna build it by yourself like for several years

and then show up and be like, hey, does anyone need Postgres

sharding?

And Nikolay's like, well, I don't know, maybe, maybe not, depends.

So what I'm actively looking for right now, like codebase is

okay.

I'm sure there's bugs in it, performance issues, and that's totally

fine.

I'm just looking for people who'd be like, this is an interesting

idea.

I like the idea of Postgres sharding.

I like the way it's done at the pooler.

It's not done as an extension.

It's not done as some kind of other thing that I can't even think

of.

I like the way it's done.

So I'd like to try it out and help you finish the job.

You know, by deploying it in production, by benchmarking it,

by finding bugs, by reporting bugs, by even fixing bugs would

be great, but not required.

My job.

Nikolay: I can confirm you're very
quick reacting to requests.

I remember looking at PgCat, I
had the idea that mirroring, let's

have mirroring to have like a kind
of A-B testing, A-B performance

testing, right in production.

And you implemented it, it was
great.

I think you were, like you were, or no?

Lev: Actually, to be perfectly
correct, it was actually Mostafa

and Instacart who implemented it,
because he had a use case for

it.

He was my design partner.

Nikolay: It's a very common request,
and the only problem, like,

you need to have this pool already
in production to use it.

This is the trickiest part.

Lev: It's 0 to 1.

It's always tricky.

New stuff, especially in the hot
path, always going to be hard.

But if the problem is there, and
If the problem is big enough,

I'll find my champion.

Yeah, exactly.

Nikolay: So yeah, and you mentioned
PgDog.

Maybe let's move back to technical
discussions.

A little bit out of business and
license and so on.

So first of all, it's not like
PgCat, it's not explicit sharding

when you command with SQL comments
how to route.

This PgDog has automated routing,
right?

And second thing, there's no Postgres
middleware for this.

So it's just a pooler with routing.

Can you explain architectural decisions
here and what do you

use and what kind of components?

You got a parser from Postgres
to understand queries.

I'm very curious how you are going
to automatically route selects

of functions which are writing,
for example, right?

Or select for update, which you
cannot route to a physical standby,

a replica, right?

Or I don't know, something else,
like how you are going to, or

You obviously will have some limitations,
already have some limitations,

right?

So can you talk about this a little
bit?

Lev: Yeah, of course.

Select for update is actually really
simple.

That's a clear intent to write
something.

So that's an easy 1, straight to
the primary.

No problem.

That's an easy 1, which I actually
should implement.

Now that I'm thinking about it,
I'm routing it to the replica

right now.

Bug issue incoming.

Thank you very much.

Nikolay: You're welcome.

Lev: The other 1 is the functions.

Obviously It's impossible to know
if a function is writing or

not by just looking at it.

Even if you look at the code, you're
not going to know.

So like static analysis, I don't
think is necessarily possible.

So for that 1, I think it should
be pretty easy.

You put it in the config, you have
a list of functions that actually

write.

Nikolay: This is what pgpool does,
right?

Lev: I think so, yeah.

I'm not sure.

Nikolay: pgpool does everything,
so I'm sure they do it as well.

Lev: Exactly.

You have to be careful.

You can't do everything.

People don't believe.

I know that approach is not perfect
because if you add a new

function, you have to update the
config and you're always going

to forget that you're always going
to have issues.

So for that 1, I don't have a solution.

My theory so far is that that is.

Not as common as I'd like, but
I will probably prove it wrong

and then we'll figure something
out probably.

Some kind of migration process
that says, if you want to add

a function that writes, send it
through the, like you should

be writing migrations and sending
them through the, through the

pooler, not some kind of side channel.

And you can probably mark that
function as like, hey, this function

writes, like put in a comment or
something.

And then PgDog is gonna be like,
great, good to know.

You're gonna need persistent storage
for that kind of stuff,

which you could probably implement.

Nikolay: Yeah, I agree.

Some dogs behave like cats sometimes.

Lev: Exactly.

Yeah, exactly.

People will forget to put that
comment in and there's going to

be issues.

But that's just software, you know.

When there's people involved, there's
always going to be...

The more manual stuff you have
to do, the more problems there

are going to be.

But if you're writing a function
by hand in the first place,

you know, That's just got to be
part of your review process.

Michael: I was reading Hacker News
comments and somebody asked

about aggregates as well.

I think there's a limitation around
those at the moment.

What's the story there?

Lev: Well When I posted the thing,
I didn't have support for

aggregates at all, but then I'm
like, hey, you know what, let's

add some.

I've been thinking about it for months.

I might as well just do a couple of simple ones.

So I added, yeah, I added like count, simple 1, you just sum

the counts across all shards.

Maxmin, again, super simple.

I added sum as well.

Sum is really just sum everything.

Nikolay: Yeah.

Yeah.

Some of the- Let's, let's, small comment here.

It's MapReduce basically, like an analogy for it.

It's, it's cool, right?

So.

Lev: It is cool, right?

This is cool.

Like, you know, it's scale

Nikolay: to billions, trillions and so on.

Lev: Precisely.

Yeah, this is like a MapReduce for Postgres is phenomenal.

I'm actually, okay, so I'm going to release a blog post in a

couple of days.

I'm actually sharding and doing MapReduce for pgvector.

I don't know if you've heard about this 1.

Yeah, yeah.

That one's really fun.

pgvector is like a-

Nikolay: No, no, pgvector we know, but what do you do with it?

Lev: Well, you're going to have to wait till the blog post comes

out, but it's really fun.

I'm doing both MapReduce and like a machine learning algorithm

to route queries in the cluster.

Because like scaling, like searching vectors is a completely

different problem than searching B-trees.

So I don't know how many people will need that solution.

Nikolay: Well, HNSW is terrible if you go beyond 1 million records.

It's a big problem still, So.

Lev: That's the number I had in my blog post as well.

I don't know why 1 million just feels like the right number,

but that's exactly what I said.

I'll answer you, over a million, you probably need to shard your

pgvector index.

Nikolay: Or use a different approach, yeah.

Exactly.

That's great.

And MapReduce, someone told in the past that Pale Proxy by Skype,

very old tech, also looked like MapReduce, but it required you

to use only functions, which is a huge limitation, especially

if you have ORM or GraphQL, it's a big showstopper.

And also it was Postgres in the middle, right?

For routing and for this MapReduce.

But in your case, this is more
lightweight software in the middle,

PgDog, right?

And it does some simple, like,
arithmetic operations.

Yeah.

And do you plan to define some
interface for more advanced operations

that the user could define, like,
beyond simple sum or count

or other aggregates?

Lev: I haven't worked much with
custom data types, UDFs and all

that stuff, So that's going to
be a learning curve for me, I'm

sure.

I'm sure it's not that hard, but
like once you add custom functions,

you need to add custom logic.

I think that should be pretty straightforward
to implement if

there's a synchronization between,
you know, the this is working.

Nikolay: This would give full-fledged
MapReduce capabilities

to this, right?

Lev: Yeah, absolutely.

More open

Nikolay: and interesting perspectives,
I suppose.

Lev: Absolutely, yeah.

If I find someone who thinks this
is cool as well, we could definitely

build it together.

Nikolay: Not only thinks it's cool,
but has some production to

try, right?

Because it's just- Absolutely.

Exactly.

Lev: Yes, absolutely, yeah.

I think that would be pretty terrific.

It will work pretty well, but yeah.

So again, there's a lot of interesting
things about aggregates

that, you know, for example, like
percentiles, like notoriously

difficult, basically impossible
to solve, I think, at the sharding

level, because you need to look
at the whole dataset to compute

it.

You can approximate it.

Approximation functions should
be like a feature that we add

to like PgDog that says like, you
know what, I don't care about

the exact number, like average,
the simplest 1, you could estimate

it.

I think

Nikolay: for percentile, like
we could define some custom data

type, right, to remember how many
members were analyzed and so

on, like, I don't know, to bring
not just 1 number from each

shard, but a couple of numbers
and then it would be possible

to understand percentiles, maybe,
should be.

Maybe.

Not super difficult.

Michael: Feels like a similar,
like HyperLogLog had some similar,

like, I don't know how you do that
cross-shard, but it feels

like there might be some secret
sauce in what they've done already

that could be applied cross-shard.

Lev: Yeah, HyperLogLog is like
a counter, basically, approximates

how many members are in a set.

Actually, there's an extension
for it in Postgres, which you

can use.

Yeah, it's pretty fun.

But yeah, it's, I need a statistician,
like a data scientist

Nikolay: to come

Lev: and be like, all right, this
is how you approximate percentiles.

And I'd be like, great.

Do you know Rust?

Good stuff.

Yeah.

Nikolay: What won't do with Rust.

Lev: Exactly, yeah.

Michael: Well, so what is, like,
what's next?

What are you looking, I guess it
depends on what people want,

but what does tomorrow look like
or the next week?

Lev: Well You're gonna be shocked
to hear this, but I'm an engineer

who does sales now I literally
just like yeah as I said, I'm

building a company So I'm literally
just sending as many like

LinkedIn and emails to whoever
I can think of to find design

partners for people to be like,
hey, I want to help.

This problem exists.

First of all, that's the first
feedback I need to get.

Be like, I would like sharding,
I would like it solved, and I

would like to be solved at your
way.

That would be great.

Or, you know, just tell me how
you'd like it to be solved.

And if like there's an interlap
and I think there should be overlap,

solve it together.

So that's what I'm doing mostly,
but you know I'm an engineer,

so I need, I need like a safe space
from, from all the social

activities.

So I still code.

So that's why the pgvector sharding
is coming out because I

needed to code a little bit and
I thought the idea would be cool.

And I'm gonna keep doing that.

I'm gonna keep adding these kind
of features, keep adding tests,

benchmarks, fixing bugs, finding
more use cases, like SELECT

for UPDATE that I forgot.

But yeah, that's that's the plan.

Nikolay: I have specific question
about rebalancing without downtime.

Right.

If 1 shard is too huge, others
are smaller.

We need to rebalance.

What do you think?

Will this feature first of all
be inside open source core offering?

Because we remember in Citus it
was not until Microsoft decision,

as I understand, Microsoft decision to, or like a Citus team

decision to make everything open source.

And second question actually, triggering by my this question,

like, how do you compare PgDog to Citus?

Lev: Yeah, the open source, I would like to stay, I don't foresee

myself writing closed source code.

Maybe things around deployments and orchestrating the stuff in

companies.

My idea, we'll see if it's realistic, is to sell managed deployments

of PgDog to on-prem deployments to companies.

So probably the the actual code that orchestrates that stuff

will probably be proprietary, you know, mostly because I'm embarrassed

how much bash I use.

Nobody really wants to know how that sausage is made.

Nikolay: Do you follow Google Bash code style?

Lev: Yeah, somebody taught me to use curly braces for my bash

variables, and ever since then I've been doing that religiously.

So I learn new things every day.

But I think the core will stay open source forever.

I don't see...

Even that data migration part, like that's a, that's a known,

there's a known solution for it.

There's no point of building a closed source solution that does

the same thing.

Like it's already been solved.

So might as well just...

Nikolay: It doesn't exist yet, right?

This, this rebalancing feature.

Lev: Well the rebalancing feature is basically, again, it depends

on your sharding key.

It depends how you store data on the shard.

Like Instagram wrote that blog post a long time ago where they

pre-partition everything into smaller shards.

And that's how Citus does it underneath.

Like Instead of like, you see, if you say you want 3 shards,

it's not going to build you 3 tables.

It's going to build 128 tables and move them between the shards.

And then when 1 shard gets- How?

How?

Well, logical replication became a thing in 10.

So, Asidus uses that to move things around.

I think logical replication makes sense up to a point.

You really have to catch the tables at the right time.

Once they get a little bit too big, logical propagation can't

catch up anymore.

So that's going to be an orchestration problem.

I've seen logical kind of...

You can

Nikolay: partition it virtually, like what peerDB did.

They implemented virtual partitioning, splitting.

If your primary key or partition
key, partition key in this case,

allows to define some ranges, you
can use multiple streams to

copy initially and then even to
have CDC.

So it's kind of interesting.

That's right.

So yeah.

Well,

Lev: it's funny that you mentioned
that because what is it, Postgres

16 allows us to now create logical
application for

Nikolay: other tests.

Binary equals true.

Postgres 17.

17?

Or 16.

Maybe 16.

1 of those.

Maybe you are right.

I just looked at the computation
a few days ago and I already

forget.

Yeah, but still binary is good,
but it's whole table.

Maybe we don't need it.

If we rebalance, we don't.

Well, in case if you split already
partitions, it's fine.

But if not, yeah, I'm very interested
to understand design decisions

here.

It's going to be interesting.

Because I think for sharding at
large scale, this is 1 of the

key features to understand this.

Lev: Yeah.

Well, so when you started from
like a 1 big database to shard

it into like 12, what we did at
Instacart was we created a replication

slot, we snapshot at 12 databases,
we restored them 12 different

times, deleted the data that's
not part of the shard, synchronized

it with logical replication and
launched.

So deleting data is faster than
writing it from-

Nikolay: Copy it as physical replica
first, right?

Or copy it logically, like provisioning
logical replica, basically

dump restore before binary.

Lev: No dump restore.

So RDS, we were on RDS, so EBS.

Nikolay: This leads us to this
discussion about upgrades, because

recovery targetless doesn't exist.

How, like, let's follow up on this.

Like, this is good bridge to 0
downtime upgrades on RDS.

Lev: Yeah, those are fine.

Oh yeah, so the Second question,
yeah, Citus.

So the difference is mostly philosophical
in the architecture.

Citus runs inside the database,
which limits it to 2 things.

First of all, the database host
has to allow you to run Citus,

Maybe because of AGPL, maybe because
they just don't want you

competing with their internal products.

Again, you know, it's business,
it's all fair game.

And then the second 1 is performance.

When you run something inside Postgres
that needs to be massively

parallel, you know, you're limited
by the number of processes

you can spawn and by number of
connections you can serve.

So PgDog is asynchronous, Tokio,
Rust, lightweight.

It's not even threaded.

I mean, it's multi-threaded, but
it's mostly like asynchronous,

like task-based runtime.

So you can connect, I mean, I'm
gonna pull up a big number here

just for, you know, for effect
that you could have like a million

connections going to PgDog from
a single machine and that technically

should work because it's using
Ebola underneath.

But for Postgres, you could probably
do like, you know.

Nikolay: You need PgBouncer or
something in front of

Lev: it.

You need PgBouncer and you need
most of those connections to

be idle because concurrency-wise
Postgres can only do maybe like,

you know, 2 per core.

That's the myth at least.

So again, like Citus, single machine,
they have some kind of

support for multiple coordinators,
but I think the readme just

says like, please contact us.

Nikolay: Any case you're in the
hands of Microsoft, you need

to go to Azure or you need to do
self host everything.

You cannot use it on RDS because
extensions required.

In your case, no extensions required.

This is the key.

You can run it on RDS because the
extensions are not needed.

Exactly.

So you can run it on any Postgres.

I see a lot of guys are trying
to develop Extension ecosystem.

I'm over time became big Like opponent
of extensions idea because

we have a lot of managed services
and if you develop extension

it takes a lot of time to bring
that extension to managed provider.

It takes years.

So if you can do something without
extensions, it might be better

in some cases.

And sharding maybe is such a case.

Lev: No, absolutely.

Yeah.

Because if you develop an extension
that gets installed by RDS,

like, I don't know if they're gonna
pay

Nikolay: you for that.

Approve it and support.

Lev: Approve it or support it or
all that stuff.

RDS notoriously upgrades like once
a year.

Nikolay: We need to wrap up soon,
but like a few words about

encryption.

Does PgDog support encryption now
already?

Because it's super important for...

And it can be a bottleneck.

I know...

Odyssey connection pool was created
because PgBouncer needed

2 layers of PgBouncers to handle
a lot of...

Yeah, this is what several companies
did in the past, two layers

of the bouncers because of a handshake.

So, okay.

Tell us handshake.

What, what, what's in this area
encryption?

Lev: Sure.

Yeah.

It supports TLS.

It's using a library, Tokyo, it's
like one of the rest libraries

that implements TLS.

It's completely fine, you can use
TLS.

And my personal favorite, I finally
find a library that implements

the SCRAM SHA-256 authentication.

So now that's finally supported.

Nikolay: Yeah, I saw MD5 is going
to be duplicated in the next

few years in Postgres, so...

Lev: It still comes up.

I mean, it's been duplicated for
10 years, and people still use

it just because it's really simple
to implement.

And SCRAM is really hard.

Nikolay: And one more small question
about...

Yes, about prepared statements.

I know PgCat supported them, right?

In transaction pool mode, right?

Lev: Yeah, the implementation wasn't
great, but it does support

it.

Yeah, PgDog supports them too.

Much better implementation this
time.

Nikolay: Okay, cool.

Well, I'm not out of questions.

I'm out of time.

Yes!

But it was absolutely interesting.

Thank you so much.

I'm definitely rooting and going
to follow the project.

Best of luck to you and your new company.

Maybe Michael has some questions
additionally.

I took the microphone for too long
this time.

Apologies.

Michael: Well, Lev, is there anything
we should have asked that

we didn't?

Lev: Oh, no you guys are, I think
you covered it.

Michael: Well, really nice to meet
you, thanks so much for joining

us and yeah catch next week Nikolay.

Nikolay: Thank you so much, Bye
bye.

Thank you.

Some kind things our listeners have said