Nikolay: Hello, hello, this is
PostgresFM, your favorite podcast

about Postgres.

And as usual, my name is Nik,
PostgresAI, and Michael is here

as usual as well from pgMustard.

Hi, Michael.

Michael: Hello, Nik.

Nikolay: And we have a great guest
for the second time, Lev, PgDog.

Hello, Lev, thank you for coming.

Lev: Hello.

So thank you for having me for
a second time.

Nikolay: Yeah, I'm sure there's
a lot new to discuss.

So yeah, I think it's a great idea
to see what has changed since

last time we talked, because it
was around your first release,

I guess, right?

And yeah, so where to start?

Lev: Where to start?

Yeah, we're in a different year,
2026.

So it's been almost 8 months since
we last chatted, something

like that.

7 months.

Now we're in January.

Michael: Yeah, I think 10.

Lev: 10 months.

Oh boy.

I mean, I've been working full
time on this stuff, so I have

a whole list of things that have
been developed ever since.

I think the biggest thing is PgDog
is very much real and in production

with real companies and real customers.

You know, last time we chatted,
it was like a thing, but now

it's a real thing.

So that's the big update.

It's in prod.

Most of the code I think I would
comfortably say is stable and

performant, which obviously I would
say that, but there's real

usage behind this which is really
cool.

Let's see, there's a long list
of features that's been developed

since then.

Michael: Before we go on to those,
I wondered of the people using

it in production, what are the
primary drivers?

Is it sharding or is it like an
upgraded pooler?

You know, I feel like there are
a few selling points to PgDog

that we talked about last time
and I wondered which ones were

driving like the primary adoption.

Lev: That's a great question.

One of our early adopters, they went
straight for sharding, which

to me was like, wow, you're brave,
sir.

And they went straight for it and
they replaced Citus with it,

so that was really cool.

Nikolay: Can you name some names
or?

Lev: They're on my website.

1 of their logos, I'll let you
guess which 1.

I'm still trying to close them
as an official customer, so soon

I'll be able to say their name,
but their logo is there.

And yeah, they're pushing, I think,
over 100k QPS per shard.

And they have like 12 shards, I
think Yeah, yeah, it's it's kind

of crazy they're using direct to
shard query so that's that's

the real you know why it worked
so early is because They didn't

use any of the fancy like aggregates
and sorting across multiple

shards.

That's the real tricky part.

Although we've made some progress
there, but it's still not 100%

compatible to Postgres does it.

So on its way though.

But other use cases, load balancer
is a huge thing.

Like people love having the ability
to talk to multiple replicas

without actually manually creating
a connection pool for each.

You know, there's health checks
and failover.

1 of the big uses is we deployed
alongside, you know, Aurora

or RDS where Patroni is basically
not used And it can detect

primary replica failover, and it'll
move the traffic to the new

primary within a second or so,
which is really great.

And then, of course, when 1 of
the replicas starts failing, it'll

ban it and then use the remaining
replicas for re-query.

So you get that, you know, that
amazing, like 4 nines of uptime

for Postgres, which is pretty

Nikolay: cool.

I have questions about all these
features.

Yeah.

All right.

Before that, I have underlying
question.

10 months, Is it really 10 months
or 10 years?

Because you use AI.

Do you use AI to develop?

I do.

Yes.

Claude, right?

Lev: Yeah.

I use Claude, Opus specifically.

Nikolay: Claude code, Opus 4.5 right
now.

Right.

So this is what you use.

Lev: That's right.

Yeah.

For only certain areas of the code,
not all of it.

There's some areas where I don't
trust it at all, but some areas

like, like we have a, we're building
like a dashboard like with

all the metrics and stuff.

Nikolay: Not critical paths, right?

Lev: Yeah.

Critical paths are still organic
code, if you will.

Organic code.

Exactly.

Yeah.

This is the first

Nikolay: time I hear this.

It's, it's hilarious.

Lev: Yeah.

Homegrown organic code.

But, I don't write JavaScript anymore, for example.

I don't write CSS anymore, like that kind of stuff.

Nikolay: So you use code for this like additional pieces of software,

not super critical person.

And I'm sure you ask it to duplicate and reuse the code as much

as possible, recycle existing pieces of code.

Stop.

Okay.

It's

Lev: crazy how much that machine loves to generate like copy-based

code around.

It's allergic to function reuse.

You really have to tell it like, hey, yeah.

Nikolay: Yeah, I have a special additional code-based review

process, which like basically targeting this like deduplication

vector.

But it's possible to achieve.

If you don't do that, by default, you will get a lot of duplicated

code.

Yeah, that's great to hear.

Okay, I see by the way, great logos on your website.

I missed that, haven't visited front page for a while.

I see Coinbase Circleback, we use Circleback at work.

Some other names, that's great, congrats.

Lev: Thank you, hard at work.

Nikolay: Yeah, So what is your current pace?

How often releases happen?

How do you ship?

And if you still write a code a lot, like I guess it's not like

every day new version.

It's less, right?

Lev: Yep.

Not every day.

Our release schedule is every Thursday.

I'm trying to be very diligent about it.

Like whatever code that's in main is usually stable, but I give

it a week before I tag a release.

And then every Thursday I compile release notes and ship it out.

And then you get it in GitHub and there's a Docker image that

gets built for both x86 and ARM.

So you can pretty much run it anywhere, including on your Mac,

if you wanted to.

And we have a Helm chart for if you're using Kubernetes, our

Helm chart is getting a lot of good usage and contributions from

a bunch of people.

So, you know, I'm not a Kubernetes expert, but They are, so they're

helping me along.

So if you're running Kube, you should be able to deploy PgDog

in like 30 seconds, honestly.

But otherwise, you know, it's just a container.

You can run it anywhere.

Nikolay: Right, and you use libpg_query for query parsing, so

to build AST trees and yeah, and we recently optimized it, right?

So it's not merged, your optimization is not merged to upstream

yet, but can you tell us a little bit more about it?

Also, specifically, I think My question is also about is it worth

optimizing?

Like how much of microseconds we save on each query?

Should we worry about those microseconds at all?

Lev: Yeah, you should.

It was...

First of all, full disclosure.

I told Claude to do it.

And then he did it.

I checked it.

I was very open about it.

I put in the PR.

I know how open source projects hate this kind of stuff.

Like, you know, 3,000 lines of boilerplate.

I was up front, look, guys, this is Claude-generated.

I reviewed it.

It works okay.

But give it a few months before seriously looking at it.

But the idea is in libpg_query, there's a translation layer between

the C library and a bunch of other bindings.

So notably, like we use the Rust binding and the translation

layer is in protobuf.

So every single time you call parse or fingerprint, every time

you get the AST back, basically, it goes through a serialization

and deserialization process with protobuf inside the same process.

And that's, and that's really slow actually.

So I just replaced it with like only for the rust bindings with

just like direct C to rust and rust to C mappings of the Postgres

AST and That improved the translation part of it by, honestly,

I'm not even exaggerating, like 5x faster.

The throughput went from like, you know, 1000 per, I think per

millisecond or per second to like 5 or 10k or something like

that.

The benchmarks are in the PR, but It's huge.

And I think another important factor is CPU utilization.

So for us, we're a network proxy, so our CPU budget has to be

pretty reasonable.

If you can decrease it by 5x, and
that's most of our CPU comes

from query parsing.

Everything else is pretty straightforward
stuff, just I.O.

Back and forth.

And Atomics.

So for us, it's huge.

Nikolay: So this optimization is
just like, you reduced the number

of times how much query is copied
between like various buffers.

Lev: Exactly.

And the protobuf serialization
is Probably one of the fastest in

the industry, but it's still very,
very slow compared to just

copying a C struct in memory, like
a memcpy.

Nikolay: It sounds to me like shorter
queries should suffer less

from this than huge CTEs, you know,
like multi-kilobytes in size.

They should suffer even more, right?

Lev: Yeah, yeah, definitely.

I haven't measured how much faster
are slow queries or long queries.

I just took up a pretty sizable
query, like 600 lines, and used

that as my benchmark.

Because a lot of people who use
PgDog surprisingly like to write

long and complex SQL statements
for their database.

So that's what I went with and
that's where latency will really

matter.

So yeah, for small ones, maybe
it's smaller, but doing protobuf,

DCR in memory, it's a big waste
of time when you can just map

like a C-struct directly with a
few allocations.

Nikolay: Right, right.

I also saw there was optimization
of like SIMD vector calculations.

So like, this is like cool stuff
or what?

Like is it, or you saw some real
benefit?

Lev: Well, the vector one is kind
of funny.

There's a blog post about it, but
basically I tried to shard

a pgvector index among multiple
Postgres instances.

And to do that, I implemented one
of their algorithms, IVFFlat

inside PgDog, and then it's able
to basically like separate the

vectors between multiple databases
and then search for them using

IVFFlat.

And for that I needed SIMD to calculate,
I think I'm supporting,

we're supporting cosine distance
only, just as a POC, just to

see if anyone even cares about
this stuff.

And that's where the SIMD optimization
comes from, which is to

calculate the cosine distance between
the input and one of the

centroids, because we store the
centroids basically in a config

file.

Everything is in PgDog right now
is config driven because I trust

the disk not to do anything.

Eventually we'll move away into
something more dynamic.

Right now it's like you pre-calculate
your centroids in advance

and then you store them and then
you can basically query one shard

or multiple shards to get your
probes, the number of probes,

if you will.

And then underneath each Postgres
database has their own IVFFlat or HNSW index.

So it's an approximation on top
of an approximation, which is

fun because you can scale pgvector
into like presumably billions

of billions

Nikolay: of vectors like others
like announced, right, and we

said it's not true because it's
not a single instance.

So, yeah.

Yeah.

Yeah.

1,000,000,000 vectors on a single instance,
I haven't seen this yet.

Lev: No, no, no, it's not possible
with pgvector, but with sharding

it is.

Nikolay: Right.

So if you have sharding schema
defined, like you can store a

lot.

Right, so it's a great feature.

Anyone using it or is it just showcase
and that's it?

Lev: I don't think anyone, I think
I got an email from somebody

who was like, this is really cool
and I'm like, well, this is

how you need to use it.

And they're like, cool.

I haven't heard back from them.

But yeah, it's more like a...

Nikolay: Speaking of performance,
let's talk about two-phase

commit.

There is support for it, right?

Can you describe what's your
experience and view on using

two-phase commit in larger systems?

Oh, we have a dog.

Lev: We have a dog. And we have
a dog on two-phase commit as well.

Okay, just for

Nikolay: those who are listening
on the audio version, we just

saw a dog actually walking in the background.

That's great.

Lev: Big dog.

She is the reason why PgDog
is named PgDog.

Nikolay: Right.

You explained that last time.

Right.

So what about two-phase commit
and performance overhead it has

and so on?

Lev: I'll be completely honest
with you.

I haven't benchmarked it very much.

Nobody really talked to me about
using it yet in production.

So I don't know how much the Postgres
internals themselves create

an overhead.

I just know from the proxy side,
it's 2 extra statements to execute.

And also the additional complexity
of managing phase 1, phase

2 of the commit.

So if phase 1 fails, you need to
roll back.

And if phase 2 fails, you need
to roll forward or commit.

And you also need to make it PgDog
crash resistant, which we

haven't done yet.

So there's still work to do there
to make it, I think, production

grade.

But from our point of view, it
was just an extra couple of statements

to execute.

Nikolay: Yeah, okay, understood.

So this is just like, also like
some foundation for future work,

maybe, right?

Definitely, yeah.

Lev: After we get

Nikolay: some feedback from production
systems.

Okay, I see.

Another thing I noticed also, like,
it's fun, fun feature.

You created ID generator, right?

Which is like, we, like, Postgres
18 brings UUID version 7, but

it's 16 bytes.

It's quite a lot of space spent
for a single ID value.

Yours is 8 bytes, right?

Lev: Yeah, big end, yeah.

Nikolay: It's 8 byte integer, but
it behaves similarly.

So there is timestamp component
in it.

This is cool feature and it's going
to serve us until when?

70 years, right?

70 years left.

It's not a lot, but for many systems
should be enough.

What inspired you to build this?

Lev: I have a potential production
deployment and there were

big usage of Omni-sharded tables
or replicated tables or mirrored

tables or whatever you want to
call it Basically the the data

in the table is the same on all
shards And they needed to generate

the primary key for that row And
you couldn't rely on the sequence

because even if you used to face
commit, which they originally

thought they would do, there's
still a chance that the sequence

numbers on each shard will be different,
especially because sequences

are consumed whether the transaction
succeeds or fails.

So the ID could be different on
each shard.

So you needed an external ID generation
service.

And the first thought about using
a third database or another

database to use its sequence, to
pull it in and then insert into

the query, which creates a single
point of failure, but it's

much simpler to actually be able
to generate a unique big int

inside your proxy, inside PgDog,
which is not that hard

Nikolay: to do.

Could you sequence, we just could
have some extra added, like

for example, if we say we don't
need more than 100 billion per

shard, we could say on shard number
1, we always add 100 billion,

on the second shard we add 200
billion.

Why not?

Michael: They need to be the same
on every shard.

This is the table that's replicated.

Nikolay: Okay.

I see.

Lev: I like

Michael: it.

I think it's interesting.

So I guess it's because we don't
have unique constraints across

shards, right?

Lev: Yeah, that's right.

Well, we don't need them basically
because the ID generation

is guaranteed to be unique.

Before.

Yeah, before.

Yeah.

I mean, all the good database engineers
will tell you should

still have a Unique Constraint, even
for UIDs, which are theoretically

guaranteed to be unique, but even
in this case, you don't necessarily

need 1.

So the idea is, again, so you insert
the same ID into all the

shards and you need to be able
to generate it quickly and efficiently.

And that's why this like, it's
the first of its kind, PgDog executed

Postgres function, okay?

Cause we're using the Postgres
parser, right?

So why not, you know, just hook
into the Query execution.

Nikolay: Extract.

Lev: step towards Citus

Nikolay: and having whole Postgres
in your middle there, no?

Lev: I mean, you heard it here
first in Postgres FM.

No, I don't know.

I think it's a good start, but
it was so simple to implement.

You extract, you find the function
execution, and then you pull,

you generate a unique ID inside
the pooler.

And the way you do that is, well,
you get the timestamp.

You kind of assume your clock is
reasonably accurate.

Then you have an internal sequence
number so you can generate

multiple IDs per millisecond.

So the clock tick is millisecond.

And then there's another unique
number, call it node ID or PgDog

node ID.

So you can have multiple PgDog
containers in front of the same

deployment and it will guarantee
that none of them will hit a

duplicate from the

Nikolay: same list.

No collisions, yeah.

Lev: No collisions, exactly.

So, you know, the 70 years is a
little bit arbitrary.

We can reduce any 1 of these components
and increase the timestamp

component to exit by just reducing
the number of PgDog nodes

that you can deploy.

Right now it's 1,024, which if
you're running that much, you're

Facebook probably.

But you can reduce that by half
and give that extra bit to the

timestamp.

And then you can also reduce the
number of per millisecond generation.

Right now it's 4 million.

You can generate 4 million per
millisecond.

It's a lot.

You could probably reduce that.

Actually, no.

Let me double check.

I think it's probably 4,000 per
millisecond.

That sounds more reasonable.

Michael: 4 million per second would
make sense.

4 million.

Lev: That's right, yeah.

This is where documentation, or
good documentation,

Michael: comes

Lev: in handy.

Michael: That's a big change from
last time as well.

Lev: Yep.

The biggest thing I'm proud of
is adding an icon next to every

single link.

Just, you know, being very, like,
attention to detail, I suppose.

But yeah, like, every single comment
I get from people is like,

well, is this documented?

Is this documented?

And somebody's talking about our
feature, and I'm like, good

point.

Now it is.

Thank you.

So if it's not in the docs, I want
to say it doesn't exist, but

that's not true.

There's a backlog of documentation
to be done.

Yeah.

Nikolay: I should help with this.

Lev: I would love for you guys
to help out with anything.

Nikolay: Not I, AI.

Oh, AI.

Maybe I as well.

I would love to help with AI.

Yeah, I'm looking at PgDog from
time to time and definitely

looking for some use case.

And as we all know, maybe not all
that we, my team benchmarked.

And so I'm looking forward to revisit
that benchmark to see how

that optimization related to protobuf
helped.

I'm very curious.

This time we are going to visit
not with trivial queries like

SELECT, that's it, but with some
heavy...

That's why I mentioned, like, I
expect some benefits from...

For heavier queries, bigger text.

Yeah, so yeah.

Documentation, I think AI should
help with it.

Like this is my plan.

Lev: No, no way, man.

Like, because they all sound the
same.

They all look the same.

Like you could give it a couple
of tips, like, you'll sound original,

but like if you write text using
AI, it just all reads like it

was written by the same.

Nikolay: But at least the reference
could be like the usual problem

with reference-like documentation,
which lists various functions,

config params, everything, it usually
lags.

And to maintain it, I think AI
is great to maintain it.

Lev: Yeah, I mean, I don't disagree.

For me, writing documentation is
kind of like a, it's like a

ritual, you know, I go through
the things that we built, and

I'm like, oh, document, and then
I look at what I wrote.

I'm like, wait a minute.

This is wrong.

I'm putting myself into the user's
shoes.

I'm reading the docs.

I'm like, OK, this is what it looks
like.

Nikolay: It's an additional point
where you can verify the whole

picture and see if all that makes
sense, all that needs to be

adjusted additionally.

Yeah, I agree.

Also, demos, right?

Like when you think how to show
what was built, sometimes you

end up changing everything.

Lev: Yep.

That has happened before.

Nikolay: Because you simply cannot
explain it, it means that

it probably should be done differently,
right?

Lev: Exactly, exactly.

So,

Nikolay: let's talk maybe about
hard stuff, like prepared statements,

what happened to them, like how
PgDog supports prepared statements,

transaction mode and so on.

Can you, can you...

Lev: Funny.

Yeah.

You know, we have to stop sharding,
but prepared statements is

where we really have issues.

Nikolay: Yeah, let's talk about
it

Lev: a little bit.

Sure.

They work.

They're fine.

Everything's great.

It works, I think.

Nikolay: You saved them, right?

Lev: Yeah, they're cached.

It's a global cache.

So if multiple clients send the
same statement, the same statement

on the server side is executed.

So pretty high cache hits.

Nikolay: Implicit or explicit,
like protocol level or prepare

keyword?

Lev: Protocol level right now and
partial support for explicit,

Nikolay: you

Lev: know, since we have the parser.

For the explicit ones I need to
differentiate on the number.

Sometimes you can pass the data
types in the prepared statement,

sometimes you don't.

I just need to add a little bit
more support to the parser there

to extract the differentiation
between like different unique...

Identical prepared statements,
but they expect different data

types for some reason.

So protocol ones work.

I guarantee it.

If they don't, I swear.

They do.

But the other kind is still a work
in progress.

Nikolay: Yeah, and have you seen
any issues with prepared statements

used together with partitioning
or your clients don't use partitioning

at all because they have sharding?

Why use partitioning?

Lev: I think we use them with partitioning.

I haven't seen any issues.

Like if the query, if the prepared
statement plan is wrong, Postgres

will return an error and we evict
it from the cache and then

we prepare it.

Nikolay: It's not that simple,
right?

Because the first few calls are
not cached yet in prepared statements

because first it uses custom plan
for the first 5 times.

There is a hard-coded constant
in PostgreSQL.

And with partitioning, it's much
more complicated.

I have a series of blog posts about
this.

It's like super complicated.

Yeah, I'm curious about this.

Maybe just that nobody encountered
any issues yet, but yeah,

I'm curious about this.

Lev: No clue.

Yeah, let me know what I need to
fix.

Nikolay: Yeah, I understand.

Very practical.

I understand that.

Michael: Yeah, but it's also, also
Nik, I think you're like,

this feels like the kind of thing
that actually gets a little

bit better with sharding.

Like you just reduce the number,
you reduce the QPS, you reduce

the load on each like, it, everything
that could cause issues

around that is, however many shards
you have, times less bad

or less likely to hit the cliff?

Nikolay: Absolutely, sharding reduces
the need in partitioning,

100% so, but maybe not fully eliminates
it because partitioning,

if it's time-based, it still makes
sense within 1 shard.

We know cases of really large databases
which combine both approaches.

Michael: Yeah, for maintenance
reasons it still makes complete

sense, right?

Nikolay: Right.

Yeah, yeah.

So yeah, it's sticky.

So and if you want prepared statements,
oh, It's like, it's

super complex.

Lev: So what's the deal?

What doesn't work?

Nikolay: So partitioning brings
cache not at planning time, but

at execution time.

And this increases complexity because
if you just deal with planning

time cache, it's quite simple.

So if it's not partitioned, only
planner can decide the cache

or not, right?

That's it.

If it's partitioned, there is a
decision point shifted to execution

time.

And there is a point when basically
you are without cache and

you cannot avoid it because when
cache is being built, at that

execution, it's number 6 execution
or fifth execution.

So at that point, executor doesn't
have this cache because it's

being built.

So we are without cache.

And in high QPS cases, when frequency
of calls is super high,

We basically, there is a moment
when we must leave without cache

and it might be a super non-performance.

So performance can be terrible
in that 0,

Lev: I see.

Nikolay: Which doesn't happen with
unpartitioned case at all.

This is like, I already keep, I
already start forgetting details

because so many nuances if you
don't leave there, if you don't

only visit it like I did for a
couple of weeks and then, but

I have several blog posts about
this problem with demonstrations.

Lev: See, this is the prime example
why being a proxy for Postgres

sharding is the best place to be
because to me, that problem

is completely irrelevant.

Right?

Because I'm just passing statements
through, right?

I think, unless I'm, because like, they will work or they won't,

but from a performance perspective, like if you have trouble

with partition tables and prepared statements, you just don't

use prepared statements, right?

Nikolay: Yeah, yeah, yeah.

Partitioning, we must, so Partitioning also brings the problem

of explosion of access share locks.

Because...

I'm talking about this particular, but yeah.

So if you have a lot of partitions, and we have plan cache 1

idea, but also we have routing to specific partition, which is

called constraint exclusion, right?

We need to route to specific partition.

This is another idea.

And at planning time, if planner knows where to route, this is

normal case, when you don't have prepared statements.

It just routes to specific partition, all good.

No problem.

We lock only 1 partition with all its indexes.

But if we postpone it to execution, it needs to lock all tables

with all its indexes, and it's terrible.

With sharding, I don't know, you just...

Lev: Well, you reduce the QPS, so...

Nikolay: Yeah, this is already...

And the lock manager is not a problem anymore.

Or you just need to use Postgres 18 and raise max locks per transaction.

In this case, lock manager also is the same.

So yeah, anyway, this is an interesting topic.

And I'm curious if some people combine sharding and partitioning

I think they should sometimes they should sometimes and it's

just it's in really big setups yeah

Lev: I mean as a guy who's trying to build sharding you should

always use sharding unless you can't but like that's you know

Michael: I think there's a maintenance argument.

Imagine if you've got like a logs table per shard and You want

to delete old logs after a certain you've got a retention period

you probably still want partitioning to make those deletes really

efficient.

I think there's still a use case for it but it's not so much

for performance reasons but it's for maintenance reasons, that'd

be my guess.

Nikolay: Yeah, for to build indexes, to have vacuum working more

efficiently, and so on and so on.

Michael: Yeah, but I think those are, I still characterize those

slight, I know they are maintenance, but they're kind of still

performance reasons, right?

You want them, because you don't want the table to be too big.

I'm talking about doing like deletes efficiently, being able

to, instead of having to like do batch deletes you can...

Exactly.

Nikolay: It's exactly how's it
called not data retention but

there is some additional term

Michael: yeah yeah the opposite
of

Lev: yeah there could be like a
legal requirement for data retention.

I think that's the only reason
why people would ever delete data,

right?

Well, the other reason is performance.

So if you eliminate the performance
argument, like if you make

performance just free because of
sharding makes that, I mean,

Obviously that's not the case,
but like, no.

Nikolay: Yeah, and if you compare
sharding and partitioning,

with partitioning, we often are
willing to have 1 active partition

where we write and all others are
archived.

In case of sharding, it's strange,
right?

Because a lot of hardware is not
used.

Like if you have only 1 shard read-write
and others like archive,

it's strange, simply strange, right?

So you want to distribute writes
usually.

Lev: You want it to be equivalent,
like you want like a perfect,

like all shards have the same load,
And that's why you'd usually

use like hash-based partitioning
if you can, again.

Or the way you design your schema
for sharding is you make sure

that each shard roughly has the
same load.

If you have hot shard, you're gonna
have a hot shard problem

sooner rather than later.

Nikolay: Yeah.

Lev: Exactly, especially for like
multi-tenant, like what I call

multi-tenant is B2B basically,
where your sharding key is like

a customer ID, if you will, and
then 1 of your customers grows

big and you need a second sharding
tier, or if you will, a second

partitioning tier.

Nikolay: And does PgDog support
resharding without downtime

already, you know?

Oh yeah, yeah, yeah, of course.

How's it technically working?

Lev: I mean, it's good.

It's been tested, so that's kind
of cool.

The idea is, I mean, I don't want
to say simple, but I think

practical.

For re-sharding, this is where
the contrast between how this

works and like NoSQL databases
work is a little bit more apparent.

With NoSQL, you kind of add another
machine and then it kind

of rebalances all the data using
existing machines.

The PgDog approach is you create
a brand new empty cluster, and

then you copy all the data out
into the new cluster.

And this decision is actually not
because of a particular limitation,

but it's intentional.

Because you want to reduce the
load on the existing cluster during

re-sharding, because usually when
you start re-sharding, it's

already under load, it's already
underperforming, it's already

having problems.

So you don't want to add additional load with resharding to it.

So you create a new, brand new cluster, you move the data over

and we are able to use replicas since Postgres 16 to copy data

out with logical replication, that was a huge thing.

That was, I think, one of the game changers that made this possible,

is logical replication slots on replicas.

Why?

Why does it matter?

Why does it matter?

Because you can create 16 replicas and parallelize your copying.

Why does it matter?

Nikolay: Disco or what?

Why does it matter?

Like you don't want to use disco on primary or what?

Lev: Well, you definitely don't.

Your primary is already shot.

Like if when you're sharding, your primary is already having

a really, really bad time.

That's the main reason behind sharding.

You want to scale your writes.

So adding additional replication load to an already underwater

database is bad.

And especially when you want to create, let's say, you go from

one shard to 16 shards.

Extreme example, but realistic.

You're going to add 16 replication streams to your primary.

That's like 16 times like streaming the same data.

The thing is going to fall over.

I guarantee it.

Nikolay: But overhead is amplified here.

So the same work is done on different CPUs, VCPUs.

Lev: Yeah, yeah, decoded and recoded.

Nikolay: Yeah, yeah, yeah.

So.

You want to shard this load, right?

Lev: Exactly, exactly.

We're able to use one replica per table basically for, for re-sharding.

So if you have like, say 1000 tables, you're able to shard your

sharding stream and you're both like split it into 16 different

machines or 32 different machines and sync it in parallel, which

I think is pretty huge.

It's all the same LSN, right?

Because they're replicas, right?

So you can synchronize at the end and make sure that you got,

you're at the right point and like pause the live traffic and

then wait for the logical stream to catch up and cut over to

the new cluster.

Like all of that is, the pieces are there, you just need a button

to make it work.

We haven't built the button yet, but we've built all the levers

you need to push as an operator.

So when

Nikolay: you say you create cluster, you mean create Postgres

cluster as just basically 1 new primary node.

And when you say you copy, it means that there is already something

in PgDog that copies.

You just need to use it.

It's not on the shoulders of the user, right?

Lev: No, we hook into the replication protocol to stream the

rows and we use copy, binary copy underneath, and we shard the

copy stream.

Nikolay: All right, this is the option since Postgres 16, or

14, I think.

Binary.

Lev: Oh, it's been a minute, yeah.

Nikolay: When you initiate logical replica, I guess.

There is option binary equals true for initialization part.

Interesting, actually.

Lev: Yeah, look at this.

Nikolay: There is also trick to convert physical to logical,

which is official since Postgres 17, I think it's called

pg_createsubscriber, CLI.

Lev: Oh, interesting.

I'm using like pgoutput, which streams everything in text

Nikolay: format.

Yeah, yeah, yeah.

But yes, but right.

But still there are additional things here.

You can convert physical replica to logical replica, and we could

do it before, but now it's official.

It's official.

Lev: Oh, I

Nikolay: see.

pg_createsubscriber.

Unfortunately, we won't be able to use it on most managed services

because they don't expose this.

They should, actually.

Yeah.

They should expose this in API.

I hope they will.

But in case of self-managed, it's good.

Lev: Yeah, with self-managed, you have a lot more options for

sure.

I think the original goal was to make this work in managed environments

because that's where them.

Nikolay: Right.

Yeah.

You mentioned Aurora already, support of Aurora.

Lev: That's great.

Nikolay: Yeah.

And these replicas, so you create
logical replica out of physical

copying data and, and, and performing
CDC.

So, so that's ranges from a slot
on replica, which is supported

since recently, right?

I'm curious this, when you initialize
logical replica from this,

is this physical standby?

Does it have hot standby feedback
on or off?

Does it report to the primary that
it still needs some data?

Because if it does, the primary
health can be affected.

If you eventually switch over,
that bloat doesn't matter.

But if you roll back...

Lev: So I think

Nikolay: if this physical replica
is not used by anyone else,

I would keep HBF off.

To let it lag, I would raise the
max.

There are a couple of settings
which allow a replica to lag.

If hot standby feedback is off, it's allowed
to lag.

There are many, many nuances here
in this process.

I'm curious how you do it.

But okay, you're using logical
replica here.

Lev: Yeah, that's a good call out
actually.

Like my recommendation is to create
a dedicated replica for this

or dedicated set of replicas, but.

Nikolay: Without hot standby feedback, I would

Lev: do it.

Nikolay: If it's temporary replica,
why not?

Like, And also how this replica
is created, right?

Like it's an interesting topic.

Do you integrate with some backup
tools or you just, it's out

of scope, right?

For PgDog.

Lev: I feel like at the moment
it is, but maybe in 6 months it

won't be.

Like right now I'm relying on people
using like some kind of

reliable way to create replicas.

If you use RDS, you just click
a button.

Again, like if you're using like
any of the managed services,

you just click a button, you get
a replica, it's using disk snapshots.

You're self-hosting.

Like if you have ZFS, presumably
you'll have your snapshots somewhere.

You can restore from, from a snapshot
and resynchronize pretty

quickly.

But even then it takes a while
to, to sync the ZFS snapshot on

disk.

Tested that 1.

Takes minutes, sometimes hours.

Oh, well, if you have like a 2
terabyte file system.

Nikolay: You mean the first send
receive?

So between the hosts?

Yeah, I understand.

Lev: Yeah.

It's surprisingly limited by, I
think, actually network.

I needed to like compress it and
then do something else to it.

And I wrote a blog post to it on
our old blog at Postgres.

Oh my god.

Michael: ML?

Lev: Thank you.

But the website went down so my
shame went down with it as well.

The blog post is gone.

I think it was a good blog post
but whatever.

I learned how to use compression
tools and ZFS and, you know,

big deal.

Michael: Just out of interest,
like, are people generally using

this with a managed service or
is it quite split between managed

service and self-managed?

Lev: Right now it's split.

A lot of management or service
use, especially like failover

for Aurora is a big deal.

Like failover detection, because
you don't run Patroni typically.

So it's able to like query the
pg..., sorry, what was the

name of the function?

Nikolay: pg_is_in_recovery.

Lev: pg_is_in_recovery, thank you.

Nikolay: In my opinion, to stop
confusing people.

Lev: The people who know, know,
and the people who don't know,

don't need to know.

Yeah, if you know, you know.

Yeah.

Right.

Yeah, and I'm sure you know Aurora
loves to failover on a perfectly

good day.

So it helps a lot with uptime.

But 1 of my biggest sharded deployments
is literally like on-prem,

on-prem, like They rent their own
data center and everything.

So it varies.

Like that's the beauty of it.

Like operating on the application
network layer, you really don't

care that much where you're deployed.

Nikolay: So based on these, almost
a year since you started this,

What's your current vision for
your customer who needs pg_auto_failover?

And what's your vision for next
future steps?

Where do you move?

Lev: Yeah, this is a really interesting
question because sharding

is always going to be like the
north star because it did it explains

to people like this is a scaling
product ultimately you can start

anywhere I already feel like it's
a better connection pool than

pgBouncer

Nikolay: I

Lev: already feel like the load
balancing aspect of it is a game

changer and sharding as much as it is a work in progress, is

the North Star for us.

So that's how we want this project to evolve.

We want to make Postgres as scalable as Cassandra or Scylla or

Elasticsearch.

I don't know what people think about those particular databases,

but I think they're horizontally scaled, we can all agree.

So we want Postgres to be like that as well.

We're building, it hasn't been released yet, we're building an

enterprise edition of PgDog, which adds the classic things like

monitoring, understanding how everything works, a control plane,

something that we can.

Sell as a product to people, you know, like you don't have to

like go through like many, many steps requires to set up like

a, like a sharding layer and network proxy, So that should come

out within the next few months, hopefully.

Nikolay: Will it include automated sharding with AI?

Like, press a button and your database is sharded already.

Lev: You want automatic sharding with AI as part of our future

set?

Nikolay: I think we

Michael: just want more consulting clients.

Lev: I don't think we'll ever get automatic sharding with AI,

but I don't know.

Right now my idea for that is like dump the, use pg_dump, get

the schema out, pipe it into ChatGPT and ask it what to do.

With a prompt.

And test.

Yeah, exactly.

And test.

Yeah.

I don't know.

A human being is able to come up with a sharding key really quickly.

Like I talked to like a senior database person at a company or

an app engineer and they're like, oh, my sharding key is like,

they already know it's org ID or customer ID or like they instinctively

know how the data is organized and they even know what the edge

cases are.

You know, like I don't think you need AI for that 1 unless you

literally have you've never seen this database before but yeah,

that's That's the way I think about it.

Nikolay: I don't think

Michael: yeah on the vision stuff.

How how closely are you paying attention to the other sharding

projects in the space and how do you differ in terms of your

North Star versus what you think theirs is based on what they're

saying?

Lev: I'll be completely honest with you, I literally never look

at them.

And it's not because I think I'm cool or something, It's literally

I just don't have the time like
I have I have something in my

mind I'm building it what the other
guys are doing is usually

more of a distraction than the
learning point to me personally

So I have my vision of my North
Star and I'm gonna follow it

until I hit a wall or something.

And so far I haven't.

So that's how I think about it.

Nikolay: Great words to finish
this podcast, I guess.

Michael: Is there, is there anything
left that you wanted to

like mention that we haven't asked
about?

Lev: Ah, well, let's see, I have
like a, I'm gonna do a post

on Hacker News soon with all the
stuff that we've done, but I'm

just gonna blab with you, I know
it's kind of boring.

We added a bunch of support for
Aggregate functions, you know,

we do query rewriting now, which
is really cool.

Like we take the AST and we add
whatever needs, whatever we need

to that query to execute it.

1 thing that I think is really,
really phenomenal is we support

sharding key updates.

That's something I think Citus
never got around to do is like

changing the actual value of the
sharding key and moving the

rows between the shards.

Wow.

That one's really fun.

I don't know how often it's gonna
be used, but anytime somebody

asks us about sharding, they ask
about that feature, so it must

be useful.

We support multi-Tuple inserts,
we rewrite insert statements,

we separate the Tuples, and we
send them individually to each

shard, you know, all of that happens,
you know, transparently,

which is really cool.

We're just, we're focusing on usability,
like the idea is you

just deploy this and you don't
change your app, like don't rewrite

your queries, don't rewrite your
code and just make it work.

But, you know, obviously We're
not there yet, but we're getting

closer.

So every week there's something
new that comes out.

Nikolay: And it comes from the
fields, right?

Lev: Yep.

It comes from people asking me,
hey, is this supported?

I'm like, yes.

Almost, almost.

Next week.

Thank you for the suggestion.

Nikolay: Yeah.

Cool.

Yeah.

Production driven development.

Lev: It is.

I really enjoy it because every
single feature is immediately

used.

Like, it's great.

Nikolay: Cool.

Great.

Thank you.

Thank you for update.

It's really good to see the project
is growing.

Excited to test it once again soon.

Lev: That would be great.

And I still have to review and
merge your PR.

Nikolay: It's OK.

Lev: It's been open for, I believe,
10 months, since the last

time we talked.

Nikolay: Something like this.

Good.

Michael: Well yeah really good
to see you again Lev.

Take care.

Lev: Thank you so much.

Michael: Catch you soon.

Lev: Yeah Thanks so much for having
me guys.

Bye bye.

Some kind things our listeners have said