PgDog update
Nikolay: Hello, hello, this is
PostgresFM, your favorite podcast
about Postgres.
And as usual, my name is Nik,
PostgresAI, and Michael is here
as usual as well from pgMustard.
Hi, Michael.
Michael: Hello, Nik.
Nikolay: And we have a great guest
for the second time, Lev, PgDog.
Hello, Lev, thank you for coming.
Lev: Hello.
So thank you for having me for
a second time.
Nikolay: Yeah, I'm sure there's
a lot new to discuss.
So yeah, I think it's a great idea
to see what has changed since
last time we talked, because it
was around your first release,
I guess, right?
And yeah, so where to start?
Lev: Where to start?
Yeah, we're in a different year,
2026.
So it's been almost 8 months since
we last chatted, something
like that.
7 months.
Now we're in January.
Michael: Yeah, I think 10.
Lev: 10 months.
Oh boy.
I mean, I've been working full
time on this stuff, so I have
a whole list of things that have
been developed ever since.
I think the biggest thing is PgDog
is very much real and in production
with real companies and real customers.
You know, last time we chatted,
it was like a thing, but now
it's a real thing.
So that's the big update.
It's in prod.
Most of the code I think I would
comfortably say is stable and
performant, which obviously I would
say that, but there's real
usage behind this which is really
cool.
Let's see, there's a long list
of features that's been developed
since then.
Michael: Before we go on to those,
I wondered of the people using
it in production, what are the
primary drivers?
Is it sharding or is it like an
upgraded pooler?
You know, I feel like there are
a few selling points to PgDog
that we talked about last time
and I wondered which ones were
driving like the primary adoption.
Lev: That's a great question.
One of our early adopters, they went
straight for sharding, which
to me was like, wow, you're brave,
sir.
And they went straight for it and
they replaced Citus with it,
so that was really cool.
Nikolay: Can you name some names
or?
Lev: They're on my website.
1 of their logos, I'll let you
guess which 1.
I'm still trying to close them
as an official customer, so soon
I'll be able to say their name,
but their logo is there.
And yeah, they're pushing, I think,
over 100k QPS per shard.
And they have like 12 shards, I
think Yeah, yeah, it's it's kind
of crazy they're using direct to
shard query so that's that's
the real you know why it worked
so early is because They didn't
use any of the fancy like aggregates
and sorting across multiple
shards.
That's the real tricky part.
Although we've made some progress
there, but it's still not 100%
compatible to Postgres does it.
So on its way though.
But other use cases, load balancer
is a huge thing.
Like people love having the ability
to talk to multiple replicas
without actually manually creating
a connection pool for each.
You know, there's health checks
and failover.
1 of the big uses is we deployed
alongside, you know, Aurora
or RDS where Patroni is basically
not used And it can detect
primary replica failover, and it'll
move the traffic to the new
primary within a second or so,
which is really great.
And then, of course, when 1 of
the replicas starts failing, it'll
ban it and then use the remaining
replicas for re-query.
So you get that, you know, that
amazing, like 4 nines of uptime
for Postgres, which is pretty
Nikolay: cool.
I have questions about all these
features.
Yeah.
All right.
Before that, I have underlying
question.
10 months, Is it really 10 months
or 10 years?
Because you use AI.
Do you use AI to develop?
I do.
Yes.
Claude, right?
Lev: Yeah.
I use Claude, Opus specifically.
Nikolay: Claude code, Opus 4.5 right
now.
Right.
So this is what you use.
Lev: That's right.
Yeah.
For only certain areas of the code,
not all of it.
There's some areas where I don't
trust it at all, but some areas
like, like we have a, we're building
like a dashboard like with
all the metrics and stuff.
Nikolay: Not critical paths, right?
Lev: Yeah.
Critical paths are still organic
code, if you will.
Organic code.
Exactly.
Yeah.
This is the first
Nikolay: time I hear this.
It's, it's hilarious.
Lev: Yeah.
Homegrown organic code.
But, I don't write JavaScript anymore, for example.
I don't write CSS anymore, like that kind of stuff.
Nikolay: So you use code for this like additional pieces of software,
not super critical person.
And I'm sure you ask it to duplicate and reuse the code as much
as possible, recycle existing pieces of code.
Stop.
Okay.
It's
Lev: crazy how much that machine loves to generate like copy-based
code around.
It's allergic to function reuse.
You really have to tell it like, hey, yeah.
Nikolay: Yeah, I have a special additional code-based review
process, which like basically targeting this like deduplication
vector.
But it's possible to achieve.
If you don't do that, by default, you will get a lot of duplicated
code.
Yeah, that's great to hear.
Okay, I see by the way, great logos on your website.
I missed that, haven't visited front page for a while.
I see Coinbase Circleback, we use Circleback at work.
Some other names, that's great, congrats.
Lev: Thank you, hard at work.
Nikolay: Yeah, So what is your current pace?
How often releases happen?
How do you ship?
And if you still write a code a lot, like I guess it's not like
every day new version.
It's less, right?
Lev: Yep.
Not every day.
Our release schedule is every Thursday.
I'm trying to be very diligent about it.
Like whatever code that's in main is usually stable, but I give
it a week before I tag a release.
And then every Thursday I compile release notes and ship it out.
And then you get it in GitHub and there's a Docker image that
gets built for both x86 and ARM.
So you can pretty much run it anywhere, including on your Mac,
if you wanted to.
And we have a Helm chart for if you're using Kubernetes, our
Helm chart is getting a lot of good usage and contributions from
a bunch of people.
So, you know, I'm not a Kubernetes expert, but They are, so they're
helping me along.
So if you're running Kube, you should be able to deploy PgDog
in like 30 seconds, honestly.
But otherwise, you know, it's just a container.
You can run it anywhere.
Nikolay: Right, and you use libpg_query for query parsing, so
to build AST trees and yeah, and we recently optimized it, right?
So it's not merged, your optimization is not merged to upstream
yet, but can you tell us a little bit more about it?
Also, specifically, I think My question is also about is it worth
optimizing?
Like how much of microseconds we save on each query?
Should we worry about those microseconds at all?
Lev: Yeah, you should.
It was...
First of all, full disclosure.
I told Claude to do it.
And then he did it.
I checked it.
I was very open about it.
I put in the PR.
I know how open source projects hate this kind of stuff.
Like, you know, 3,000 lines of boilerplate.
I was up front, look, guys, this is Claude-generated.
I reviewed it.
It works okay.
But give it a few months before seriously looking at it.
But the idea is in libpg_query, there's a translation layer between
the C library and a bunch of other bindings.
So notably, like we use the Rust binding and the translation
layer is in protobuf.
So every single time you call parse or fingerprint, every time
you get the AST back, basically, it goes through a serialization
and deserialization process with protobuf inside the same process.
And that's, and that's really slow actually.
So I just replaced it with like only for the rust bindings with
just like direct C to rust and rust to C mappings of the Postgres
AST and That improved the translation part of it by, honestly,
I'm not even exaggerating, like 5x faster.
The throughput went from like, you know, 1000 per, I think per
millisecond or per second to like 5 or 10k or something like
that.
The benchmarks are in the PR, but It's huge.
And I think another important factor is CPU utilization.
So for us, we're a network proxy, so our CPU budget has to be
pretty reasonable.
If you can decrease it by 5x, and
that's most of our CPU comes
from query parsing.
Everything else is pretty straightforward
stuff, just I.O.
Back and forth.
And Atomics.
So for us, it's huge.
Nikolay: So this optimization is
just like, you reduced the number
of times how much query is copied
between like various buffers.
Lev: Exactly.
And the protobuf serialization
is Probably one of the fastest in
the industry, but it's still very,
very slow compared to just
copying a C struct in memory, like
a memcpy.
Nikolay: It sounds to me like shorter
queries should suffer less
from this than huge CTEs, you know,
like multi-kilobytes in size.
They should suffer even more, right?
Lev: Yeah, yeah, definitely.
I haven't measured how much faster
are slow queries or long queries.
I just took up a pretty sizable
query, like 600 lines, and used
that as my benchmark.
Because a lot of people who use
PgDog surprisingly like to write
long and complex SQL statements
for their database.
So that's what I went with and
that's where latency will really
matter.
So yeah, for small ones, maybe
it's smaller, but doing protobuf,
DCR in memory, it's a big waste
of time when you can just map
like a C-struct directly with a
few allocations.
Nikolay: Right, right.
I also saw there was optimization
of like SIMD vector calculations.
So like, this is like cool stuff
or what?
Like is it, or you saw some real
benefit?
Lev: Well, the vector one is kind
of funny.
There's a blog post about it, but
basically I tried to shard
a pgvector index among multiple
Postgres instances.
And to do that, I implemented one
of their algorithms, IVFFlat
inside PgDog, and then it's able
to basically like separate the
vectors between multiple databases
and then search for them using
IVFFlat.
And for that I needed SIMD to calculate,
I think I'm supporting,
we're supporting cosine distance
only, just as a POC, just to
see if anyone even cares about
this stuff.
And that's where the SIMD optimization
comes from, which is to
calculate the cosine distance between
the input and one of the
centroids, because we store the
centroids basically in a config
file.
Everything is in PgDog right now
is config driven because I trust
the disk not to do anything.
Eventually we'll move away into
something more dynamic.
Right now it's like you pre-calculate
your centroids in advance
and then you store them and then
you can basically query one shard
or multiple shards to get your
probes, the number of probes,
if you will.
And then underneath each Postgres
database has their own IVFFlat or HNSW index.
So it's an approximation on top
of an approximation, which is
fun because you can scale pgvector
into like presumably billions
of billions
Nikolay: of vectors like others
like announced, right, and we
said it's not true because it's
not a single instance.
So, yeah.
Yeah.
Yeah.
1,000,000,000 vectors on a single instance,
I haven't seen this yet.
Lev: No, no, no, it's not possible
with pgvector, but with sharding
it is.
Nikolay: Right.
So if you have sharding schema
defined, like you can store a
lot.
Right, so it's a great feature.
Anyone using it or is it just showcase
and that's it?
Lev: I don't think anyone, I think
I got an email from somebody
who was like, this is really cool
and I'm like, well, this is
how you need to use it.
And they're like, cool.
I haven't heard back from them.
But yeah, it's more like a...
Nikolay: Speaking of performance,
let's talk about two-phase
commit.
There is support for it, right?
Can you describe what's your
experience and view on using
two-phase commit in larger systems?
Oh, we have a dog.
Lev: We have a dog. And we have
a dog on two-phase commit as well.
Okay, just for
Nikolay: those who are listening
on the audio version, we just
saw a dog actually walking in the background.
That's great.
Lev: Big dog.
She is the reason why PgDog
is named PgDog.
Nikolay: Right.
You explained that last time.
Right.
So what about two-phase commit
and performance overhead it has
and so on?
Lev: I'll be completely honest
with you.
I haven't benchmarked it very much.
Nobody really talked to me about
using it yet in production.
So I don't know how much the Postgres
internals themselves create
an overhead.
I just know from the proxy side,
it's 2 extra statements to execute.
And also the additional complexity
of managing phase 1, phase
2 of the commit.
So if phase 1 fails, you need to
roll back.
And if phase 2 fails, you need
to roll forward or commit.
And you also need to make it PgDog
crash resistant, which we
haven't done yet.
So there's still work to do there
to make it, I think, production
grade.
But from our point of view, it
was just an extra couple of statements
to execute.
Nikolay: Yeah, okay, understood.
So this is just like, also like
some foundation for future work,
maybe, right?
Definitely, yeah.
Lev: After we get
Nikolay: some feedback from production
systems.
Okay, I see.
Another thing I noticed also, like,
it's fun, fun feature.
You created ID generator, right?
Which is like, we, like, Postgres
18 brings UUID version 7, but
it's 16 bytes.
It's quite a lot of space spent
for a single ID value.
Yours is 8 bytes, right?
Lev: Yeah, big end, yeah.
Nikolay: It's 8 byte integer, but
it behaves similarly.
So there is timestamp component
in it.
This is cool feature and it's going
to serve us until when?
70 years, right?
70 years left.
It's not a lot, but for many systems
should be enough.
What inspired you to build this?
Lev: I have a potential production
deployment and there were
big usage of Omni-sharded tables
or replicated tables or mirrored
tables or whatever you want to
call it Basically the the data
in the table is the same on all
shards And they needed to generate
the primary key for that row And
you couldn't rely on the sequence
because even if you used to face
commit, which they originally
thought they would do, there's
still a chance that the sequence
numbers on each shard will be different,
especially because sequences
are consumed whether the transaction
succeeds or fails.
So the ID could be different on
each shard.
So you needed an external ID generation
service.
And the first thought about using
a third database or another
database to use its sequence, to
pull it in and then insert into
the query, which creates a single
point of failure, but it's
much simpler to actually be able
to generate a unique big int
inside your proxy, inside PgDog,
which is not that hard
Nikolay: to do.
Could you sequence, we just could
have some extra added, like
for example, if we say we don't
need more than 100 billion per
shard, we could say on shard number
1, we always add 100 billion,
on the second shard we add 200
billion.
Why not?
Michael: They need to be the same
on every shard.
This is the table that's replicated.
Nikolay: Okay.
I see.
Lev: I like
Michael: it.
I think it's interesting.
So I guess it's because we don't
have unique constraints across
shards, right?
Lev: Yeah, that's right.
Well, we don't need them basically
because the ID generation
is guaranteed to be unique.
Before.
Yeah, before.
Yeah.
I mean, all the good database engineers
will tell you should
still have a Unique Constraint, even
for UIDs, which are theoretically
guaranteed to be unique, but even
in this case, you don't necessarily
need 1.
So the idea is, again, so you insert
the same ID into all the
shards and you need to be able
to generate it quickly and efficiently.
And that's why this like, it's
the first of its kind, PgDog executed
Postgres function, okay?
Cause we're using the Postgres
parser, right?
So why not, you know, just hook
into the Query execution.
Nikolay: Extract.
Lev: step towards Citus
Nikolay: and having whole Postgres
in your middle there, no?
Lev: I mean, you heard it here
first in Postgres FM.
No, I don't know.
I think it's a good start, but
it was so simple to implement.
You extract, you find the function
execution, and then you pull,
you generate a unique ID inside
the pooler.
And the way you do that is, well,
you get the timestamp.
You kind of assume your clock is
reasonably accurate.
Then you have an internal sequence
number so you can generate
multiple IDs per millisecond.
So the clock tick is millisecond.
And then there's another unique
number, call it node ID or PgDog
node ID.
So you can have multiple PgDog
containers in front of the same
deployment and it will guarantee
that none of them will hit a
duplicate from the
Nikolay: same list.
No collisions, yeah.
Lev: No collisions, exactly.
So, you know, the 70 years is a
little bit arbitrary.
We can reduce any 1 of these components
and increase the timestamp
component to exit by just reducing
the number of PgDog nodes
that you can deploy.
Right now it's 1,024, which if
you're running that much, you're
Facebook probably.
But you can reduce that by half
and give that extra bit to the
timestamp.
And then you can also reduce the
number of per millisecond generation.
Right now it's 4 million.
You can generate 4 million per
millisecond.
It's a lot.
You could probably reduce that.
Actually, no.
Let me double check.
I think it's probably 4,000 per
millisecond.
That sounds more reasonable.
Michael: 4 million per second would
make sense.
4 million.
Lev: That's right, yeah.
This is where documentation, or
good documentation,
Michael: comes
Lev: in handy.
Michael: That's a big change from
last time as well.
Lev: Yep.
The biggest thing I'm proud of
is adding an icon next to every
single link.
Just, you know, being very, like,
attention to detail, I suppose.
But yeah, like, every single comment
I get from people is like,
well, is this documented?
Is this documented?
And somebody's talking about our
feature, and I'm like, good
point.
Now it is.
Thank you.
So if it's not in the docs, I want
to say it doesn't exist, but
that's not true.
There's a backlog of documentation
to be done.
Yeah.
Nikolay: I should help with this.
Lev: I would love for you guys
to help out with anything.
Nikolay: Not I, AI.
Oh, AI.
Maybe I as well.
I would love to help with AI.
Yeah, I'm looking at PgDog from
time to time and definitely
looking for some use case.
And as we all know, maybe not all
that we, my team benchmarked.
And so I'm looking forward to revisit
that benchmark to see how
that optimization related to protobuf
helped.
I'm very curious.
This time we are going to visit
not with trivial queries like
SELECT, that's it, but with some
heavy...
That's why I mentioned, like, I
expect some benefits from...
For heavier queries, bigger text.
Yeah, so yeah.
Documentation, I think AI should
help with it.
Like this is my plan.
Lev: No, no way, man.
Like, because they all sound the
same.
They all look the same.
Like you could give it a couple
of tips, like, you'll sound original,
but like if you write text using
AI, it just all reads like it
was written by the same.
Nikolay: But at least the reference
could be like the usual problem
with reference-like documentation,
which lists various functions,
config params, everything, it usually
lags.
And to maintain it, I think AI
is great to maintain it.
Lev: Yeah, I mean, I don't disagree.
For me, writing documentation is
kind of like a, it's like a
ritual, you know, I go through
the things that we built, and
I'm like, oh, document, and then
I look at what I wrote.
I'm like, wait a minute.
This is wrong.
I'm putting myself into the user's
shoes.
I'm reading the docs.
I'm like, OK, this is what it looks
like.
Nikolay: It's an additional point
where you can verify the whole
picture and see if all that makes
sense, all that needs to be
adjusted additionally.
Yeah, I agree.
Also, demos, right?
Like when you think how to show
what was built, sometimes you
end up changing everything.
Lev: Yep.
That has happened before.
Nikolay: Because you simply cannot
explain it, it means that
it probably should be done differently,
right?
Lev: Exactly, exactly.
So,
Nikolay: let's talk maybe about
hard stuff, like prepared statements,
what happened to them, like how
PgDog supports prepared statements,
transaction mode and so on.
Can you, can you...
Lev: Funny.
Yeah.
You know, we have to stop sharding,
but prepared statements is
where we really have issues.
Nikolay: Yeah, let's talk about
it
Lev: a little bit.
Sure.
They work.
They're fine.
Everything's great.
It works, I think.
Nikolay: You saved them, right?
Lev: Yeah, they're cached.
It's a global cache.
So if multiple clients send the
same statement, the same statement
on the server side is executed.
So pretty high cache hits.
Nikolay: Implicit or explicit,
like protocol level or prepare
keyword?
Lev: Protocol level right now and
partial support for explicit,
Nikolay: you
Lev: know, since we have the parser.
For the explicit ones I need to
differentiate on the number.
Sometimes you can pass the data
types in the prepared statement,
sometimes you don't.
I just need to add a little bit
more support to the parser there
to extract the differentiation
between like different unique...
Identical prepared statements,
but they expect different data
types for some reason.
So protocol ones work.
I guarantee it.
If they don't, I swear.
They do.
But the other kind is still a work
in progress.
Nikolay: Yeah, and have you seen
any issues with prepared statements
used together with partitioning
or your clients don't use partitioning
at all because they have sharding?
Why use partitioning?
Lev: I think we use them with partitioning.
I haven't seen any issues.
Like if the query, if the prepared
statement plan is wrong, Postgres
will return an error and we evict
it from the cache and then
we prepare it.
Nikolay: It's not that simple,
right?
Because the first few calls are
not cached yet in prepared statements
because first it uses custom plan
for the first 5 times.
There is a hard-coded constant
in PostgreSQL.
And with partitioning, it's much
more complicated.
I have a series of blog posts about
this.
It's like super complicated.
Yeah, I'm curious about this.
Maybe just that nobody encountered
any issues yet, but yeah,
I'm curious about this.
Lev: No clue.
Yeah, let me know what I need to
fix.
Nikolay: Yeah, I understand.
Very practical.
I understand that.
Michael: Yeah, but it's also, also
Nik, I think you're like,
this feels like the kind of thing
that actually gets a little
bit better with sharding.
Like you just reduce the number,
you reduce the QPS, you reduce
the load on each like, it, everything
that could cause issues
around that is, however many shards
you have, times less bad
or less likely to hit the cliff?
Nikolay: Absolutely, sharding reduces
the need in partitioning,
100% so, but maybe not fully eliminates
it because partitioning,
if it's time-based, it still makes
sense within 1 shard.
We know cases of really large databases
which combine both approaches.
Michael: Yeah, for maintenance
reasons it still makes complete
sense, right?
Nikolay: Right.
Yeah, yeah.
So yeah, it's sticky.
So and if you want prepared statements,
oh, It's like, it's
super complex.
Lev: So what's the deal?
What doesn't work?
Nikolay: So partitioning brings
cache not at planning time, but
at execution time.
And this increases complexity because
if you just deal with planning
time cache, it's quite simple.
So if it's not partitioned, only
planner can decide the cache
or not, right?
That's it.
If it's partitioned, there is a
decision point shifted to execution
time.
And there is a point when basically
you are without cache and
you cannot avoid it because when
cache is being built, at that
execution, it's number 6 execution
or fifth execution.
So at that point, executor doesn't
have this cache because it's
being built.
So we are without cache.
And in high QPS cases, when frequency
of calls is super high,
We basically, there is a moment
when we must leave without cache
and it might be a super non-performance.
So performance can be terrible
in that 0,
Lev: I see.
Nikolay: Which doesn't happen with
unpartitioned case at all.
This is like, I already keep, I
already start forgetting details
because so many nuances if you
don't leave there, if you don't
only visit it like I did for a
couple of weeks and then, but
I have several blog posts about
this problem with demonstrations.
Lev: See, this is the prime example
why being a proxy for Postgres
sharding is the best place to be
because to me, that problem
is completely irrelevant.
Right?
Because I'm just passing statements
through, right?
I think, unless I'm, because like, they will work or they won't,
but from a performance perspective, like if you have trouble
with partition tables and prepared statements, you just don't
use prepared statements, right?
Nikolay: Yeah, yeah, yeah.
Partitioning, we must, so Partitioning also brings the problem
of explosion of access share locks.
Because...
I'm talking about this particular, but yeah.
So if you have a lot of partitions, and we have plan cache 1
idea, but also we have routing to specific partition, which is
called constraint exclusion, right?
We need to route to specific partition.
This is another idea.
And at planning time, if planner knows where to route, this is
normal case, when you don't have prepared statements.
It just routes to specific partition, all good.
No problem.
We lock only 1 partition with all its indexes.
But if we postpone it to execution, it needs to lock all tables
with all its indexes, and it's terrible.
With sharding, I don't know, you just...
Lev: Well, you reduce the QPS, so...
Nikolay: Yeah, this is already...
And the lock manager is not a problem anymore.
Or you just need to use Postgres 18 and raise max locks per transaction.
In this case, lock manager also is the same.
So yeah, anyway, this is an interesting topic.
And I'm curious if some people combine sharding and partitioning
I think they should sometimes they should sometimes and it's
just it's in really big setups yeah
Lev: I mean as a guy who's trying to build sharding you should
always use sharding unless you can't but like that's you know
Michael: I think there's a maintenance argument.
Imagine if you've got like a logs table per shard and You want
to delete old logs after a certain you've got a retention period
you probably still want partitioning to make those deletes really
efficient.
I think there's still a use case for it but it's not so much
for performance reasons but it's for maintenance reasons, that'd
be my guess.
Nikolay: Yeah, for to build indexes, to have vacuum working more
efficiently, and so on and so on.
Michael: Yeah, but I think those are, I still characterize those
slight, I know they are maintenance, but they're kind of still
performance reasons, right?
You want them, because you don't want the table to be too big.
I'm talking about doing like deletes efficiently, being able
to, instead of having to like do batch deletes you can...
Exactly.
Nikolay: It's exactly how's it
called not data retention but
there is some additional term
Michael: yeah yeah the opposite
of
Lev: yeah there could be like a
legal requirement for data retention.
I think that's the only reason
why people would ever delete data,
right?
Well, the other reason is performance.
So if you eliminate the performance
argument, like if you make
performance just free because of
sharding makes that, I mean,
Obviously that's not the case,
but like, no.
Nikolay: Yeah, and if you compare
sharding and partitioning,
with partitioning, we often are
willing to have 1 active partition
where we write and all others are
archived.
In case of sharding, it's strange,
right?
Because a lot of hardware is not
used.
Like if you have only 1 shard read-write
and others like archive,
it's strange, simply strange, right?
So you want to distribute writes
usually.
Lev: You want it to be equivalent,
like you want like a perfect,
like all shards have the same load,
And that's why you'd usually
use like hash-based partitioning
if you can, again.
Or the way you design your schema
for sharding is you make sure
that each shard roughly has the
same load.
If you have hot shard, you're gonna
have a hot shard problem
sooner rather than later.
Nikolay: Yeah.
Lev: Exactly, especially for like
multi-tenant, like what I call
multi-tenant is B2B basically,
where your sharding key is like
a customer ID, if you will, and
then 1 of your customers grows
big and you need a second sharding
tier, or if you will, a second
partitioning tier.
Nikolay: And does PgDog support
resharding without downtime
already, you know?
Oh yeah, yeah, yeah, of course.
How's it technically working?
Lev: I mean, it's good.
It's been tested, so that's kind
of cool.
The idea is, I mean, I don't want
to say simple, but I think
practical.
For re-sharding, this is where
the contrast between how this
works and like NoSQL databases
work is a little bit more apparent.
With NoSQL, you kind of add another
machine and then it kind
of rebalances all the data using
existing machines.
The PgDog approach is you create
a brand new empty cluster, and
then you copy all the data out
into the new cluster.
And this decision is actually not
because of a particular limitation,
but it's intentional.
Because you want to reduce the
load on the existing cluster during
re-sharding, because usually when
you start re-sharding, it's
already under load, it's already
underperforming, it's already
having problems.
So you don't want to add additional load with resharding to it.
So you create a new, brand new cluster, you move the data over
and we are able to use replicas since Postgres 16 to copy data
out with logical replication, that was a huge thing.
That was, I think, one of the game changers that made this possible,
is logical replication slots on replicas.
Why?
Why does it matter?
Why does it matter?
Because you can create 16 replicas and parallelize your copying.
Why does it matter?
Nikolay: Disco or what?
Why does it matter?
Like you don't want to use disco on primary or what?
Lev: Well, you definitely don't.
Your primary is already shot.
Like if when you're sharding, your primary is already having
a really, really bad time.
That's the main reason behind sharding.
You want to scale your writes.
So adding additional replication load to an already underwater
database is bad.
And especially when you want to create, let's say, you go from
one shard to 16 shards.
Extreme example, but realistic.
You're going to add 16 replication streams to your primary.
That's like 16 times like streaming the same data.
The thing is going to fall over.
I guarantee it.
Nikolay: But overhead is amplified here.
So the same work is done on different CPUs, VCPUs.
Lev: Yeah, yeah, decoded and recoded.
Nikolay: Yeah, yeah, yeah.
So.
You want to shard this load, right?
Lev: Exactly, exactly.
We're able to use one replica per table basically for, for re-sharding.
So if you have like, say 1000 tables, you're able to shard your
sharding stream and you're both like split it into 16 different
machines or 32 different machines and sync it in parallel, which
I think is pretty huge.
It's all the same LSN, right?
Because they're replicas, right?
So you can synchronize at the end and make sure that you got,
you're at the right point and like pause the live traffic and
then wait for the logical stream to catch up and cut over to
the new cluster.
Like all of that is, the pieces are there, you just need a button
to make it work.
We haven't built the button yet, but we've built all the levers
you need to push as an operator.
So when
Nikolay: you say you create cluster, you mean create Postgres
cluster as just basically 1 new primary node.
And when you say you copy, it means that there is already something
in PgDog that copies.
You just need to use it.
It's not on the shoulders of the user, right?
Lev: No, we hook into the replication protocol to stream the
rows and we use copy, binary copy underneath, and we shard the
copy stream.
Nikolay: All right, this is the option since Postgres 16, or
14, I think.
Binary.
Lev: Oh, it's been a minute, yeah.
Nikolay: When you initiate logical replica, I guess.
There is option binary equals true for initialization part.
Interesting, actually.
Lev: Yeah, look at this.
Nikolay: There is also trick to convert physical to logical,
which is official since Postgres 17, I think it's called
pg_createsubscriber, CLI.
Lev: Oh, interesting.
I'm using like pgoutput, which streams everything in text
Nikolay: format.
Yeah, yeah, yeah.
But yes, but right.
But still there are additional things here.
You can convert physical replica to logical replica, and we could
do it before, but now it's official.
It's official.
Lev: Oh, I
Nikolay: see.
pg_createsubscriber.
Unfortunately, we won't be able to use it on most managed services
because they don't expose this.
They should, actually.
Yeah.
They should expose this in API.
I hope they will.
But in case of self-managed, it's good.
Lev: Yeah, with self-managed, you have a lot more options for
sure.
I think the original goal was to make this work in managed environments
because that's where them.
Nikolay: Right.
Yeah.
You mentioned Aurora already, support of Aurora.
Lev: That's great.
Nikolay: Yeah.
And these replicas, so you create
logical replica out of physical
copying data and, and, and performing
CDC.
So, so that's ranges from a slot
on replica, which is supported
since recently, right?
I'm curious this, when you initialize
logical replica from this,
is this physical standby?
Does it have hot standby feedback
on or off?
Does it report to the primary that
it still needs some data?
Because if it does, the primary
health can be affected.
If you eventually switch over,
that bloat doesn't matter.
But if you roll back...
Lev: So I think
Nikolay: if this physical replica
is not used by anyone else,
I would keep HBF off.
To let it lag, I would raise the
max.
There are a couple of settings
which allow a replica to lag.
If hot standby feedback is off, it's allowed
to lag.
There are many, many nuances here
in this process.
I'm curious how you do it.
But okay, you're using logical
replica here.
Lev: Yeah, that's a good call out
actually.
Like my recommendation is to create
a dedicated replica for this
or dedicated set of replicas, but.
Nikolay: Without hot standby feedback, I would
Lev: do it.
Nikolay: If it's temporary replica,
why not?
Like, And also how this replica
is created, right?
Like it's an interesting topic.
Do you integrate with some backup
tools or you just, it's out
of scope, right?
For PgDog.
Lev: I feel like at the moment
it is, but maybe in 6 months it
won't be.
Like right now I'm relying on people
using like some kind of
reliable way to create replicas.
If you use RDS, you just click
a button.
Again, like if you're using like
any of the managed services,
you just click a button, you get
a replica, it's using disk snapshots.
You're self-hosting.
Like if you have ZFS, presumably
you'll have your snapshots somewhere.
You can restore from, from a snapshot
and resynchronize pretty
quickly.
But even then it takes a while
to, to sync the ZFS snapshot on
disk.
Tested that 1.
Takes minutes, sometimes hours.
Oh, well, if you have like a 2
terabyte file system.
Nikolay: You mean the first send
receive?
So between the hosts?
Yeah, I understand.
Lev: Yeah.
It's surprisingly limited by, I
think, actually network.
I needed to like compress it and
then do something else to it.
And I wrote a blog post to it on
our old blog at Postgres.
Oh my god.
Michael: ML?
Lev: Thank you.
But the website went down so my
shame went down with it as well.
The blog post is gone.
I think it was a good blog post
but whatever.
I learned how to use compression
tools and ZFS and, you know,
big deal.
Michael: Just out of interest,
like, are people generally using
this with a managed service or
is it quite split between managed
service and self-managed?
Lev: Right now it's split.
A lot of management or service
use, especially like failover
for Aurora is a big deal.
Like failover detection, because
you don't run Patroni typically.
So it's able to like query the
pg..., sorry, what was the
name of the function?
Nikolay: pg_is_in_recovery.
Lev: pg_is_in_recovery, thank you.
Nikolay: In my opinion, to stop
confusing people.
Lev: The people who know, know,
and the people who don't know,
don't need to know.
Yeah, if you know, you know.
Yeah.
Right.
Yeah, and I'm sure you know Aurora
loves to failover on a perfectly
good day.
So it helps a lot with uptime.
But 1 of my biggest sharded deployments
is literally like on-prem,
on-prem, like They rent their own
data center and everything.
So it varies.
Like that's the beauty of it.
Like operating on the application
network layer, you really don't
care that much where you're deployed.
Nikolay: So based on these, almost
a year since you started this,
What's your current vision for
your customer who needs pg_auto_failover?
And what's your vision for next
future steps?
Where do you move?
Lev: Yeah, this is a really interesting
question because sharding
is always going to be like the
north star because it did it explains
to people like this is a scaling
product ultimately you can start
anywhere I already feel like it's
a better connection pool than
pgBouncer
Nikolay: I
Lev: already feel like the load
balancing aspect of it is a game
changer and sharding as much as it is a work in progress, is
the North Star for us.
So that's how we want this project to evolve.
We want to make Postgres as scalable as Cassandra or Scylla or
Elasticsearch.
I don't know what people think about those particular databases,
but I think they're horizontally scaled, we can all agree.
So we want Postgres to be like that as well.
We're building, it hasn't been released yet, we're building an
enterprise edition of PgDog, which adds the classic things like
monitoring, understanding how everything works, a control plane,
something that we can.
Sell as a product to people, you know, like you don't have to
like go through like many, many steps requires to set up like
a, like a sharding layer and network proxy, So that should come
out within the next few months, hopefully.
Nikolay: Will it include automated sharding with AI?
Like, press a button and your database is sharded already.
Lev: You want automatic sharding with AI as part of our future
set?
Nikolay: I think we
Michael: just want more consulting clients.
Lev: I don't think we'll ever get automatic sharding with AI,
but I don't know.
Right now my idea for that is like dump the, use pg_dump, get
the schema out, pipe it into ChatGPT and ask it what to do.
With a prompt.
And test.
Yeah, exactly.
And test.
Yeah.
I don't know.
A human being is able to come up with a sharding key really quickly.
Like I talked to like a senior database person at a company or
an app engineer and they're like, oh, my sharding key is like,
they already know it's org ID or customer ID or like they instinctively
know how the data is organized and they even know what the edge
cases are.
You know, like I don't think you need AI for that 1 unless you
literally have you've never seen this database before but yeah,
that's That's the way I think about it.
Nikolay: I don't think
Michael: yeah on the vision stuff.
How how closely are you paying attention to the other sharding
projects in the space and how do you differ in terms of your
North Star versus what you think theirs is based on what they're
saying?
Lev: I'll be completely honest with you, I literally never look
at them.
And it's not because I think I'm cool or something, It's literally
I just don't have the time like
I have I have something in my
mind I'm building it what the other
guys are doing is usually
more of a distraction than the
learning point to me personally
So I have my vision of my North
Star and I'm gonna follow it
until I hit a wall or something.
And so far I haven't.
So that's how I think about it.
Nikolay: Great words to finish
this podcast, I guess.
Michael: Is there, is there anything
left that you wanted to
like mention that we haven't asked
about?
Lev: Ah, well, let's see, I have
like a, I'm gonna do a post
on Hacker News soon with all the
stuff that we've done, but I'm
just gonna blab with you, I know
it's kind of boring.
We added a bunch of support for
Aggregate functions, you know,
we do query rewriting now, which
is really cool.
Like we take the AST and we add
whatever needs, whatever we need
to that query to execute it.
1 thing that I think is really,
really phenomenal is we support
sharding key updates.
That's something I think Citus
never got around to do is like
changing the actual value of the
sharding key and moving the
rows between the shards.
Wow.
That one's really fun.
I don't know how often it's gonna
be used, but anytime somebody
asks us about sharding, they ask
about that feature, so it must
be useful.
We support multi-Tuple inserts,
we rewrite insert statements,
we separate the Tuples, and we
send them individually to each
shard, you know, all of that happens,
you know, transparently,
which is really cool.
We're just, we're focusing on usability,
like the idea is you
just deploy this and you don't
change your app, like don't rewrite
your queries, don't rewrite your
code and just make it work.
But, you know, obviously We're
not there yet, but we're getting
closer.
So every week there's something
new that comes out.
Nikolay: And it comes from the
fields, right?
Lev: Yep.
It comes from people asking me,
hey, is this supported?
I'm like, yes.
Almost, almost.
Next week.
Thank you for the suggestion.
Nikolay: Yeah.
Cool.
Yeah.
Production driven development.
Lev: It is.
I really enjoy it because every
single feature is immediately
used.
Like, it's great.
Nikolay: Cool.
Great.
Thank you.
Thank you for update.
It's really good to see the project
is growing.
Excited to test it once again soon.
Lev: That would be great.
And I still have to review and
merge your PR.
Nikolay: It's OK.
Lev: It's been open for, I believe,
10 months, since the last
time we talked.
Nikolay: Something like this.
Good.
Michael: Well yeah really good
to see you again Lev.
Take care.
Lev: Thank you so much.
Michael: Catch you soon.
Lev: Yeah Thanks so much for having
me guys.
Bye bye.