Michael: Hello and welcome to PostgresFM, a weekly show about

all things PostgreSQL.

My name is Michael, founder of pgMustard, and I'm joined

as usual by Nik, founder of Postgres.AI.

Hey Nik.

Nikolay: Hi, Michael.

Michael: And today we have a special guest.

We have Harry Brundage from Gadget, co-founder and CTO there.

Hey, Harry, thanks for joining us.

Harry: Thanks so much for having me on.

I'm happy to be here.

Michael: It is our pleasure.

So you've been doing some cool things with Postgres for a while.

Recently blogged about a no downtime, a 0 downtime upgrade, but

there's a ton of other things we wanted to talk with you about

as well.

Where would you like to start?

Harry: Yeah, I'd love to tell you about this 0 Downtime upgrade,

just because, I don't know, it's like a point of pride of mine,

I guess.

It's always so frustrating to me when I get an email from 1 of

my providers.

It's like, hey, Sunday at 2 PM, we're going to be down for 30

minutes.

Like, hope that's OK.

So yeah, Gadget's an infrastructure provider.

Like, the developers who use us use us for mission critical apps,

So we just try to take it more seriously than that and like not,

you know, disappear.

So yeah, we did a 0 downtime upgrade with the help of Postgres.AI

and Nik, thank you so much for your help, Nik, what a treat.

Nikolay: Thank you for trusting us because you know, like, we

did it with other companies and always like in the beginning,

I like, I might hear pushback because understanding how much

work it's needed there, you know, like, because it's far from

being well automated.

It should be in Postgres and it's improving, but

Harry: yeah.

Not quite yet.

Nikolay: Yeah, yeah, yeah.

So you did this journey and I thank you for not stopping.

Harry: Absolutely, absolutely.

So it was a lot of work,

Nikolay: it was, yeah.

Harry: I think too, it's like, to me it just kind of feels like

something that should be built into the managed providers.

Like I'm kind of surprised that so few of them have it.

I think the only 1 I could find that has like true 0 downtime

is PlanetScale.

Nikolay: But for Postgres,

Harry: the GCP options, I don't
know about the new Postgres one.

I guess I know they have it for
MySQL but not for Postgres yet.

Nikolay: For MySQL, yeah, I think
so.

I don't know details but for Postgres
I don't see yet anywhere.

Harry: But again, like what the
heck?

Like as an industry shouldn't this
have been the thing we kind

of like solved 20 years ago and
never looked back?

You know, I don't understand.

Nikolay: Yeah.

And there's no AI needed there.

It's just pure automation.

Harry: Yeah.

Terrifying button to press.

Nikolay: I think.

Harry: But

Nikolay: yeah, I think what happens
is that it should be automated

more in Postgres itself.

And most managed platforms like
RDS and others, they just sit

on the fence.

Well, I know they they work and
do it, but they don't want to

automate things outside Postgres
too much.

For example, ANALYZE, right?

ANALYZE is not automated.

You always forget about this.

And they don't automate it, and
they just put it in documentation

on the shoulders of users.

Now, it's time to analyze.

And I started like asking around
like maybe a couple of years

ago, why, why, why 1 year ago and
then okay, Postgres 18 should

be out next week, right?

Finally, statistics will be dump
restore of statistics will be

fully automated.

So they will just exclude this
from their how to manual, right?

So this is an example.

Things should be inside Postgres.

But I think still, they should
automate much better.

Indeed.

Not waiting for Postgres itself.

Anyway, I'm with you here, 100%.

Harry: I'm glad.

What I was going to say is I've
always attributed it, or the

absence of something like a true
0 downtime major version upgrade

to the fact that it's like more
than just Postgres that you need

to pull it off.

Right?

Like you need a connection pooler
or a proxy or something in

front of it sitting outside the
system and you need a bunch of

tight coordination between that
proxy and the rest of the system

to kind of orchestrate it all and
so like I know at least on

our previous Postgres provider like they ran the PgBouncer on

the same VM as Postgres itself.

And so they're kind of hamstrung by the architecture where they

can't have this piece that lives outside that can talk to 2 at

once, whereas we self-hosted our bouncers for exactly this reason

and can kind of do it all and orchestrate it all and it's like

sucks to self-host it but gets you that architectural flexibility

you know we needed that.

Nikolay: Yeah and PgBouncer supports pause/resume which helps you

achieve 0 downtime.

Harry: We're maybe nuts too and we run our PgBouncers in K8s

and do the whole rolling restart and whatever and it's turned

out to be, I think, an okay choice.

I think a lot of people may be shy away from that because it's

sort of super stateful, super critical.

Maybe this is a pet, not a cattle.

But we did it in case just because that's where the rest of our

compute runs and it's like easy to sort of bin pack and share

it all.

It actually really saved our butts recently because we're paying

a lot for intrazonal network transit to our databases.

Specifically, like once from our application pod to the bouncer,

and then again from the bouncer to the database.

And because the bouncer was in K8s, we could turn on the handy-dandy

like prefer zone local network routing configurations and kind

of have the application pod go to a zone local bouncer and then

to the database And I think were we outside of K8s, that would

have been a lot more annoying to kind of like, teach all the

routing to be able to do the right thing.

Nik, you look scared.

You look skeptical.

Nikolay: No, no, I'm just thinking this is also a job for database

platform, I think, to take care of such things.

So yeah, and I know like RDS has RDS proxy but it doesn't have

a it has cool features PgBouncer doesn't but it also lacks cool

features PgBouncer has so it's a zoo and yeah And I think pause

resume is a wonderful feature.

Still, I have to emphasize it.

Michael: I was going to ask about your PgBouncer setup, actually.

Because your blog post on the upgrade is really cool, and goes

into good detail on the actual upgrade process.

But you didn't talk much about your PgBouncer setup, and how

many you're running.

And do all of your application stuff go through that?

Or does any of it skip it going to the database?

Is there anything interesting about that that you wanted to share

with people?

Harry: Good question.

I have 2 interesting tidbits.

The first is, I think our, so our PgBouncer deployment's a bit

weird, and then we have 9 instances.

Michael: Interesting.

Harry: Yeah, and we take our kind
of total connection count,

where we're targeting 400 kind
of server-side connections, and

split that up over 9, And the reason
for that is that same K8s

network-aware routing thing, where
the kind of up until very

recent versions of Kubernetes,
you needed 3 endpoints per availability

zone for the network-aware routing
to kick in.

So we had to use a minimum of 9
and initially we were really

skeptical about that because it
seems kind of bad right instead

of having like say 1 or 2 buckets
of a lot of connections, we

were kind of brokering them fairly.

You have many small buckets and
you could get kind of weird inefficiencies,

but it didn't come to pass where
that bit us.

It seemed actually okay in the
end.

And what I like about it is we
have a lot of CPU headroom, right?

Like PgBouncer, I think is still
largely single-threaded.

I don't know the innermost details
of it, but we're using the

fancy new prepared statement stuff
within it, and so I'm kind

of happy that I can give it 9 cores
instead of 2 or 4 or something

like that, and kind of be like,
you know.

Take as much as you need.

Michael: Yeah, nice.

Always good to have headroom.

Cool.

Anything else about the upgrade,
or like which bits were most

scary when Nik came to you with
the proposed plan, you said

the pushing the button, is it because
it is single direction

or did you have a rollback plan?

Harry: Good question.

We had a rollback plan, but it
was written rather than tested.

Because we sort of like, So the
way that we pulled it out, or

the way that we kind of built the
whole thing is we built a temporal

workflow.

So we're like huge fans of temporal,
very easy to kind of script

something like this.

And the reason why we did this
is because it lets us get all

the nice kind of retries and item
potency stuff for free, where

we get to write in a somewhat failure
oblivious way.

And we wrote our workflow to have
like a dry run capability,

where we do all but the very last
step.

So we would go do everything where
we like prepared the new host,

kind of replicated the data, made
sure we were caught up, did

the PgBouncer pause, and then
we just wouldn't reload the config

or change over and we'd just do
a resume and everything would

keep on going as it was.

And so The terrifying part is we
just never actually done the

middle thing.

And we weren't really sure, like,
here's how we can test this,

other than like standing up a whole
parallel set of infrastructure,

which we were like, I'm not even
sure that will prove this 1

will work, right?

Like, just in my admittedly still
limited lifespan, like, the

things that go wrong are always
the things you didn't foresee,

right?

So it's like we dry-runned it a
million times, found all the

things we could foresee, and then
it was like the spooky, you

know.

Nikolay: Yeah, so it's really like
It's impossible to test switchover

in production, but I wanted to
emphasize it's possible to test

everything before that.

And it's great.

Because this is the beauty of this
procedure.

You create the whole cluster, which
is running already a new

version.

You see it's working.

You can touch it and it's great,
and so on.

But yeah, I agree, the scariest
part is switchover how to test

it properly.

Of course, it was tested in non-production,
but it's different.

Here we just need to jump at some
point.

Harry: Exactly.

It was the best kind of jump, where
it was like a big meeting

that we were all like scared of,
we all like sat down, Nik was

on the phone, and it was over in
like 7 minutes, where it was

like, oh, well, I guess it worked.

Like all the numbers look good.

Like yeah, that's it.

Onwards you know.

So I feel fortunate to work with
a team where, you know, that

was how it went as opposed to like
us screaming around like,

check out their heads.

Nikolay: I wanted to thank you
for this, like emphasizing that

you use this temporal framework
because we, like Postgres.AI, we're

always big fans of Go and we like
developed using Go a lot of

stuff.

But recently, our team, we have
a few Python guys right now.

And I just envision that probably
we'll learn from you in this

experience as well and do some
stuff with temporal as well.

So thank you for emphasizing.

Harry: Yeah, we're heavy, heavy
users of it.

We use it for these kinds of systems
administration tasks or

behind-the-scenes maintenance workflows.

But we use it for a lot of other
stuff too.

Like Gadget is mostly used today
by people building Shopify apps.

And when you build Shopify apps,
you have to receive a metric

shit ton of webhooks.

And so the Gadget has like a background
kind of queuing system

that knows how to take a webhook
and process it later or process

it with retries or whatever.

So each 1 of those is a temporal
workflow for us and so we're

doing I don't know something like
3k, 4k temporal state transitions

per second also backed by Postgres
in this big ass temporal cluster.

And that's something I never ever
want to have to try to build

myself.

Like it's so complicated and nasty
and fun for some.

Michael: You mentioned in the post
also that Sharding was on

your to-do list.

Yes.

You mentioned though you were using
a different database for

that now, like AlloyDB specifically
I think for the new setup,

but you said it was just for the
user data.

What's the separation there with
like the Shopify?

Is it the data of the Shopify apps,
for example?

Harry: Yeah, so Gadget is like
a general purpose app development

platform.

So If you want to make a to-do
list tracker or a ChatGPT clone

or a Shopify app, you can sign
up for Gadget and you get kind

of like a full stack dev environment.

So you get a place to run kind
of backend code.

It's all JavaScript, TypeScript.

You get a database or a sort of
virtual slice of a database to

store back-end data.

You get a front-end and React and
whatever and you're off to

the races.

You write back-end code, you write
front-end code, you define

your data models, you transact,
so on and so forth.

And then the Shopify part comes
in where we kind of have a managed

API connection to Shopify, where
we'll do the OAuth and the webhooks

and sync the historical data, et
cetera, et cetera.

So like many infrastructure providers,
we have a control plane

and a data plane.

The control plane stores our list
of users, our who has what

access to what app, so on and so
forth.

Then the data plane is the actual
apps data, like this app has

a to-do model that stores, you
know, this many to-dos.

This app has the Shopify product
model and, you know, stores

this many Shopify products.

So when when we started we just
stuck those in the same Postgres.

Yeah.

And that Postgres got too big and
so this major version upgrade

kind of precipitated splitting
the control plane in the data

plane which I think is like a pretty
obvious best practice.

Michael: When you say too big,
by which metric, like what's what

was the where was the bottleneck
coming from?

Harry: Good question, so we had
about 8 terabytes of data in

the 1 instance, or sorry, our disk
was 8 terabytes big and we

were at 6 or something like that,
6.5, and we're like, okay,

this is scary, like we gotta figure
out the next thing, and we

were on PG13, which was end of
life long ago, and so we needed

to both do the upgrade and we just
wanted kind of a better roadmap

for how we were gonna deal with
growth.

Because Gadget has lots of these
e-commerce customers, BFCM is

always a quote fun unquote time
for us and like we'll see probably

this year like a 4x traffic increase
and like a 4x database QPS

increase and so that means like
we kind of need to be ready for

a 4xing of you know a database
CPU and our provider at the time

it didn't sell a big enough 1.

So we were kind of like, we have
to do the horizontal shard,

or scaling sort of journey.

So what we did is we sharded in
order to move the bulk of that

6 terabytes out into newer instances
that had much more room

to grow, leaving only the reasonably
sized control plane left

over to do this major version upgrade.

And it simplified a few of the
steps of this 0 downtime major

version upgrade, because we didn't
have to replay 6 terabytes

worth of stuff on the new instance.

Michael: Yeah, yeah.

Makes sense.

But why not stick with the existing
provider for those shards?

Harry: 2 reasons, or 3 reasons.

The first is we're a Google shop,
and we get good pricing from

Google.

The second is Google claims up
to a 4X TPS increase with AlloyDB.

Like, I like Google and I don't
want to poo-poo them.

I'm not sure I've ever seen it
actually hit 4X, but we did see

like at least a 1.5X kind of like
TPS increase without really

much config change or tuning or
whatever.

I can't even say if it's actually
the same CPUs like under the

hood, you know what I mean?

I don't know what the initial provider
was using, but the instance

was faster for the same price,
so that was nice.

And then the other big thing with
Alloy is you don't pay intrazonal

network transit fees.

It's not like a normal VM.

I present that without comment
to the competitive forces that

may or may not be listening.

But yeah, it's free networking
to the Alloy instance.

Whereas if you're Aiven or Crunchy
or whatever and you run instances,

your customers have to pay from
2 out of 3 zones just to talk

to it, so that sucks.

And then the second thing is you
don't pay for replica storage.

It's a split storage compute sort
of setup, where The disk is

this sort of auto-scaled elastic
fancy log storage service thing,

and then you pay for your nodes
that kind of read and write from

that shared tier, but you don't
pay n times, you pay once.

So it's like 40-ish cents a gigabyte,
which is like more expensive

than like say a big fat EBS volume
but not that much more and

so if you're running like a lot
of replicas it ends up being

cheaper storage-wise and you don't
have like all the nasty IO

spikes from backups or any of that stuff.

Cool.

Nikolay: And unlike Aurora, I don't see the hidden cost with

Aurora, usually I/O costs.

Harry: Yeah, no I/O costs.

Nikolay: Yeah, that's interesting.

Harry: So our database bill came out to be about the same as

a pretty classic like GCP VM with you know then they're not EBS

and GCP but like fat kind of SSD volumes backing it and we save

money on the networking we save money on not having to pay for

storage again for the replicas, and we deleted our I/O spike

for the backups problem.

Nikolay: That's an interesting comparison.

So we even don't talk about features like this column store or

row versus row store.

Even without that, it's a direct comparison.

Interesting.

Harry: Yeah.

We intend to make use of that column store.

Like, I'm skeptical.

I used to work on sort of like the big data warehousing team

at Shopify when I worked there and like I just don't really believe

that you can force Postgres to do the kind of super wide throughput

stuff you need for crazy scale analytics.

But you can definitely get pretty far with it.

You know what I mean?

So my guess is some of our operational reporting use cases where

you're traversing half a million a million rows to answer a query

on a read replica with this column store is going to be fine.

But in the like billion, 10 billion scale, I think purpose-built

analytics databases still should exist.

Michael: We've had a few of those conversations recently.

Harry: I'm sure, I'm sure.

Michael: I had 1 more boring question on the major version upgrade

before we move on to more interesting things.

It looks like you're on crunchy data, you were on crunchy data

before for the control plane stuff and are going to be again

for the upgrade.

But you only upgraded I think from 13, which by the way you're

a bit harsh on yourself.

I don't think it's quite yet out of support.

I think it's

Harry: like a

Michael: couple of months time that'll be out of support.

But you only moved to 15 and I think Crunchy have always been

very good at supporting the latest versions so probably 17 is

available.

So why not?

Like why 15 not 16 or 17?

Harry: Yes, Crunchy has been great at that kind of thing.

I think the main reason was...

Shoot, you know what?

I don't even remember.

Michael: Maybe Nik knows.

Nikolay: I also don't remember.

I remember that Crunchy had different
end of life dates somehow.

Crunchy Bridge had different dates.

Harry: Yeah, they deprecate support,
or they stop sort of long-term

support before the full-on end
of the kind of core support.

But they, in fairness, they gave
us extreme advanced warning

and like we knew it was coming
for a long time and they supported

it for 4 years or whatever, which
is, I think, pretty good.

I don't remember.

I think it might have been a desire
to keep the versions homogenous

across all our instances.

Sure.

And for really silly reasons, the
Alloy shards that we stood

up were stuck on 15 and couldn't
be newer.

Michael: Yes, that would explain
it.

So, Alloy, I think because it's
a fork and it's probably quite

hard to maintain fork, I don't
think they're as good at keeping

up with the major versions.

So, there's a good chance you wouldn't
even have the option of

going to 16 or 17 on there.

I don't know

Harry: for sure.

That's correct.

That's correct.

At the time, 16 was GA and 17 wasn't
even out yet.

Now 17 is out, but it's still in
preview.

Like, you can't, shouldn't run
production workloads on Alloy

17.

But the weirdest thing was we wanted
to go to 16, but they just

don't have pg_squeeze support on
their 16 version.

They have it on 15 and they have
it on 17 and not on 16 and couldn't

tell you why, but we squeeze a
lot, so We really cared about

that.

Michael: That makes so much sense.

In my head, the only reason I could
think of is because now you're

so good at major version upgrades,
you wanted to give yourself

more to do in the future.

Harry: Let's rinse and repeat.

Yeah.

Nice.

Michael: All right.

So you're running Postgres for
a lot of different workloads.

Are there any challenges or fun
things you wanted to share on

that front?

Harry: I think maybe a good 1 just
for the audience and like

if you're listening out there and
you think I'm full of it, like

please send me an email.

I would love to know what you think.

But we did the classic Postgres
is a queue thing early in our

life.

And I think that that, Well, it's
on my mind a lot.

So a lot of people who have like
background queuing problems

use Redis.

When we were at Shopify or when
I was at Shopify, we used this

library called Resque

There's another really popular
1 called Sidekiq and Node, it's

BullMQ.

There's a litany of them, right?

They all seem to use Redis.

We decided that for our user, who
tends to be like a front-end

dev or sort of someone who isn't
deeply familiar with like data

consistency or data durability
concerns, that we didn't really

want to be like, hey, we'll happily
process your background jobs,

but we'll drop 0.5% of them and
not be able to tell you which

ones we dropped and not be able
to tell you when it happened.

So for our background queue product,
we said it's going to be

durable.

Not only is it going to be durable,
it's going to be transactional,

where it's like when you do some
business logic and you enqueue

a job within a transaction, if
the transaction commits, then

the job will get enqueued.

And if the transaction doesn't
commit, the job won't be enqueued,

and that's the transaction boundary.

So we were like, this is nice.

This is us kind of trying to prioritize
our customers' sanity.

And I think it's the kind of thing
that's like annoying, because

not that many of them are going
to notice or care.

It's just going to be no data quality
issues for this business

logic.

But that locked us into needing
Postgres to be on the hot path

for those end queues, right?

We have to participate in that
business logic transaction to

kind of save the thing.

We knew we didn't want to build
a giant queue of our own.

So we knew we were going to get
it into Temporal and let Temporal

take over, but we sort of needed
like a Postgres outbox, if that

makes sense, where it's like, you
write to the outbox and then,

you know, you're slowly forklifting
out of the outbox into the

actual system eventually.

And then Temporal itself is durable.

And so, yeah.

So that's the architecture we went
with.

But as we sort of built it out
more and more, we learned about

more limitations of Temporal.

Like it came out of Uber from this
system that kind of predates

it called Cadence.

And my guess is they learned all
this stuff the hard way and

built in a bunch of limits that
make it, you know, easier for

the operators to guarantee certain
sort of performance characteristics,

but really annoying for people
like me.

So 1 of those limits is when you
start a job, there's a maximum

payload size that's pretty small.

I forget what it is exactly.

I think it's like 50 megs or 5
megs or something like that.

And then as that job executes,
it can only generate X amount

of state transitions, they're called,
but events in its processing,

where it can only have a certain
number of retries, basically,

and you need to sort of re-architect
your temporal workflows

if you want to overcome that.

And again, in our case, our users
are kind of like, hey Harry,

I have this 400 megabyte video
file that I'd like to process

in a background action.

Can you just make it work please?

Can you not make me like re-architect
my entire thing to toss

it into cloud storage?

Whatever, it's ephemeral data.

Like I don't need it in the cloud.

I just want to push it to this
thing and work it within the background.

So we found that the temporal limitations
were annoying enough

that we actually needed to stage
a bunch of each jobs data outside

of temporal.

And so we ended up kind of having
this nasty hybrid where that

Postgres outbox I mentioned sort
of became like a store for a

bunch of extra stuff and then the
temporal workflow as it went

it would be you know transacting
with its state storage and doing

its state transition things, but
then also updating what is now

no longer the outbox to be our
source of truth, it powered our

UI, it had all these extra params,
yada, yada, yada.

So we weren't even really using
Postgres as a queue.

We weren't having the classic skip
locked problems and whatever,

but we were just having like this
sort of like request scale

data transmit, like 1 web hook
would be 1 write to this outbox

to enqueue, 1 write to start processing,
1 write to finish processing.

So just like massive, massive tuple
of turnover.

And we also needed to build a UI
on this stuff.

So we only could get a few of them
to be like HOT updates, like

many of them were not.

And so we were just like destroying
this poor Postgres instance

with what I would call like less
important data than the actual

data plane data but still important
enough that we felt that

merited being on a disk.

So all that to say we ditched Postgres
for this and we switched

to Bigtable, which is like Google
fancy super distributed key

value store and it friggin sucks.

It really sucks and the reason
is no transactions, right?

It's like a way worse model for
me as a programmer and a systems

designer, and a way more scalable, fundamentally supersized system

under the hood.

But boy, oh boy, do I miss the creature comforts, we'll call

them, of secondary indexes and transactions and a sane query

language, let alone the gazillion and 1 features that a full

relational system has over a key-value store like that.

So we use Postgres, I think, by default for all our workloads.

And then when we find something that we're like, okay, I think

there's maybe a better fit here.

We try to switch it out for the next best thing.

But it's like, it's really hard to beat.

And I know this is maybe unsurprising on a Postgres podcast,

but it's like, I just wish I didn't have to do that.

I could have stayed with my creature comforts and got my scale,

you know?

Nikolay: Yeah, so I'm curious what was the number 1 reason.

Is it like bloat maybe or?

Harry: Yeah, it was like bloat, tuple turnover and just like

the fact that we were like always vacuuming.

Nikolay: Yeah, yeah.

I can imagine.

Harry: Another big motivation for switching to something akin

to Bigtable is like compute elasticity, where we can kind of

like Bigtable auto scales by default.

You can't even turn it off, I don't think, where they just add

more nodes as demand kind of grows.

And for us, where we're kind of like the background subsystems

are the thing that absorb load, right?

When you have a flash sale on your Shopify shop and you need

to enqueue a bunch of stuff, like it's quite, We need this thing

to be able to handle unpredictable surges.

And it was a bummer at the time, we didn't have any way to do

a 0 downtime, like non-disruptive Postgres resize.

Now we do, but even then I don't really want to be doing it automatically,

like in response to load.

I kind of want something like Bigtable that's like, 5 seconds

later you can get your capacity expansion.

Nikolay: Yeah, well, I can see it easily.

Harry: I think, I think like My ideal would be a million tiny

Postgreses and have some fancy system for like disaggregating

a logical Postgres from a physical VM.

Or if I have like, let's say I shard my workload into 16,000

Postgreses, but they only run on 4 VMs.

And then as I need more compute, they kind of split out, get

bigger, and whatever.

But that's like a twinkle in old man Harry's eyes.

Nikolay: There is work in this direction by multiple teams right

now.

So we have good format.

Harry: Go on.

Go on.

Yeah.

Nikolay: Yeah.

There is PgDog.

There is Multigres and PlanetScale also.

They're all working in this direction.

Michael: And Multigres specifically is 1 of the things Sugu

mentioned as well.

From his experience at YouTube, he loved having lots of smaller

databases and he much preferred that in terms of reliability

and localized outages even.

So there were just so many benefits

Nikolay: highly automated operational source in the roadmap like

upgrades as well

Michael: and rebalancing yeah

Harry: yeah I am forgive me for another brief aside but are you

familiar with Azure Cosmos DB?

Like the kind of main 1, not the Citus 1, the main 1.

Michael: I thought Cosmos was Citus.

I get confused by their names though.

Harry: So I think there is Azure Cosmos DB for Postgres, which

is Citus.

But I think there's, and I'm not deeply familiar, so I may be

totally wrong here, but I think there is a different original

Cosmos DB that's kind of the Azure answer to Dynamo and Bigtable.

Like they're kind of like super scale, sort of less sophisticated

query patterns, but guaranteed scale for whenever.

The way that it works, or the way that it scales is you specify

like a partition key with every piece of data, and you're only

allowed partition local transactions.

So it's not like a Spanner or a Cockroach where the whole key

space can be transacted against together.

It's just like a million tiny partitions and then every partition

could be a max of 50 gigs.

And to me, that's like so desirable because it means like 1 little

part, you get like no local performance, but the same horizontal

scaling properties of the super scalers, but without necessarily

two-phase commit or Paxos or whatever.

And so I'm hoping someone manages to do that.

I think it's maybe, what's it called, Neki, like the Vitess

for Postgres.

Did I get that

Michael: right?

It's Neki.

I have no idea how to pronounce it.

Harry: Yeah, same.

I have no idea.

But I'm hoping that's what it is, which is like tiny little boundaries

that force me to design my data in a way that makes the transactions

efficient where you don't need a bunch of computers to agree

on a Transaction outcome and instead
you can just funnel them

to the right place

Nikolay: unfortunately, they all
Not all but many of them came

to me and asked, have you used
2PC?

I said, no, I always avoided it,
but so they need it, unfortunately.

Of course, like some workload will
avoid it, but some workload

will still need 2PC.

This is the sad truth.

Harry: Totally.

Totally.

I think like the thing like I,
so I used to work at Shopify circa,

I don't know, 2012, 2013.

We were running like a MySQL fleet
at the time, and we felt constrained

by the biggest Database servers
we could buy.

Like we were sharding, but it was
still like the biggest customers

could push the 1 instances, I don't
know what it was, probably

64 threads, like to the max, I'm
not really feeling that way

anymore.

Like I feel like there's few workloads
that don't have a natural

partitioning where you need more
than 1 Server's worth of compute

just to satisfy 1 partitions worth
of work I mean, maybe you

talked about this assignment last
time But it's like you can

always split it up and when you
split it up Like you don't need

you need an Aggregate many computers,
but 1 Transaction boundary

doesn't need more than 1.

And so I'm just, and like the hardware
has just gotten so much

better in that 10 years that I'm
not sure I could grow my company

fast enough to need that, Right?

It's like maybe if we had like
a graph problem, maybe if we were

Facebook or whatever, but as it
stands right now.

Nikolay: That's true.

And on GCP, if we don't look at
AMD, we look at only Xeon scalable

fifth generation, the best available
C4 nodes, I think, right?

You can get 192 cores or so, but
if you go to AWS, you will have

even more, like almost, I think,
800 cores also the same type

of processor, and this is already
huge.

So you can scale 1 node.

Harry: It's just bonkers.

Like, that's ridiculous.

Yeah.

It's a lot of doge coins, you know?

Sorry, so that was a long-winded
rant.

But what more can I tell you about
that temporal sort of Postgres

workload setup thing?

Michael: Before we started, you
mentioned 1 of the most interesting

things you're working on is that
you're serving other people's

workloads and that you don't control
with design, like Schemas

and design decisions that you don't
control.

How do you stay sane while doing
that?

And also, what are the interesting
parts of that?

Harry: Great question.

You hit me with the question that
just invalidates my entire

prior rant, which is like, yeah,
you can partition your workload

if you think hard about the schema
design, but yeah.

So, okay, let me set the stage
a little.

So, Gadget's, like, reason for being.

Like, the reason why we're working
hard on this thing is because

we kind of believe software development
is still hard, unnecessarily.

That, like, when you go to spin
up an app on the weekend or your

next big business or just, like,
solve a small problem in your

life with software.

You spend a lot of time on what
we call like baloney problems

of provisioning infrastructure,
setting up connection poolers,

like worrying about the autovacuum
schedule, so on and so forth.

We just think that's all a distraction
from actually building

something useful.

It matters, like you can't ignore
it, but you also ideally don't

want to have to deal with it.

So similar to the people who came
before us, like the Firebases

of the world, We kind of want to
delete a bunch of those problems

from our customers' lives, take
them on ourselves, and allow

them to focus on whatever unique,
interesting custom problem

they want.

Like, recently, it might even be
allow cloud code to focus on

whatever unique, interesting problem
they want, right?

But, like, by kind of deleting
infrastructure foot guns and systems

integration, we just let people
kind of succeed faster.

And then I guess I should say,
bearing in mind the audience listening

here, it's not for everybody, right?

Like, if you have the perfect knowledge
of the Postgres schema

you want, Gadget's probably going
to be less efficient than that,

right?

Because we've added lots of layers
of abstraction.

We have our own operations and
infrastructure team kind of managing

this stuff.

So I'm not trying to say, like,
no 1 needs to care about these

problems.

I'm just trying to say that a lot
of people, I hope, don't have

to care about these problems so
that me and my team pull our

hair out stressing about them.

All this means is there's this
tension between asking the developer

to be more specific about their
data access patterns and adding

friction for them to like on the
path to success right that like

every time you say do you want
4 8 or 32 bytes for this number

you're you're posing what I would
call computer baloney question

rather than like it's just a number
like you know screw off I

don't know I don't know how big
it's ever gonna be just you figure

it out you know so in Gadget we
have like a number field type

instead of an int and a float and
a decimal we've taken this

to a pretty intense extreme where
when you do data modeling in

Gadget you specify tables, we call
them models because they include

behavior as well, but you specify
tables and you specify fields,

you can specify relationships among
them, but you don't run any

SQL to realize those schemas.

And then you get a kind of fully
auto-generated GraphQL API for

interacting with those tables without
doing anything else.

You just, you specify your schema,
you get the API.

We generate a nice client for that
API, we generate nice docs

for that API, but the idea is like
you say, to do title string,

completed boolean, and now you
immediately can write some React

code for a form, or you can ask
the AI to write some React code

for a form to like transact against
that stuff.

All this to say, we're responsible
for the SQL that, or the schema

of those tables under the, like
the real tables under the hood,

as well as the SQL that that API
generates and executes.

So this is a win because we can
avoid things like the

N+1 problem, like we know how to
do that good.

We can implement things like 0
downtime cutovers or kind of maintenance

modes where we know how it's like
this app is in the process

of being moved to a different shard,
like, hold on a second,

we'll finish that and then serve
your request, you know.

It's powerful that Gadget sort
of owns the whole stack.

Oh, we can implement other stuff
like cloud storage, where you

can store data blobs in the right
way, instead of within Postgres,

you know.

But it means that all we know about
a given application is what

we've collected from the developer,
which is that list of tables,

the list of fields on the tables,
and the relationships among

them, Not stuff like, ah, this
is a shard key, or this is how

I intend to transact, or if I'm
going to make an HTTP call in

the middle of my transaction, like
2 days from now when I change

my code.

So we picked this bed and we made
it on purpose.

The idea is we want building high
quality software that scales

and like meets the demands of Black
Friday, Cyber Monday to be

more accessible and to not be locked
behind you know whatever

between the 3 of us 60 years of
Postgres experience or something

like that, right?

You should be able to build and
get an app going way faster without

that.

So the gist of it is we need some
way to structure our data under

the hood within our Postgres such
that people can't take down

the platform.

They can't use more than their
fair share.

We just need to be safe by default
where we're committing to

serving this API, and regardless
of what strange queries you

throw at it, strange transaction
patterns, poorly written code,

either out of ignorance or malice
or whatever, just needs to

still work.

Same as like a Firebase or yeah,
those other companies I mentioned

before.

And the way that we do that is
like kind of 2 main ways.

We don't give everyone their own
instance and say it's your job,

like you manage the performance.

I think that's what, say, a Neon
or Supabase does, which is

still helpful, but it's different,
right?

They're sort of saying, you get
this much resources and within

that, it's up to you to optimize.

We don't say that.

We charge by operation, more usage-based,
like say an Upstash

Redis or something like that.

And then we consider ourselves
Gadget on the hook for the performance

of those operations.

And that means that, again, we
have to find some way to not process

queries that could take us down.

So we do 2 things.

We refuse to execute a query unless
there's an index on the predicates.

And that's just a safety must,
basically.

We can't run unindexed queries
because we can't know that they'll

process fast enough for it to be
safe to run at scale.

And then we rate limit.

And that allows us to say like,
you know, no 1 can have more

than their fair share.

But the crazy thing is almost no
1 hits their rate limits.

We consider that like a regrettable
private platform protection

mechanism as opposed to like a
pricing mechanism or like an incentive

and the reason that no 1 hits
their rate limits is because

if you just force every query to
be indexed like you get pretty

far.

There's there's the planner is
so good and like the bitmap index

combining stuff gets you so far
that almost nobody, or any time

we ever have a database CPU spike,
it's usually because something's

gone wrong with our creation of
the indexes, like Gadget's automations,

their Temporal workflow, surprise,
like have failed to set the

index up properly, or we like let
the code that makes the API

calls ship before the index was
actually in place, or something

like that, you know.

But it's like Postgres just works
if you do this.

So it's really naive, but it works.

Nikolay: Are you okay to share
how many indexes you've already

created?

Harry: Yeah, we have, I think,
probably over 2 million indexes

in our system right now.

Actually, sorry, I take that back.

At the time of the sharding we
did, but we've reduced that drastically

since then, but for kind of unrelated
reasons.

But yeah, basically if a user wants
to be able to, in their API,

sort or filter on a column, we
just say there has to be an index.

And so we have, I don't know, most
of the columns indexed.

And like, oh god, I can feel the
hate mail now, talking to a

giant legion of database engineers who are all screaming in their

cars right now.

Why are you indexing recall?

No, no, no.

The modern hardware that respected Nik just mentioned means

that the cost of this, while substantial, is not actually that

crazy.

It's higher, no doubt.

But we're not in the days where you need to be so incredibly

stingy about every index you make, especially when the cardinality

of the tables are like a million, 2 million records, as opposed

to like 10 gazillion, right?

When our first 10 gazillion customer comes along, we're going

to work with them closely to be like, you don't need that index.

Like, let's turn that off.

Your stuff will go faster.

But the vast majority of people, the overhead is it adds up,

but it's not so meaningful that our business is busted.

Nikolay: Yeah, so I wanted to say that this is a pet versus cattle

philosophy.

If we like think about indexes as a pet, we need to get rid of

unused indexes, take care of bloat and all.

But at your scale, before schema became sharded, millions of

indexes, you think about this as cattle as well.

Just inside 1 Postgres instance, right?

Before it became sharded.

This is an interesting approach.

And I agree with, since we have small tables, we know there is

right amplification, all indexes, if it's not HOT updated, it

needs to update all those indexes, some of them are not used,

but since approach by design it's cattle, okay, it's not that

bad.

Yes.

This is interesting case, I must say I never saw it before, it's

interesting.

Michael: I've heard of 1 other case which was Heap Analytics

blogged about, something about, yeah you know the 1, great.

I sure do.

So that was the only other time I heard about this.

Nikolay: Was it also where they discussed that they create indexes

without work concurrently, right?

Michael: Yeah, because you can't do more than 1 at the same time.

Nikolay: Yeah, to parallelize work.

Michael: Yeah.

Harry: What's neat or strange or weird about Gadget is we kind

of participate in the whole SDLC, like the whole life cycle of

like authoring a feature, debugging it, deciding it's good, merging

it and deploying to production.

Gadget has a development environment as well as the production

environment.

And so we kind of get to witness the moment that you want to

go live.

And we just do a normal style, call it Rails or whatever, migration

where on deploy, we create the indexes concurrently behind the

scenes.

Then we ship the code that makes use of them.

And then it's like, we can do that because we're not sort of

exposing the raw SQL and asking people to manage their own schema,

so on and so forth.

But what do I want to share about this that I think is the most

interesting?

Oh, yeah.

So the first thing is, like, giant catalogs where you have, you

know, we had, I think, at peak in 1 instance, like 250,000 tables,

1.4 million indexes.

Giant catalogs suck, but were more manageable than I was guessing.

I was assuming there was going to be some O(n²) system

table inside Postgres, or some doinky little thing that just

breaks when things are that big.

But no, that instance operated well.

The thing that did break was a bunch of the analytics tools out

there.

Like we were using.

Like pg_stat_statements?

pg_stat_statements was OK.

It was big.

It takes a lot of data, but it wasn't, like, an ungodly amount

of performance overhead.

But, like, pganalyze and many of the kind of collectors, they're

like, let's get the size of every table every minute, you know?

It's like that's a hard no.

Nikolay: Our monitoring also didn't work properly because of

the catalogs.

Harry: Yeah, but you guys only had to make a couple tweaks.

I think we're talking about pgwatch2, like the Postgres.AI

remix, and it works well for us.

It's the kind of fastest 1.

I haven't tried pgMustard, but I must.

I

Michael: owe them a minimum.

It's not

Nikolay: a money-ching code.

Harry: It's all right.

So that was surprising.

But the other, actually, This is just a fun fact.

So we did our sharding, we moved this gazillion number of tables

over to AlloyDB.

We turned on their analytics thing.

It's called Advanced Query Insights.

And it's not bad.

It's a nice UI for pg_stat_statements, souped up a bit.

They store all the data for it in a private database in your

instance.

So like you have an almost superuser but not a full superuser

and there's this one database you can't touch that has it in it.

And currently, this is my fault, I set it all up and I was

like, oh, this is probably going to be not that expensive.

Currently our like AlloyDB Insights database is almost 1.5x the

size of the actual database.

And I think it's because the query diversity is so high, and

all the data that they're keeping about each one adds up so much,

that yeah, we're paying more than double in storage just for

the privilege of this thing.

And half the queries time out.

But again, I'm going to blame me for that, not Alloy for that.

We just need to turn that off.

Nikolay: I'm curious if the main, like if we have just one AlloyDB

database and we store metrics inside the same thing, if it's

down, how will we troubleshoot?

Harry: I think it is only like query insights like the equivalent

of pg_stat_statements.

I guess you're right, I don't know.

Michael: Isn't the same true for Postgres like with pg_stat_statements?

Nikolay: Well yeah, but that's why we need external monitoring

tool which will pull data and store it independently for troubleshooting.

I thought about this like self-observed or externally observed

system.

I think we actually need both.

But if it's just self-observed, it might be a problem.

Harry: Yeah, it goes full circle to that same thing we were talking

about with the infrastructure provider earlier, which is like,

they don't have anywhere else to put it unless they're going

to pay for it.

Right?

So by putting it in my instance, I pay for it, which I am paying

for dearly, and I'm now going to turn off, you know, but I agree.

We, you do get the like system metrics that they store in their,

you know, fancy time series database, but yeah.

Nikolay: Yeah, I didn't know about this.

Interesting.

Harry: Yeah, I think, if I may, the most interesting thing we've

done to manage this ungodly number of indexes and tables is classic

computer science, a hybrid strategy.

What we noticed is many, many, many Gadget apps are like tutorials.

It's like someone wants to kick the tires, or they're building

a to-do app and they only ever add three to-dos.

It's kind of excessive to create them one to-do table, one user table,

indexes on many of those columns to store zero to five rows.

So we use a thing we call a storage strategy pattern where different

Gadget models are backed by different data.

We still store them in the same Postgres, and that's so that

we can still transact against them together.

Like we support real, actual Postgres transactions, but we use

doinky little JSONB tables when the cardinality is small.

So we have 1 shared JSON table, it has an ID, a model identifier,

and then just field values, and we don't index that at all.

No GIN, no nothing.

We just do a table.

You could call it a table scan even though it's not, it's like

a range scan.

And then we just bound the cardinality of that.

And then when we witnessed the, I think it's like 22,500th record,

we swap out kind of a real table that has the real indexes and

whatever.

And so that cut our table count and our index count down by a

lot, a lot, a lot.

Again, that swap is 0 downtime, and it has to be.

That's again our commitment to our customers.

And that was actually quite difficult in order for someone's

doing a bulk load of data or whatever.

We witnessed that 2500 record, we have to kind of like pause

traffic, do the switcheroo, resume traffic without, like it'll

be a latency spike, but not an error spike.

And I think that's kind of, it's Gadget is a bundle, right?

Where we have this backend runtime, this frontend runtime, this

database, all these things that work nicely together.

And we see that as validation that the bundle's showing through.

Or it's like, we can only do that because we have control over

the app tier, control over the load balancers to be able to be

like, ah, hold on a second.

We've got to do some maintenance on your app real quick.

We won't drop any requests.

We'll do a little switcheroo without anybody knowing, and you're

off to the races with the full performance after that.

Michael: That's a really neat trick.

I like that.

You must have provisions in that code that doesn't allow

queries without indexes to like account for tiny tape like Yeah,

oh my goodness.

This sounds painful

Harry: We've been working on it for like 4 years, so yeah, yes,

but I think we're in a good position.

Like, I'm excited for BFCM this year, which for operations staff

is, I think, a rare thing to hear.

Like, I think I'm excited for the challenge, you know?

Michael: I saw the color drain from Nikolay's face.

Nikolay: Yeah.

Yeah.

Do you think there might be some point that true Postgres will

be fully replaced in your company by something?

Or there will always be a place
for true Postgres, not AlloyDB

Postgres-like?

Harry: Great question.

I think yes.

I do think it will be fully replaced.

And I think it's only because we
haven't seen any alloy downsides.

Like while it's a fork, it's still
99% true Postgres.

So we have like pgvector support,
for example, and pg_squeeze

support.

It's not like a wire compatible
thing.

It's like mostly actually Postgres
with the WAL and the block

storage kind of ripped out.

Nikolay: It also has lightweight
locks and so on, all those problems.

Harry: Yes.

Yeah.

Yes, true.

I think we're going to stick on
Alloy, but not because it's the

very best, but because now we have
familiarity, Now we know we

can maintain it, now we can scale
to Valhalla.

So we've sort of unlocked something
and we've got other stuff

to do.

I think the thing that would make
me revisit that is OrioleDB,

or whatever the...

Whatever InnoDB for Postgres is.

Like, when that comes around and
I don't have to vacuum anymore,

that seems like a pretty epic win
that I'd be interested in realizing.

Nikolay: Makes sense.

Great answer.

Great, thank you so much.

Enjoyed.

Thank you

Harry: guys, I really appreciate
getting to nerd out.

These are the things that keep
me up at night.

They're like 1.4 million indexes.

What do I do?

It's nice to let it out a little
bit.

Nikolay: Cool.

Thank you.

Michael: It's our pleasure.

Some kind things our listeners have said