
Gadget's use of Postgres
Michael: Hello and welcome to PostgresFM, a weekly show about
all things PostgreSQL.
My name is Michael, founder of pgMustard, and I'm joined
as usual by Nik, founder of Postgres.AI.
Hey Nik.
Nikolay: Hi, Michael.
Michael: And today we have a special guest.
We have Harry Brundage from Gadget, co-founder and CTO there.
Hey, Harry, thanks for joining us.
Harry: Thanks so much for having me on.
I'm happy to be here.
Michael: It is our pleasure.
So you've been doing some cool things with Postgres for a while.
Recently blogged about a no downtime, a 0 downtime upgrade, but
there's a ton of other things we wanted to talk with you about
as well.
Where would you like to start?
Harry: Yeah, I'd love to tell you about this 0 Downtime upgrade,
just because, I don't know, it's like a point of pride of mine,
I guess.
It's always so frustrating to me when I get an email from 1 of
my providers.
It's like, hey, Sunday at 2 PM, we're going to be down for 30
minutes.
Like, hope that's OK.
So yeah, Gadget's an infrastructure provider.
Like, the developers who use us use us for mission critical apps,
So we just try to take it more seriously than that and like not,
you know, disappear.
So yeah, we did a 0 downtime upgrade with the help of Postgres.AI
and Nik, thank you so much for your help, Nik, what a treat.
Nikolay: Thank you for trusting us because you know, like, we
did it with other companies and always like in the beginning,
I like, I might hear pushback because understanding how much
work it's needed there, you know, like, because it's far from
being well automated.
It should be in Postgres and it's improving, but
Harry: yeah.
Not quite yet.
Nikolay: Yeah, yeah, yeah.
So you did this journey and I thank you for not stopping.
Harry: Absolutely, absolutely.
So it was a lot of work,
Nikolay: it was, yeah.
Harry: I think too, it's like, to me it just kind of feels like
something that should be built into the managed providers.
Like I'm kind of surprised that so few of them have it.
I think the only 1 I could find that has like true 0 downtime
is PlanetScale.
Nikolay: But for Postgres,
Harry: the GCP options, I don't
know about the new Postgres one.
I guess I know they have it for
MySQL but not for Postgres yet.
Nikolay: For MySQL, yeah, I think
so.
I don't know details but for Postgres
I don't see yet anywhere.
Harry: But again, like what the
heck?
Like as an industry shouldn't this
have been the thing we kind
of like solved 20 years ago and
never looked back?
You know, I don't understand.
Nikolay: Yeah.
And there's no AI needed there.
It's just pure automation.
Harry: Yeah.
Terrifying button to press.
Nikolay: I think.
Harry: But
Nikolay: yeah, I think what happens
is that it should be automated
more in Postgres itself.
And most managed platforms like
RDS and others, they just sit
on the fence.
Well, I know they they work and
do it, but they don't want to
automate things outside Postgres
too much.
For example, ANALYZE, right?
ANALYZE is not automated.
You always forget about this.
And they don't automate it, and
they just put it in documentation
on the shoulders of users.
Now, it's time to analyze.
And I started like asking around
like maybe a couple of years
ago, why, why, why 1 year ago and
then okay, Postgres 18 should
be out next week, right?
Finally, statistics will be dump
restore of statistics will be
fully automated.
So they will just exclude this
from their how to manual, right?
So this is an example.
Things should be inside Postgres.
But I think still, they should
automate much better.
Indeed.
Not waiting for Postgres itself.
Anyway, I'm with you here, 100%.
Harry: I'm glad.
What I was going to say is I've
always attributed it, or the
absence of something like a true
0 downtime major version upgrade
to the fact that it's like more
than just Postgres that you need
to pull it off.
Right?
Like you need a connection pooler
or a proxy or something in
front of it sitting outside the
system and you need a bunch of
tight coordination between that
proxy and the rest of the system
to kind of orchestrate it all and
so like I know at least on
our previous Postgres provider like they ran the PgBouncer on
the same VM as Postgres itself.
And so they're kind of hamstrung by the architecture where they
can't have this piece that lives outside that can talk to 2 at
once, whereas we self-hosted our bouncers for exactly this reason
and can kind of do it all and orchestrate it all and it's like
sucks to self-host it but gets you that architectural flexibility
you know we needed that.
Nikolay: Yeah and PgBouncer supports pause/resume which helps you
achieve 0 downtime.
Harry: We're maybe nuts too and we run our PgBouncers in K8s
and do the whole rolling restart and whatever and it's turned
out to be, I think, an okay choice.
I think a lot of people may be shy away from that because it's
sort of super stateful, super critical.
Maybe this is a pet, not a cattle.
But we did it in case just because that's where the rest of our
compute runs and it's like easy to sort of bin pack and share
it all.
It actually really saved our butts recently because we're paying
a lot for intrazonal network transit to our databases.
Specifically, like once from our application pod to the bouncer,
and then again from the bouncer to the database.
And because the bouncer was in K8s, we could turn on the handy-dandy
like prefer zone local network routing configurations and kind
of have the application pod go to a zone local bouncer and then
to the database And I think were we outside of K8s, that would
have been a lot more annoying to kind of like, teach all the
routing to be able to do the right thing.
Nik, you look scared.
You look skeptical.
Nikolay: No, no, I'm just thinking this is also a job for database
platform, I think, to take care of such things.
So yeah, and I know like RDS has RDS proxy but it doesn't have
a it has cool features PgBouncer doesn't but it also lacks cool
features PgBouncer has so it's a zoo and yeah And I think pause
resume is a wonderful feature.
Still, I have to emphasize it.
Michael: I was going to ask about your PgBouncer setup, actually.
Because your blog post on the upgrade is really cool, and goes
into good detail on the actual upgrade process.
But you didn't talk much about your PgBouncer setup, and how
many you're running.
And do all of your application stuff go through that?
Or does any of it skip it going to the database?
Is there anything interesting about that that you wanted to share
with people?
Harry: Good question.
I have 2 interesting tidbits.
The first is, I think our, so our PgBouncer deployment's a bit
weird, and then we have 9 instances.
Michael: Interesting.
Harry: Yeah, and we take our kind
of total connection count,
where we're targeting 400 kind
of server-side connections, and
split that up over 9, And the reason
for that is that same K8s
network-aware routing thing, where
the kind of up until very
recent versions of Kubernetes,
you needed 3 endpoints per availability
zone for the network-aware routing
to kick in.
So we had to use a minimum of 9
and initially we were really
skeptical about that because it
seems kind of bad right instead
of having like say 1 or 2 buckets
of a lot of connections, we
were kind of brokering them fairly.
You have many small buckets and
you could get kind of weird inefficiencies,
but it didn't come to pass where
that bit us.
It seemed actually okay in the
end.
And what I like about it is we
have a lot of CPU headroom, right?
Like PgBouncer, I think is still
largely single-threaded.
I don't know the innermost details
of it, but we're using the
fancy new prepared statement stuff
within it, and so I'm kind
of happy that I can give it 9 cores
instead of 2 or 4 or something
like that, and kind of be like,
you know.
Take as much as you need.
Michael: Yeah, nice.
Always good to have headroom.
Cool.
Anything else about the upgrade,
or like which bits were most
scary when Nik came to you with
the proposed plan, you said
the pushing the button, is it because
it is single direction
or did you have a rollback plan?
Harry: Good question.
We had a rollback plan, but it
was written rather than tested.
Because we sort of like, So the
way that we pulled it out, or
the way that we kind of built the
whole thing is we built a temporal
workflow.
So we're like huge fans of temporal,
very easy to kind of script
something like this.
And the reason why we did this
is because it lets us get all
the nice kind of retries and item
potency stuff for free, where
we get to write in a somewhat failure
oblivious way.
And we wrote our workflow to have
like a dry run capability,
where we do all but the very last
step.
So we would go do everything where
we like prepared the new host,
kind of replicated the data, made
sure we were caught up, did
the PgBouncer pause, and then
we just wouldn't reload the config
or change over and we'd just do
a resume and everything would
keep on going as it was.
And so The terrifying part is we
just never actually done the
middle thing.
And we weren't really sure, like,
here's how we can test this,
other than like standing up a whole
parallel set of infrastructure,
which we were like, I'm not even
sure that will prove this 1
will work, right?
Like, just in my admittedly still
limited lifespan, like, the
things that go wrong are always
the things you didn't foresee,
right?
So it's like we dry-runned it a
million times, found all the
things we could foresee, and then
it was like the spooky, you
know.
Nikolay: Yeah, so it's really like
It's impossible to test switchover
in production, but I wanted to
emphasize it's possible to test
everything before that.
And it's great.
Because this is the beauty of this
procedure.
You create the whole cluster, which
is running already a new
version.
You see it's working.
You can touch it and it's great,
and so on.
But yeah, I agree, the scariest
part is switchover how to test
it properly.
Of course, it was tested in non-production,
but it's different.
Here we just need to jump at some
point.
Harry: Exactly.
It was the best kind of jump, where
it was like a big meeting
that we were all like scared of,
we all like sat down, Nik was
on the phone, and it was over in
like 7 minutes, where it was
like, oh, well, I guess it worked.
Like all the numbers look good.
Like yeah, that's it.
Onwards you know.
So I feel fortunate to work with
a team where, you know, that
was how it went as opposed to like
us screaming around like,
check out their heads.
Nikolay: I wanted to thank you
for this, like emphasizing that
you use this temporal framework
because we, like Postgres.AI, we're
always big fans of Go and we like
developed using Go a lot of
stuff.
But recently, our team, we have
a few Python guys right now.
And I just envision that probably
we'll learn from you in this
experience as well and do some
stuff with temporal as well.
So thank you for emphasizing.
Harry: Yeah, we're heavy, heavy
users of it.
We use it for these kinds of systems
administration tasks or
behind-the-scenes maintenance workflows.
But we use it for a lot of other
stuff too.
Like Gadget is mostly used today
by people building Shopify apps.
And when you build Shopify apps,
you have to receive a metric
shit ton of webhooks.
And so the Gadget has like a background
kind of queuing system
that knows how to take a webhook
and process it later or process
it with retries or whatever.
So each 1 of those is a temporal
workflow for us and so we're
doing I don't know something like
3k, 4k temporal state transitions
per second also backed by Postgres
in this big ass temporal cluster.
And that's something I never ever
want to have to try to build
myself.
Like it's so complicated and nasty
and fun for some.
Michael: You mentioned in the post
also that Sharding was on
your to-do list.
Yes.
You mentioned though you were using
a different database for
that now, like AlloyDB specifically
I think for the new setup,
but you said it was just for the
user data.
What's the separation there with
like the Shopify?
Is it the data of the Shopify apps,
for example?
Harry: Yeah, so Gadget is like
a general purpose app development
platform.
So If you want to make a to-do
list tracker or a ChatGPT clone
or a Shopify app, you can sign
up for Gadget and you get kind
of like a full stack dev environment.
So you get a place to run kind
of backend code.
It's all JavaScript, TypeScript.
You get a database or a sort of
virtual slice of a database to
store back-end data.
You get a front-end and React and
whatever and you're off to
the races.
You write back-end code, you write
front-end code, you define
your data models, you transact,
so on and so forth.
And then the Shopify part comes
in where we kind of have a managed
API connection to Shopify, where
we'll do the OAuth and the webhooks
and sync the historical data, et
cetera, et cetera.
So like many infrastructure providers,
we have a control plane
and a data plane.
The control plane stores our list
of users, our who has what
access to what app, so on and so
forth.
Then the data plane is the actual
apps data, like this app has
a to-do model that stores, you
know, this many to-dos.
This app has the Shopify product
model and, you know, stores
this many Shopify products.
So when when we started we just
stuck those in the same Postgres.
Yeah.
And that Postgres got too big and
so this major version upgrade
kind of precipitated splitting
the control plane in the data
plane which I think is like a pretty
obvious best practice.
Michael: When you say too big,
by which metric, like what's what
was the where was the bottleneck
coming from?
Harry: Good question, so we had
about 8 terabytes of data in
the 1 instance, or sorry, our disk
was 8 terabytes big and we
were at 6 or something like that,
6.5, and we're like, okay,
this is scary, like we gotta figure
out the next thing, and we
were on PG13, which was end of
life long ago, and so we needed
to both do the upgrade and we just
wanted kind of a better roadmap
for how we were gonna deal with
growth.
Because Gadget has lots of these
e-commerce customers, BFCM is
always a quote fun unquote time
for us and like we'll see probably
this year like a 4x traffic increase
and like a 4x database QPS
increase and so that means like
we kind of need to be ready for
a 4xing of you know a database
CPU and our provider at the time
it didn't sell a big enough 1.
So we were kind of like, we have
to do the horizontal shard,
or scaling sort of journey.
So what we did is we sharded in
order to move the bulk of that
6 terabytes out into newer instances
that had much more room
to grow, leaving only the reasonably
sized control plane left
over to do this major version upgrade.
And it simplified a few of the
steps of this 0 downtime major
version upgrade, because we didn't
have to replay 6 terabytes
worth of stuff on the new instance.
Michael: Yeah, yeah.
Makes sense.
But why not stick with the existing
provider for those shards?
Harry: 2 reasons, or 3 reasons.
The first is we're a Google shop,
and we get good pricing from
Google.
The second is Google claims up
to a 4X TPS increase with AlloyDB.
Like, I like Google and I don't
want to poo-poo them.
I'm not sure I've ever seen it
actually hit 4X, but we did see
like at least a 1.5X kind of like
TPS increase without really
much config change or tuning or
whatever.
I can't even say if it's actually
the same CPUs like under the
hood, you know what I mean?
I don't know what the initial provider
was using, but the instance
was faster for the same price,
so that was nice.
And then the other big thing with
Alloy is you don't pay intrazonal
network transit fees.
It's not like a normal VM.
I present that without comment
to the competitive forces that
may or may not be listening.
But yeah, it's free networking
to the Alloy instance.
Whereas if you're Aiven or Crunchy
or whatever and you run instances,
your customers have to pay from
2 out of 3 zones just to talk
to it, so that sucks.
And then the second thing is you
don't pay for replica storage.
It's a split storage compute sort
of setup, where The disk is
this sort of auto-scaled elastic
fancy log storage service thing,
and then you pay for your nodes
that kind of read and write from
that shared tier, but you don't
pay n times, you pay once.
So it's like 40-ish cents a gigabyte,
which is like more expensive
than like say a big fat EBS volume
but not that much more and
so if you're running like a lot
of replicas it ends up being
cheaper storage-wise and you don't
have like all the nasty IO
spikes from backups or any of that stuff.
Cool.
Nikolay: And unlike Aurora, I don't see the hidden cost with
Aurora, usually I/O costs.
Harry: Yeah, no I/O costs.
Nikolay: Yeah, that's interesting.
Harry: So our database bill came out to be about the same as
a pretty classic like GCP VM with you know then they're not EBS
and GCP but like fat kind of SSD volumes backing it and we save
money on the networking we save money on not having to pay for
storage again for the replicas, and we deleted our I/O spike
for the backups problem.
Nikolay: That's an interesting comparison.
So we even don't talk about features like this column store or
row versus row store.
Even without that, it's a direct comparison.
Interesting.
Harry: Yeah.
We intend to make use of that column store.
Like, I'm skeptical.
I used to work on sort of like the big data warehousing team
at Shopify when I worked there and like I just don't really believe
that you can force Postgres to do the kind of super wide throughput
stuff you need for crazy scale analytics.
But you can definitely get pretty far with it.
You know what I mean?
So my guess is some of our operational reporting use cases where
you're traversing half a million a million rows to answer a query
on a read replica with this column store is going to be fine.
But in the like billion, 10 billion scale, I think purpose-built
analytics databases still should exist.
Michael: We've had a few of those conversations recently.
Harry: I'm sure, I'm sure.
Michael: I had 1 more boring question on the major version upgrade
before we move on to more interesting things.
It looks like you're on crunchy data, you were on crunchy data
before for the control plane stuff and are going to be again
for the upgrade.
But you only upgraded I think from 13, which by the way you're
a bit harsh on yourself.
I don't think it's quite yet out of support.
I think it's
Harry: like a
Michael: couple of months time that'll be out of support.
But you only moved to 15 and I think Crunchy have always been
very good at supporting the latest versions so probably 17 is
available.
So why not?
Like why 15 not 16 or 17?
Harry: Yes, Crunchy has been great at that kind of thing.
I think the main reason was...
Shoot, you know what?
I don't even remember.
Michael: Maybe Nik knows.
Nikolay: I also don't remember.
I remember that Crunchy had different
end of life dates somehow.
Crunchy Bridge had different dates.
Harry: Yeah, they deprecate support,
or they stop sort of long-term
support before the full-on end
of the kind of core support.
But they, in fairness, they gave
us extreme advanced warning
and like we knew it was coming
for a long time and they supported
it for 4 years or whatever, which
is, I think, pretty good.
I don't remember.
I think it might have been a desire
to keep the versions homogenous
across all our instances.
Sure.
And for really silly reasons, the
Alloy shards that we stood
up were stuck on 15 and couldn't
be newer.
Michael: Yes, that would explain
it.
So, Alloy, I think because it's
a fork and it's probably quite
hard to maintain fork, I don't
think they're as good at keeping
up with the major versions.
So, there's a good chance you wouldn't
even have the option of
going to 16 or 17 on there.
I don't know
Harry: for sure.
That's correct.
That's correct.
At the time, 16 was GA and 17 wasn't
even out yet.
Now 17 is out, but it's still in
preview.
Like, you can't, shouldn't run
production workloads on Alloy
17.
But the weirdest thing was we wanted
to go to 16, but they just
don't have pg_squeeze support on
their 16 version.
They have it on 15 and they have
it on 17 and not on 16 and couldn't
tell you why, but we squeeze a
lot, so We really cared about
that.
Michael: That makes so much sense.
In my head, the only reason I could
think of is because now you're
so good at major version upgrades,
you wanted to give yourself
more to do in the future.
Harry: Let's rinse and repeat.
Yeah.
Nice.
Michael: All right.
So you're running Postgres for
a lot of different workloads.
Are there any challenges or fun
things you wanted to share on
that front?
Harry: I think maybe a good 1 just
for the audience and like
if you're listening out there and
you think I'm full of it, like
please send me an email.
I would love to know what you think.
But we did the classic Postgres
is a queue thing early in our
life.
And I think that that, Well, it's
on my mind a lot.
So a lot of people who have like
background queuing problems
use Redis.
When we were at Shopify or when
I was at Shopify, we used this
library called Resque
There's another really popular
1 called Sidekiq and Node, it's
BullMQ.
There's a litany of them, right?
They all seem to use Redis.
We decided that for our user, who
tends to be like a front-end
dev or sort of someone who isn't
deeply familiar with like data
consistency or data durability
concerns, that we didn't really
want to be like, hey, we'll happily
process your background jobs,
but we'll drop 0.5% of them and
not be able to tell you which
ones we dropped and not be able
to tell you when it happened.
So for our background queue product,
we said it's going to be
durable.
Not only is it going to be durable,
it's going to be transactional,
where it's like when you do some
business logic and you enqueue
a job within a transaction, if
the transaction commits, then
the job will get enqueued.
And if the transaction doesn't
commit, the job won't be enqueued,
and that's the transaction boundary.
So we were like, this is nice.
This is us kind of trying to prioritize
our customers' sanity.
And I think it's the kind of thing
that's like annoying, because
not that many of them are going
to notice or care.
It's just going to be no data quality
issues for this business
logic.
But that locked us into needing
Postgres to be on the hot path
for those end queues, right?
We have to participate in that
business logic transaction to
kind of save the thing.
We knew we didn't want to build
a giant queue of our own.
So we knew we were going to get
it into Temporal and let Temporal
take over, but we sort of needed
like a Postgres outbox, if that
makes sense, where it's like, you
write to the outbox and then,
you know, you're slowly forklifting
out of the outbox into the
actual system eventually.
And then Temporal itself is durable.
And so, yeah.
So that's the architecture we went
with.
But as we sort of built it out
more and more, we learned about
more limitations of Temporal.
Like it came out of Uber from this
system that kind of predates
it called Cadence.
And my guess is they learned all
this stuff the hard way and
built in a bunch of limits that
make it, you know, easier for
the operators to guarantee certain
sort of performance characteristics,
but really annoying for people
like me.
So 1 of those limits is when you
start a job, there's a maximum
payload size that's pretty small.
I forget what it is exactly.
I think it's like 50 megs or 5
megs or something like that.
And then as that job executes,
it can only generate X amount
of state transitions, they're called,
but events in its processing,
where it can only have a certain
number of retries, basically,
and you need to sort of re-architect
your temporal workflows
if you want to overcome that.
And again, in our case, our users
are kind of like, hey Harry,
I have this 400 megabyte video
file that I'd like to process
in a background action.
Can you just make it work please?
Can you not make me like re-architect
my entire thing to toss
it into cloud storage?
Whatever, it's ephemeral data.
Like I don't need it in the cloud.
I just want to push it to this
thing and work it within the background.
So we found that the temporal limitations
were annoying enough
that we actually needed to stage
a bunch of each jobs data outside
of temporal.
And so we ended up kind of having
this nasty hybrid where that
Postgres outbox I mentioned sort
of became like a store for a
bunch of extra stuff and then the
temporal workflow as it went
it would be you know transacting
with its state storage and doing
its state transition things, but
then also updating what is now
no longer the outbox to be our
source of truth, it powered our
UI, it had all these extra params,
yada, yada, yada.
So we weren't even really using
Postgres as a queue.
We weren't having the classic skip
locked problems and whatever,
but we were just having like this
sort of like request scale
data transmit, like 1 web hook
would be 1 write to this outbox
to enqueue, 1 write to start processing,
1 write to finish processing.
So just like massive, massive tuple
of turnover.
And we also needed to build a UI
on this stuff.
So we only could get a few of them
to be like HOT updates, like
many of them were not.
And so we were just like destroying
this poor Postgres instance
with what I would call like less
important data than the actual
data plane data but still important
enough that we felt that
merited being on a disk.
So all that to say we ditched Postgres
for this and we switched
to Bigtable, which is like Google
fancy super distributed key
value store and it friggin sucks.
It really sucks and the reason
is no transactions, right?
It's like a way worse model for
me as a programmer and a systems
designer, and a way more scalable, fundamentally supersized system
under the hood.
But boy, oh boy, do I miss the creature comforts, we'll call
them, of secondary indexes and transactions and a sane query
language, let alone the gazillion and 1 features that a full
relational system has over a key-value store like that.
So we use Postgres, I think, by default for all our workloads.
And then when we find something that we're like, okay, I think
there's maybe a better fit here.
We try to switch it out for the next best thing.
But it's like, it's really hard to beat.
And I know this is maybe unsurprising on a Postgres podcast,
but it's like, I just wish I didn't have to do that.
I could have stayed with my creature comforts and got my scale,
you know?
Nikolay: Yeah, so I'm curious what was the number 1 reason.
Is it like bloat maybe or?
Harry: Yeah, it was like bloat, tuple turnover and just like
the fact that we were like always vacuuming.
Nikolay: Yeah, yeah.
I can imagine.
Harry: Another big motivation for switching to something akin
to Bigtable is like compute elasticity, where we can kind of
like Bigtable auto scales by default.
You can't even turn it off, I don't think, where they just add
more nodes as demand kind of grows.
And for us, where we're kind of like the background subsystems
are the thing that absorb load, right?
When you have a flash sale on your Shopify shop and you need
to enqueue a bunch of stuff, like it's quite, We need this thing
to be able to handle unpredictable surges.
And it was a bummer at the time, we didn't have any way to do
a 0 downtime, like non-disruptive Postgres resize.
Now we do, but even then I don't really want to be doing it automatically,
like in response to load.
I kind of want something like Bigtable that's like, 5 seconds
later you can get your capacity expansion.
Nikolay: Yeah, well, I can see it easily.
Harry: I think, I think like My ideal would be a million tiny
Postgreses and have some fancy system for like disaggregating
a logical Postgres from a physical VM.
Or if I have like, let's say I shard my workload into 16,000
Postgreses, but they only run on 4 VMs.
And then as I need more compute, they kind of split out, get
bigger, and whatever.
But that's like a twinkle in old man Harry's eyes.
Nikolay: There is work in this direction by multiple teams right
now.
So we have good format.
Harry: Go on.
Go on.
Yeah.
Nikolay: Yeah.
There is PgDog.
There is Multigres and PlanetScale also.
They're all working in this direction.
Michael: And Multigres specifically is 1 of the things Sugu
mentioned as well.
From his experience at YouTube, he loved having lots of smaller
databases and he much preferred that in terms of reliability
and localized outages even.
So there were just so many benefits
Nikolay: highly automated operational source in the roadmap like
upgrades as well
Michael: and rebalancing yeah
Harry: yeah I am forgive me for another brief aside but are you
familiar with Azure Cosmos DB?
Like the kind of main 1, not the Citus 1, the main 1.
Michael: I thought Cosmos was Citus.
I get confused by their names though.
Harry: So I think there is Azure Cosmos DB for Postgres, which
is Citus.
But I think there's, and I'm not deeply familiar, so I may be
totally wrong here, but I think there is a different original
Cosmos DB that's kind of the Azure answer to Dynamo and Bigtable.
Like they're kind of like super scale, sort of less sophisticated
query patterns, but guaranteed scale for whenever.
The way that it works, or the way that it scales is you specify
like a partition key with every piece of data, and you're only
allowed partition local transactions.
So it's not like a Spanner or a Cockroach where the whole key
space can be transacted against together.
It's just like a million tiny partitions and then every partition
could be a max of 50 gigs.
And to me, that's like so desirable because it means like 1 little
part, you get like no local performance, but the same horizontal
scaling properties of the super scalers, but without necessarily
two-phase commit or Paxos or whatever.
And so I'm hoping someone manages to do that.
I think it's maybe, what's it called, Neki, like the Vitess
for Postgres.
Did I get that
Michael: right?
It's Neki.
I have no idea how to pronounce it.
Harry: Yeah, same.
I have no idea.
But I'm hoping that's what it is, which is like tiny little boundaries
that force me to design my data in a way that makes the transactions
efficient where you don't need a bunch of computers to agree
on a Transaction outcome and instead
you can just funnel them
to the right place
Nikolay: unfortunately, they all
Not all but many of them came
to me and asked, have you used
2PC?
I said, no, I always avoided it,
but so they need it, unfortunately.
Of course, like some workload will
avoid it, but some workload
will still need 2PC.
This is the sad truth.
Harry: Totally.
Totally.
I think like the thing like I,
so I used to work at Shopify circa,
I don't know, 2012, 2013.
We were running like a MySQL fleet
at the time, and we felt constrained
by the biggest Database servers
we could buy.
Like we were sharding, but it was
still like the biggest customers
could push the 1 instances, I don't
know what it was, probably
64 threads, like to the max, I'm
not really feeling that way
anymore.
Like I feel like there's few workloads
that don't have a natural
partitioning where you need more
than 1 Server's worth of compute
just to satisfy 1 partitions worth
of work I mean, maybe you
talked about this assignment last
time But it's like you can
always split it up and when you
split it up Like you don't need
you need an Aggregate many computers,
but 1 Transaction boundary
doesn't need more than 1.
And so I'm just, and like the hardware
has just gotten so much
better in that 10 years that I'm
not sure I could grow my company
fast enough to need that, Right?
It's like maybe if we had like
a graph problem, maybe if we were
Facebook or whatever, but as it
stands right now.
Nikolay: That's true.
And on GCP, if we don't look at
AMD, we look at only Xeon scalable
fifth generation, the best available
C4 nodes, I think, right?
You can get 192 cores or so, but
if you go to AWS, you will have
even more, like almost, I think,
800 cores also the same type
of processor, and this is already
huge.
So you can scale 1 node.
Harry: It's just bonkers.
Like, that's ridiculous.
Yeah.
It's a lot of doge coins, you know?
Sorry, so that was a long-winded
rant.
But what more can I tell you about
that temporal sort of Postgres
workload setup thing?
Michael: Before we started, you
mentioned 1 of the most interesting
things you're working on is that
you're serving other people's
workloads and that you don't control
with design, like Schemas
and design decisions that you don't
control.
How do you stay sane while doing
that?
And also, what are the interesting
parts of that?
Harry: Great question.
You hit me with the question that
just invalidates my entire
prior rant, which is like, yeah,
you can partition your workload
if you think hard about the schema
design, but yeah.
So, okay, let me set the stage
a little.
So, Gadget's, like, reason for being.
Like, the reason why we're working
hard on this thing is because
we kind of believe software development
is still hard, unnecessarily.
That, like, when you go to spin
up an app on the weekend or your
next big business or just, like,
solve a small problem in your
life with software.
You spend a lot of time on what
we call like baloney problems
of provisioning infrastructure,
setting up connection poolers,
like worrying about the autovacuum
schedule, so on and so forth.
We just think that's all a distraction
from actually building
something useful.
It matters, like you can't ignore
it, but you also ideally don't
want to have to deal with it.
So similar to the people who came
before us, like the Firebases
of the world, We kind of want to
delete a bunch of those problems
from our customers' lives, take
them on ourselves, and allow
them to focus on whatever unique,
interesting custom problem
they want.
Like, recently, it might even be
allow cloud code to focus on
whatever unique, interesting problem
they want, right?
But, like, by kind of deleting
infrastructure foot guns and systems
integration, we just let people
kind of succeed faster.
And then I guess I should say,
bearing in mind the audience listening
here, it's not for everybody, right?
Like, if you have the perfect knowledge
of the Postgres schema
you want, Gadget's probably going
to be less efficient than that,
right?
Because we've added lots of layers
of abstraction.
We have our own operations and
infrastructure team kind of managing
this stuff.
So I'm not trying to say, like,
no 1 needs to care about these
problems.
I'm just trying to say that a lot
of people, I hope, don't have
to care about these problems so
that me and my team pull our
hair out stressing about them.
All this means is there's this
tension between asking the developer
to be more specific about their
data access patterns and adding
friction for them to like on the
path to success right that like
every time you say do you want
4 8 or 32 bytes for this number
you're you're posing what I would
call computer baloney question
rather than like it's just a number
like you know screw off I
don't know I don't know how big
it's ever gonna be just you figure
it out you know so in Gadget we
have like a number field type
instead of an int and a float and
a decimal we've taken this
to a pretty intense extreme where
when you do data modeling in
Gadget you specify tables, we call
them models because they include
behavior as well, but you specify
tables and you specify fields,
you can specify relationships among
them, but you don't run any
SQL to realize those schemas.
And then you get a kind of fully
auto-generated GraphQL API for
interacting with those tables without
doing anything else.
You just, you specify your schema,
you get the API.
We generate a nice client for that
API, we generate nice docs
for that API, but the idea is like
you say, to do title string,
completed boolean, and now you
immediately can write some React
code for a form, or you can ask
the AI to write some React code
for a form to like transact against
that stuff.
All this to say, we're responsible
for the SQL that, or the schema
of those tables under the, like
the real tables under the hood,
as well as the SQL that that API
generates and executes.
So this is a win because we can
avoid things like the
N+1 problem, like we know how to
do that good.
We can implement things like 0
downtime cutovers or kind of maintenance
modes where we know how it's like
this app is in the process
of being moved to a different shard,
like, hold on a second,
we'll finish that and then serve
your request, you know.
It's powerful that Gadget sort
of owns the whole stack.
Oh, we can implement other stuff
like cloud storage, where you
can store data blobs in the right
way, instead of within Postgres,
you know.
But it means that all we know about
a given application is what
we've collected from the developer,
which is that list of tables,
the list of fields on the tables,
and the relationships among
them, Not stuff like, ah, this
is a shard key, or this is how
I intend to transact, or if I'm
going to make an HTTP call in
the middle of my transaction, like
2 days from now when I change
my code.
So we picked this bed and we made
it on purpose.
The idea is we want building high
quality software that scales
and like meets the demands of Black
Friday, Cyber Monday to be
more accessible and to not be locked
behind you know whatever
between the 3 of us 60 years of
Postgres experience or something
like that, right?
You should be able to build and
get an app going way faster without
that.
So the gist of it is we need some
way to structure our data under
the hood within our Postgres such
that people can't take down
the platform.
They can't use more than their
fair share.
We just need to be safe by default
where we're committing to
serving this API, and regardless
of what strange queries you
throw at it, strange transaction
patterns, poorly written code,
either out of ignorance or malice
or whatever, just needs to
still work.
Same as like a Firebase or yeah,
those other companies I mentioned
before.
And the way that we do that is
like kind of 2 main ways.
We don't give everyone their own
instance and say it's your job,
like you manage the performance.
I think that's what, say, a Neon
or Supabase does, which is
still helpful, but it's different,
right?
They're sort of saying, you get
this much resources and within
that, it's up to you to optimize.
We don't say that.
We charge by operation, more usage-based,
like say an Upstash
Redis or something like that.
And then we consider ourselves
Gadget on the hook for the performance
of those operations.
And that means that, again, we
have to find some way to not process
queries that could take us down.
So we do 2 things.
We refuse to execute a query unless
there's an index on the predicates.
And that's just a safety must,
basically.
We can't run unindexed queries
because we can't know that they'll
process fast enough for it to be
safe to run at scale.
And then we rate limit.
And that allows us to say like,
you know, no 1 can have more
than their fair share.
But the crazy thing is almost no
1 hits their rate limits.
We consider that like a regrettable
private platform protection
mechanism as opposed to like a
pricing mechanism or like an incentive
and the reason that no 1 hits
their rate limits is because
if you just force every query to
be indexed like you get pretty
far.
There's there's the planner is
so good and like the bitmap index
combining stuff gets you so far
that almost nobody, or any time
we ever have a database CPU spike,
it's usually because something's
gone wrong with our creation of
the indexes, like Gadget's automations,
their Temporal workflow, surprise,
like have failed to set the
index up properly, or we like let
the code that makes the API
calls ship before the index was
actually in place, or something
like that, you know.
But it's like Postgres just works
if you do this.
So it's really naive, but it works.
Nikolay: Are you okay to share
how many indexes you've already
created?
Harry: Yeah, we have, I think,
probably over 2 million indexes
in our system right now.
Actually, sorry, I take that back.
At the time of the sharding we
did, but we've reduced that drastically
since then, but for kind of unrelated
reasons.
But yeah, basically if a user wants
to be able to, in their API,
sort or filter on a column, we
just say there has to be an index.
And so we have, I don't know, most
of the columns indexed.
And like, oh god, I can feel the
hate mail now, talking to a
giant legion of database engineers who are all screaming in their
cars right now.
Why are you indexing recall?
No, no, no.
The modern hardware that respected Nik just mentioned means
that the cost of this, while substantial, is not actually that
crazy.
It's higher, no doubt.
But we're not in the days where you need to be so incredibly
stingy about every index you make, especially when the cardinality
of the tables are like a million, 2 million records, as opposed
to like 10 gazillion, right?
When our first 10 gazillion customer comes along, we're going
to work with them closely to be like, you don't need that index.
Like, let's turn that off.
Your stuff will go faster.
But the vast majority of people, the overhead is it adds up,
but it's not so meaningful that our business is busted.
Nikolay: Yeah, so I wanted to say that this is a pet versus cattle
philosophy.
If we like think about indexes as a pet, we need to get rid of
unused indexes, take care of bloat and all.
But at your scale, before schema became sharded, millions of
indexes, you think about this as cattle as well.
Just inside 1 Postgres instance, right?
Before it became sharded.
This is an interesting approach.
And I agree with, since we have small tables, we know there is
right amplification, all indexes, if it's not HOT updated, it
needs to update all those indexes, some of them are not used,
but since approach by design it's cattle, okay, it's not that
bad.
Yes.
This is interesting case, I must say I never saw it before, it's
interesting.
Michael: I've heard of 1 other case which was Heap Analytics
blogged about, something about, yeah you know the 1, great.
I sure do.
So that was the only other time I heard about this.
Nikolay: Was it also where they discussed that they create indexes
without work concurrently, right?
Michael: Yeah, because you can't do more than 1 at the same time.
Nikolay: Yeah, to parallelize work.
Michael: Yeah.
Harry: What's neat or strange or weird about Gadget is we kind
of participate in the whole SDLC, like the whole life cycle of
like authoring a feature, debugging it, deciding it's good, merging
it and deploying to production.
Gadget has a development environment as well as the production
environment.
And so we kind of get to witness the moment that you want to
go live.
And we just do a normal style, call it Rails or whatever, migration
where on deploy, we create the indexes concurrently behind the
scenes.
Then we ship the code that makes use of them.
And then it's like, we can do that because we're not sort of
exposing the raw SQL and asking people to manage their own schema,
so on and so forth.
But what do I want to share about this that I think is the most
interesting?
Oh, yeah.
So the first thing is, like, giant catalogs where you have, you
know, we had, I think, at peak in 1 instance, like 250,000 tables,
1.4 million indexes.
Giant catalogs suck, but were more manageable than I was guessing.
I was assuming there was going to be some O(n²) system
table inside Postgres, or some doinky little thing that just
breaks when things are that big.
But no, that instance operated well.
The thing that did break was a bunch of the analytics tools out
there.
Like we were using.
Like pg_stat_statements?
pg_stat_statements was OK.
It was big.
It takes a lot of data, but it wasn't, like, an ungodly amount
of performance overhead.
But, like, pganalyze and many of the kind of collectors, they're
like, let's get the size of every table every minute, you know?
It's like that's a hard no.
Nikolay: Our monitoring also didn't work properly because of
the catalogs.
Harry: Yeah, but you guys only had to make a couple tweaks.
I think we're talking about pgwatch2, like the Postgres.AI
remix, and it works well for us.
It's the kind of fastest 1.
I haven't tried pgMustard, but I must.
I
Michael: owe them a minimum.
It's not
Nikolay: a money-ching code.
Harry: It's all right.
So that was surprising.
But the other, actually, This is just a fun fact.
So we did our sharding, we moved this gazillion number of tables
over to AlloyDB.
We turned on their analytics thing.
It's called Advanced Query Insights.
And it's not bad.
It's a nice UI for pg_stat_statements, souped up a bit.
They store all the data for it in a private database in your
instance.
So like you have an almost superuser but not a full superuser
and there's this one database you can't touch that has it in it.
And currently, this is my fault, I set it all up and I was
like, oh, this is probably going to be not that expensive.
Currently our like AlloyDB Insights database is almost 1.5x the
size of the actual database.
And I think it's because the query diversity is so high, and
all the data that they're keeping about each one adds up so much,
that yeah, we're paying more than double in storage just for
the privilege of this thing.
And half the queries time out.
But again, I'm going to blame me for that, not Alloy for that.
We just need to turn that off.
Nikolay: I'm curious if the main, like if we have just one AlloyDB
database and we store metrics inside the same thing, if it's
down, how will we troubleshoot?
Harry: I think it is only like query insights like the equivalent
of pg_stat_statements.
I guess you're right, I don't know.
Michael: Isn't the same true for Postgres like with pg_stat_statements?
Nikolay: Well yeah, but that's why we need external monitoring
tool which will pull data and store it independently for troubleshooting.
I thought about this like self-observed or externally observed
system.
I think we actually need both.
But if it's just self-observed, it might be a problem.
Harry: Yeah, it goes full circle to that same thing we were talking
about with the infrastructure provider earlier, which is like,
they don't have anywhere else to put it unless they're going
to pay for it.
Right?
So by putting it in my instance, I pay for it, which I am paying
for dearly, and I'm now going to turn off, you know, but I agree.
We, you do get the like system metrics that they store in their,
you know, fancy time series database, but yeah.
Nikolay: Yeah, I didn't know about this.
Interesting.
Harry: Yeah, I think, if I may, the most interesting thing we've
done to manage this ungodly number of indexes and tables is classic
computer science, a hybrid strategy.
What we noticed is many, many, many Gadget apps are like tutorials.
It's like someone wants to kick the tires, or they're building
a to-do app and they only ever add three to-dos.
It's kind of excessive to create them one to-do table, one user table,
indexes on many of those columns to store zero to five rows.
So we use a thing we call a storage strategy pattern where different
Gadget models are backed by different data.
We still store them in the same Postgres, and that's so that
we can still transact against them together.
Like we support real, actual Postgres transactions, but we use
doinky little JSONB tables when the cardinality is small.
So we have 1 shared JSON table, it has an ID, a model identifier,
and then just field values, and we don't index that at all.
No GIN, no nothing.
We just do a table.
You could call it a table scan even though it's not, it's like
a range scan.
And then we just bound the cardinality of that.
And then when we witnessed the, I think it's like 22,500th record,
we swap out kind of a real table that has the real indexes and
whatever.
And so that cut our table count and our index count down by a
lot, a lot, a lot.
Again, that swap is 0 downtime, and it has to be.
That's again our commitment to our customers.
And that was actually quite difficult in order for someone's
doing a bulk load of data or whatever.
We witnessed that 2500 record, we have to kind of like pause
traffic, do the switcheroo, resume traffic without, like it'll
be a latency spike, but not an error spike.
And I think that's kind of, it's Gadget is a bundle, right?
Where we have this backend runtime, this frontend runtime, this
database, all these things that work nicely together.
And we see that as validation that the bundle's showing through.
Or it's like, we can only do that because we have control over
the app tier, control over the load balancers to be able to be
like, ah, hold on a second.
We've got to do some maintenance on your app real quick.
We won't drop any requests.
We'll do a little switcheroo without anybody knowing, and you're
off to the races with the full performance after that.
Michael: That's a really neat trick.
I like that.
You must have provisions in that code that doesn't allow
queries without indexes to like account for tiny tape like Yeah,
oh my goodness.
This sounds painful
Harry: We've been working on it for like 4 years, so yeah, yes,
but I think we're in a good position.
Like, I'm excited for BFCM this year, which for operations staff
is, I think, a rare thing to hear.
Like, I think I'm excited for the challenge, you know?
Michael: I saw the color drain from Nikolay's face.
Nikolay: Yeah.
Yeah.
Do you think there might be some point that true Postgres will
be fully replaced in your company by something?
Or there will always be a place
for true Postgres, not AlloyDB
Postgres-like?
Harry: Great question.
I think yes.
I do think it will be fully replaced.
And I think it's only because we
haven't seen any alloy downsides.
Like while it's a fork, it's still
99% true Postgres.
So we have like pgvector support,
for example, and pg_squeeze
support.
It's not like a wire compatible
thing.
It's like mostly actually Postgres
with the WAL and the block
storage kind of ripped out.
Nikolay: It also has lightweight
locks and so on, all those problems.
Harry: Yes.
Yeah.
Yes, true.
I think we're going to stick on
Alloy, but not because it's the
very best, but because now we have
familiarity, Now we know we
can maintain it, now we can scale
to Valhalla.
So we've sort of unlocked something
and we've got other stuff
to do.
I think the thing that would make
me revisit that is OrioleDB,
or whatever the...
Whatever InnoDB for Postgres is.
Like, when that comes around and
I don't have to vacuum anymore,
that seems like a pretty epic win
that I'd be interested in realizing.
Nikolay: Makes sense.
Great answer.
Great, thank you so much.
Enjoyed.
Thank you
Harry: guys, I really appreciate
getting to nerd out.
These are the things that keep
me up at night.
They're like 1.4 million indexes.
What do I do?
It's nice to let it out a little
bit.
Nikolay: Cool.
Thank you.
Michael: It's our pleasure.