To 100TB, and beyond!
Michael: Hello and welcome to Postgres
FM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard,
and I'm joined as usual by
Nikolay, founder of Postgres.AI.
Hey Nikolay.
Nikolay: Hi Michael.
Michael: And this week we have
not 1, not 2, but 3 special guests
with us to celebrate episode number
100.
And we asked you, our listeners,
for ideas for this episode.
And we got some great suggestions,
including the topic of scaling
to 100 terabytes for episode 100.
So we thought that...
Nikolay: And beyond.
Michael: Oh, and beyond.
Yes, always beyond.
Yeah, no downtime allowed.
We're delighted to welcome 3 people
who have successfully scaled
Postgres exceptionally well.
And those are Arka Ganguli from
Notion.
Hey, Arka.
Arka: Hey, good to be here.
Really excited to talk about Postgres.
Michael: And hey, Sammy, Sammy
Steele from Figma.
Sammy: Thanks so much for having
me on.
Michael: And finally, Derk van
Veen from Adyen.
Derk: Yeah, thank you very much
for the invitation.
I'm honored.
Michael: We are honored to have
all 3 of you.
So yeah, thank you all for joining.
So we're gonna try a kind of panel
round table style discussion.
I'll play good cop trying to keep
us on track, but you are all
welcome to chime in with questions
for each other.
And I know Nikolay will as well.
So, let's start on the simpler
side of things.
And starting with you, Derk, would
you mind letting us know what
kind of things you're using Postgres
for at Adyen there?
Derk: Well, that's an easy question.
We keep our data within Postgres
databases.
The thing is we are processing
financial transactions.
And the thing about financial transactions
is you'd better not
lose them.
Because if you have a social media
kind of platform, right, and
you refresh your page, and you're
missing a picture and refresh
again, and the picture's there,
nobody really cares.
But if you refresh your bank app
and your money is gone, then
people, for some reason, they kind
of freak out.
So yeah, we try to not lose a single
bit of data within our Postgres
databases.
Nikolay: So you don't use data
type money as well, right?
Derk: No.
Okay.
Michael: Nice.
Thanks so much.
Sammy, how about at Figma?
Sammy: Hi, everyone.
I'm Sammy and at Figma, our platform
is an online web-based SaaS
software that allows designers
to collaborate together.
And you know, kind of the magic
secret sauce is this real-time
platform where you can comment
on a design and it immediately
shows up for other people.
And so all of that core data is
powered by Postgres and it has
to be really reliable, really fast
and low latency, so it can
feel real-time.
And we've grown 100X in the last
4 years.
So that's where we're getting our
100 for this episode from.
Michael: Nice, even more hundreds,
we like it.
And Arka, how about Notion?
Arka: Yeah, so unsurprisingly,
we also use Postgres to store
our data and make sure that it
continues being there when our
customers expect it to be.
For those of you that don't know,
Notion is kind of like a collaborative
workspace.
You can use it for documents, wikis,
product management, and
it has a lot of AI built in recently.
So really it's like the place where
a lot of customers store
their entire knowledge base and
run their company or their personal
use cases as well.
Nikolay: I just wanted to mention
that I'm user of all 3 companies
products, because for example,
Notion, I'm a big fan of lists
and so on, so it's a great thing
for that, for such kind of people
I am.
And Figma, I actually remember
I created a lot of interfaces
in Figma myself.
It's super cool, too, and especially
this real-time thing as
well.
And somehow I noticed Figma replaced
all things I used in the
past.
And Adyen, I just told Derk before
we started recording that every
time I go with my daughter to Legoland,
I see Adyen on payment
machine when I pay for something.
And I think, oh, Postgres processing
is there.
So it's cool.
Thank you for coming once again.
Michael: I love how humble you
all are, assuming people might
not know who your companies are
as well.
It's nice.
Let's flip the order and let's
look at how you set things up
Postgres-wise.
So how things were in the past,
how things are set up now.
I've realized this might be quite
an in-depth answer and any
future plans.
Perhaps as well.
So maybe at the highest level,
you can describe that would be
awesome.
Okay.
We restart on your side.
Arka: Sure.
So I joined Notion about a little
bit over 3 years ago.
And when I got there, we had actually
just finished like our
first sharding effort, but essentially
at Notion, we run a pretty
simple deployment.
We use RDS, which is Amazon's managed
database offering.
We run Postgres on that.
And we run pgBouncer in front
of our RDS cluster for connection
pooling.
And that's pretty much it.
We do have sharding, like I mentioned,
which basically is just
like splitting up the data across
more databases.
And the logic for how to talk to
the database is actually controlled
by the application.
So depending on some ID, we know
what database to send your query
to.
And yeah, that's kind of how it
is even today.
We have a few more databases than
when I started, but the overall
high level architecture is pretty
much the same.
Michael: Nice.
I've read both of your sharding
and resharding blog posts.
I'll link both up in the show notes.
But I'm guessing that will continue
to work.
Any reason it won't continue to
work for the foreseeable?
Arka: Yeah, I think our plans are
not really to move away from
Postgres.
It's worked really well for us
so far, and we've got a lot of
people at the company now that
have built up a lot of knowledge
around how to operate this thing
at scale.
So definitely something we want
to stick with moving forward.
Probably we do have to re-shard
in the future at some point,
but in the longer term we're actually
looking at moving to more
of like a cellular architecture
where we can kind of deploy Maybe
like different versions of Notion
to scale our capacity with
like their own databases and caches
and things like that.
So we don't have to continuously
spend our time re-sharding over
and over again, because that's
a very time intensive and operationally
intensive kind of thing for engineers
to take on.
So we probably do have to re-shard
in the near future, but in
the longer term, we hope our more
like cellular architecture
will be able to scale much further.
Michael: Super interesting.
Sammy, let's go to you.
I know you're sharding as well,
but slightly differently, I believe.
Sammy: Yeah, well, we definitely
have a lot in common with what
Arka just described.
And so your blog posts were a great
inspiration for us as we
were in the early days of horizontal
sharding.
So similar high-level stacks, and
that we're also on RDS Postgres,
Amazon's managed system, and we
also use pgBouncer for connection
pooling, at least for now.
And where we start diverging from
Notion is that I think our
data model is somewhat more complex
and less naturally a good
fit for sharding.
My understanding is Notion had
shard IDs based on the orgs or
the workspaces and that that worked
pretty well.
But at Figma, we have a lot of
data not associated with an org
and data moves quite frequently
between orgs, which makes that
kind of sharding model quite hard.
And so we actually ended up going
with a more tactical approach
of, for each table, picking a set
of a few shard keys that would
be more optimal.
And we ended up building a proxy
layer so that instead of the
application having to be aware
of shards, we actually have this
DB proxy service we built, which
is a Golang service.
And it's able to do all the routing
and handle all of the horizontal
sharding, scatter gathers where
a query hits many shards, and
then you aggregate results back.
And so that's kind of a superpower
for us to have a lot more
control in that layer.
And long-term, we have sharded
our simplest, highest write rate
tables, but we still have a ways
to go until everything is sharded.
And we have to build out a lot
more infrastructure to support
that.
So unlike Notion, we actually expect
restart operations to be
quite common.
And so our goal is like 1 click
failover where you can just hit
a button and then have a restart
operation happen and transparently
in the background.
Nikolay: Sounds like logical replication
involved, but maybe
we'll talk about that later.
Derk: I'm so interested in this
proxy layer.
Did you build it yourself in-house?
Sammy: We did.
Yeah.
So it's a Golang service.
So it has a couple of components.
It has a query engine, which has
a parsing layer that takes in
raw SQL and transforms it to an
AST.
And that part we did borrow CockroachDB's
Postgres parser, which
has mostly worked well for us with
the few modifications, but then
all of the logic for, we have like
a topology layer that we've
built that knows where all the
databases are and what shard keys
map to what tables.
And so the query engine really
does all that evaluation and query
rewriting.
Derk: Yeah, that sounds pretty
mind blowing to me.
It's really cool.
Nikolay: Yeah, there are a couple
of projects like PGCat and
SPQR which
Sammy: do that,
Nikolay: but maybe when you started
they are very early stage,
right?
Sammy: Yes, actually to clarify,
so we wish we had named this
something besides dbproxy, because
PGCAT would not be a replacement
for what we have, because we actually
might use PGCAT and replace
a pgBouncer, because we are hitting
some pgBouncer scaling problems.
I'd be curious if you guys are
also running into challenges.
But dbproxy is really, it's a stateless
service that talks to
every single database.
And so a request gets routed to
it, and it has a connection pooling
layer which talks to a pgBouncer
or PGCAT-like layer that is
specific to a single host.
So it's really playing more of
a high-level orchestrator role.
Michael: So cool.
And I love that we've already got
2 different approaches to sharding,
1 that's transparent to the application
and the developers, and
1 that's not.
And I think we're about to get
a third answer from Derk's side.
How are things set up at Adyen
and past, present, future?
Derk: Let's start with the past.
I think we had a pretty, it was
our by far, it's our most well
read blog article.
And that's how to upgrade a 50
terabyte database.
And when looking back on it, it's
quite a while ago.
So I think 2 years ago, we had
an internal event and it was like,
no, it's not how you upgrade a
50 terabyte database.
It's how you upgrade a terabyte
database with multiple zeros
in the end.
Nikolay: At least 2, I guess, right?
Derk: Yeah, we have 2 zeros in
the end.
Nikolay: And not just upgrade,
but 0 downtime upgrade, or like,
I just want to clarify, because
regular upgrade is not super
big problem if you can afford a
couple of hours downtime to.
Derk: Yeah, we tried to think about
ordinary upgrades, but the
rough calculation, it will take
us like 4 or 5 weeks downtime,
which didn't cut the bill.
Nikolay: We've had links.
Derk: No, that 1 is much faster.
Fortunately, I think we limited
to 15 minutes of downtime.
And we already had a big fight
for this 15 minutes, but it has
to happen.
And it's it's not even a database
downtime, but it's more like
all the infrastructure around it
to get it fixed, which required
downtime.
If you look purely at Postgres,
I think it was minutes, but it's
not just a database.
Nikolay: I know this problem very
well.
So
Derk: Yeah, that's
Nikolay: everything around.
It's more and more minutes and
then sometimes hours, right?
Sammy: Yeah.
I'm curious, what's the biggest
individual table or do you have
partitioning under the hood there
to have so many terabytes on
1 box?
Derk: What is the biggest table
we have?
I don't know, but the biggest partition
is over 30 terabytes,
which is a single partition.
Sammy: Wow.
And you don't run into vacuuming
issues with that?
Derk: I think we are the most skilled
company when it comes to
vacuum and how to tweak vacuum.
Nikolay: And rebuild your indexes,
which blocks XminHorizon,
right?
Derk: Yeah, we have seen many vacuum
issues.
When I started to work with Postgres
and I was installing this
database and transaction wraparound
was some kind of weird theoretical
thing.
And then I joined Kotien and then
something you do every week,
at least.
Nikolay: So I yesterday asked My
Twitter audience, what would
you ask to people who manage many,
many, many terabytes?
And 1 of the questions was, how
are you?
I guess this is the case when you
have problems like 30 terabytes
on 1 table.
How long is the index creation
time or rebuild time?
It's like days or?
Yeah, days.
Days.
During which XminHorizon is blocked,
right?
So we accumulate bloat in whole
database for all tables, right?
It's a super annoying problem.
Derk: But most of these tables
are partitions.
So first we create an index per
partition, and then finally in
the end, we create the index on
the parent.
Otherwise we don't make it before
we hit wraparound.
Nikolay: Oh, wraparound.
It's also a problem, but I'm talking
about also vacuum and bloat
problems, so multiple problems
here.
Well, that's super interesting
experience.
Derk: It keeps us busy.
But it's also, it's an honor to
have these problems, but I think
all 3 of us got this big databases
because the company is very successful.
successful.
Nikolay: Right.
Derk: And I think at the end of
the day, having a big database
is actually a design problem.
You forgot to design in time for
the size you need to accommodate.
Sammy: See, I actually disagree.
I think it's a good, It's an expected
problem.
If you design a company too early
to scale, then you're not gonna
have a company probably if you're
trying to solve those problems
when you're 5 or 10 people.
I think 1 of the cool things about
Postgres is all of our companies
did get very far on 1 host.
Like Figma existed for 7 or 8 years
and was pretty successful
before we had to actually start
scaling out.
Derk: That's definitely true.
Nikolay: I agree with that.
And my team helped a couple of
companies who went to IPO being
like on a single Postgres cluster
and they got evaluation, thousands
of billions of dollars having 1
cluster and this was impressive.
Of course, later it should be changed,
but yeah, so I agree with
1 cluster you can scale a lot,
But I would not like to be in
your shoes dealing with multi-day
index rebuild and also transaction
ID wraparound problems.
So how do you escape from these
problems today?
Like 30 terabytes 1 partition,
It's quite difficult, right?
So is it to be partitioned into
smaller partitions or
Derk: Yeah, I'm thinking about
how to answer this question.
I think before I joined Adyen,
I tried to manage my database
in a proper way to do it by the
book and by the rules and everything
is fine.
And Adyen is pushing me to find
which boundaries can be bended
or violated without too much risk.
I think that's the way to put it.
We still are on the safe side,
but with more knowledge, you're
better able to decide when do I
cross a boundary and when is
it really dangerous and when is
it safe to do, but not strictly
advised.
Sammy: I think that's definitely
one of the benefits of staying
on a system like Postgres that
you end up understanding really
well is all these limits are extremely
specific to the workloads
and the tables and even what works
for one table at Figma doesn't
work for another.
And so there is a lot of just accumulated
organizational knowledge
that makes it easier to run these
systems at scale and understand
actually at what point do things
get scary.
Nikolay: Yeah, I agree.
If, for example, index creation
takes one day, but our real transaction
ID doesn't grow a lot, it's not
a big problem.
But if it grows very fast, then
we have two kinds of problems.
So transaction ID wraparound and
bloat accumulated.
I agree.
So it's very specific to particular
workloads.
Michael: I want to go back to Derk
quickly because you've done
such a great series of blog posts
on partitioning.
I'll share those.
I've reread them today and part
3 promises a part 4.
So I'm, I'm hopeful of getting
another installment at some point,
but yeah, is it, would you say
partitioning is the big, like
the way you've largely scaled there
or are there other strategies?
Like, are you moving data out?
Like how are you dealing with it?
And how are you planning to in
the future?
Derk: Yeah, I think we started
with horizontal sharding and then
within every shard, tables are
partitioned, but financial transactions,
you need to be able to refund them
for a very, very long time.
So it takes quite a long time before
you can actually archive
data.
So we have a lot of old data, which
is still around, but yeah,
these partitions kind of shift
more back to the back in the row
and vacuum is much easier if you
don't change a lot on a partition.
But at the moment we are actually
reaching the limits of our
sharding solution again.
So we basically start redesigning
the entire system again.
Arka: You know, that's really interesting.
Cause that notion like it's interesting
that you chose partitioning
Postgres as like your thing that
like got you really far because
that notion we actually ran into
I don't know like really weird
issues with dropping indexes on
partitioned tables, But that's
because I think we were running
on like, like this database was
running on like Postgres 12.
And I think a lot of these things
may have been fixed in the
future releases.
But we actually went the opposite
way where we're like, we don't
want Postgres partitions, let us
manage our own.
Because I don't know, there's some
weird behavior we ran into
when creating and dropping indexes,
especially because some things
you can't do concurrently on partitioned
tables.
Derk: Yeah, dropping indexes is
a nightmare.
I always tell my developers, you
can create indexes on a partitioned
table, but if you want to drop
them, I come after you.
Arka: Yeah.
Fun problem.
Michael: Actually, Derk, you said
something earlier about knowing
which rules you can kind of bend
or break.
One of the ones you mentioned in
those posts was around adding
check constraints.
And so adding them in a not valid
state, but then not running
validate afterwards.
I think you mentioned just updating
the system catalog.
And that's such a cool trick.
But like, how?
Yeah, I guess you have to make
Nikolay: up comfortable.
We use this trick like 15 years
ago when validate was not an
option.
So like something like that we
use it, but it wasn't like undocumented
thing.
Don't do it.
Like if you're, if you're not owning
database, For example, if
you're a consultant, this is not
something you should recommend,
because who knows what will happen
next.
If you own a database, if you work
inside the company, it's probably
okay.
It's not official.
By the way, I wanted to emphasize
problems we just started touching.
Most people who listen to us don't
have, right?
Because it's like extreme problems.
Derk: I'm not completely sure.
I think partitioning becomes useful
for a lot of people way before
you hit the limit where you have
to.
So if you're listening to this
podcast, and you think like, yeah,
partitioning is too far away, just
think again, because if you
do it now, when you have all the
time in the world to figure
it out, find a good strategy, Then
you don't end up with a huge
partition, a small partition, weirdly
partitioning things.
So I would say just starting time.
Nikolay: I'm on the same page with
you here because you mentioned
this rule, like 100 gigabytes as
threshold when you need to start
partitioning.
Same rule we also apply everywhere.
If you exceed 100 gigabytes, it's
time to partition, I agree.
Sammy: Interesting.
We have a lot of tables larger
than that that aren't partitioned,
but probably about a terabyte is
what we aim for max for our
largest shards, although we have
a 116 terabyte table today.
But another interesting topic that
I think is more relevant to
people are upgrades.
We have built this no downtime
tooling that we've mostly used
for re-shard and horizontal sharding
operations, but more recently
we've been using it for no downtime
major version upgrades with
the ability to roll back if you
run into problems.
And so I think that's something
that could be interesting for
anyone who's running Postgres.
Upgrades are always a hard thing
to do.
Nikolay: Yeah, By the way, we forgot
to mention that ADN is on
self-managed situation.
It's not on RDS, right?
Because it's very different.
Yeah.
I'm very curious how you did this
for RDS case, not using Bluegreen
deployments, right?
Just fully...
Sammy: At least today...
Well, Bluegreen only very recently
became available for Postgres.
Last year it was only really MySQL.
And so we'll probably explore it
going forward.
1 blocker that actually we're talking
to the RDS team today is
our rollback.
Exactly.
There is no way to rollback.
Nikolay: And that
Sammy: is the main thing.
Nikolay: That's wrong.
Bluegreen deployment's idea.
It should be symmetric.
And when you switch over, reverse
replication should be installed
immediately.
Without this, you don't have rollback
plan, right?
Sammy: Exactly.
It means data loss.
Yeah, so that's what our failover
operation does.
I think it's also nice to have
more control when you do it locally
because we're able to, for instance,
do a no downtime replicas
failover.
And because most of our P0 critical
workflows read from replicas,
that means that most failovers
don't actually incur any downtime.
Whereas, you know, if you're using
a managed thing like Blue-Green,
you lose that kind of control.
Nikolay: Yeah, well, I'm super
curious about details here, but
maybe it should be a separate discussion.
And first of all, I wanted to thank
all of you.
We invited you because you shared
very great posts.
So this knowledge sharing is super
important.
So I would be happy to see more
and more coming.
And maybe if we talk more and exchange
ideas, we could collect
better materials, for example,
how to do 0 downtime upgrades,
because I know very good recipe
for self-managed Postgres, but
I don't know for RDS, because they
don't allow you to control
recovery target LSN.
So I'm very curious to talk about
details, but maybe in a separate
discussion.
Because I know Michael has a different
plan, not to talk half
an hour about upgrades.
Michael: Well, maybe we could actually
do a quick lightning round
of which major version each of
you are currently running, if
you know and are happy to share.
Nikolay: Is it a single version
only?
Michael: Also a good question.
Sami, are you happy to share that
1?
Sammy: Sure, I can definitely talk
about that.
Well, we are newly, as of 2 months
ago, fully on PG-13.
So we had a couple of PG-11, very
legacy boxes.
There were some of our earliest,
largest databases, and that
actually was the main motivation
for building out this upgrade
workflow, was to get those into
PG-13.
And we would love to use some of
the features that are available
in PG-15, PG-16.
And so we're probably going to
look into fleet-wide upgrades
sometime next year to get to a
newer version.
But this is where this one-click
upgrades or failovers really
matters because today it's quite
toilsome and it would be pretty
painful to upgrade the whole fleet.
Michael: Yeah, absolutely.
How about you, Arka?
Arka: Yeah, we're actually we're
like 90% on Postgres 15.
And it is my life's mission
Nikolay: to
Arka: get us that final 10% over.
Yeah, like most of our databases
are on actually Postgres 15.
We finished that upgrade last year.
And yeah, we really wanted some
of the features that I think
got introduced maybe after Postgres
14, especially around being
able to filter the replication
stream because that's something
we want to use in the future, probably
if we want to rechart,
that'll be a very useful thing
for us.
And just wanted to touch on Sammy's
point a little bit.
We actually use the exact same
kind of, I think, failover that
you guys use as well at Figma based
on pgBouncer and with the
reversibility as well.
And that's something that has been
really, really surprisingly
amazing.
Nikolay: Yeah, you mean pgBouncer
post-resume.
I guess this is why you don't use
RDS proxy being on RDS, right?
Because it doesn't have post-resume.
Yeah, that's great.
I also was surprised last year
that it can handle really heavy
workloads.
Yeah.
Just a few seconds spike of latency,
you can upgrade and switch
over to a different primary.
Arka: Yeah, It was one of those things
where when we were writing
out the tech spec for it, I was
like, is this thing going to
actually work?
Nikolay: The problem with this,
by the way, there are no good
materials about this.
Proving like, nobody says, like
it's like public secret.
Nobody talks about it somehow,
but it works really well.
So we need more materials proving
that Pause Resume works well
in pgBouncer.
Sammy: Yeah, it's been great for
us.
We have some PgBouncer scaling
problems, but everything on the
failover's PgPause, PgResume side
has been flawless.
Michael: That's so good to hear.
And finally, Dec, on the major
version side of things.
Derk: We are fully on 13, and we're
having discussions about
the next version.
We want to upgrade too because
every version has nice things.
I want certain partitioning functionality,
A colleague wants
more logical replication functionality.
So, of course, 17 would be the
ideal, but then do we really want
to run 17 already with the biggest
clusters we have?
So, yeah, to be continued.
Michael: And I think also the fact
that, I mean, Notion being
on 15 was somewhat of a surprise
to me.
I think because of these major
version upgrade challenges, I
see so many companies, even at
the cutting edge of what Postgres
can do, lagging several years behind
because upgrades are so
difficult.
So I know we won't get a chance
to discuss all the reasons why,
but I think it's proof because
we're all lagging.
Sammy: Yeah, well, I actually think
it's not for us so much fear
of newer versions.
You know, I actually worked in
MySQL before Postgres, and MySQL
major version upgrades were truly
terrifying.
These were like 2 or 3-year-long
efforts where you found a lot
of major incompatibilities.
But one thing I love about Postgres
is it is mostly backward compatible
across major versions.
And so it's not so much that we're
scared of the newer ones as
versus we can't take downtime
to do these upgrades.
And so there it's much more manual toil.
Michael: Yeah, exactly.
I think that is, I think that's
the case for a lot of people.
So last one I was going to ask each
of you is if you have that
funny or scary, and I don't know
which one I want most story that
you can share with us related to
your scaling.
Any of you got one prepared already,
feel free to jump in.
Or Nikolay?
Nikolay: Wide-width locks, lock
manager, multi-xact transactions,
multi-xact IDs and so on, like
something like that, maybe.
Derk: Yeah, I think we definitely
got bitten by the LockManager,
like most people who start doing
partitioning.
It's like,
Nikolay: yeah, well,
Derk: we were, we started partitioning
and we, we did a good
job.
And I built this framework around
partitioning.
By the way, Michael, the fourth
blog post is, we are writing
it at the moment.
Yes.
So it will be there.
It's a nice idea.
Now I lost the rest of my story.
Oh yeah, partitioning.
Nikolay: So it sounds like we have
prepared statements.
Derk: No, we just have to prepare
statements.
But You know everything about prepared
statements, right?
You try it 5 times, then the optimizer
decides, do we go with
the generic plan or the custom
plan?
But we also created a lot and a
lot of partitions and a lot of
partitioned tables.
So we definitely ran out of logs.
And our CPU, basically every time
we created a new partition
for table, the CPU uses jumped
10%.
But jumping 10% was fine because
we had a lot of CPU to spare.
So by the time it alerted, It was
pretty high.
And then we were looking back and
we see these jumps of 10% and
10% and 10%.
And then we were looking like the
next jump of 10% that's above
100.
And then we are not in a good position.
So, then we had this small window
with all kinds of things prepared,
like don't create any partitions
anymore on this cluster.
What can we break open partition
boundaries so we can basically
extend the final partition to hold
more data?
And then we were researching this
problem, like why is the CPU
usually jumping like this?
And in the end, we started to force
the optimizer to pick generic
plans because custom plans were
faster in execution, but definitely
not at prepare time.
So we won like a few milliseconds
in execution time, but we were
paying like 100 milliseconds in
prepare time.
And that was really heating up
the CPUs.
Nikolay: Because during planning
Postgres locks all indexes and
tables, all partitions, all its
indexes if partition pruning
is not working.
So it's like, it's terrible.
Derk: Yeah, So you're running out
of your fastpath locks because
you need to lock like 20 partitions
and 20 indexes on these partitions
for a single table.
And then you join with a partitioned
table, and with partitioned table,
all these locks go to the log manager,
which is burning CPU cycles.
Nikolay: You're in a good club
of a bunch of companies who experienced
this over a couple of last years.
Derk: Yeah, I was
Sammy: going to say, we hit this
for vacuuming on our bigger
tables.
When you get above like 4 terabytes,
you start seeing that at
the end of the cycle, when you
run analyze and the cache gets
cleared out, and then you suddenly
have this custom plans that
are 5x more expensive.
And that's when CPU spikes to 100%
for a few seconds.
Nikolay: I have a question to all
of you.
When you have these problems, do
you discuss this with people
who can influence this?
I mean, hackers somehow.
Do you raise these problems in
mailing lists or somewhere else
where hackers live?
Arka: We work with AWS support
quite a bit.
Nikolay: Well, some hackers there
are for sure, but not all of
them, right?
Sammy: Yeah, we're good friends
with the AWS RDS team and they
have actually a team that contributes
to Postgres, so we try
to leverage them to get some of
these patches in.
But it's honestly a lot of us reading
the source code ourselves
as well and trying to figure out
workarounds.
Nikolay: By the way, I just last
week I think I realized that
SLRU patches created by originally
by Andrey Borodin, which will
be in Postgres 17, they are already
adopted by AWS RDS Aurora
for a couple of years, almost.
I was super surprised.
Actually, Andrey was also surprised.
But you're not going to use Aurora,
right?
Instead of regular RDS.
Arka: Maybe.
I think, I mean, for Notion, I
think we are actually starting
to look into if Aurora makes sense.
Mostly because the AWS team promises
a lot better replication
for global replicas compared to
the RDS built-in replicas.
So it's something we're considering,
but we haven't actually
used before.
So was curious if any of you have
any experience with that or
like scaling read replicas in general?
Because that's also something we
haven't leveraged too much at
Notion yet.
Sammy: Well, less scaling read
replicas, but actually my scary
story was going to be around moving
things more to replicas because
At the same time that we were growing
100X, we were dealing with
a lot more pressure on the reliability
of our system and trying
to move towards 4 nines globally
as a company.
And so our application, because
it's a web browser and we have
a kind of single threaded Ruby
monolith that serves a lot of
traffic, is very latency sensitive,
that if your database slows
down for a minute, then the whole
application can back up and
your whole website can be down.
And so that was 1 of our biggest
reliability risks and most persistent
pain points is our RDS does have
some big latency spikes on EBS
that most customers don't notice,
but we were really hitting.
And so our 2 big solutions there
were to move, basically remove
writes from our most business critical
routes wherever possible
and put all reads, or 99% of reads
onto replicas for these P0
user workflows.
And then we build out replica hedging.
So every request hits 2 replicas
and returns the first results
that are successful there, which
allows us to tolerate any 1
replica being down.
And that's been hugely useful.
We also got onto IO2 for EBS, which
has been a game changer for
EBS reliability.
Arka: Oh, nice.
How many replicas do you guys normally
have off 1 database?
Sammy: Today it's quite, there
are only 2 large replicas.
In the future we want to move towards
more smaller replicas.
Derk: We have multiple, Let's put
it that way.
Reading from a replica is much
harder when it is about financial
data because you want to have the
correct data and it might just
been updated.
So we try to move a lot of stuff
to the replicas, but it's kind
of hard.
Arka: Yeah, that would definitely
be a problem for financial
data.
Nikolay: Are we talking about asynchronous
replicas only, or
maybe some semi-synchronous or
quorum commit approach?
Derk: We have both.
Arka: Also, cache invalidation,
I think That's 1 of the bigger
concerns for us at Notion because
we're super, super heavy users
of memcache.
And reasoning about how that works
for replicas is another thing
that we don't really want to do,
but we have to do.
I don't really have a super scary
story, but actually it's something
I've been working on recently,
which we ran into some surprising
behavior.
So maybe it'd be just fun to talk
about.
So recently at Notion, we've been
trying to upgrade that last
10%.
And of course those databases are
just the hardest because they're
just so big.
So we're trying to upgrade this
partition table that is around
20 terabytes with like 16 partitions
on each database.
So, you know, each partition is
about like a terabyte ish.
And we were having trouble getting
this to work with just regular
logical replication.
So we actually started exploring
DMS, which is like an AWS service
that lets you migrate data between
two separate databases.
Side note, it's actually super
cool.
Like you can go from like a completely
random system to like
another new system.
It doesn't have to be like Postgres
to Postgres only.
I don't know how good it is at
huge scale.
Nikolay: It just requires a department
in your company to manage.
Arka: Yeah, so originally I was
like, oh yeah, I can do this.
And then it's been like 4 months
now and I'm still like, oh my
God, this is a lot.
But anyway, 1 of the problems we
ran into was actually related
to Postgres bloat.
Cause it turns out with DMS, you
get a lot of kind of knobs that
you can tune for the concurrency
of like the copy.
And that goes pretty fast.
Like we're able to copy, you know,
a terabyte of data in maybe
12, 14 hours with the way we have
our DMS set up.
But the thing we noticed after
we set up the target databases
is somehow the data size, like,
so we use PG Analyze as like
our Postgres monitoring tool.
And like, we noticed that the data
size literally on disk was
like 3 times larger on the new
tables for some reason than the
old 1.
And you know, we were going from
like Postgres 12 to Postgres
15, so at first we thought maybe
it was like a TOAST compression
thing.
Cause we know that the TOAST compression
algorithm changed between,
I think, 12 to 14 or something.
It went from like PGLZ to LZ4 or
something.
I don't remember exactly but I
know that that changed.
So I was like, oh like is it is
it the new post compression that's
causing like the data size to be
literally much larger on disk?
Which obviously leads to much worse
latency because then every
1 of your queries are doing much
more I/O.
Nikolay: Buffers.
Buffers.
Arka: Yeah, so you know we tried
a few tests with like changing
the compression back to the old
1.
We weren't fully convinced still,
But it turns out actually we
were just hitting like bloat and
this clicked when I was actually
listening to 1 of your earlier
podcasts I think on Postgres bloat
and I realized that our vacuums
weren't actually fully completing
on the new table because there
was just so many new transactions
being written the vacuum could
not keep up which meant that the
bloat just kept on accumulating
and accumulating.
And on top of that, a lot of the
data in this table was being
TOASTed because they're just JSON
B columns.
So anyway, it was just like 2 weeks
of investigation to be like,
oh, we should just run a full vacuum.
And turns out that brought the
data size right back to what it
should be.
So it was a fun little kind of
investigation.
Nikolay: It sounds like PgAnalyze
doesn't tell you about XminHorizon
and proper like things like why
a vacuum can be lagging and skipping
some dead tuples that cannot be
deleted.
PgAnalyze has a very good vacuum
dashboard.
So I'm very curious, does it speak
about
Arka: this new horizon?
It actually does point those things
out.
We just didn't have the new databases
in PGAnalyze yet, because
we're like, oh, we're not using
production traffic on these,
so we don't need it there.
When we put it there, we realized
that, oh yeah, like the vacuums
are falling behind.
Nikolay: I'm just personally myself
very annoyed that Postgres
monitoring systems lack better
xmin horizon and long transactions,
which is 2 separate things basically.
Alerts.
Long transaction can happen and
nobody tells you.
It's not the right thing if you
have sometimes 2 Postgres monitorings
and nobody tells you about this.
So I'm curious if PGAnalyze has
some alerts.
Arka: It does.
Nikolay: Okay, okay.
Arka: It sends you an email if,
for example, you have had any
transaction running for over an
hour.
Nikolay: Right, But if you use
logical, it's not the only reason
of xmin horizon being installed.
So, yeah.
Arka: Right.
Nikolay: Okay.
For
Arka: us, like, PGAnalyze has
been really great, actually.
Like, I love working with that
team.
Shout out to Lucas.
Like, he's been so amazing.
Nikolay: And also content, again,
back to content sharing.
Lucas does a great job and his
team does a great job.
Sammy: Yeah, Lucas is awesome.
I talked to him as well a few weeks
ago and everything at PGAnalyze
their blogs are really
great.
Okay, I think it's interesting
you guys had the data copy be
such a big bottleneck.
We also had a big pain point there
when we were first doing vertical
sharding.
And what we ended up finding out
is we kind of went into the
source code for Postgres and logical
replication and realized
that it's really, really expensive
to have indexes on your database
table while you're doing logical
replication, and much faster
to drop the indexes and then recreate
them afterwards.
So it went from weeks to a day
for a multi-terabyte data copy
when you drop all of your indexes
and foreign keys.
Arka: Yeah, that's a great insight.
Nikolay: Yeah, I'm also curious
if you talk about regular logical
replica provisioning when data
is copied logically or maybe,
For example, there is a binary
copy option and also there is,
if you need it for upgrades, for
example, or you need majority
of data, sometimes it's better
to convert physical replica to
logical, which I hope soon will
become standard recipe inside
Postgres itself, but now it's only
possible if you orchestrate
it properly.
I'm very curious if you do this
trick on RDS, both of you, or
no?
Sammy: We rely a lot on logical
replication for all of our core
failover tooling, so we haven't
really explored physical replication.
I'm not sure if we really can because
of RDS.
Nikolay: On RDS it's possible if
you apply so-called Instacart
recipe, but
Sammy: there
Nikolay: are debates about is it
safe to do it.
Actually, Lucas's blog posted about
it as well recently.
Arka: You're talking about the
thing where you create a replication
slot, take a backup, and then advance
the replication slot.
Nikolay: Yeah, either this or...
Yeah, on RDS only this, yes.
Arka: Yeah,
Nikolay: yeah.
On self-managed, you can play with
recovery target LSN, right?
Arka: Yeah.
Nikolay: So I'm curious, is it
only in Stackgres or what...
Arka: We've actually used that
before too.
So actually, before we knew the
trick that Sammy just told us
about with the indexes, we could
never get logical replication
to catch up ever to our tables.
So we actually ended up using the
Instacart trick to do our first
round of upgrades that I ever did
at Notion.
It worked well.
I think 1 of the problems you run
into with that though is similarly,
when you provision the new database
off the backup, you have
to do another extra step of actually
dropping the indexes again.
Because even just catching up with
logical replication from the
point that you took the backup
for a sufficiently large database
Still took us like a day or so
just to catch up the changes between
the replication time and that thing
So I think yeah that trick
could work really well probably
combined with dropping the indexes
would speed it up significantly.
But actually for us, after that,
we've always just used regular
logical replication, and it's been
fast enough if we can provision
the schemas without any indexes.
Nikolay: Actually, dropping indexes
lead us to 1 topic I wanted
not to miss.
Huge WAL volumes generated and
index-write amplification.
Anyone can tell anything about
these problems, like experience
something about it or...
Because this is 1 of the key criticism
points from Uber when
they posted this infamous article,
right?
Postgres has index write amplification.
We have hot updates, but still,
it's a problem, right?
Actually, Derk, I remember an article
about maybe a couple of
years ago about fill factor, right?
Is it still the tool you use actively?
Derk: Well, fill factor is not
not a tool, right?
It's just a parameter you can set
for your tables and your indexes.
Nikolay: But the tool is to use
it to provoke more hot updates.
Derk: Yeah.
If we lose hot updates, like we
did last week, it's like we doubled
our WAL usage.
So that's already, and that was
only for single table for some
reason, fifth, next or the new
release.
We had a longer running query,
which means pretty long because
it was holding back hot updates.
And that means like you just double
the WAL volume on your system,
but we have many tables, right?
And then this, only this single
table is responsible or becomes
responsible for half the WAL of
the entire cluster.
For me, that's an amazing number
that you can generate or skip
creating that amount of WAL when
using hot updates.
So yeah, if we are having heavily
updated tables, we always try
to find the right fill factor.
I've been trying to get this formula,
like what tables do you
need to fill factor and how high
do you set it?
And it's still a mystery to me
because we have some tables which
do fine at 95% fill factor and
some do really great at 70%.
Nikolay: And also you need to protect
this table from new indexes
being created and losing all the
hot updates somehow, right?
If developers decide to add one more
index.
Derk: Yeah, I can't.
I have no way to force my developers.
Sammy: I'm actually curious.
A big pain point we have right
now is for really expensive index
creations, there's no good way
of throttling it.
So if it eats a lot of your IOPS,
there's not really a postgres
way We've found to tune this so
that it, you know, maybe you
want the index creation to happen
much more slowly in the background
over a few days for some really
large tables.
Nikolay: But it will hold xmin
again, this infamous problem
from...
Sammy: Sure, but I mean, maybe
3 days of holding that is better
than an hour of the website being
down.
Michael: This might be a stupid
question, but would reducing
maintenance work mem in a, can
you do that in a session?
Would that help?
Nikolay: More disk I/O maybe.
Sammy: Yeah, I think we have the
minimum.
Yeah, we're only using 1 thread
for the maintenance settings
there, but it's, it has been a
coulnle of times enough to spike
our IOPS to basically the RDS limits.
Derk: I always want indexes to
be created faster.
Nikolay: Yeah.
But it requires a faster disk IO
capabilities, of course.
Michael: This is what I love about
TrueScale is you just hit
trade-offs.
You hit times where 1 thing's best
and you hit other times where
the complete opposite is best.
So yeah, I'm really conscious of
time.
You've all been amazing.
Thanks so much for joining.
I wanted to give you each an opportunity
to say anything else
you wanted to add or forgot to
say, or feel free to shout out
at you if you're hiring, that kind
of thing.
Nikolay: Let me ask 1 more technical
question.
Everyone uses form keys everywhere,
or decided to drop some of
them or everywhere, right?
Arka: No, I will actually say something
about this.
I think like, honestly, we use
probably Postgres in like the
least interesting way possible.
For our sharded cloud databases,
we don't have foreign keys.
We don't do joins.
We don't have any triggers.
We don't have really anything that
can make, we don't have any
generated columns.
Like we limit so many things that
you can do that we're basically
using Postgres as like a blob store
at this point, which is why
we end up with really giant databases
with a lot of indexes,
but the queries are quite simple
actually to retrieve the data.
And I think that's been like probably
1 of the primary reasons
that I've actually, and like my
team has actually been able to
kind of keep up with the growth
of our Postgres cluster because
we're able to do things like re-sharding
without having to, you
know, talk to every team at the
company and be like, can you
please change this query?
Can you please not do this thing?
You know, I think a lot of people
get really excited about using
all the fun features of relational
databases, but actually you
can use it in a very boring way,
but still get some of the benefits
of relational databases, such as
consistency and things like
that.
So I think that's like my 1 trick
to high scale with any database,
not just Postgres, MySQL and really
any other thing that you
can think of.
It's good to be aware of some of
the pitfalls of the more interesting
features.
Nikolay: So no foreign keys?
Arka: Absolutely not.
Okay.
Not allowed.
Okay.
Nikolay: Any more input?
Sammy: Yeah, We do allow foreign
keys on the sharding key.
And similarly, we allow joins on
the sharding key.
So for us, it's a matter of picking
a sharding key so that most
operations are happening within
that.
And then that allows us to avoid
rewriting a lot of application
logic.
Smart.
Derk: We also have foreign keys
at Postgres, which is in 2 weeks
time, a little bit less.
I tell you how to find a proper
partitioning strategy because
up to now I've been focusing on
partitioning a single table in
the best way possible, But I kind
of forgot to look up the bigger
picture.
So all my tables are partitioned
slightly different with different
partition boundaries.
And then the entire system becomes
a bit messy.
So now we are working on basically
straighten out all the partition
boundaries because then the joining
and the foreign keys work
much more efficient.
So it's partition pruning.
Michael: Clever.
Derk: I kid you not, some queries
became 20 times faster by straightening
out partition boundaries.
Michael: Wow.
Sammy: I'll make my last plug on
the hiring side.
My team is hiring really aggressively
right now.
We're basically trying to double
this year.
And in particular, we just had
an awesome senior staff engineer
become a manager.
And so we have a big gap in senior
staff, principal type engineers,
folks who have worked at scale
before.
So if you're interested in those
problems, please reach out to
me or come apply.
Arka: Notion is also hiring.
But my pitch is definitely not
as refined as Sammy's.
But if you want to come work on
fun problems, check out our careers
page.
Derk: I've actually think this
is might be the first time in
20 years.
We are not expanding my team this
year.
Nikolay: Oh wow.
Michael: But we should plug Derk's
talk at Postgres, which you can
watch.
So I'll include links to all of
these things.
Final 1 final thank you from me.
This has been amazing.
Thank you so much for joining us
for episode a hundred.
Nikolay: Yeah.
Thank you so much.
I want to just to once again, like
say, thank you for knowledge
sharing.
Super important.
Looking forward to more, definitely
to more blog posts and maybe
some run books published.
I don't know, like things like
you use in your daily work maintaining
these large databases, large Postgres
databases.
So super curious to see more from
you and maybe to collaborate
on some things like some knowledge
and exchanging and so on.
Thank you.
Derk: Thank you very much, both
of you.
It's really cool to be here, especially
in such good company.
Sammy: Yeah, this is a really fun
conversation.
Thanks for getting us all together.
Arka: Yeah.
Thanks again.
I really enjoyed this.
I hope to be back in the future
as well.
Nikolay: This was the best episode
number a hundred we could
imagine.