
MultiXact member space exhaustion
Nikolay: Hello, hello, this is
Postgres.FM.
This is Nik from Postgres.AI.
And as usual, co-host Michael.
Hi, Michael.
Michael: Hi, Nik.
Nikolay: Yeah.
And we invited a couple of guests
today, Nate and Andrew.
And we invited because the company
Metronome experienced outages
in I think in May, right, or April,
in May, right?
In May of 2025, which were extremely
hard to diagnose, but these
folks did a very good job and discussed
all the details in excellent
blog post.
And this blog post looked so great,
so we couldn't miss it.
So thank you for coming.
Let's dive into some very rare
Postgres problems.
Nate: Yeah, thanks for having us.
And hopefully sharing this out
will keep making it rare.
Andrew: Yeah, yes.
Thank you for having us.
It's a pleasure to be here.
Nikolay: Great.
So, Michael, where should we start?
Michael: How about we start with,
could you guys give us a little
bit of an overview of what Metronome
does, why you're using Postgres,
and what you were up to that was
hitting the limits this hard?
Yeah, so I can start with
Nate: the overview, and then Andrew,
I can pair on the exact
way that we kind of blew our foot
off here.
Andrew: Absolutely.
Nate: Yeah, so Metronome is a usage-based
billing platform.
And so, you know, when it was founded,
I think it's coming up
on 6 years ago now, it was really,
you know, At that point in
time, this is just before COVID
and lockdowns began, the target
was really, there's a lot of infrastructure
and service companies,
most of them are building on some
kind of heartbeat, you know,
per minute or sometimes per second
model.
That gets really tricky to do at
scale.
In the simplest cases, yeah, you're
just incrementing a counter,
But as things ramp up and your
account executives go out and
cook up all different ways to finesse
how they want to charge
that client, it gets really complicated.
You add in things like committed
spend, credits, you want to
automate it, you want to cut people
off and they're not paying.
The complexity really compounds
immediately.
And the 1 of the founders got worked
on the billing at Dropbox
and had also encountered issues
where when you build this internally,
which is where everyone kind of
starts, it gets very easy to
build the brittle system by accident.
Builds your local Maxima, fits
your current pricing and packaging.
delayed.
Like they can't launch their product
or they can't charge for
it for weeks or something, months.
So you get led into the spot where
you're delaying launches or
just like burning compute because
you can't actually charge for
it, which feels really silly.
And then there's also a frequent,
you just accept a lot of slippage
because building these systems
to be perfectly accurate gets
really tricky.
And thankfully there's a lot of
foundational pieces such as Kafka
exactly once and strong transactional
databases to make this
possible to build and have confidence
in.
But it's very difficult to do this
internally and prioritize
that degree of rigor.
And also if you're building infrastructure
company, you know,
let's say you're a database startup,
your best engineers are
there because they want to work
on building a database, they're
not actually there because they
want to work on building.
And so we also think that, you
know, kind of consolidating talent
to work in this problem yields
a better result.
And I think the marriage of the
business logic domain, the exact
correctness needs, and the overall
data scale volumes, we're
right place, right time for this
AI boom, introduces a lot of
very interesting challenges.
And well, we're about to talk about
one of them.
Nikolay: And you chose Postgres,
obviously, right?
Nate: Yeah, the Postgres decision
preceded both of us by a ways,
but I think we're broadly quite
happy with it.
Nikolay: Okay, good, good.
So do you know why Postgres was
chosen in the past?
Nate: Not precisely.
I mean, like, probably, yeah, we
immediately went to using RDS.
I don't know if it was immediately
Aurora.
Interesting.
Yeah.
I think it would be an interesting
decision as a small startup
who anticipates having a lot of
data scale to want to deal with
the schema change pain associated
with MySQL.
Nikolay: Yeah, I wanted to add
a joke that maybe somebody googled
some open source relational
database with good money
data type.
But yeah, you know, money data
type is what people shouldn't
use in Postgres.
This is...
Nate: Yeah, no, I think we're mostly
in numeric, but at this
point we're kind of cursed.
We didn't lock down the precision
more tightly because Postgres
will give you far too much precision.
And kind of having to interoperate
with any other system becomes
difficult.
And so we're starting to figure
out how to unwind some of those
decisions.
Michael: Yeah, yeah.
That is an interesting tangent.
Especially, I guess, with people
billing by fractions of a cent
and things.
So yeah, anyway, probably too much
of a tangent, or maybe we'll
come back to it later.
Nate: Yeah, I mean, in short, yeah,
it gets very painful translating
between the different systems that
do or don't have that precision.
Sometimes even your billing provider,
understandably can't bill
in fractions of a cent, which means
that how all the different
sub-products you paid for and have
those round together, might
not end up with the same total.
Michael: Yeah.
Yeah.
Nikolay: Can you tell us a little
bit about the scale?
I remember 30 terabytes on Aurora
from the blog post.
Anything like in this area you
could share to get us the feeling
of the scale?
Nate: So we have 2 primary workloads
that are running on Postgres.
The 1 we'll mostly be talking about
is our main DB, which is
a common start-up pattern where
we've got our kitchen sink database
that has nearly every dataset within
it.
And yeah, that's ballooned out
to around 30 terabytes now and
growing terabyte plus per month
at this point.
And yeah, this is, wasn't an issue
for a while.
And then depending on how our clients
leverage our platform,
you know, part of billing is generating
invoices and invoices
are usually characterized, you
know, with like, you know, a top
level total and whatnot, but then
you have a line item per product
usage.
And, you know, some of these companies
have a lot of different
products you can pay for.
You know, you look at the list
of different LLM model engines
that are available.
Sometimes you can use batch, async
versus real time, or like
infrastructure vendors who are
deployed to dozens of different
regions.
And it's easy for the cardinality
here to explode, especially
if you want to group it by a logical
application on their platform.
And we may, but we now regret,
decision to basically persist
those line items in a separate
row and a ginormous table in our
Postgres database.
How big is
Nikolay: that table?
Nate: It's about half the cluster
and growing.
Nikolay: Like 15 terabytes unpartitioned,
right?
Yep.
Yeah.
Yeah, well, I know this.
It's a pain,
Nate: Yeah.
There's some scaling problems you
get really ahead of and others
that you defer because I mean,
a lot of the technology in the
Postgres space is really fantastic
now.
Like Aurora does probably just
handle it, copy out, making sure
that you're dealing with vacuuming
and whatnot and all the other
kind of implicit thresholds.
Nikolay: Actually, I forgot to
mention that those who watch Postgres
FM regularly should watch already
by this point previous episode
with Sugu, where we discussed Multigres
and sharding.
And in that very episode, he mentioned
your case, right?
So yeah, I wanted to answer this
is what we go next to do next
week, but it was a secret at that
time.
So, and before that we also had
PgDog, Lev Kokotov from PgDog
And obviously sharding currently
is hot topic again and with
Citus in the past, there is SPQR.
Before that we had a PL/Proxy from
Skype very long ago, like
previous era.
So is it right to say that the
problem we are going to discuss
today, you wouldn't meet it if
you're on sharding schema?
If your database was sharded, this
wouldn't happen probably.
Nate: I think we've seen the app
suggest that it would be easier
to avoid it.
As part of the migration that we're
about to mention, we're moving
another table that's generated
data to a partition scheme.
And we've already seen a lot of
positive effects there, and that
you can run a lot of autovacuum
workers in parallel.
And so it allows you to much more
easily keep up with the rate
of change.
Nikolay: This is partitioning,
not sharding.
And with unpartitioned table, the
problem is vacuum is single
threaded, right?
Yeah.
So with partitioned schema, you
can raise a number of workers
for max workers you can raise it
and benefit from it if you have
a lot of CPUs and disk IO capacity.
But what I'm trying to say problems
like this they like can be
avoided completely if we have sharding
and every shard is smaller.
Every single shard
Nate: is smaller.
Yeah, and that's what I was trying
to get at.
By partitioning it, we can kind
of see the effect of, you know,
if these were all in logical chunks,
that were maybe at max 100
gigabytes, it's much easier for
the garbage collection system
to keep up.
Nikolay: Yeah, yeah, yeah, yeah.
That's interesting.
Nate: It would be even easier if
we were across many machines.
But at the very least, I think we can infer most of the effect,
even if it's all on 1.
Nikolay: Yeah.
Okay.
Maybe it's already time to dive into the problem itself, right?
Nate: Yeah.
Or maybe briefly, like, why we create the problem.
Nikolay: Yeah.
Before you go into this space, I was curious, do you have any
issues with multixacts before that at all?
Like maybe other kinds of issues like observing in performance
insights, lightweight locks, contention, lightweight multixacts,
or like a couple of them.
You never had them before, right?
Nate: Not in particular.
No, I mean, we had run into emergency out of the vacuums for
sure.
Nikolay: Yeah, yeah, that's very interesting because your path
is like definitely is different from what we observed with other
projects.
Because usually, different problems with multixacts came, and
the contention happened, and so on.
Your path is quite unique in this case, I think.
Yeah.
So let's dive in.
Let's dive in.
Nate: Yeah, I really want to describe how we create the problem
for ourselves.
And I really want Andrew to explain how we diagnose it.
Cause like, that's the most interesting part here by far.
Yeah.
As we keep seeing these tables grow, you know, we kind of realize
also do some access patterns that were introduced.
It wasn't really tenable to keep just appending data month after
month for every customer in terms of the line item table.
And yeah, this was also painful for the invoice table, which,
you know, there's a lot of right amplification between the 2.
The average invoice has, you know, dozens and sometimes hundreds
of line items, depending on how a client's using our platform.
And so invoices have to be queried a lot of different ways.
We didn't really see a way of moving that outside of a relational
store and like being able to use foreign keys and constraints
with it in the rest of our data.
It was like providing too much value to really doing anything
different there.
We did wind up partitioning it just to like enable the autovacuum
process to keep up much more easily.
But we realized we could constrain the access pattern for the
line items sufficiently that it didn't actually need to be in
Postgres.
Because I guess within Metronome,
you know, over the course of
a month, you have the notion of
an invoice that's in a draft state
where we'll calculate on demand
for you, or we're not actually
persisting data associated with
it.
Then only after you go through
a finalization, which is typically
the first of the month, because
most billing periods are going
to be monthly, that we do all the
math, do the calculations,
persist that data.
Which, you can then infer our workload
sounds pretty funky.
So we've got a steady trickle of
writes for configuration changes.
And then once a month, we have
to write out several billion rows
which you know amounts to like
1 to 2 terabytes of data.
Yeah
Michael: all on the same day.
Nate: As close to it as we can
get.
Nikolay: Yeah very different from
social media.
Very different.
Nate: Otherwise it's you know a
more typical you know we have
99 plus percent reads versus writes.
It's a lot of compute on demand.
So it's a little funky, I think
that because the high spikes
in write volume are so periodic
is why we haven't seen multixact
problems in the past.
I think we'd likely narrowly been
avoiding them because it's
so hard to observe whether or not
you're approaching them.
I feel like this is 1 of the meta-stable
states of Postgres,
if you're familiar with that, hey,
from the definition of stable,
where often things are working,
working, working, and they just
fall off a cliff or entirely lock
up.
But the basic trigger here being
we identified we wanted to make
this architecture change, and then
we were running the very long
migration process of pushing these
line items into S3, which
is what our back end store was
going to be, so their access pattern
was so simple.
But also rewriting and moving the
invoices from 1 table to another.
And fortunately, we're doing that
broadly 1 by 1, which we'll
get to why that's relevant.
Nikolay: So you mentioned performance
cliff.
This is a very good term when we
talk about multixact set,
when we had different cases in
the past where, as I mentioned,
contention happened.
And interesting enough, like we
always, like 20, 15, 10 years
ago, we often saw opinions that
foreign keys are dangerous, like
they have huge overhead, don't
use them because like they slow
down everything and so on.
But interesting that now we can
agree with it, but in very different
way, in an unexpected way, because
they, yes, they slow down
writes, but we can afford it.
And your table, which grown to
15 terabytes, is a good example,
right?
You survived with foreign keys
and you paid that price to have
good data consistency, right?
But then performance cliffs, right?
Nate: Yeah, and for that table
in particular, the reason why
we were able to move that was it
only had 1 foreign key, which
is back to the invoice row.
It was much simpler.
That table is effectively append-only.
It was a workload that just really
didn't have to be on Postgres.
I don't think we were taking advantage
of the qualities to justify
the cost.
Nikolay: So let's discuss multixacts
in general, like why we
should think about them.
Like, you know, my own story is
usually, like with multixacts,
I take some monitoring, this is
my hobby.
Take some Postgres monitoring and
we check which metrics they
have.
And usually they have transaction
wraparound risk analysis,
right?
Like to control freezing process
and to understand how far we
are from transaction ID wraparound.
My question number 1, where is
multixact ID wraparound analysis?
Most monitoring systems lack it.
But they lack it because it never
happens.
Usually, like, transaction ID wraparound
happens.
We know several blog posts and
horror stories from the past.
But multixact ID wraparound,
nobody experienced it.
So OK, Why should we think about
multixacts?
Tell us more technical details
what happened with your case.
Nate: Well, briefly I'll respond
to the point about foreign keys.
The community is right that you
don't always need them.
But I think what can be tricky
is that when you're a very young
company making lots of changes
to your product, trying to find
that product market fit and get
to the point where scaling becomes
the biggest concern.
Foreign keys are really useful to
like prevent teams from stepping
on each other and breaking your
data.
And only when those product areas
and like the platform you're
building on top of become mature
enough, can you maybe afford
to put in the, you know, like pull
request level, you know, integrity
checks that make sure you're not
changing some access pattern
or maybe build the asynchronous,
you know, like offline data
consistency check to make sure
you're not violating it.
And this is a case where we didn't
think critically about have
we made that leap from 1 side to
the other.
And in retrospect, we have.
And it's 1 of the mitigations we
took.
Yeah.
Where I guess I would defend having
a really strict schema early
on, and then as you scale the platform,
just becomes necessary
to progress some of those guardrails.
Nikolay: Yeah, to maintain good
quality of data, that's absolutely
great advice.
But let's discuss how multixacts
are created, and so on.
Andrew: Sure.
Yes.
So I suppose we should start with
what a multixact even is
and when it's used.
So basically, in Postgres, when
you have 2 or more transactions
trying to lock the same row at
the same time, for instance, when
you're using SELECT FOR SHARE or
SELECT FOR UPDATE or in fact
when you're inserting into a table
that has a foreign key to
another table, Postgres uses a
system called multixacts or multi
transactions.
And a multixact is basically a
logical object that represents
a set of transactions locking the
same row.
Each participating transaction
is called a multixact member.
And those are a little distinct,
but they are related.
And the way that a multixact
is stored is that it's represented
by a 32-bit integer ID and that's
where your wraparound issue
can happen with multixact IDs.
But in this scenario, that was
actually not the problem.
For every multixact, Postgres
keeps an offset into a globally
shared, append-only, multixact
membership space.
These are stored in the pg_multixact_members
file on disk.
You can think of it as logically
like a simple last recently
used cache that is contiguous,
so you have to append to it every
single time.
And so when you have to store these
members, you will have to
look up an offset into this theoretically
long array, and each
1 of your transaction IDs is stored
in 1 slot in this array,
and this array is indexed by a
32-bit integer.
This membership array is what it
was exhausted in our case.
So due to the way that multixacts
are created and subsequently
reaped, it is easy to exhaust this
membership space when you
have a lot of parallel transactions
attempting to lock the same
row, whether it's through foreign
keys or whether you're doing
SELECT FOR SHARE or SELECT FOR
UPDATE.
And the reason for that is that
since these multixacts are immutable,
suppose that you have 3 transactions
trying to lock the same
row.
You would first create the first
multixact between the first
2 transactions, and then the third
transaction would come in
and you would have to create yet
another multixact with the
old transaction IDs and then the
new 1.
But the membership space is not
relinquished until a vacuum occurs,
which can happen later, or perhaps
never in some cases, if you
are constantly adding new transactions
and you're super busy
doing all these updates.
And what's interesting is this.
The multi-transaction membership space cannot be reclaimed in
chunks.
It has to be reclaimed contiguously.
So suppose you have a very long-running transaction that is part
of a multixact that is very old.
It can hold up the entire vacuuming of subsequent multixacts
that are finished but are short-lived.
And it is not until that the oldest multi-transaction is finished
that the vacuum can actually reclaim all the membership space
from the old multixact to the most recent one.
And this scenario creates the ability to exhaust the multixact
membership space.
And that's what we experienced ourselves.
Nikolay: Yeah, I wanted to add one thing.
It might sound strange that SELECT FOR UPDATE also might, may produce.
You also explored this, right?
It happens only when we have sub-transactions, I think, right?
So if it's a single transaction, select for update, there is
no place for multixacts.
But if we start doing savepoints there, basically we have different
transactions inside one transaction, and then multixact mechanism
starts working again.
And I just remember, like, how many years ago?
Four years ago, I explored this when I explored another big performance
cliff related to sub-transactions.
So there I noticed this as well.
Yeah.
Andrew: That is true.
Nikolay: Good introduction.
Thank you so much.
So yeah, it's not it, right?
There's also some problems like quadratic behavior, right?
Andrew: Yes, I hinted at that before.
So essentially, if you have, let's say, five transactions that are
attempting to lock the same row, you have to create the multixacts
one at a time by essentially cloning it with the previous transaction
IDs and adding that additional transaction ID one at a time.
So now you have a multixact with two, then three, then four, then five.
And you add that all up, and that becomes a quadratically growing
use case of multixact membership space, at least until the
vacuum can actually clean up those old multixacts.
And so that's where you will find quadratic consumption of the
multixact space.
And this can happen very quickly if you have a lot of concurrent
transactions attempting to, say, insert into a table with a foreign
key that links to another table that is a low cardinality table.
So they're all trying to reference that exact same row.
Nikolay: But it's also applicable when we have also additional
long-running transaction, right?
Because otherwise, cleanup will happen, right?
Andrew: Correct, so a long running
transaction will cause the
vacuum to essentially stop reclaiming
multixact membership
space until it is finished, because
it can only do it sequentially.
Nikolay: Is it this transaction,
long-running transaction, should
it also participate in multixact
mechanism or it can be any,
even SELECT, simple SELECT?
Andrew: A simple SELECT probably
won't do it.
It would have to be in a multi-transaction
context.
So if you, let's say, have a multixact
and you have 2 transactions
in that and the first 1 finishes,
that multixact is not cleaned
up until that second transaction.
Part of that multixact is also
finished.
So all transactions in the multixact
have to finish before the
vacuum can reclaim it.
Nikolay: Yeah, so this highlights
pieces of advice like partition
your large tables, avoid long-running
transactions if imposed
with OLTP, right?
Like just yet another reason to
avoid long-running transactions.
And if it's, yeah, if, so long-running
transactions on replica
reported by, hosted by feedback,
they are harmless here.
So, right, so like again, yeah,
yeah.
I'm trying, we are debating right
now how to properly monitor
long running transactions because
there are different effects,
negative effects, and we need to
distinguish them.
In this case, yeah, we should monitor
probably long-run transactions
which participate.
And by long, how much long was
it in your case?
Like 1 minute is long or 30 minutes
is long in this case?
Andrew: I'm actually not sure.
I don't actually have the metrics
on that.
Nate, do you know?
Nate: I think 99.9%, probably 9%
of our queries are sub-second.
And I guess I previously worked
on a Database and for a company,
or for a team at Stripe, and had
some of these lessons, just
like really deeply, or have all
the scar tissue in short.
Running a platform at their scale
is difficult.
And so, I've been pushing us in
that direction and kind of moving
anything that we expect to be longer
off of the online workload
into our lake house.
And so it's very, you know, occasionally
we have queries that
go 5, 10 seconds, but I believe
we have a like, you know, global
time of about 30 seconds.
So I guess like part of my misunderstanding
here is I thought
Part of these multixacts is that
the multi-transaction can
be daisy-chained as you have many
concurrent overlapping transactions
occur.
I think that's more what you're
running into than a long running
1.
Was that we just had this steady
stream of overlapping transactions.
And until you break that chain,
you're in this like, host position
where you keep, you know, very
quickly at an increasing rate,
consuming space.
Andrew: Yes, I was only merely
stating the cleanup scenario and
the challenges there, but you're
absolutely correct.
When you have this daisy-chaining,
you know, constantly creating
new multixacts by adding new transaction
IDs to it, you will
very quickly exhaust your membership
space.
Michael: You say very quickly and
very easily, but I mean, we
are still talking about 2000000000,
right?
Like, I think that's a, there's
been a debate in the Postgres
world about 32-bit versus 64-bit
for a long time, but we are
still talking about 2000000000.
So You say very quickly and very
easily at your scale, right?
Like this is also due to how many
things you're doing and maybe
the pattern as well, like invoices
that have lots of line items
or, you know, like it feels like
there are specifics here that
make this more likely.
I'm not saying it shouldn't be
a problem, I'm looking forward
to hearing how you diagnose this
and what we can do better.
But I feel like we should say it's
not necessarily very easy
to hit this limit.
I'm not actually, I think I found
1 other Stack Overflow post
or somebody else who hit it I don't
know if you came across that
while you were while you were in
the middle of this But I couldn't
find any other examples on the
mailing list, but it's super interesting
I think we're seeing more and more
Companies using Postgres at
this scale Like Nikolay mentioned
a few of them that we talked
to, but we also had an episode
with folks from Notion and Figma
and Adyen as well on that same
episode.
So there are a lot of high volume
companies now on Postgres.
So I think this stuff is super
interesting, but 2 billion still
seems like, even with the daisy
chaining, it still seems like
a lot to me.
Nikolay: Let me disagree with you.
There is no, like debate is not
like 64 or 32.
Postgres desperately needs 64 transactions.
And for example, OrioleDB already
implemented it.
But OrioleDB is like alternative
storage engine and it's right
now in hands of Supabase in terms
of development.
And I think right now there is
launch week and they published
some benchmarks again, like it's
great, great project.
So I think many projects dream
about the day when we will have
64, not 32.
Right now only several like forks
implemented only.
There are commercial forks which
implement 64 transaction IDs
and multixact IDs.
Right?
Michael: I didn't know that.
Sadly not Aurora it sounds like.
Nikolay: Yeah.
So 2 billion is not a lot already.
It's not a lot.
Andrew: Yeah.
It's pretty easy to hit it if you're
going at scale and you're
hitting it with quadratic velocity.
You get to 2 billion quickly.
It does sound like a big number,
Nikolay: but it's not that big.
Michael, sorry for interruption.
Michael, you can take chess board
and start putting 1 grain on
first cell.
And so.
Michael: Oh, yeah.
So wait, maybe I misunderstood.
If there are 5 row changes we're
trying to do, do we create 2
to the 5 rows in this table?
Or is it like 1, then 2, then 3?
Is that only 5 rows?
So in the member space?
Andrew: Yeah, good question.
So basically, it's more of a sequencing
thing.
So essentially, you have 2 transactions.
You start with 2.
You get 1 multixact taking up
2 members.
The third transaction comes in,
you create a new multixact
with 3 members, but that old 1
with 2 members has not been cleaned
up yet, so you're only taking up
5.
The third 1 comes in, I'm sorry,
excuse me, the fourth transaction
comes in, you create a multixact
with 4 and 3 and 2.
So now you've added an additional
4 on top of that.
And now the fifth 1 comes in, and
now you have a multixact
with 5432.
And that all add together, it takes
up membership space until
the vacuum can come up and reap
those old multixacts.
Nikolay: Yeah, chessboard example
is good here, I guess.
We can try to understand volume
when we approach 64 cell, number
64, how big it will be.
Michael: Andrew, you've done such
a good job describing how this
could happen, but when you're in
the middle of this, I'm guessing
this is relatively new knowledge,
or like when you're in the
middle of this, how did you work
out what on earth was going
on?
Andrew: Yes, that's quite the journey.
So I'm actually really new to Metronome.
I just joined in April.
So when this incident was occurring,
it was only my fifth week
or so.
And this was occurring on a Saturday
right before a vacation
to Miami.
So, wow.
Nate: Well, let's rewind a little
bit, because I guess the most
embarrassing part of this is like
the first, what we now know
was the first occurrence, was the
full week prior.
Andrew: Yes.
Nate: Where our migration workload
had basically implicitly triggered
this.
We hadn't really known what to
do.
And unfortunately we're not able
to get to a full RCA.
And so, yeah, we got to our best
explanation, which is like,
oh, there's something in transaction
ID wraparounds that we're
not grokking correctly, or Amazon's
hiding something from us.
And then we proceeded to turn it
back on with a slightly lower
concurrency, you know, 5 days later.
And so like, yeah, so we didn't
get there immediately.
But then, you know, that was a
P1 incident, you know, internally,
tried to be cautious, but didn't
follow through sufficiently.
And so this is like kind of in
the second chapter where Andrew
comes in.
Andrew: Yes, that's right.
I was not a participant in the
first 3 incidences.
It was only the fourth 1, which
I think was on a Saturday in
which the call went out from Cosmo
to see if additional engineering
resources could come and jump on,
so I decided to.
But having been so new, I actually
didn't have any tactical knowledge
I could use to address all the
specific fires that were being
started by this.
So I decided to actually figure
out what exactly was happening.
And as Nate had said, we weren't
really sure, but we knew it
was related to multixact in some
way.
We just weren't really sure exactly
why.
And we were struggling to really
fully understand because in
our initial evaluation, we saw
that the number of multixact
IDs that was being created was
well below the threshold of like
400 million.
And we weren't really sure why.
So what I did is I took that knowledge
and I just went ahead
and looked into the Postgres source
code and looked at the multixact.c
file, which is this enormous
file that contains pretty much
all the logic that you'll need
to know about multixacts.
And in there, we were working with
the hypothesis that somehow
this threshold of 400 million,
but we were only seeing 200 million
before vacuums started occurring,
was related in some way.
So I found the function called
multixact member freeze threshold
which calculates the threshold
that you should start basically
an autovacuum based on the number
of multixacts that are in
use and newly the number of multixact
members that are in use,
and that's what caught my eye.
So from there, I mathematically
proved that the amount of members
should be extremely high given
the code path that we are obviously
hitting, because we're seeing vacuums
occur at a specific number
of multixacts being used.
And from there I was able to find
log messages related to multixact
membership exhaustion and correlate
that with Datadog logs that
we had found ourselves in the recent
week.
And there we could conclude that
multixact members were being,
the membership space was being
exhausted.
And we were able to put together
what I would consider kind of
a hacky estimate of the number
of multixact members that are
currently in use by looking at
the files that are written for
each member and estimating by multiplying
by a constant how many
members are currently in use at
the moment.
And that's how we came across the
solution, or the cause, I should
say.
Michael: I think this is worth
diving into a little bit because
I definitely didn't realize that
there were at least 2 exhaustion
possibilities with multixact
so we've got the the total well
yeah so what why are there these
2 numbers like why do we have
2 there's 2,000,000,000 that you were
monitoring at 400 million,
which should be super safe, right?
Because we have the possibility
to go up to 2,000,000,000, but you
want it to be cleaning up much
sooner than that.
So you monitored at 400 million
or so.
Why is there another limit?
Andrew: Yeah, so we have 2 limits
here.
We have an ID that we give to multixacts.
That's a 32-bit integer.
So, of course, you cannot assign
more than 2 billion to these
multixacts because we don't assign
negative IDs.
And then you have the multixact
membership space.
Now that is represented as a logically
a simple LRU that is append
only.
And so we can only address up to
32 bits of membership space.
Each membership slot is addressed
by a 32-bit integer.
And thus now you have 2 different
values that you have to worry
about.
You have to worry about how many
multixacts you're creating and
how much membership space you're
consuming with those multixacts.
And that's where those 2 numbers
come from.
Michael: And 1 of the great things
in your blog post you mention
is how do you even monitor that?
Have you worked out a way since?
Is there anything you can do there?
Andrew: Yes, I think I might have
just offhand mentioned it,
but we use an estimate where we
read the membership files because
they are written to disk, And then
we know that the files have
to be a certain size and we multiply
by that constant, whatever
the size of these members are,
and we get an estimate for the
number of members.
Now, this is not the best solution,
which is why I submitted
a Postgres patch to surface this
telemetry directly.
Michael: Nice, Thank you.
Nikolay: Yeah, that's what I wanted to mention.
We have several things going on in mailing lists after this blog
post.
By the way, before this, why did you decide to post in so great
detail?
Like, what was your intention?
To help community or like?
Nate: I guess a couple of factors.
1, we do want to help the greater Postgres community.
I think there are lots of other situations where we've benefited
greatly by blog posts put out, in particular by companies like
Percona, as well as others who have maybe led us away from paths
that would have wound up being painful.
But another part is that internally at Metronome, we view ourselves
as a infrastructure provider for our clients.
And while I think it's okay to say that we're not fully recognized
as that yet, and we're still on our journey to solidifying that
narrative, it is what we want to, or how we want to act to be
perceived that way.
And we are in a critical path for a lot of our clients and how
they operate.
And so for us to have multiple multi-hour downtime in the space
of less than 10 days is pretty bad and honestly not what you
want for your infrastructure provider if you choose to be in
that way.
And so having a really crisp explanation of, hey, we are a little
embarrassed by this, but here's what we learned so you can learn
from it, as well as a little bit of vetting of, we didn't run
into the easiest thing.
It is novel.
There is a reason why it took a while for us to sort it out.
And so I think a lot of our partners felt a lot better.
Better could be a strong term.
But when you have an honest RCA, and Cloudflare might be leading
the industry in how they operate in this vein, you feel a lot
better that your provider is actually learning from mistakes
and is going to do a better job of avoiding them than you might
internally if you're evaluating, should I just build this in-house
instead?
Nikolay: So great transparency and so on.
It pays off over time.
Yeah, that's great.
So I saw several mailing list discussions, but before that I
also like, I feel like with multixacts, we can have different
kinds of problems.
1 is wraparound, right?
Okay, as I said, I never saw it.
And that's why many monitoring systems even don't monitor it.
And we have several cases of SLRU limits reached, right?
And 1 of them was explored in the past.
We can provide maybe blog posts, links to description.
And this is your case.
And your case obviously bumped into lack of observability bits,
like how much of capacity we reach, like how far from the performance
cliff we are, right?
But also besides that, I saw different topics were discussed
in mailing list.
For example, there is a setting vacuum_multixact_failsafe_age,
which is about wraparound, actually.
Peter Geoghegan brought to attention that this setting doesn't take
into account multixact membership exhaustion.
So it's also an interesting side effect from your blog post.
Maybe fully unexpected, right?
Then I saw, Andrew, I think your proposal to have multixact
monitoring, like system view, so any monitoring could have some
graphs and also alerts probably, right?
And this proposal currently at what stage?
I saw it's in commitfest, but it requires review, right?
Andrew: Correct.
Yes.
The patch is in the commitfest and, it definitely still requires
review.
We've, well, I've engaged with another contributor named Naga,
who's, we've actually submitted competing proposals.
His, actually directly reads from the, the memory space while
mine uses the pg_stats infrastructure to surface this telemetry.
But yes, it is still just in review.
It has not been accepted yet, as far as I can tell.
Nikolay: Yeah, so but it's great that it's registered in commitfest
and I think if some people who listen to us and maybe some people
who watch also our sessions of hacking we do with Andrey and
Kirk sometimes almost every week.
We missed today because of some issues, but we are going to be
back next week.
So we show how easy it is actually to test.
You take Cursor, you like tell it, just compile Postgres, apply
patch, and let's start using it.
And just you, instead of diving too much into C code, you think
about use cases and how to, for example, emulate problems, right?
And so on.
This is actually fun.
And I encourage our audience to maybe to start with their first
review and test this patch and this can increase chances that
it will be included to Postgres 19, right?
Andrew: Yes, what's also kind of interesting is I know you earlier
mentioned that there was a debate about moving to 64-bit integers
and in fact part of this commitfest does include a proposal
to change those offsets to 64-bits.
So there's a lot of changes that are coming in 19, it seems,
if they're accepted, of course,
which would help alleviate this
multixact membership and multixact
issue writ large.
Nikolay: That's great, but it's
only next year.
Andrew: I suppose we can't go too
fast.
Michael: Well, and even longer
if it's Aurora, right?
They have a generally a bit of
an additional lag to getting major
versions because it's a fork that
they have to do a lot of work
on.
On that note though, you mentioned
trust in an infrastructure
provider and how issues, big issues
can rock that trust and you
need to see some maybe root cause
analysis or maybe some action.
Has this at any point rocked your
confidence in Postgres or less
so Aurora, I guess, in terms of
is it the right platform as you
scale?
Maybe initially and then since
you've worked out exactly what
the issue was, what are your thoughts
there?
Andrew: Yeah, I don't think it
actually has really rocked our
confidence in Postgres.
I haven't seen anyone run for the
hills yet.
But you know, it definitely surfaces
the fact that there is quite
a bit to learn and that this is
a very complex system But to
that end something I've learned
at least Participating with the
Postgres dev community and being
on the mailing list is that
there's a lot of very smart individuals
who are part of this
community and who are actively
contributing and are passionately
debating what should and shouldn't
be part of it and how things
should be designed.
So, in many ways, it actually may
have built my confidence to
see so many people paying attention
and caring about it from
all across the world, from Fujitsu,
Amazon, Google, ourselves,
even just regular folks doing it
in their hobby spare time.
So in many ways perhaps it's a
testament to how far open source
can go and how much confidence
we can have in these systems because
like you said this is a pretty
rare instance and many of the
core cases that most people need
to use are covered or at least
will be.
Michael: That's really good to
hear and of course when people
hit issues with proprietary systems
like SQL Server and oracle
I'm sure when they blog about it
those conversations do happen
internally at Microsoft or oracle
but you can't see them you
can't see the the actual conversations
and part of the way you
diagnosed this issue was diving
into the source code so there
are some really cool factors that
wouldn't have been possible
if this was an open source.
Yeah, that's not just open source,
but open discussions as well,
open mailing lists.
So yeah, that's really good to
hear, but it would have been very
understandable if...
People have definitely left Postgres
for less good reasons.
So not naming names, Uber.
Andrew: Yeah, I will say, as we
said in our blog post, 30 terabytes
is not a small amount of data to
move.
And so if we decided to make a
large decision like that, I believe
we would have to reckon with the
transition itself.
And so in many ways, you know,
we probably want to try to work
with what we got.
And thankfully, I think it's served
us well for the most part.
Nate: Yeah, there's something to
be said for the devil you know.
And, you know, there's lots of
other systems Postgres compatible
that, you know, promise you the
world and something gives typically
whether it's the price, the actual
performance on either the
read or the writes or how maintainable
it is to actually run
it.
And yeah, I think it's clear that
we're going to have to undergo
the investment to figure out what's
next, because we can only
go single writer for so long, especially
with the demands that
we put on the database on the first
of the month every single
month.
A great extreme.
Nikolay: Yeah.
And behind the scenes, we also
chatted with Nate about the formula.
So Michael, for the first time,
we should include some formula
in the show notes.
There's a formula that is saying
that roughly 63,000 transactions
overlapping is enough to reach
a 2 billion limit.
Nate: Yeah, and it's a really dangerous
thing of some math on
the fly that can be checked by
anyone after the fact.
But yeah, as I worked it out, yeah,
only 63,000 overlapping
transactions, which is not that
many.
Relatively easy to do depending
on how you, you know, with a
highly concurrent workload operating
1 row at a time with a potential
overlap between them.
You can consume that really quickly.
Andrew: It might not also be transactions.
If you take a checkpoint, that's
almost seen as like a sub-transaction
and will also create a new multixact
as well.
So if you're doing a lot of checkpointing,
you can easily hit
that as well.
Nikolay: Are you using sub-transactions?
Andrew: I'm not actually sure myself,
but it is a possibility.
Nate: I don't believe we are, at
least not intentionally.
Nikolay: I think right now in the
latest versions of Postgres
there's ability to change SRU sizes
to avoid SRU overflow cases,
But my opinion about sub-transactions
in projects like this,
at this scale, remains just avoid
them.
Yeah, Because they can bring other
problems as well, like sub-transfer
can be overflown and so on.
And they consume transaction ID.
So yeah.
Good.
That's I think quite good deep
dive, unless we forgot anything.
Michael: Any last things you wanted
to add?
Nate: I think we can talk about
the mitigations, just so that
if someone's unfortunate enough
to run into this, that they have
a few breadcrumbs for how to work
around it.
Because it was rather expensive
for us to figure out what those
were, including Andrew and others
digging through the source
to really understand it.
But a lot of what we inferred after
the fact was, so we had this
migration which was doing a bunch
of row by row writes, not ideal
if you're doing a highly concurrent
process here.
And so we did make those batches.
We also did a lot of deep thinking
on which of the foreign keys
we actually needed on this new
table.
Nikolay: Batches, How did you find
the ideal for you batch size?
Nate: I don't know that we're at
an ideal, but if we climbed
the hill far enough, that it's
not a problem.
Nikolay: I see.
Nate: And so it was kind of a trade-off
between what throughput
we could obtain and making sure
that we weren't having the kind
of like overtime growth in our
heuristic for the member space
utilization, which is mainly based
on how many of the offset
files existed.
Yeah,
Nikolay: I usually say choosing
batches, choose the biggest possible,
but avoid locking rows for too
long to affect users, because
users, we know perception is 200
milliseconds, so if you lock
like 1 second should be maximum,
2 seconds should be absolute
maximum for batch size choice.
Nate: Yeah, it's less often.
So we're even batches around like
200,
Michael: which
Nate: is enough to mitigate this
concern and kind of keep all
of the operations up, you know,
short, since there's a very online
workload hitting this as well.
Nikolay: This is not users in terms
of batch transactions, but
users read results, right?
Or no?
Nobody's using
Nate: it now.
So sorry, this is in the context
of our greater migration to
move line items, line item construct
out of our database, as
well as partitioning the invoice
table, which has also grown
to be quite large.
And so, you know, kind of we would
read the old version, all
the line items in the invoice row,
write those to a new format
in S3, which is ultimately keyed
by the invoice ID.
And then a lot of logic, like we
can't just do a deterministic
mapping because then you get hot
partitions in S3, so you have
to do a little bit of extra work
and leave yourself a link in
the actual database row.
And so then we were writing those
out to the new table, which
is partitioned 1 by 1.
And our diagnosis, and it's a little
hard to confirm this, because
again, there is no observability
bit here, is that that was the
big mistake, as well as the single
inserts, like very quick.
It was the issue that they all
reference a handful of other tables
for foreign keys, some of which
were very low cardinality.
So, you know, we have like some
enum tables and we, you know,
for better or worse, in this case
worse, had chosen to use a
foreign key reference to them.
And so if every invoice is corresponding
to the USD credit type,
for example, you're going to have
a lot of overlapping transactions
all accessing that row to verify
the integrity of the data.
And so we kind of walk through
it, and we realize a lot of these
low cardinality cases, we don't
really change those code paths.
We're quite confident they're going
to be stable.
It's only the higher cardinality
cases.
An invoice also references its
customer that owns that invoice.
But that's a much higher cardinality
connection.
We're much less likely to have
overlapping transactions referring
to the same customer in a migration
like this.
Nikolay: Without foreign keys,
did you think about periodical
checks of referential integrity,
like asynchronously?
Nate: Yeah, so we're going to be
adding those to run on our Lakehouse.
So yeah, we have a syncing process
that brings all the files
over into Parquet.
Eventually, we're going to finish
this way on getting it to be
CDC, so it's more incremental,
probably using Iceberg.
And so on those, we'll run every,
I mean, it's much cheaper to
run the checks there, for a lot
of reasons.
Nikolay: More and more reasons
to move data out of Postgres.
I understand it for analytical
reasons, but yesterday S3, yesterday
S3 released vector index.
If you haven't heard it, it's interesting.
So I'm just curious where the place
for Postgres will be in our
systems.
But I totally understand, For analytical
workloads, definitely.
Nate: Yeah.
I mean, everything that's OLTP-shaped
is going to be staying
in Postgres for some time.
Can't see a reason to move it anywhere
else.
Nikolay: Well, yeah.
Michael: Great.
In terms of mitigations, did you
tune any of the settings around
multixact parameters?
Yeah.
What did you do?
Which ones did you increase or
decrease?
Andrew: I don't know if there was
actually any specific like
Postgres like flags that we changed actually.
I don't think we changed any constant specifically.
But what we did do is we did introspect into the code base and
determine the danger levels for membership spaces.
And so we used that as a threshold for alerting ourselves with
Grafana.
And so if our membership space consumption breaches a certain
level, what we will do is we will take actions to reduce the
number of concurrency, the amount of concurrent Transactions
that are occurring until Postgres can recover that membership
space.
So it's a little bit of a manual intervention at the moment,
but I don't think there's any specific knobs that we can tune
because it is fundamentally limited by a 32-bit integer.
And the telemetry with respect to multixact membership numbers
is not used anywhere else except in the detection of when autovacuum
should occur.
And that is sort of automatic, so, and kind of hard-coded.
Nikolay: Yeah, and the current utilization, you use function
pg_get_multixact_members, right?
Or how do you?
Andrew: No, I don't believe that is it.
I think it's some specialized function where we actually have
to get all the files and read the file sizes themselves.
So it's not something from Postgres, I don't believe.
Nikolay: But it's Aurora.
You don't have access to files.
Andrew: There is some function.
I can't remember, Mnid.
Nikolay: It's JLS or something, right?
So it lists files.
You can still SQL.
You can list for, oh, OK, interesting.
Nate: The Query originated from Amazon blog posts.
I'll try to pull it up.
You can see if you want to include it in the show notes.
Nikolay: Let's make sure.
I don't remember this recipe was shared in your blog post, how
to monitor this.
Andrew: It was not, we did not post it there.
We actually, intend to have a followup with a more engineering
based audience where we would detail these steps kind of in the
same way that we detailed it here, but in a more concrete format
for engineers and not for executives and company leaders.
Nate: Yeah, and we were a bit more timeline constrained for initial
RCA.
And we are optimistic that we can link to Andrew's patch for
the second follow-up blog post.
Nikolay: Exactly.
That's awesome.
Because people like, probably we
should add it to monitoring
like practically everywhere in
large systems, right?
This would be main outcome of this
discussion probably, right?
For those who listen and have big
systems.
Andrew: Absolutely.
Nikolay: Before your patch Andrew
got accepted and everyone upgraded
to Postgres 19, which will take
some time.
Right, so...
Andrew: Yes, until then we definitely
have some guidance for
monitoring systems.
We admit though that it is not
the most ideal, but it is something
that people can use to guide themselves
away from disaster.
Michael: Well and I think work
well there's the thread you mentioned
from Peter Geoghegan he's somebody
that has worked in the past on
mitigation factors.
So not only to be able to monitor
a transaction ID wraparound,
but he did some things to vacuum
so that it's just less likely
to get into that state in the first
place.
And I think making that parameter
he mentioned aware of the member
space or have some different parameter
that kicks in and does
a, like an anti-wraparound vacuum
or whatever the equivalent
would need to be earlier would
make this less likely to happen.
And of course, monitoring is a
good idea anyway, but prevention
seems even better.
So yeah.
Nikolay: Optimization inside Postgres
could be implemented here
Michael: because it's
Nikolay: quadratic behavior, at
least it could, as like in other
cases in Postgres, at least it
could be split to some pieces,
like 128 pieces, you know, something
like this.
I don't know.
I haven't seen discussion of actual
optimization to get more
capacity.
Have you seen it?
Michael: No, but I was looking
through what parameters we even
have that mention multixact,
and 1 of them is in version 17,
we got multixact member buffers,
which is quite low by default,
only 256 kilobytes or 32 pages,
and we got multixact offset
buffers, also low, 128 kilobytes
by default.
So I'm wondering whether increasing
those would help with the
member space maybe that's the offset
buffers 1 and then yeah
we also have multixact failsafe
age which you mentioned version
14 that's the 1 for Peter Geoghegan
and 2 much older ones I did wonder
about Freeze Min Age, that's 5
million by default, I wondered
if lowering that would just get
you at, make sure that the vacuum
ran quicker when it did need to
happen, you know that kind of
thing, and then there's the freeze
table age is quite is 150
million, so I again I'm not sure
if lowering that would mean
you just get vacuums more often
but I guess when we're talking
about such a huge table it's still
going to be a problem with
the vacuum does so until you get
partitioning in place anyway
thank you so much both of you it's
been amazing learning about
this scary but great and thank
you for paying the pain and the
cost so that so many fewer people
don't have to in future.
Nikolay: And special thanks for
transparency and admitting like
your own mistakes in the past.
Like I especially appreciate this
like how you talk freely about
your own mistakes.
This is very valuable and a good
example for others as well.
Andrew: Absolutely.
Thanks for having us.
It was a pleasure to be here.
Nikolay: Thank you.
Nate: Yeah.
It's really fun chat.