MultiXact member space exhaustion

MultiXact member space exhaustion

Nikolay: Hello, hello, this is
Postgres.FM.

This is Nik from Postgres.AI.

And as usual, co-host Michael.

Hi, Michael.

Michael: Hi, Nik.

Nikolay: Yeah.

And we invited a couple of guests
today, Nate and Andrew.

And we invited because the company
Metronome experienced outages

in I think in May, right, or April,
in May, right?

In May of 2025, which were extremely
hard to diagnose, but these

folks did a very good job and discussed
all the details in excellent

blog post.

And this blog post looked so great,
so we couldn't miss it.

So thank you for coming.

Let's dive into some very rare
Postgres problems.

Nate: Yeah, thanks for having us.

And hopefully sharing this out
will keep making it rare.

Andrew: Yeah, yes.

Thank you for having us.

It's a pleasure to be here.

Nikolay: Great.

So, Michael, where should we start?

Michael: How about we start with,
could you guys give us a little

bit of an overview of what Metronome
does, why you're using Postgres,

and what you were up to that was
hitting the limits this hard?

Yeah, so I can start with

Nate: the overview, and then Andrew,
I can pair on the exact

way that we kind of blew our foot
off here.

Andrew: Absolutely.

Nate: Yeah, so Metronome is a usage-based
billing platform.

And so, you know, when it was founded,
I think it's coming up

on 6 years ago now, it was really,
you know, At that point in

time, this is just before COVID
and lockdowns began, the target

was really, there's a lot of infrastructure
and service companies,

most of them are building on some
kind of heartbeat, you know,

per minute or sometimes per second
model.

That gets really tricky to do at
scale.

In the simplest cases, yeah, you're
just incrementing a counter,

But as things ramp up and your
account executives go out and

cook up all different ways to finesse
how they want to charge

that client, it gets really complicated.

You add in things like committed
spend, credits, you want to

automate it, you want to cut people
off and they're not paying.

The complexity really compounds
immediately.

And the 1 of the founders got worked
on the billing at Dropbox

and had also encountered issues
where when you build this internally,

which is where everyone kind of
starts, it gets very easy to

build the brittle system by accident.

Builds your local Maxima, fits
your current pricing and packaging.

delayed.

Like they can't launch their product
or they can't charge for

it for weeks or something, months.

So you get led into the spot where
you're delaying launches or

just like burning compute because
you can't actually charge for

it, which feels really silly.

And then there's also a frequent,
you just accept a lot of slippage

because building these systems
to be perfectly accurate gets

really tricky.

And thankfully there's a lot of
foundational pieces such as Kafka

exactly once and strong transactional
databases to make this

possible to build and have confidence
in.

But it's very difficult to do this
internally and prioritize

that degree of rigor.

And also if you're building infrastructure
company, you know,

let's say you're a database startup,
your best engineers are

there because they want to work
on building a database, they're

not actually there because they
want to work on building.

And so we also think that, you
know, kind of consolidating talent

to work in this problem yields
a better result.

And I think the marriage of the
business logic domain, the exact

correctness needs, and the overall
data scale volumes, we're

right place, right time for this
AI boom, introduces a lot of

very interesting challenges.

And well, we're about to talk about
one of them.

Nikolay: And you chose Postgres,
obviously, right?

Nate: Yeah, the Postgres decision
preceded both of us by a ways,

but I think we're broadly quite
happy with it.

Nikolay: Okay, good, good.

So do you know why Postgres was
chosen in the past?

Nate: Not precisely.

I mean, like, probably, yeah, we
immediately went to using RDS.

I don't know if it was immediately
Aurora.

Interesting.

Yeah.

I think it would be an interesting
decision as a small startup

who anticipates having a lot of
data scale to want to deal with

the schema change pain associated
with MySQL.

Nikolay: Yeah, I wanted to add
a joke that maybe somebody googled

some open source relational
database with good money

data type.

But yeah, you know, money data
type is what people shouldn't

use in Postgres.

This is...

Nate: Yeah, no, I think we're mostly
in numeric, but at this

point we're kind of cursed.

We didn't lock down the precision
more tightly because Postgres

will give you far too much precision.

And kind of having to interoperate
with any other system becomes

difficult.

And so we're starting to figure
out how to unwind some of those

decisions.

Michael: Yeah, yeah.

That is an interesting tangent.

Especially, I guess, with people
billing by fractions of a cent

and things.

So yeah, anyway, probably too much
of a tangent, or maybe we'll

come back to it later.

Nate: Yeah, I mean, in short, yeah,
it gets very painful translating

between the different systems that
do or don't have that precision.

Sometimes even your billing provider,
understandably can't bill

in fractions of a cent, which means
that how all the different

sub-products you paid for and have
those round together, might

not end up with the same total.

Michael: Yeah.

Yeah.

Nikolay: Can you tell us a little
bit about the scale?

I remember 30 terabytes on Aurora
from the blog post.

Anything like in this area you
could share to get us the feeling

of the scale?

Nate: So we have 2 primary workloads
that are running on Postgres.

The 1 we'll mostly be talking about
is our main DB, which is

a common start-up pattern where
we've got our kitchen sink database

that has nearly every dataset within
it.

And yeah, that's ballooned out
to around 30 terabytes now and

growing terabyte plus per month
at this point.

And yeah, this is, wasn't an issue
for a while.

And then depending on how our clients
leverage our platform,

you know, part of billing is generating
invoices and invoices

are usually characterized, you
know, with like, you know, a top

level total and whatnot, but then
you have a line item per product

usage.

And, you know, some of these companies
have a lot of different

products you can pay for.

You know, you look at the list
of different LLM model engines

that are available.

Sometimes you can use batch, async
versus real time, or like

infrastructure vendors who are
deployed to dozens of different

regions.

And it's easy for the cardinality
here to explode, especially

if you want to group it by a logical
application on their platform.

And we may, but we now regret,
decision to basically persist

those line items in a separate
row and a ginormous table in our

Postgres database.

How big is

Nikolay: that table?

Nate: It's about half the cluster
and growing.

Nikolay: Like 15 terabytes unpartitioned,
right?

Yep.

Yeah.

Yeah, well, I know this.

It's a pain,

Nate: Yeah.

There's some scaling problems you
get really ahead of and others

that you defer because I mean,
a lot of the technology in the

Postgres space is really fantastic
now.

Like Aurora does probably just
handle it, copy out, making sure

that you're dealing with vacuuming
and whatnot and all the other

kind of implicit thresholds.

Nikolay: Actually, I forgot to
mention that those who watch Postgres

FM regularly should watch already
by this point previous episode

with Sugu, where we discussed Multigres
and sharding.

And in that very episode, he mentioned
your case, right?

So yeah, I wanted to answer this
is what we go next to do next

week, but it was a secret at that
time.

So, and before that we also had
PgDog, Lev Kokotov from PgDog

And obviously sharding currently
is hot topic again and with

Citus in the past, there is SPQR.

Before that we had a PL/Proxy from
Skype very long ago, like

previous era.

So is it right to say that the
problem we are going to discuss

today, you wouldn't meet it if
you're on sharding schema?

If your database was sharded, this
wouldn't happen probably.

Nate: I think we've seen the app
suggest that it would be easier

to avoid it.

As part of the migration that we're
about to mention, we're moving

another table that's generated
data to a partition scheme.

And we've already seen a lot of
positive effects there, and that

you can run a lot of autovacuum
workers in parallel.

And so it allows you to much more
easily keep up with the rate

of change.

Nikolay: This is partitioning,
not sharding.

And with unpartitioned table, the
problem is vacuum is single

threaded, right?

Yeah.

So with partitioned schema, you
can raise a number of workers

for max workers you can raise it
and benefit from it if you have

a lot of CPUs and disk IO capacity.

But what I'm trying to say problems
like this they like can be

avoided completely if we have sharding
and every shard is smaller.

Every single shard

Nate: is smaller.

Yeah, and that's what I was trying
to get at.

By partitioning it, we can kind
of see the effect of, you know,

if these were all in logical chunks,
that were maybe at max 100

gigabytes, it's much easier for
the garbage collection system

to keep up.

Nikolay: Yeah, yeah, yeah, yeah.

That's interesting.

Nate: It would be even easier if
we were across many machines.

But at the very least, I think we can infer most of the effect,

even if it's all on 1.

Nikolay: Yeah.

Okay.

Maybe it's already time to dive into the problem itself, right?

Nate: Yeah.
Or maybe briefly, like, why we create the problem.

Nikolay: Yeah.

Before you go into this space, I was curious, do you have any

issues with multixacts before that at all?

Like maybe other kinds of issues like observing in performance

insights, lightweight locks, contention, lightweight multixacts,

or like a couple of them.

You never had them before, right?

Nate: Not in particular.

No, I mean, we had run into emergency out of the vacuums for

sure.

Nikolay: Yeah, yeah, that's very interesting because your path

is like definitely is different from what we observed with other

projects.

Because usually, different problems with multixacts came, and

the contention happened, and so on.

Your path is quite unique in this case, I think.

Yeah.

So let's dive in.

Let's dive in.

Nate: Yeah, I really want to describe how we create the problem

for ourselves.

And I really want Andrew to explain how we diagnose it.

Cause like, that's the most interesting part here by far.

Yeah.

As we keep seeing these tables grow, you know, we kind of realize

also do some access patterns that were introduced.

It wasn't really tenable to keep just appending data month after

month for every customer in terms of the line item table.

And yeah, this was also painful for the invoice table, which,

you know, there's a lot of right amplification between the 2.

The average invoice has, you know, dozens and sometimes hundreds

of line items, depending on how a client's using our platform.

And so invoices have to be queried a lot of different ways.

We didn't really see a way of moving that outside of a relational

store and like being able to use foreign keys and constraints

with it in the rest of our data.

It was like providing too much value to really doing anything

different there.

We did wind up partitioning it just to like enable the autovacuum

process to keep up much more easily.

But we realized we could constrain the access pattern for the

line items sufficiently that it didn't actually need to be in

Postgres.

Because I guess within Metronome,
you know, over the course of

a month, you have the notion of
an invoice that's in a draft state

where we'll calculate on demand
for you, or we're not actually

persisting data associated with
it.

Then only after you go through
a finalization, which is typically

the first of the month, because
most billing periods are going

to be monthly, that we do all the
math, do the calculations,

persist that data.

Which, you can then infer our workload
sounds pretty funky.

So we've got a steady trickle of
writes for configuration changes.

And then once a month, we have
to write out several billion rows

which you know amounts to like
1 to 2 terabytes of data.

Yeah

Michael: all on the same day.

Nate: As close to it as we can
get.

Nikolay: Yeah very different from
social media.

Very different.

Nate: Otherwise it's you know a
more typical you know we have

99 plus percent reads versus writes.

It's a lot of compute on demand.

So it's a little funky, I think
that because the high spikes

in write volume are so periodic
is why we haven't seen multixact

problems in the past.

I think we'd likely narrowly been
avoiding them because it's

so hard to observe whether or not
you're approaching them.

I feel like this is 1 of the meta-stable
states of Postgres,

if you're familiar with that, hey,
from the definition of stable,

where often things are working,
working, working, and they just

fall off a cliff or entirely lock
up.

But the basic trigger here being
we identified we wanted to make

this architecture change, and then
we were running the very long

migration process of pushing these
line items into S3, which

is what our back end store was
going to be, so their access pattern

was so simple.

But also rewriting and moving the
invoices from 1 table to another.

And fortunately, we're doing that
broadly 1 by 1, which we'll

get to why that's relevant.

Nikolay: So you mentioned performance
cliff.

This is a very good term when we
talk about multixact set,

when we had different cases in
the past where, as I mentioned,

contention happened.

And interesting enough, like we
always, like 20, 15, 10 years

ago, we often saw opinions that
foreign keys are dangerous, like

they have huge overhead, don't
use them because like they slow

down everything and so on.

But interesting that now we can
agree with it, but in very different

way, in an unexpected way, because
they, yes, they slow down

writes, but we can afford it.

And your table, which grown to
15 terabytes, is a good example,

right?

You survived with foreign keys
and you paid that price to have

good data consistency, right?

But then performance cliffs, right?

Nate: Yeah, and for that table
in particular, the reason why

we were able to move that was it
only had 1 foreign key, which

is back to the invoice row.

It was much simpler.

That table is effectively append-only.

It was a workload that just really
didn't have to be on Postgres.

I don't think we were taking advantage
of the qualities to justify

the cost.

Nikolay: So let's discuss multixacts
in general, like why we

should think about them.

Like, you know, my own story is
usually, like with multixacts,

I take some monitoring, this is
my hobby.

Take some Postgres monitoring and
we check which metrics they

have.

And usually they have transaction
wraparound risk analysis,

right?

Like to control freezing process
and to understand how far we

are from transaction ID wraparound.

My question number 1, where is
multixact ID wraparound analysis?

Most monitoring systems lack it.

But they lack it because it never
happens.

Usually, like, transaction ID wraparound
happens.

We know several blog posts and
horror stories from the past.

But multixact ID wraparound,
nobody experienced it.

So OK, Why should we think about
multixacts?

Tell us more technical details
what happened with your case.

Nate: Well, briefly I'll respond
to the point about foreign keys.

The community is right that you
don't always need them.

But I think what can be tricky
is that when you're a very young

company making lots of changes
to your product, trying to find

that product market fit and get
to the point where scaling becomes

the biggest concern.

Foreign keys are really useful to
like prevent teams from stepping

on each other and breaking your
data.

And only when those product areas
and like the platform you're

building on top of become mature
enough, can you maybe afford

to put in the, you know, like pull
request level, you know, integrity

checks that make sure you're not
changing some access pattern

or maybe build the asynchronous,
you know, like offline data

consistency check to make sure
you're not violating it.

And this is a case where we didn't
think critically about have

we made that leap from 1 side to
the other.

And in retrospect, we have.

And it's 1 of the mitigations we
took.

Yeah.

Where I guess I would defend having
a really strict schema early

on, and then as you scale the platform,
just becomes necessary

to progress some of those guardrails.

Nikolay: Yeah, to maintain good
quality of data, that's absolutely

great advice.

But let's discuss how multixacts
are created, and so on.

Andrew: Sure.

Yes.

So I suppose we should start with
what a multixact even is

and when it's used.

So basically, in Postgres, when
you have 2 or more transactions

trying to lock the same row at
the same time, for instance, when

you're using SELECT FOR SHARE or
SELECT FOR UPDATE or in fact

when you're inserting into a table
that has a foreign key to

another table, Postgres uses a
system called multixacts or multi

transactions.

And a multixact is basically a
logical object that represents

a set of transactions locking the
same row.

Each participating transaction
is called a multixact member.

And those are a little distinct,
but they are related.

And the way that a multixact
is stored is that it's represented

by a 32-bit integer ID and that's
where your wraparound issue

can happen with multixact IDs.

But in this scenario, that was
actually not the problem.

For every multixact, Postgres
keeps an offset into a globally

shared, append-only, multixact
membership space.

These are stored in the pg_multixact_members
file on disk.

You can think of it as logically
like a simple last recently

used cache that is contiguous,
so you have to append to it every

single time.

And so when you have to store these
members, you will have to

look up an offset into this theoretically
long array, and each

1 of your transaction IDs is stored
in 1 slot in this array,

and this array is indexed by a
32-bit integer.

This membership array is what it
was exhausted in our case.

So due to the way that multixacts
are created and subsequently

reaped, it is easy to exhaust this
membership space when you

have a lot of parallel transactions
attempting to lock the same

row, whether it's through foreign
keys or whether you're doing

SELECT FOR SHARE or SELECT FOR
UPDATE.

And the reason for that is that
since these multixacts are immutable,

suppose that you have 3 transactions
trying to lock the same

row.

You would first create the first
multixact between the first

2 transactions, and then the third
transaction would come in

and you would have to create yet
another multixact with the

old transaction IDs and then the
new 1.

But the membership space is not
relinquished until a vacuum occurs,

which can happen later, or perhaps
never in some cases, if you

are constantly adding new transactions
and you're super busy

doing all these updates.

And what's interesting is this.

The multi-transaction membership space cannot be reclaimed in

chunks.

It has to be reclaimed contiguously.

So suppose you have a very long-running transaction that is part

of a multixact that is very old.

It can hold up the entire vacuuming of subsequent multixacts

that are finished but are short-lived.

And it is not until that the oldest multi-transaction is finished

that the vacuum can actually reclaim all the membership space

from the old multixact to the most recent one.

And this scenario creates the ability to exhaust the multixact

membership space.

And that's what we experienced ourselves.

Nikolay: Yeah, I wanted to add one thing.

It might sound strange that SELECT FOR UPDATE also might, may produce.

You also explored this, right?

It happens only when we have sub-transactions, I think, right?

So if it's a single transaction, select for update, there is

no place for multixacts.

But if we start doing savepoints there, basically we have different

transactions inside one transaction, and then multixact mechanism

starts working again.

And I just remember, like, how many years ago?

Four years ago, I explored this when I explored another big performance

cliff related to sub-transactions.

So there I noticed this as well.

Yeah.

Andrew: That is true.

Nikolay: Good introduction.

Thank you so much.

So yeah, it's not it, right?

There's also some problems like quadratic behavior, right?

Andrew: Yes, I hinted at that before.

So essentially, if you have, let's say, five transactions that are

attempting to lock the same row, you have to create the multixacts

one at a time by essentially cloning it with the previous transaction

IDs and adding that additional transaction ID one at a time.

So now you have a multixact with two, then three, then four, then five.

And you add that all up, and that becomes a quadratically growing

use case of multixact membership space, at least until the

vacuum can actually clean up those old multixacts.

And so that's where you will find quadratic consumption of the

multixact space.

And this can happen very quickly if you have a lot of concurrent

transactions attempting to, say, insert into a table with a foreign

key that links to another table that is a low cardinality table.

So they're all trying to reference that exact same row.

Nikolay: But it's also applicable when we have also additional

long-running transaction, right?

Because otherwise, cleanup will happen, right?

Andrew: Correct, so a long running
transaction will cause the

vacuum to essentially stop reclaiming
multixact membership

space until it is finished, because
it can only do it sequentially.

Nikolay: Is it this transaction,
long-running transaction, should

it also participate in multixact
mechanism or it can be any,

even SELECT, simple SELECT?

Andrew: A simple SELECT probably
won't do it.

It would have to be in a multi-transaction
context.

So if you, let's say, have a multixact
and you have 2 transactions

in that and the first 1 finishes,
that multixact is not cleaned

up until that second transaction.

Part of that multixact is also
finished.

So all transactions in the multixact
have to finish before the

vacuum can reclaim it.

Nikolay: Yeah, so this highlights
pieces of advice like partition

your large tables, avoid long-running
transactions if imposed

with OLTP, right?

Like just yet another reason to
avoid long-running transactions.

And if it's, yeah, if, so long-running
transactions on replica

reported by, hosted by feedback,
they are harmless here.

So, right, so like again, yeah,
yeah.

I'm trying, we are debating right
now how to properly monitor

long running transactions because
there are different effects,

negative effects, and we need to
distinguish them.

In this case, yeah, we should monitor
probably long-run transactions

which participate.

And by long, how much long was
it in your case?

Like 1 minute is long or 30 minutes
is long in this case?

Andrew: I'm actually not sure.

I don't actually have the metrics
on that.

Nate, do you know?

Nate: I think 99.9%, probably 9%
of our queries are sub-second.

And I guess I previously worked
on a Database and for a company,

or for a team at Stripe, and had
some of these lessons, just

like really deeply, or have all
the scar tissue in short.

Running a platform at their scale
is difficult.

And so, I've been pushing us in
that direction and kind of moving

anything that we expect to be longer
off of the online workload

into our lake house.

And so it's very, you know, occasionally
we have queries that

go 5, 10 seconds, but I believe
we have a like, you know, global

time of about 30 seconds.

So I guess like part of my misunderstanding
here is I thought

Part of these multixacts is that
the multi-transaction can

be daisy-chained as you have many
concurrent overlapping transactions

occur.

I think that's more what you're
running into than a long running

1.

Was that we just had this steady
stream of overlapping transactions.

And until you break that chain,
you're in this like, host position

where you keep, you know, very
quickly at an increasing rate,

consuming space.

Andrew: Yes, I was only merely
stating the cleanup scenario and

the challenges there, but you're
absolutely correct.

When you have this daisy-chaining,
you know, constantly creating

new multixacts by adding new transaction
IDs to it, you will

very quickly exhaust your membership
space.

Michael: You say very quickly and
very easily, but I mean, we

are still talking about 2000000000,
right?

Like, I think that's a, there's
been a debate in the Postgres

world about 32-bit versus 64-bit
for a long time, but we are

still talking about 2000000000.

So You say very quickly and very
easily at your scale, right?

Like this is also due to how many
things you're doing and maybe

the pattern as well, like invoices
that have lots of line items

or, you know, like it feels like
there are specifics here that

make this more likely.

I'm not saying it shouldn't be
a problem, I'm looking forward

to hearing how you diagnose this
and what we can do better.

But I feel like we should say it's
not necessarily very easy

to hit this limit.

I'm not actually, I think I found
1 other Stack Overflow post

or somebody else who hit it I don't
know if you came across that

while you were while you were in
the middle of this But I couldn't

find any other examples on the
mailing list, but it's super interesting

I think we're seeing more and more
Companies using Postgres at

this scale Like Nikolay mentioned
a few of them that we talked

to, but we also had an episode
with folks from Notion and Figma

and Adyen as well on that same
episode.

So there are a lot of high volume
companies now on Postgres.

So I think this stuff is super
interesting, but 2 billion still

seems like, even with the daisy
chaining, it still seems like

a lot to me.

Nikolay: Let me disagree with you.

There is no, like debate is not
like 64 or 32.

Postgres desperately needs 64 transactions.

And for example, OrioleDB already
implemented it.

But OrioleDB is like alternative
storage engine and it's right

now in hands of Supabase in terms
of development.

And I think right now there is
launch week and they published

some benchmarks again, like it's
great, great project.

So I think many projects dream
about the day when we will have

64, not 32.

Right now only several like forks
implemented only.

There are commercial forks which
implement 64 transaction IDs

and multixact IDs.

Right?

Michael: I didn't know that.

Sadly not Aurora it sounds like.

Nikolay: Yeah.

So 2 billion is not a lot already.

It's not a lot.

Andrew: Yeah.

It's pretty easy to hit it if you're
going at scale and you're

hitting it with quadratic velocity.

You get to 2 billion quickly.

It does sound like a big number,

Nikolay: but it's not that big.

Michael, sorry for interruption.

Michael, you can take chess board
and start putting 1 grain on

first cell.

And so.
Michael: Oh, yeah.

So wait, maybe I misunderstood.

If there are 5 row changes we're
trying to do, do we create 2

to the 5 rows in this table?

Or is it like 1, then 2, then 3?

Is that only 5 rows?

So in the member space?

Andrew: Yeah, good question.

So basically, it's more of a sequencing
thing.

So essentially, you have 2 transactions.

You start with 2.

You get 1 multixact taking up
2 members.

The third transaction comes in,
you create a new multixact

with 3 members, but that old 1
with 2 members has not been cleaned

up yet, so you're only taking up
5.

The third 1 comes in, I'm sorry,
excuse me, the fourth transaction

comes in, you create a multixact
with 4 and 3 and 2.

So now you've added an additional
4 on top of that.

And now the fifth 1 comes in, and
now you have a multixact

with 5432.

And that all add together, it takes
up membership space until

the vacuum can come up and reap
those old multixacts.

Nikolay: Yeah, chessboard example
is good here, I guess.

We can try to understand volume
when we approach 64 cell, number

64, how big it will be.

Michael: Andrew, you've done such
a good job describing how this

could happen, but when you're in
the middle of this, I'm guessing

this is relatively new knowledge,
or like when you're in the

middle of this, how did you work
out what on earth was going

on?

Andrew: Yes, that's quite the journey.

So I'm actually really new to Metronome.

I just joined in April.

So when this incident was occurring,
it was only my fifth week

or so.

And this was occurring on a Saturday
right before a vacation

to Miami.

So, wow.

Nate: Well, let's rewind a little
bit, because I guess the most

embarrassing part of this is like
the first, what we now know

was the first occurrence, was the
full week prior.

Andrew: Yes.

Nate: Where our migration workload
had basically implicitly triggered

this.

We hadn't really known what to
do.

And unfortunately we're not able
to get to a full RCA.

And so, yeah, we got to our best
explanation, which is like,

oh, there's something in transaction
ID wraparounds that we're

not grokking correctly, or Amazon's
hiding something from us.

And then we proceeded to turn it
back on with a slightly lower

concurrency, you know, 5 days later.

And so like, yeah, so we didn't
get there immediately.

But then, you know, that was a
P1 incident, you know, internally,

tried to be cautious, but didn't
follow through sufficiently.

And so this is like kind of in
the second chapter where Andrew

comes in.

Andrew: Yes, that's right.

I was not a participant in the
first 3 incidences.

It was only the fourth 1, which
I think was on a Saturday in

which the call went out from Cosmo
to see if additional engineering

resources could come and jump on,
so I decided to.

But having been so new, I actually
didn't have any tactical knowledge

I could use to address all the
specific fires that were being

started by this.

So I decided to actually figure
out what exactly was happening.

And as Nate had said, we weren't
really sure, but we knew it

was related to multixact in some
way.

We just weren't really sure exactly
why.

And we were struggling to really
fully understand because in

our initial evaluation, we saw
that the number of multixact

IDs that was being created was
well below the threshold of like

400 million.

And we weren't really sure why.

So what I did is I took that knowledge
and I just went ahead

and looked into the Postgres source
code and looked at the multixact.c

file, which is this enormous
file that contains pretty much

all the logic that you'll need
to know about multixacts.

And in there, we were working with
the hypothesis that somehow

this threshold of 400 million,
but we were only seeing 200 million

before vacuums started occurring,
was related in some way.

So I found the function called
multixact member freeze threshold

which calculates the threshold
that you should start basically

an autovacuum based on the number
of multixacts that are in

use and newly the number of multixact
members that are in use,

and that's what caught my eye.

So from there, I mathematically
proved that the amount of members

should be extremely high given
the code path that we are obviously

hitting, because we're seeing vacuums
occur at a specific number

of multixacts being used.

And from there I was able to find
log messages related to multixact

membership exhaustion and correlate
that with Datadog logs that

we had found ourselves in the recent
week.

And there we could conclude that
multixact members were being,

the membership space was being
exhausted.

And we were able to put together
what I would consider kind of

a hacky estimate of the number
of multixact members that are

currently in use by looking at
the files that are written for

each member and estimating by multiplying
by a constant how many

members are currently in use at
the moment.

And that's how we came across the
solution, or the cause, I should

say.

Michael: I think this is worth
diving into a little bit because

I definitely didn't realize that
there were at least 2 exhaustion

possibilities with multixact
so we've got the the total well

yeah so what why are there these
2 numbers like why do we have

2 there's 2,000,000,000 that you were
monitoring at 400 million,

which should be super safe, right?

Because we have the possibility
to go up to 2,000,000,000, but you

want it to be cleaning up much
sooner than that.

So you monitored at 400 million
or so.

Why is there another limit?

Andrew: Yeah, so we have 2 limits
here.

We have an ID that we give to multixacts.

That's a 32-bit integer.

So, of course, you cannot assign
more than 2 billion to these

multixacts because we don't assign
negative IDs.

And then you have the multixact
membership space.

Now that is represented as a logically
a simple LRU that is append

only.

And so we can only address up to
32 bits of membership space.

Each membership slot is addressed
by a 32-bit integer.

And thus now you have 2 different
values that you have to worry

about.

You have to worry about how many
multixacts you're creating and

how much membership space you're
consuming with those multixacts.

And that's where those 2 numbers
come from.

Michael: And 1 of the great things
in your blog post you mention

is how do you even monitor that?

Have you worked out a way since?

Is there anything you can do there?

Andrew: Yes, I think I might have
just offhand mentioned it,

but we use an estimate where we
read the membership files because

they are written to disk, And then
we know that the files have

to be a certain size and we multiply
by that constant, whatever

the size of these members are,
and we get an estimate for the

number of members.

Now, this is not the best solution,
which is why I submitted

a Postgres patch to surface this
telemetry directly.

Michael: Nice, Thank you.

Nikolay: Yeah, that's what I wanted to mention.

We have several things going on in mailing lists after this blog

post.
By the way, before this, why did you decide to post in so great

detail?

Like, what was your intention?

To help community or like?

Nate: I guess a couple of factors.

1, we do want to help the greater Postgres community.

I think there are lots of other situations where we've benefited

greatly by blog posts put out, in particular by companies like

Percona, as well as others who have maybe led us away from paths

that would have wound up being painful.

But another part is that internally at Metronome, we view ourselves

as a infrastructure provider for our clients.

And while I think it's okay to say that we're not fully recognized

as that yet, and we're still on our journey to solidifying that

narrative, it is what we want to, or how we want to act to be

perceived that way.

And we are in a critical path for a lot of our clients and how

they operate.

And so for us to have multiple multi-hour downtime in the space

of less than 10 days is pretty bad and honestly not what you

want for your infrastructure provider if you choose to be in

that way.

And so having a really crisp explanation of, hey, we are a little

embarrassed by this, but here's what we learned so you can learn

from it, as well as a little bit of vetting of, we didn't run

into the easiest thing.

It is novel.

There is a reason why it took a while for us to sort it out.

And so I think a lot of our partners felt a lot better.

Better could be a strong term.

But when you have an honest RCA, and Cloudflare might be leading

the industry in how they operate in this vein, you feel a lot

better that your provider is actually learning from mistakes

and is going to do a better job of avoiding them than you might

internally if you're evaluating, should I just build this in-house

instead?

Nikolay: So great transparency and so on.

It pays off over time.

Yeah, that's great.

So I saw several mailing list discussions, but before that I

also like, I feel like with multixacts, we can have different

kinds of problems.

1 is wraparound, right?

Okay, as I said, I never saw it.

And that's why many monitoring systems even don't monitor it.

And we have several cases of SLRU limits reached, right?

And 1 of them was explored in the past.

We can provide maybe blog posts, links to description.

And this is your case.

And your case obviously bumped into lack of observability bits,

like how much of capacity we reach, like how far from the performance

cliff we are, right?

But also besides that, I saw different topics were discussed

in mailing list.

For example, there is a setting vacuum_multixact_failsafe_age,

which is about wraparound, actually.

Peter Geoghegan brought to attention that this setting doesn't take

into account multixact membership exhaustion.

So it's also an interesting side effect from your blog post.

Maybe fully unexpected, right?

Then I saw, Andrew, I think your proposal to have multixact

monitoring, like system view, so any monitoring could have some

graphs and also alerts probably, right?

And this proposal currently at what stage?

I saw it's in commitfest, but it requires review, right?

Andrew: Correct.

Yes.

The patch is in the commitfest and, it definitely still requires

review.

We've, well, I've engaged with another contributor named Naga,

who's, we've actually submitted competing proposals.

His, actually directly reads from the, the memory space while

mine uses the pg_stats infrastructure to surface this telemetry.

But yes, it is still just in review.

It has not been accepted yet, as far as I can tell.

Nikolay: Yeah, so but it's great that it's registered in commitfest

and I think if some people who listen to us and maybe some people

who watch also our sessions of hacking we do with Andrey and

Kirk sometimes almost every week.

We missed today because of some issues, but we are going to be

back next week.

So we show how easy it is actually to test.

You take Cursor, you like tell it, just compile Postgres, apply

patch, and let's start using it.

And just you, instead of diving too much into C code, you think

about use cases and how to, for example, emulate problems, right?

And so on.

This is actually fun.

And I encourage our audience to maybe to start with their first

review and test this patch and this can increase chances that

it will be included to Postgres 19, right?

Andrew: Yes, what's also kind of interesting is I know you earlier

mentioned that there was a debate about moving to 64-bit integers

and in fact part of this commitfest does include a proposal

to change those offsets to 64-bits.

So there's a lot of changes that are coming in 19, it seems,

if they're accepted, of course,
which would help alleviate this

multixact membership and multixact
issue writ large.

Nikolay: That's great, but it's
only next year.

Andrew: I suppose we can't go too
fast.

Michael: Well, and even longer
if it's Aurora, right?

They have a generally a bit of
an additional lag to getting major

versions because it's a fork that
they have to do a lot of work

on.

On that note though, you mentioned
trust in an infrastructure

provider and how issues, big issues
can rock that trust and you

need to see some maybe root cause
analysis or maybe some action.

Has this at any point rocked your
confidence in Postgres or less

so Aurora, I guess, in terms of
is it the right platform as you

scale?

Maybe initially and then since
you've worked out exactly what

the issue was, what are your thoughts
there?

Andrew: Yeah, I don't think it
actually has really rocked our

confidence in Postgres.

I haven't seen anyone run for the
hills yet.

But you know, it definitely surfaces
the fact that there is quite

a bit to learn and that this is
a very complex system But to

that end something I've learned
at least Participating with the

Postgres dev community and being
on the mailing list is that

there's a lot of very smart individuals
who are part of this

community and who are actively
contributing and are passionately

debating what should and shouldn't
be part of it and how things

should be designed.

So, in many ways, it actually may
have built my confidence to

see so many people paying attention
and caring about it from

all across the world, from Fujitsu,
Amazon, Google, ourselves,

even just regular folks doing it
in their hobby spare time.

So in many ways perhaps it's a
testament to how far open source

can go and how much confidence
we can have in these systems because

like you said this is a pretty
rare instance and many of the

core cases that most people need
to use are covered or at least

will be.

Michael: That's really good to
hear and of course when people

hit issues with proprietary systems
like SQL Server and oracle

I'm sure when they blog about it
those conversations do happen

internally at Microsoft or oracle
but you can't see them you

can't see the the actual conversations
and part of the way you

diagnosed this issue was diving
into the source code so there

are some really cool factors that
wouldn't have been possible

if this was an open source.

Yeah, that's not just open source,
but open discussions as well,

open mailing lists.

So yeah, that's really good to
hear, but it would have been very

understandable if...

People have definitely left Postgres
for less good reasons.

So not naming names, Uber.

Andrew: Yeah, I will say, as we
said in our blog post, 30 terabytes

is not a small amount of data to
move.

And so if we decided to make a
large decision like that, I believe

we would have to reckon with the
transition itself.

And so in many ways, you know,
we probably want to try to work

with what we got.

And thankfully, I think it's served
us well for the most part.

Nate: Yeah, there's something to
be said for the devil you know.

And, you know, there's lots of
other systems Postgres compatible

that, you know, promise you the
world and something gives typically

whether it's the price, the actual
performance on either the

read or the writes or how maintainable
it is to actually run

it.

And yeah, I think it's clear that
we're going to have to undergo

the investment to figure out what's
next, because we can only

go single writer for so long, especially
with the demands that

we put on the database on the first
of the month every single

month.

A great extreme.

Nikolay: Yeah.

And behind the scenes, we also
chatted with Nate about the formula.

So Michael, for the first time,
we should include some formula

in the show notes.

There's a formula that is saying
that roughly 63,000 transactions

overlapping is enough to reach
a 2 billion limit.

Nate: Yeah, and it's a really dangerous
thing of some math on

the fly that can be checked by
anyone after the fact.

But yeah, as I worked it out, yeah,
only 63,000 overlapping

transactions, which is not that
many.

Relatively easy to do depending
on how you, you know, with a

highly concurrent workload operating
1 row at a time with a potential

overlap between them.

You can consume that really quickly.

Andrew: It might not also be transactions.

If you take a checkpoint, that's
almost seen as like a sub-transaction

and will also create a new multixact
as well.

So if you're doing a lot of checkpointing,
you can easily hit

that as well.

Nikolay: Are you using sub-transactions?

Andrew: I'm not actually sure myself,
but it is a possibility.

Nate: I don't believe we are, at
least not intentionally.

Nikolay: I think right now in the
latest versions of Postgres

there's ability to change SRU sizes
to avoid SRU overflow cases,

But my opinion about sub-transactions
in projects like this,

at this scale, remains just avoid
them.

Yeah, Because they can bring other
problems as well, like sub-transfer

can be overflown and so on.

And they consume transaction ID.

So yeah.

Good.

That's I think quite good deep
dive, unless we forgot anything.

Michael: Any last things you wanted
to add?

Nate: I think we can talk about
the mitigations, just so that

if someone's unfortunate enough
to run into this, that they have

a few breadcrumbs for how to work
around it.

Because it was rather expensive
for us to figure out what those

were, including Andrew and others
digging through the source

to really understand it.

But a lot of what we inferred after
the fact was, so we had this

migration which was doing a bunch
of row by row writes, not ideal

if you're doing a highly concurrent
process here.

And so we did make those batches.

We also did a lot of deep thinking
on which of the foreign keys

we actually needed on this new
table.

Nikolay: Batches, How did you find
the ideal for you batch size?

Nate: I don't know that we're at
an ideal, but if we climbed

the hill far enough, that it's
not a problem.

Nikolay: I see.

Nate: And so it was kind of a trade-off
between what throughput

we could obtain and making sure
that we weren't having the kind

of like overtime growth in our
heuristic for the member space

utilization, which is mainly based
on how many of the offset

files existed.

Yeah,

Nikolay: I usually say choosing
batches, choose the biggest possible,

but avoid locking rows for too
long to affect users, because

users, we know perception is 200
milliseconds, so if you lock

like 1 second should be maximum,
2 seconds should be absolute

maximum for batch size choice.

Nate: Yeah, it's less often.

So we're even batches around like
200,

Michael: which

Nate: is enough to mitigate this
concern and kind of keep all

of the operations up, you know,
short, since there's a very online

workload hitting this as well.

Nikolay: This is not users in terms
of batch transactions, but

users read results, right?

Or no?

Nobody's using

Nate: it now.

So sorry, this is in the context
of our greater migration to

move line items, line item construct
out of our database, as

well as partitioning the invoice
table, which has also grown

to be quite large.

And so, you know, kind of we would
read the old version, all

the line items in the invoice row,
write those to a new format

in S3, which is ultimately keyed
by the invoice ID.

And then a lot of logic, like we
can't just do a deterministic

mapping because then you get hot
partitions in S3, so you have

to do a little bit of extra work
and leave yourself a link in

the actual database row.

And so then we were writing those
out to the new table, which

is partitioned 1 by 1.

And our diagnosis, and it's a little
hard to confirm this, because

again, there is no observability
bit here, is that that was the

big mistake, as well as the single
inserts, like very quick.

It was the issue that they all
reference a handful of other tables

for foreign keys, some of which
were very low cardinality.

So, you know, we have like some
enum tables and we, you know,

for better or worse, in this case
worse, had chosen to use a

foreign key reference to them.

And so if every invoice is corresponding
to the USD credit type,

for example, you're going to have
a lot of overlapping transactions

all accessing that row to verify
the integrity of the data.

And so we kind of walk through
it, and we realize a lot of these

low cardinality cases, we don't
really change those code paths.

We're quite confident they're going
to be stable.

It's only the higher cardinality
cases.

An invoice also references its
customer that owns that invoice.

But that's a much higher cardinality
connection.

We're much less likely to have
overlapping transactions referring

to the same customer in a migration
like this.

Nikolay: Without foreign keys,
did you think about periodical

checks of referential integrity,
like asynchronously?

Nate: Yeah, so we're going to be
adding those to run on our Lakehouse.

So yeah, we have a syncing process
that brings all the files

over into Parquet.

Eventually, we're going to finish
this way on getting it to be

CDC, so it's more incremental,
probably using Iceberg.

And so on those, we'll run every,
I mean, it's much cheaper to

run the checks there, for a lot
of reasons.

Nikolay: More and more reasons
to move data out of Postgres.

I understand it for analytical
reasons, but yesterday S3, yesterday

S3 released vector index.

If you haven't heard it, it's interesting.

So I'm just curious where the place
for Postgres will be in our

systems.

But I totally understand, For analytical
workloads, definitely.

Nate: Yeah.

I mean, everything that's OLTP-shaped
is going to be staying

in Postgres for some time.

Can't see a reason to move it anywhere
else.

Nikolay: Well, yeah.

Michael: Great.

In terms of mitigations, did you
tune any of the settings around

multixact parameters?

Yeah.

What did you do?

Which ones did you increase or
decrease?

Andrew: I don't know if there was
actually any specific like

Postgres like flags that we changed actually.

I don't think we changed any constant specifically.

But what we did do is we did introspect into the code base and

determine the danger levels for membership spaces.

And so we used that as a threshold for alerting ourselves with

Grafana.

And so if our membership space consumption breaches a certain

level, what we will do is we will take actions to reduce the

number of concurrency, the amount of concurrent Transactions

that are occurring until Postgres can recover that membership

space.

So it's a little bit of a manual intervention at the moment,

but I don't think there's any specific knobs that we can tune

because it is fundamentally limited by a 32-bit integer.

And the telemetry with respect to multixact membership numbers

is not used anywhere else except in the detection of when autovacuum

should occur.

And that is sort of automatic, so, and kind of hard-coded.

Nikolay: Yeah, and the current utilization, you use function

pg_get_multixact_members, right?

Or how do you?

Andrew: No, I don't believe that is it.

I think it's some specialized function where we actually have

to get all the files and read the file sizes themselves.

So it's not something from Postgres, I don't believe.

Nikolay: But it's Aurora.

You don't have access to files.

Andrew: There is some function.

I can't remember, Mnid.

Nikolay: It's JLS or something, right?

So it lists files.

You can still SQL.

You can list for, oh, OK, interesting.

Nate: The Query originated from Amazon blog posts.

I'll try to pull it up.

You can see if you want to include it in the show notes.

Nikolay: Let's make sure.

I don't remember this recipe was shared in your blog post, how

to monitor this.

Andrew: It was not, we did not post it there.

We actually, intend to have a followup with a more engineering

based audience where we would detail these steps kind of in the

same way that we detailed it here, but in a more concrete format

for engineers and not for executives and company leaders.

Nate: Yeah, and we were a bit more timeline constrained for initial

RCA.

And we are optimistic that we can link to Andrew's patch for

the second follow-up blog post.

Nikolay: Exactly.

That's awesome.

Because people like, probably we
should add it to monitoring

like practically everywhere in
large systems, right?

This would be main outcome of this
discussion probably, right?

For those who listen and have big
systems.

Andrew: Absolutely.

Nikolay: Before your patch Andrew
got accepted and everyone upgraded

to Postgres 19, which will take
some time.

Right, so...

Andrew: Yes, until then we definitely
have some guidance for

monitoring systems.

We admit though that it is not
the most ideal, but it is something

that people can use to guide themselves
away from disaster.

Michael: Well and I think work
well there's the thread you mentioned

from Peter Geoghegan he's somebody
that has worked in the past on

mitigation factors.

So not only to be able to monitor
a transaction ID wraparound,

but he did some things to vacuum
so that it's just less likely

to get into that state in the first
place.

And I think making that parameter
he mentioned aware of the member

space or have some different parameter
that kicks in and does

a, like an anti-wraparound vacuum
or whatever the equivalent

would need to be earlier would
make this less likely to happen.

And of course, monitoring is a
good idea anyway, but prevention

seems even better.

So yeah.

Nikolay: Optimization inside Postgres
could be implemented here

Michael: because it's

Nikolay: quadratic behavior, at
least it could, as like in other

cases in Postgres, at least it
could be split to some pieces,

like 128 pieces, you know, something
like this.

I don't know.

I haven't seen discussion of actual
optimization to get more

capacity.

Have you seen it?

Michael: No, but I was looking
through what parameters we even

have that mention multixact,
and 1 of them is in version 17,

we got multixact member buffers,
which is quite low by default,

only 256 kilobytes or 32 pages,
and we got multixact offset

buffers, also low, 128 kilobytes
by default.

So I'm wondering whether increasing
those would help with the

member space maybe that's the offset
buffers 1 and then yeah

we also have multixact failsafe
age which you mentioned version

14 that's the 1 for Peter Geoghegan
and 2 much older ones I did wonder

about Freeze Min Age, that's 5
million by default, I wondered

if lowering that would just get
you at, make sure that the vacuum

ran quicker when it did need to
happen, you know that kind of

thing, and then there's the freeze
table age is quite is 150

million, so I again I'm not sure
if lowering that would mean

you just get vacuums more often
but I guess when we're talking

about such a huge table it's still
going to be a problem with

the vacuum does so until you get
partitioning in place anyway

thank you so much both of you it's
been amazing learning about

this scary but great and thank
you for paying the pain and the

cost so that so many fewer people
don't have to in future.

Nikolay: And special thanks for
transparency and admitting like

your own mistakes in the past.

Like I especially appreciate this
like how you talk freely about

your own mistakes.

This is very valuable and a good
example for others as well.

Andrew: Absolutely.

Thanks for having us.

It was a pleasure to be here.

Nikolay: Thank you.

Nate: Yeah.
It's really fun chat.

Some kind things our listeners have said