Nikolay: Hello, hello, this is Postgres FM.

I'm Nik, PostgresAI.

And as usual, here, Michael, pgMustard.

Hi, Michael.

Michael: Hi, Nik.

Nikolay: I proposed this topic because I thought we have a gap

here.

Before we started recording, you said we have already episode

about transaction_timeout, but it's very related but it's not

the same.

So I wanted to dive deep into long-running transactions.

What do you think?

Michael: Yeah, I was surprised we didn't have an episode on it

specifically already.

We had a first ever episode, didn't we, on slow queries and I

guess long transactions as part of that.

But we hadn't done a specific episode on it.

And I guess as a long-running podcast maybe it's definitely time

to revisit.

Yeah,

Nikolay: sometimes idle in podcasting.

So I would like to challenge us and think by the end of this

recording, we record in advance as usual because some preparation

is needed, post-processing is needed, thank you for doing this

by the way.

So I would like to challenge us and think should this episode

be called long-running transactions or somehow else?

Because you already understand what I'm talking about, right?

Michael: Only because of our pre-discussion I got a good hint

from you. But is it more that the problems they cause that you

want to talk about and therefore maybe it's only a subset of

long-running transactions that are an issue.

Nikolay: So listeners who listen to this already see the decision.

Did we preserve a long-running transaction title originally or

we decided to change it?

But the problem in my head is simple.

I see in monitoring tools, in blog posts, and everywhere, like

everywhere actually, that long-running transactions are bad.

And at the same time, I remember, actually it was an interesting

moment, I had an experimental educational course where I explained

various concepts of database health, and we touched this topic,

how long-running transactions are harmful to autovacuum and I said

you just open transaction and it will be blocking autovacuum and

I said it's so easy to demonstrate let's open a couple of psql

sessions or something like yeah psql vacuum verbose and we will

see it has dead tuples but cannot clean up them because there

is a long-running transaction in progress.

And I just wrote begin and it didn't work.

It Cleaned up tuples perfectly.

I think I also tried again with begin and SELECT, doing some

select and keeping transaction
open, being in idle transaction

state according to pg_stat_activity.

And again, it didn't work.

And I looked stupid.

And the reason I look stupid is
actually because we have actually

2 problems and we constantly mix
them.

And saying long running transactions
block vacuum is wrong.

Not all of them block vacuum.

Some of them block vacuum.

And we have, in my mind, we should
improve tooling and materials

and everywhere understanding and
start distinguishing 2 problems.

1 is lock-related and 1 is vacuum-related.

We should change monitoring tools
and snippets, everything, to

hunt 2 problems separately.

Diagnose them separately.

Michael: Yeah, should we go back
to the basics though, in terms

of, maybe let's first cover the
long-running transactions that

are harmful, or like can be harmful,
and then reverse our way

backwards to the ones that aren't
and therefore trying to distinguish

between them.

Nikolay: Some people say long transactions,
which I think is

oversimplification because it may
be very long but fast, or it

may be very short but very long
running.

Michael: Wait, what does long mean
if it's not just about the

time?

Nikolay: A lot of text.

Michael: Oh, sure.

OK.

I didn't even think about that.

Nikolay: Like, I'm not a person
who tries to be pedantic.

I gave up, for example, my scientific
mentor was distinguishing

database systems and database and
saying you cannot say Postgres

is database, you must say Postgres
is database system or database

management system.

I already gave up.

You cannot fight with old people.

They will say it's a good database,
a meaning database system.

So it's okay to say long transactions.

It's not okay for my texts I write
and like products we build.

There we always pursue precision
and we say long-running transactions

right but at a very high level
if you have long-running transactions

there are 2 dangers at least actually
there are maybe a few more

but 2 main dangers are some locks
are acquired.

And as we know, like this is a
simple rule to remember, we talked

about this so many times locks
are released only at the end of

transaction.

They cannot be released before
rollback or commit, that's it.

So it means that transaction, if
it lasts long, accumulates a

lot of locks and even access share
locks can be harmful, right?

Sometimes.

Yeah,

Michael: even if the statement that acquired the lock was short,

if it's followed by statements that are slower, it's not...

Nikolay: The

Michael: lock is acquired, that's it.

Exactly.

Nikolay: The lock is held until the very end of the transaction.

It might block someone like DDL.

DDL might block other SELECTs, as we also discussed many times.

And this is a big surprise, and like this two-step effect, which

can put the whole server down, actually.

This is a big deal.

So if you just selected something and keep the transaction open

for long, you just, for example, should avoid changing schema.

Or you have a dump happening.

Very common situation.

There's a dump happening or something like initialization of

logical replica, which lasts a couple of hours, for example.

And at the same time, we have deployment trying to change the

DDL, trying to change schema, right?

And yeah, this is a very dangerous situation because if you don't

have lock_timeout, retries, we talked about it as well a lot.

In these cases can cause downtime.

So lock acquisition plus holding these locks for long.

This is a danger number 1.

And danger number 2 is how MVCC in Postgres is organized.

MVCC is a multi-version concurrency control for like new folks.

I know we have new folks all the time and specifically vacuum

problem.

So to clean up dead tuples promptly, server should be in good

state.

And if you keep the transaction long, in some cases, not in all,

as we will discuss now, your long-running transaction might block,

not might block, it will block.

If, for example, you inserted a row and just keep the transaction

open, This is a real transaction.

It has real transaction ID, right?

This is definitely will block vacuum and it will need to skip

the tuples which became dead recently after your transaction

started.

Because, and this is global, this is biggest surprise here for

people, as I see, to realize this is global.

You insert it to 1 table and keep the transaction open.

But vacuum cannot clean any table.

It can clean only dead tuples which became dead before, because

xmax, this is a hidden Column in every row, you can SELECT xmax

from any table.

Xmax for dead rows is filled with a transaction which deleted

this row, with UPDATE or DELETE, it deleted this row version.

And if this xmax is in the past compared to your transaction

ID, which holds this snapshot, right?

This will be possible to delete, but if it's later, Postgres

thinks maybe your transaction will need it.

Michael: Yeah, it could still, if that transaction wants to read

that row, let's say the 1 that got updated or deleted, it needs

to return the old version.

Yeah.

According to its snapshot, so it's still needed, or theoretically,

still needed.

It's a feature, right?

Nikolay: It's a feature, but also we need to think about transaction

isolation levels here.

Michael: True.

Nikolay: Because by default, so if you have a transaction ID,

so basically in pg_stat_activity, you can see the column

backend_xmin, which defines this horizon that your transaction needs,

this snapshot, it defines this snapshot.

And if you have repeatable read, which is not repeatable read,

it's already higher isolation level.

It's more like it's stricter.

It's snapshot isolation, in fact.

And it's not default, but it's used during things like dumping.

pg_dump will use a repeatable read.

It opens transaction in a repeatable read.

And also logical replication initialization.

And sometimes in multiple threads, multiple connections.

So in this case, this snapshot needs to be held because this

transaction which is running in repeatable read needs this snapshot

for consistency.

All Queries will need to read from this very snapshot.

So all tables you read during dumping must be from the same moment,

right?

While in traditional, like, default transaction isolation, which

is read committed, it's shifting.

Snapshot is shifting, so it's not being held.

And if you, for example, open transaction, SELECT from 1 table,

SELECT from another table, SELECT from another table, you will

see that this backend_xmin is shifting, right?

Which means that our horizon is shifting and we're not blocking

a vacuum like we could block if we are dumping or just using

transaction at repeatable rate.

Which means that not all, not all transaction, long-running transactions

are harmful to vacuum.

Right?

Does it make sense?

Michael: Yeah, at least from a Bloat perspective.

So the, actually, probably from everything in terms of vacuum,

I was thinking there might be an exception around like transaction

ID wraparound, for example, if we had a very long running transaction,

but if it didn't have a transaction ID, then actually that's

not gonna be an issue for a transaction
ID wraparound either.

Nikolay: That's an important distinction
because if we inserted

or we ran XID current function,
or there's a new version, I'm

constantly forgetting, I'm old.

So you have allocated real XID.

It's 1 story, but if you're just
reading and you don't have,

your transaction doesn't have allocated
XID, it has only basically,

there is concept of virtual XID,
which is optimization to save

space because we have only 4 byte
transaction IDs, right?

So it's like, it's not a lot these
days.

And if you, like most of your queries,
most of your transactions

are reading transactions, it would
be not wise to give them,

to consume from this space, right?

Yeah.

So there is optimization with virtual
transaction ID and purely

reading transactions, they only
consume from that space, right?

Virtual transaction IDs.

So we don't have it and which means
that you need to distinguish

these topics and my point is that
not all long-running transactions

are harmful.

But at the same time, if you're
just reading, but this reading

itself is still happening, you
are reading from a huge table,

it lasts long, definitely backend_xmin
will be like some value

not changing in pg_stat_activity.

In this transaction, reading transaction
and read committed,

it will be harmful.

So I'm talking only about cases
when we read something And we

either read something else constantly,
small reads, right?

Or we read and sit in idle transaction
state, not doing anything

and not closing this transaction.

Michael: Yeah.

Are you talking, is this a bit
of an edge case?

Like who's actually doing this
a lot?

Like when does it happen?

Nikolay: What exactly?

Michael: Like a small read and
it doesn't get committed.

Nikolay: Oh it happens all the
time.

The number 1 reason is humans.

So they...

Michael: Like opening an editor
or something.

Nikolay: Yeah.

In psql or even in more cases some
graphic interface, right?

They begin some production, read
something and then they all

maybe change something also, right?

More like usually when you are
reading you are not opening transaction.

Like when you're just exploring
stuff, you run some queries to

understand the state of database.

Usually you don't use begin-commit
transaction block.

But when you try to write, sometimes
you do.

And abandoned transaction from
humans, so common.

It's very common.

So just abandoned, forgot to close.

In some cases, this is called turning off auto-commit, right?

So you, in some editors, you can say, OK, I don't like auto-commit,

I will turn it off.

Then you start working but not committing, and then this is abandoned

transaction, so it doesn't do anything.

And if it holds this backend_xmin in this activity, this is

what starts blocking autovacuum.

And in first maybe few hours you won't notice.

Michael: So yes, I understand the dangerous case.

I'm talking about the non-dangerous case.

The opening a transaction, begin, then a SELECT.

You talked about just doing a read query, not doing an UPDATE.

When would that happen?

Nikolay: In the same situation, it might happen as well.

I'm just like, I was trying to understand all the possible cases.

In this case, it's not.

Michael: Well, I think, yeah, but also, I just don't think it

happens in the same way.

Like, I understand doing begin, UPDATE, and then you've got the

option to ROLLBACK if you're scared or you're traumatized maybe

from doing an UPDATE once and then seeing you're expecting 1

row updated and then you see 30,000 rows updated or something

like that maybe You start using transactions explicitly so you

can ROLLBACK to...

Nikolay: I get what you mean, yeah.

Michael: But who's doing that for SELECT queries?

Like I don't...

I don't know.

Nikolay: Our purpose is to be complete, right?

To cover all possible...

Michael: No, I understand.

So this is why I'm asking, Is this an edge case?

Nikolay: Yeah, usually I think if I have a bunch of SELECTs and

I combine them to transaction, I would probably use repeatable

read, so all of them are consistent.

That's why I combine them, right?

This is like, basically like pg_dump does, right?

If I don't, if I use transaction, but I don't use repeatable

read, why do I use transaction in the first place?

It's just a bunch of SELECTs.

Maybe to...

Just like what comes to my mind.

Maybe to make sure schema won't be changed.

To block DDL.

Michael: Or to deal with the pooler maybe?

Like to make sure you stay with the same.

Nikolay: Yeah, maybe there might be some cases and yeah.

Another very common situation when we see, so there's this term,

xmin horizon, right?

You can see it's in some places
in source code, you won't be

able to see it, I think you won't
be able to see it in documentation.

I checked long ago, I think it's
very like unofficial term.

You can see it in some blog posts,
but again, I feel it's not

fully understood topic, right?

So what is xmin horizon?

And I was trying to understand
like all aspects of it, how to

properly measure.

And my question is like, why is
not properly measured by monitoring

tools?

Because it's not.

You like check all popular monitoring
tools, excluding PostgreSQL

monitoring tool, right?

And you won't be able to see like,
they like, they talk, okay,

there is a long running transaction
lasting like this amount

of time.

Okay, have you excluded transactions
which are not harmful or

you included them?

What's the purpose of this chart?

Are we talking about danger related
to locking or are we talking

about danger related to vacuuming?

Unclear.

My point is, if we start distinguishing,
if we start thinking

about xmin horizon, we need to
understand what it is, and how

to measure it, how to properly
understand it.

So I came up with the idea, we
have 4 sources.

First is a pg_stat_activity,
backend_xmin, we just select

the oldest.

Actually, you can use function
age to backend_xmin, which will

give you difference, just like
integer difference between your

current function.

Yeah, this function should understand
that we have wraparound

of the space, right?

So it will give you a number of
transactions, the agent in terms

of number of transactions, not
in terms of seconds.

The second very, very common situation,
super common.

You created a logical slot and
either haven't started using it

or abandoned or it's lagging.

So if you check the pg_replication
slots, it also has a column

that's called xmin there.

And you also can use age.

So it's another very popular reason
to have something lagging.

And we need to distinguish logical
replication and physical replication.

Because as I remember, if you have
physical replication slot

but haven't started using it, I
think it's not harmful.

I might be mistaken.

Logical replication slots are definitely
harmful.

Unused logical replication slots
are definitely harmful.

Michael: Did this change in a recent
version where I think there

might be some new setting where
by default they expire after

a certain age of not being used.

Nikolay: Yeah, there is such setting,
it's just a protection

mechanism and I think we should
talk about protection measures

separately.

Michael: Yes, okay, cool.

Nikolay: I'm just trying to say
that physical and logical replications,

replication basically streams,
they are slightly different here.

Definitely if you have logical,
if it's unused, it's harmful.

If it's lagging, it's also harmful.

Michael: Makes sense.

Nikolay: It starts harming you
right when it's created.

So you don't want logical replication
slot to be unused or lagging

too much.

Physical slightly different, and
I think, if I'm not mistaking,

if you just create a slot but nothing
connected to it, it won't

hold xmin horizon.

Moreover, if you have physical
replication, but this replication

is not using hot standby feedback,
on replica it's set to off,

which is actually not common.

Usually people prefer to set it
to on.

Hot standby feedback sends, for
physical application, it sends

to the primary the information
which xmin horizon is needed on

replica, right, on standby.

If it's not sending, it's better,
right, because we don't care.

But this will give you, like, physical
replica which will start

lagging every time it has a long-running
transaction on its own.

While hot standby feedback will
protect replication lag, but if

there is a long-running transaction
on replica, it will report

to the primary, and we will have
the same effect as we had the

long-running transaction on our
primary, which explains why this

dilemma makes it really hard to
have long-running transactions

for analytical purposes on regular
physical nodes, right?

And you cannot offload it to replicas.

Only if you can afford replicas
which are lagging for hours,

for example.

Anyway, we have logical and physical
replication which can also

affect us.

And we have fourth component, which
is prepared transactions.

There is a pg_prepared_xacts transactions.

System view you can see also I
don't remember column name maybe

transaction or something, I don't
remember.

But you can also get, if you use
2PC, right, prepared transactions,

in this case you have them, right?

In this case some of them might
be affecting, they may affect

xmin horizon.

So I found 4 sources for our analysis
to define which, so we

dig information from all of them,
take ages And then just take

the greatest of ages.

This is such function greatest,
right?

And this will be our true age of
our xmin horizon.

What is big, what is small?

This is hard to understand.

When I think about it, what is
big?

Is 1,000 too much or 1 million
too much or 1 billion too much?

Billion is definitely too much
because we, we, our, our, yeah,

yeah, yeah.

We have 2.1 is basically our, the
whole capacity for XID wrap

around, right.

But we also, there's also MultiXact
mechanism additionally,

but let's not go there.

Billion is definitely too much.

Is 1,000,000 too much?

In heavily loaded systems, I like
to think about it like thinking

about what tables we have and how
many rows they have.

And how many rows we consider,
how many dead tuples.

So dead row versions, how many
of them we consider too much.

Usually we prefer for OLTP heavy
loaded systems, we prefer

to make autovacuum be very aggressive.

Default settings for autovacuum,
vacuum scale factor, autovacuum

analyze scale factor, they are like 10 and

20% is on 2 common default numbers.

It's too much.

If we have 10 hundred billion rows,
10% is way too much.

So we usually set, okay, like 1%
so 0.01 or even half percent.

Even sometimes even go even lower
and make it delete the tuples

more often now you already have
some hint. Okay, you think I have

like 10 million row table and I
have 1% scale factor. It means

that 100,000 rows is when the
autovacuum should start cleaning

up.

And then, I think, okay, xmin horizon
is lagging that much.

Okay.

Is it lagging 1,000?

It's okay.

We don't care.

Our basically actual horizon when
autovacuum starts, we expect

work from it, 100,000, roughly.

But the problem is that we have
different tables, small, big,

right?

So we need to think about core
tables and like nature of our

database to think how much of xmin horizon,
like xmin h is too

much for us, right?

Michael: I think there's 2, you're
talking almost like a bottoms-up

approach, right?

Working out how much bloat we can
tolerate.

Nikolay: But yeah, how much dead
tuples we can tolerate is not

yet bloat.

As we remember, we allow too many
dead tuples so autovacuum is

converter of the tuples to bloat,
it will come and convert our

millions of dead tuples to bloat,
but it's not bloat yet.

At least in my terminology, I wish
we had official terminology

for bloat.

Like for dead tuples it's clear,
but for bloat, we don't have

it as well.

Michael: The more dead tuples we
leave around, the worse bloat

issues can get, like in terms of
like fragmented indexes for

example, in terms of hot updates,
if we've got dead tuples that

we

Nikolay: can't remove.

Hot updates might benefit from
bloat because they have space.

Michael: No, because it's not space
yet if you can't reclaim

it.

If it's not being processed by
the network.

Nikolay: But then eventually a
vacuum will happen.

Michael: Eventually.

And it

Nikolay: will leave a lot of gaps
and these gaps might be reused.

Michael: Maybe, yeah.

Nikolay: There are opposite direction
of effects here.

Michael: But the point I was trying
to get to was often people

are monitoring this in order to
prevent transaction ID wraparound

so the top-down approach could
be how many how much time okay

1,000,000 might not be that bad in
terms of getting towards 2,000,000,000,

but if 2,000,000,000 could accumulate
within an hour or 2, if you've

got like a ridiculously, so actually
you might want an earlier

alert or a lower threshold because
as soon as you find out that

it's going above a healthy number
or above like a not normal

number for your system you've only
got, you're on a stopwatch,

you've only got 2 hours, you've
got to fix this problem.

So it might be that you're thinking
in terms of top down as well.

Nikolay: Yeah, for example if I
see age of this is 100 million

it's already wow, it's a lot, It's
already like emergency.

You

Michael: might only, yeah, exactly.

You might only have a couple of
hours at that point.

Nikolay: Yeah.

Speed is interesting.

I wish we had also good, again,
monitoring systems, like proper

XID consumption, XID growth
rate metric to be covered

properly.

If you think, okay, this is an xmin
horizon, some numbers, okay,

I know it's transaction, but you're
right, we like, it's hard

to map it to real time.

But here I see, okay, we consume
2 billion transactions per week.

It's quite a loaded database, in
fact, right?

Real transaction ideas.

In this case, I see peaks when
there are busy hours, I understand

my seed growth rate.

And this is helpful for me.

I understand, okay, this is, I
can already start playing with

this.

I think we could go even deeper.

I just don't see proper analysis
tools in this area developed.

Like this is like my, like thoughts
and from people I work with

and talk to, and I see huge potential
here.

So it could be much better visualized,
presented for analysis.

Michael: And as you say, separated
out in terms of things that

are risky in terms of locks and
things that are risky in terms

of...

Nikolay: Oh yeah, locks are completely
different.

Michael: I'm just going to keep
saying block.

Yeah.

But that's what I mean, you're
talking about different alerts,

different charts, different...

Yes.

Nikolay: So I would have alert,
and we actually do have alerts

for xmin horizon based on understanding
this particular system

with its growth rate and like table
sizes, right?

And the behavior of vacuum, autovacuum.

And we set it, for example, I think
100 million is a huge number.

We should set it lower and already
start investigating what blocks

our xmin horizon.

And this is the right direction
to understand.

And we also can check logs if you
enable autovacuum.

We usually aim to enable, if not
all, but at least above 1 second

autovacuum processing.

So log_autovacuum_min_duration,
1 second.

Sometimes we set to 0.

There is some risk that you might
see a lot of and have observer

effect, but usually like at least
1 second is beneficial.

So you start Again, I think all
default was 10 minutes.

It's ridiculously huge.

Like you don't see a lot.

And I think it was, it was decreased
recently.

Right?

Oh, cool.

Yeah.

So what I'm talking about here,
if you have a logging for autovacuum

behavior, it reports like xmin horizon,
it reports it, right?

Transaction ID was this, xmin horizon
was that, it's this number

behind.

So you might extract information
from those logs.

And also it reports how many dead
tuples it would remove, but

couldn't because xmin horizon is
blocking.

So autovacuum messages and logs
are extremely helpful for this

type of risk analysis.

Michael: I think it might be just,
I think it's still 10 minutes.

Nikolay: Maybe, maybe.

Michael: Wow.

All right.

Nikolay: So this is about xmin and the vacuum behavior xmin horizon

and this concept again Like I think documentation could be Fair

that these concepts exist Maybe like 1 should propose To define

this, because everyone suffers from bloat.

And I think, in my opinion, everyone needs to understand, working

with Postgres deeply enough, it needs to understand the concept

of xmin horizon to start distinguishing it from long-running

transactions, right and The second problem is like first problem

first problem historically lock acquisition what to say about

it, so it's Yeah,

Michael: I think we should point people.

I think we did a good episode on that specifically.

But yeah, maybe that's a good excuse to switch over to mitigations

though, because...

Nikolay: Mitigation.

Yeah, 1 point I wanted to mention, there are some exotic problems

also, which might happen from long-running transactions.

Michael: Oh yeah, sure.

Nikolay: I point to my sub transactions Post explaining 4 problems

with sub transactions and 1 of them which happened to GitLab

And we had episode I think about it, right?

So it was the most interesting.

It was happening when you have long-running transactions.

I think when you actually hold this xmin horizon on the primary,

hold snapshot, or maybe locks.

This is a question actually I need to do.

I have some homework to do to return to that material and understand

what of 2 problems happened there.

So long-running transactions on the primary and high TPS replica

started to suffer at some point because of sub-transactions.

Once you close long-running transactions on the primary, suffering

ends.

So this is exotic already, exotic problems where I need to understand

is it lock-related or snapshot-related.

Michael: I also think replicas are an interesting thing in general

in terms of long running transactions.

So long running transactions on the primary and whether they

affect the replica and then also long running transactions on

the replicas and even when they can affect the primary.

Nikolay: ...a replica if hot standby feedback

is turned on, as we discussed.

Exactly.

Can report to...

They will report to primary, can have same effect.

Michael: Yeah.

So...

So I think that's not always something people are aware of.

But yeah, in terms of mitigations though, it feels like timeouts

for our friends in a lot of these cases is that, and monitoring,

we've brought up monitoring a lot, right?

Monitoring, alerting, even in extreme
cases, killing certain

problematic long-running transactions.

Nikolay: What did you want

Michael: to talk about in terms
of mitigations?

Nikolay: So, yeah, of course if
you can get rid of all the long

running transactions, both dangers
don't affect your database

health, which is great.

But sometimes we need long-running
transactions, sometimes we

need to accept some lag in replication
slots.

By the way, what caused me to start
distinguishing something

is not right.

We talk about long-running transactions,
but the same problem,

vacuum being blocked.

vacuum not fully blocked, you understand,
so you cannot DELETE

some data.

It's not only about long-running
transactions, it's also about

slots, it's also about prepared
transactions.

What's happening?

Okay, prepared transactions, they
are at least transactions,

but slots, right, or replication,
like anyway.

hot standby feedback.

Okay, hot standby feedback is also
about transactions that just

happen on standby, but just unused
logical slot.

So it's a wrong explanation of
some problem.

Problem is xmin horizon is frozen
it's not progressing right?

Michael: Not progressing yeah.

Nikolay: Maybe it's progressing
but too slow also might happen.

Michael: Yeah so if it's progressing
slowly though it's just

blocked at a new point, right?

Like it's...

Nikolay: Yeah, if it's progressing
too slow, your horizon of

XID, not horizon, but actual XID
is already too far.

So basically what we should look
at is Age of xmin horizon.

Age of the old xmin needed.

Yeah, this is what we, all we need.

And if it's too far in the past,
this is a problem.

And the reasons of this, as I said,
4 reasons.

And mitigation is simple.

Don't allow it to be too far in
the past.

How?

First of all, in OLTP I'm like
very highly recommend to be very

aggressive in time with timeout
settings globally.

I know documentation says that
it's like it's controversial and

so on like I know like If HTTP
server setting can afford 30 seconds

or 60 seconds as maximum for OLTP requests, why do we, database,

cannot afford at least the same
for all regular traffic.

Of course, for CREATE INDEX CONCURRENTLY,
REINDEX CONCURRENTLY,

we need to set it to 0.

What to set?

statement_timeout and 2 other things,
right?

If you think about transaction, it might be like something is

running, I don't transaction, something is running statement,

I don't transaction.

So it's either statement or I don't transaction and there is

a whole transaction.

So basically we have whole and 2 phases, like they switch after

each other, right?

It means we can limit the whole transaction or statement or item

transaction state.

And for all 3, we have setting, right?

And transaction_timeout.

Michael: Only since recently though, only since version 7.

Nikolay: We had an episode about transaction_timeout because

it was my idea.

I was super surprised.

I was walking around thinking, why it doesn't exist?

Like, maybe I don't understand Postgres enough after 15 years

of experience.

Then I told it to Andrei Borodin and he implemented.

Great.

In Postgres 17 it exists.

So in all new Postgres we have transaction_timeout so we can

finally limit the whole thing because before we could limit only

statement and idle_in_transaction_session_timeout.

Yes.

And if you have a series of brief statements and brief waiting

periods between statements, you could have hour or day-long transaction

easily.

Now you can limit the whole thing.

Limiting the whole thing is helpful.

Unfortunately, we cannot limit what we actually would like to

limit is xmin horizon.

Maybe we should have, I just, I just came, this idea came to

my mind just now.

Honestly, I need that info.

We could have like probably like xmin horizon maximum, after which

we start abandoning, like we're canceling transaction, abandoning

prepared transactions, like deleting them, and maybe killing

some application.

Yeah, yeah.

Yeah, why not?

Like 100 million protection, ultimate protection mechanism.

And in this case, you don't need to map what's happening with

your like server in terms of seconds and actual problem, like

how many transactions you don't want to lag, right, for your

garbage collection.

Garbage collection is a vacuum, right?

So maybe we should have those settings, maybe too much settings

at the same time, but it will be flexible.

I would say 100 million is my maximum, or 10 million is my maximum,

regardless of cost.

Michael: Even if Postgres shipped
with conservative defaults

there, it would be fine.

Even if it was like 100,000,000 as
a default, it would be fine

for like cases, wouldn't it?

Nikolay: I'm curious, maybe this
idea was discussed, maybe it's

not a good idea.

From my mind, it just came.

It's interesting.

If you have opinion listening to
us, leave this opinion as a

comment under YouTube or somewhere
like on Twitter or LinkedIn,

I don't know.

Anyway, this is interesting.

But what we can do, we can protect
with transaction_timeout,

statement_timeout, idle_in_transaction_timeout.

What else?

We should, we must actually, monitor
lags of replication slots,

especially logical replication,
but physical as well if we use

hot standby feedback.

And we should not allow lagging
too much.

And finally, we should be very
careful with 2PC and abandoned

prepared transactions.

That's it.

So only these 4 areas.

Michael: Yeah, when you say monitoring,
you mean alerting as

well, right?

Nikolay: Yeah, so if some replication
slot is unused, logical

replication slot, or it's lagging
too much, Again, this is interesting.

It will be also either in bytes
or in seconds.

We should think about seconds,
how much, how we need to map it

to transactions, understanding
our seed growth rate.

So there are different metrics
in play here.

So to understand health properly,
it takes time.

And if you have cattle, not pets,
so you should have simple rules.

If some slot is lagging too much,
By the way, slot lagging too

much, it's in bytes, right?

We cannot limit it in seconds of
lag.

There are bytes, seconds, and then
XID growth, XID like

number of transactions.

So 3 metrics, There is a triangle
here.

Interesting.

Anyway, this is to protect from
affecting vacuum.

You can also reactively just monitor
what the vacuum reports

in logs.

And it will start to complain clearly,
right?

So you can set threshold based
on logs and if you have some good

log analysis system, you could
set up alerts from there.

And as for logs, for me, it's very
completely different topic.

Like yeah, long running transactions
contribute to it, but problems

might happen very fast.

Unlike xmin horizon being blocked,
you might not notice it at

all.

Even you, okay, you ran, you blocked
it for many, many hours.

Maybe it was not, maybe it was
needed at just one time.

It won't affect you so much.

The negative effects, they are
postponed and they grow like it's

not a cliff.

It's like it's slow degradation.

Unlike situation with locks, with
locks it might be very fast.

So if you have, as I said, this
classic example is you have the

weakest heavyweight lock possible,
accessory lock on the table.

You block DDL, DDL blocks other
SELECTs, boom.

What to do here?

We need to use proper DDL deployments.

We need a low lock_timeout and
retries.

This is number one thing I would
recommend.

And number two thing is to craft
schema changes properly to avoid

long running exclusive locks, first
of all, but also accessory

locks as well at some point.

They are not harmful unless you
have conflicting sessions and

to detect conflicts sooner is helpful.

Another measure which is actually
a common problem with Postgres,

if we have a vacuum which blocks
us, vacuum always blocks us

when we do DDL.

We cannot acquire access exclusive
lock on the table to change

its schema if vacuum is happening.

But fortunately in most cases vacuum
is running in normal state

and it will kill itself after I
think a second.

Michael: Do you mean autovacuum
or do you mean anti-wraparound?

Autovacuum.

Nikolay: I'm talking about autovacuum, good correction.

But if it's running in transaction
ID wraparound prevention mode,

which happens often if you have
high XID growth, related topics,

right?

If we'll like insert a lot, not
in batches, but single rows,

for example, we consume XID so
much.

Save points can contribute to XID
consumption and XID growth,

XID growth rates a lot.

In this case, we probably have
a problem like if XID growth

is high, autovacuum transaction
ID wraparound happens more

often.

It might happen, for example, every
two days in a very busy systems.

In this case, it happens every
two days, but we aim to have deployments

five times per day.

We are like very rapidly developed
startup.

In this case, this might clash.

Like you have an autovacuum, transaction
ID wraparound prevention, and

you try to deploy DDL.

And you, even if you have retries,
you must have low lock_timeout

and retries.

You try to deploy it, but it runs
on your terabyte size table

and blocks you.

This is super annoying.

You cannot deploy something you
need to deploy right now.

What to do about it?

Question is open.

Michael: Well, you've talked a
lot about partitioning, so the

tables are smaller and the vacuums
are quicker.

But I actually also think, at the
time I thought it was a good

thing, but I thought the threshold
where an anti-wraparound vacuum

kicks in is relatively low in the
grand scheme of things in terms

of the risk.

I would seriously consider, if
it's happening every 2 days, that

seems like a bad smell to me.

That feels like something that
shouldn't be.

Like, let's make sure normal vacuum
happens enough so that it

doesn't have to run in anti-wraparound
mode.

That would be my.

Nikolay: Yes.

I'm not against this.

Like, it's good that it's happening
often.

And freezing is good.

Like, once.

So partitioning is helpful not
only because tables are smaller,

but more pages are all frozen and
we don't touch them anymore

because they are like our archive.

This is super, it makes things
much more efficient compared to

huge unpartitioned table where
anytime page can receive new tuple

and it's not all frozen anymore
in visibility map, right?

Visibility map has 2 bits, all
frozen and all visible.

And boom, we need to take care
of, autovacuum needs to take care

of this page again.

So the problem is, for me, is not
that it's too often happening.

The problem is like this threshold.

I would sometimes, so it's okay
to have it low and make it very

frequent.

Good partitioning is great, but
sometimes I wish it would be

easier to raise this bar temporarily
while I'm doing my changes

and then raise it back.

But as far as I remember, I might
be mistaken here.

Last time I checked this maybe
a couple of years ago, this topic.

It's not so easy.

You cannot change it globally so
easy.

It requires restart, I think.

Maybe I'm mistaken again.

And the table level, there is also
some caveat.

I don't remember which, but I didn't
see a good way to...

Okay, For each deployment, we want...

And also interesting, you know,
like, you kill that vacuum which

runs in transaction ID wrap around,
and if you don't acquire

log on the table fast enough, it will kick in again.

Especially, like, it might kick in if you have many workers and

nap time is low, so it's an interesting problem as well.

So raising this was not easy, as I remember.

So I came up with 2 ideas.

The first idea is if this is deployment, which is big deployment,

we have like major upgrade or something, because it's a problem

during major upgrades.

We had it during major upgrades when not a vacuum running transaction

ID wrap around was a problem for us or some major changes.

If it's a rare change, we probably should forecast when the vacuum

will start freezing.

And do it before.

Michael: I like that.

Nikolay: Yes, yes.

Do it before.

Another thing is that probably we should just sometimes kill

it ourselves.

But if you want to kill it, you need to do it properly.

You need to issue attempt to acquire lock on your table to change

and kill it immediately.

Kill it to vacuum and your attempt to acquire a lock is successful

and then good luck to try to prevent transaction ID wrap around

while we're doing our job.

This is like...

Michael: Wait, is that so it doesn't start again?

Nikolay: Yeah, yeah.

So I...

Yeah, okay.

I won't.

That's it.

So the lock is mine and I'm doing my job.

It's like, it feels slightly dirty, right?

But it's, there's no, there is no harm if you also monitor XID

and MultiXID or up around risks.

If you know, you're only at 10% of capacity, 200 million, it's

fine to go to 300, 400 million during like even big deployment.

We will be fine afterwards, right?

But this feels dirty, honestly, to do this trick.

I did this trick manually a few times in my career because I

needed to finish deployment and I knew that we are not in danger

zone.

So yeah, Yeah,

Michael: that makes loads of sense to me.

My preference, I think, still would be avoid it happening as

often as that and actually trying to make sure things got frozen

before it was needed.

So it really is a fail safe rather than a normal like thing that

happens in normal operation.

But yeah, super interesting.

We started the episode talking about what we should call it and

I think maybe we should end it like with a, what do you think

now, having talked about it?

Nikolay: Oh.

Yeah, but we try to keep titles very short, right?

We didn't have xmin horizon episode.

I tend to think this should be called xmin horizon episode, because

this is where we spend time most and it's less about locks. It's

less like and I think long running transaction is a wrong concept

to chase

Michael: Yeah, let's call it xmin horizon because I think that's

gonna

Nikolay: be, but my concern is that people think in terms of long-running

transactions and...

Michael: That's fine.

I think long-running transactions is interesting and I actually

don't mind that there's this edge case of long-running transactions

that aren't harmful.

Nikolay: Let's keep long-running transactions because it's easier

to understand what we're talking about.

This is what people expect.

Let's keep the name long running transactions.

Explain horizon. This concept is still, it's not fully clear in

my head.

Like I it's clear, but I want it to become official in documentation.

Maybe it should be explained in documentation.

Michael: And then that opens us up so we can still do an episode

on that at some point.

Nikolay: Yeah.

Maybe I missed some parts of it, how to properly monitor.

I know some people who develop observability tooling listen to

our episodes.

Michael: Oh cool.

Nikolay: Yeah, I hope it will be improved.

Michael: Nice.

Nikolay: It's like clarity.

And I thank you for listening to this.

I feel I went quite deep.

We went quite deep in this episode.

Oh, I personally

Michael: enjoyed it.

Yeah, I hope everyone else did too.

Great, it was

Nikolay: great to chat to you as usual.

Michael: Likewise, Nik, take care and see you next time.

Some kind things our listeners have said