Long-running transactions
Nikolay: Hello, hello, this is Postgres FM.
I'm Nik, PostgresAI.
And as usual, here, Michael, pgMustard.
Hi, Michael.
Michael: Hi, Nik.
Nikolay: I proposed this topic because I thought we have a gap
here.
Before we started recording, you said we have already episode
about transaction_timeout, but it's very related but it's not
the same.
So I wanted to dive deep into long-running transactions.
What do you think?
Michael: Yeah, I was surprised we didn't have an episode on it
specifically already.
We had a first ever episode, didn't we, on slow queries and I
guess long transactions as part of that.
But we hadn't done a specific episode on it.
And I guess as a long-running podcast maybe it's definitely time
to revisit.
Yeah,
Nikolay: sometimes idle in podcasting.
So I would like to challenge us and think by the end of this
recording, we record in advance as usual because some preparation
is needed, post-processing is needed, thank you for doing this
by the way.
So I would like to challenge us and think should this episode
be called long-running transactions or somehow else?
Because you already understand what I'm talking about, right?
Michael: Only because of our pre-discussion I got a good hint
from you. But is it more that the problems they cause that you
want to talk about and therefore maybe it's only a subset of
long-running transactions that are an issue.
Nikolay: So listeners who listen to this already see the decision.
Did we preserve a long-running transaction title originally or
we decided to change it?
But the problem in my head is simple.
I see in monitoring tools, in blog posts, and everywhere, like
everywhere actually, that long-running transactions are bad.
And at the same time, I remember, actually it was an interesting
moment, I had an experimental educational course where I explained
various concepts of database health, and we touched this topic,
how long-running transactions are harmful to autovacuum and I said
you just open transaction and it will be blocking autovacuum and
I said it's so easy to demonstrate let's open a couple of psql
sessions or something like yeah psql vacuum verbose and we will
see it has dead tuples but cannot clean up them because there
is a long-running transaction in progress.
And I just wrote begin and it didn't work.
It Cleaned up tuples perfectly.
I think I also tried again with begin and SELECT, doing some
select and keeping transaction
open, being in idle transaction
state according to pg_stat_activity.
And again, it didn't work.
And I looked stupid.
And the reason I look stupid is
actually because we have actually
2 problems and we constantly mix
them.
And saying long running transactions
block vacuum is wrong.
Not all of them block vacuum.
Some of them block vacuum.
And we have, in my mind, we should
improve tooling and materials
and everywhere understanding and
start distinguishing 2 problems.
1 is lock-related and 1 is vacuum-related.
We should change monitoring tools
and snippets, everything, to
hunt 2 problems separately.
Diagnose them separately.
Michael: Yeah, should we go back
to the basics though, in terms
of, maybe let's first cover the
long-running transactions that
are harmful, or like can be harmful,
and then reverse our way
backwards to the ones that aren't
and therefore trying to distinguish
between them.
Nikolay: Some people say long transactions,
which I think is
oversimplification because it may
be very long but fast, or it
may be very short but very long
running.
Michael: Wait, what does long mean
if it's not just about the
time?
Nikolay: A lot of text.
Michael: Oh, sure.
OK.
I didn't even think about that.
Nikolay: Like, I'm not a person
who tries to be pedantic.
I gave up, for example, my scientific
mentor was distinguishing
database systems and database and
saying you cannot say Postgres
is database, you must say Postgres
is database system or database
management system.
I already gave up.
You cannot fight with old people.
They will say it's a good database,
a meaning database system.
So it's okay to say long transactions.
It's not okay for my texts I write
and like products we build.
There we always pursue precision
and we say long-running transactions
right but at a very high level
if you have long-running transactions
there are 2 dangers at least actually
there are maybe a few more
but 2 main dangers are some locks
are acquired.
And as we know, like this is a
simple rule to remember, we talked
about this so many times locks
are released only at the end of
transaction.
They cannot be released before
rollback or commit, that's it.
So it means that transaction, if
it lasts long, accumulates a
lot of locks and even access share
locks can be harmful, right?
Sometimes.
Yeah,
Michael: even if the statement that acquired the lock was short,
if it's followed by statements that are slower, it's not...
Nikolay: The
Michael: lock is acquired, that's it.
Exactly.
Nikolay: The lock is held until the very end of the transaction.
It might block someone like DDL.
DDL might block other SELECTs, as we also discussed many times.
And this is a big surprise, and like this two-step effect, which
can put the whole server down, actually.
This is a big deal.
So if you just selected something and keep the transaction open
for long, you just, for example, should avoid changing schema.
Or you have a dump happening.
Very common situation.
There's a dump happening or something like initialization of
logical replica, which lasts a couple of hours, for example.
And at the same time, we have deployment trying to change the
DDL, trying to change schema, right?
And yeah, this is a very dangerous situation because if you don't
have lock_timeout, retries, we talked about it as well a lot.
In these cases can cause downtime.
So lock acquisition plus holding these locks for long.
This is a danger number 1.
And danger number 2 is how MVCC in Postgres is organized.
MVCC is a multi-version concurrency control for like new folks.
I know we have new folks all the time and specifically vacuum
problem.
So to clean up dead tuples promptly, server should be in good
state.
And if you keep the transaction long, in some cases, not in all,
as we will discuss now, your long-running transaction might block,
not might block, it will block.
If, for example, you inserted a row and just keep the transaction
open, This is a real transaction.
It has real transaction ID, right?
This is definitely will block vacuum and it will need to skip
the tuples which became dead recently after your transaction
started.
Because, and this is global, this is biggest surprise here for
people, as I see, to realize this is global.
You insert it to 1 table and keep the transaction open.
But vacuum cannot clean any table.
It can clean only dead tuples which became dead before, because
xmax, this is a hidden Column in every row, you can SELECT xmax
from any table.
Xmax for dead rows is filled with a transaction which deleted
this row, with UPDATE or DELETE, it deleted this row version.
And if this xmax is in the past compared to your transaction
ID, which holds this snapshot, right?
This will be possible to delete, but if it's later, Postgres
thinks maybe your transaction will need it.
Michael: Yeah, it could still, if that transaction wants to read
that row, let's say the 1 that got updated or deleted, it needs
to return the old version.
Yeah.
According to its snapshot, so it's still needed, or theoretically,
still needed.
It's a feature, right?
Nikolay: It's a feature, but also we need to think about transaction
isolation levels here.
Michael: True.
Nikolay: Because by default, so if you have a transaction ID,
so basically in pg_stat_activity, you can see the column
backend_xmin, which defines this horizon that your transaction needs,
this snapshot, it defines this snapshot.
And if you have repeatable read, which is not repeatable read,
it's already higher isolation level.
It's more like it's stricter.
It's snapshot isolation, in fact.
And it's not default, but it's used during things like dumping.
pg_dump will use a repeatable read.
It opens transaction in a repeatable read.
And also logical replication initialization.
And sometimes in multiple threads, multiple connections.
So in this case, this snapshot needs to be held because this
transaction which is running in repeatable read needs this snapshot
for consistency.
All Queries will need to read from this very snapshot.
So all tables you read during dumping must be from the same moment,
right?
While in traditional, like, default transaction isolation, which
is read committed, it's shifting.
Snapshot is shifting, so it's not being held.
And if you, for example, open transaction, SELECT from 1 table,
SELECT from another table, SELECT from another table, you will
see that this backend_xmin is shifting, right?
Which means that our horizon is shifting and we're not blocking
a vacuum like we could block if we are dumping or just using
transaction at repeatable rate.
Which means that not all, not all transaction, long-running transactions
are harmful to vacuum.
Right?
Does it make sense?
Michael: Yeah, at least from a Bloat perspective.
So the, actually, probably from everything in terms of vacuum,
I was thinking there might be an exception around like transaction
ID wraparound, for example, if we had a very long running transaction,
but if it didn't have a transaction ID, then actually that's
not gonna be an issue for a transaction
ID wraparound either.
Nikolay: That's an important distinction
because if we inserted
or we ran XID current function,
or there's a new version, I'm
constantly forgetting, I'm old.
So you have allocated real XID.
It's 1 story, but if you're just
reading and you don't have,
your transaction doesn't have allocated
XID, it has only basically,
there is concept of virtual XID,
which is optimization to save
space because we have only 4 byte
transaction IDs, right?
So it's like, it's not a lot these
days.
And if you, like most of your queries,
most of your transactions
are reading transactions, it would
be not wise to give them,
to consume from this space, right?
Yeah.
So there is optimization with virtual
transaction ID and purely
reading transactions, they only
consume from that space, right?
Virtual transaction IDs.
So we don't have it and which means
that you need to distinguish
these topics and my point is that
not all long-running transactions
are harmful.
But at the same time, if you're
just reading, but this reading
itself is still happening, you
are reading from a huge table,
it lasts long, definitely backend_xmin
will be like some value
not changing in pg_stat_activity.
In this transaction, reading transaction
and read committed,
it will be harmful.
So I'm talking only about cases
when we read something And we
either read something else constantly,
small reads, right?
Or we read and sit in idle transaction
state, not doing anything
and not closing this transaction.
Michael: Yeah.
Are you talking, is this a bit
of an edge case?
Like who's actually doing this
a lot?
Like when does it happen?
Nikolay: What exactly?
Michael: Like a small read and
it doesn't get committed.
Nikolay: Oh it happens all the
time.
The number 1 reason is humans.
So they...
Michael: Like opening an editor
or something.
Nikolay: Yeah.
In psql or even in more cases some
graphic interface, right?
They begin some production, read
something and then they all
maybe change something also, right?
More like usually when you are
reading you are not opening transaction.
Like when you're just exploring
stuff, you run some queries to
understand the state of database.
Usually you don't use begin-commit
transaction block.
But when you try to write, sometimes
you do.
And abandoned transaction from
humans, so common.
It's very common.
So just abandoned, forgot to close.
In some cases, this is called turning off auto-commit, right?
So you, in some editors, you can say, OK, I don't like auto-commit,
I will turn it off.
Then you start working but not committing, and then this is abandoned
transaction, so it doesn't do anything.
And if it holds this backend_xmin in this activity, this is
what starts blocking autovacuum.
And in first maybe few hours you won't notice.
Michael: So yes, I understand the dangerous case.
I'm talking about the non-dangerous case.
The opening a transaction, begin, then a SELECT.
You talked about just doing a read query, not doing an UPDATE.
When would that happen?
Nikolay: In the same situation, it might happen as well.
I'm just like, I was trying to understand all the possible cases.
In this case, it's not.
Michael: Well, I think, yeah, but also, I just don't think it
happens in the same way.
Like, I understand doing begin, UPDATE, and then you've got the
option to ROLLBACK if you're scared or you're traumatized maybe
from doing an UPDATE once and then seeing you're expecting 1
row updated and then you see 30,000 rows updated or something
like that maybe You start using transactions explicitly so you
can ROLLBACK to...
Nikolay: I get what you mean, yeah.
Michael: But who's doing that for SELECT queries?
Like I don't...
I don't know.
Nikolay: Our purpose is to be complete, right?
To cover all possible...
Michael: No, I understand.
So this is why I'm asking, Is this an edge case?
Nikolay: Yeah, usually I think if I have a bunch of SELECTs and
I combine them to transaction, I would probably use repeatable
read, so all of them are consistent.
That's why I combine them, right?
This is like, basically like pg_dump does, right?
If I don't, if I use transaction, but I don't use repeatable
read, why do I use transaction in the first place?
It's just a bunch of SELECTs.
Maybe to...
Just like what comes to my mind.
Maybe to make sure schema won't be changed.
To block DDL.
Michael: Or to deal with the pooler maybe?
Like to make sure you stay with the same.
Nikolay: Yeah, maybe there might be some cases and yeah.
Another very common situation when we see, so there's this term,
xmin horizon, right?
You can see it's in some places
in source code, you won't be
able to see it, I think you won't
be able to see it in documentation.
I checked long ago, I think it's
very like unofficial term.
You can see it in some blog posts,
but again, I feel it's not
fully understood topic, right?
So what is xmin horizon?
And I was trying to understand
like all aspects of it, how to
properly measure.
And my question is like, why is
not properly measured by monitoring
tools?
Because it's not.
You like check all popular monitoring
tools, excluding PostgreSQL
monitoring tool, right?
And you won't be able to see like,
they like, they talk, okay,
there is a long running transaction
lasting like this amount
of time.
Okay, have you excluded transactions
which are not harmful or
you included them?
What's the purpose of this chart?
Are we talking about danger related
to locking or are we talking
about danger related to vacuuming?
Unclear.
My point is, if we start distinguishing,
if we start thinking
about xmin horizon, we need to
understand what it is, and how
to measure it, how to properly
understand it.
So I came up with the idea, we
have 4 sources.
First is a pg_stat_activity,
backend_xmin, we just select
the oldest.
Actually, you can use function
age to backend_xmin, which will
give you difference, just like
integer difference between your
current function.
Yeah, this function should understand
that we have wraparound
of the space, right?
So it will give you a number of
transactions, the agent in terms
of number of transactions, not
in terms of seconds.
The second very, very common situation,
super common.
You created a logical slot and
either haven't started using it
or abandoned or it's lagging.
So if you check the pg_replication
slots, it also has a column
that's called xmin there.
And you also can use age.
So it's another very popular reason
to have something lagging.
And we need to distinguish logical
replication and physical replication.
Because as I remember, if you have
physical replication slot
but haven't started using it, I
think it's not harmful.
I might be mistaken.
Logical replication slots are definitely
harmful.
Unused logical replication slots
are definitely harmful.
Michael: Did this change in a recent
version where I think there
might be some new setting where
by default they expire after
a certain age of not being used.
Nikolay: Yeah, there is such setting,
it's just a protection
mechanism and I think we should
talk about protection measures
separately.
Michael: Yes, okay, cool.
Nikolay: I'm just trying to say
that physical and logical replications,
replication basically streams,
they are slightly different here.
Definitely if you have logical,
if it's unused, it's harmful.
If it's lagging, it's also harmful.
Michael: Makes sense.
Nikolay: It starts harming you
right when it's created.
So you don't want logical replication
slot to be unused or lagging
too much.
Physical slightly different, and
I think, if I'm not mistaking,
if you just create a slot but nothing
connected to it, it won't
hold xmin horizon.
Moreover, if you have physical
replication, but this replication
is not using hot standby feedback,
on replica it's set to off,
which is actually not common.
Usually people prefer to set it
to on.
Hot standby feedback sends, for
physical application, it sends
to the primary the information
which xmin horizon is needed on
replica, right, on standby.
If it's not sending, it's better,
right, because we don't care.
But this will give you, like, physical
replica which will start
lagging every time it has a long-running
transaction on its own.
While hot standby feedback will
protect replication lag, but if
there is a long-running transaction
on replica, it will report
to the primary, and we will have
the same effect as we had the
long-running transaction on our
primary, which explains why this
dilemma makes it really hard to
have long-running transactions
for analytical purposes on regular
physical nodes, right?
And you cannot offload it to replicas.
Only if you can afford replicas
which are lagging for hours,
for example.
Anyway, we have logical and physical
replication which can also
affect us.
And we have fourth component, which
is prepared transactions.
There is a pg_prepared_xacts transactions.
System view you can see also I
don't remember column name maybe
transaction or something, I don't
remember.
But you can also get, if you use
2PC, right, prepared transactions,
in this case you have them, right?
In this case some of them might
be affecting, they may affect
xmin horizon.
So I found 4 sources for our analysis
to define which, so we
dig information from all of them,
take ages And then just take
the greatest of ages.
This is such function greatest,
right?
And this will be our true age of
our xmin horizon.
What is big, what is small?
This is hard to understand.
When I think about it, what is
big?
Is 1,000 too much or 1 million
too much or 1 billion too much?
Billion is definitely too much
because we, we, our, our, yeah,
yeah, yeah.
We have 2.1 is basically our, the
whole capacity for XID wrap
around, right.
But we also, there's also MultiXact
mechanism additionally,
but let's not go there.
Billion is definitely too much.
Is 1,000,000 too much?
In heavily loaded systems, I like
to think about it like thinking
about what tables we have and how
many rows they have.
And how many rows we consider,
how many dead tuples.
So dead row versions, how many
of them we consider too much.
Usually we prefer for OLTP heavy
loaded systems, we prefer
to make autovacuum be very aggressive.
Default settings for autovacuum,
vacuum scale factor, autovacuum
analyze scale factor, they are like 10 and
20% is on 2 common default numbers.
It's too much.
If we have 10 hundred billion rows,
10% is way too much.
So we usually set, okay, like 1%
so 0.01 or even half percent.
Even sometimes even go even lower
and make it delete the tuples
more often now you already have
some hint. Okay, you think I have
like 10 million row table and I
have 1% scale factor. It means
that 100,000 rows is when the
autovacuum should start cleaning
up.
And then, I think, okay, xmin horizon
is lagging that much.
Okay.
Is it lagging 1,000?
It's okay.
We don't care.
Our basically actual horizon when
autovacuum starts, we expect
work from it, 100,000, roughly.
But the problem is that we have
different tables, small, big,
right?
So we need to think about core
tables and like nature of our
database to think how much of xmin horizon,
like xmin h is too
much for us, right?
Michael: I think there's 2, you're
talking almost like a bottoms-up
approach, right?
Working out how much bloat we can
tolerate.
Nikolay: But yeah, how much dead
tuples we can tolerate is not
yet bloat.
As we remember, we allow too many
dead tuples so autovacuum is
converter of the tuples to bloat,
it will come and convert our
millions of dead tuples to bloat,
but it's not bloat yet.
At least in my terminology, I wish
we had official terminology
for bloat.
Like for dead tuples it's clear,
but for bloat, we don't have
it as well.
Michael: The more dead tuples we
leave around, the worse bloat
issues can get, like in terms of
like fragmented indexes for
example, in terms of hot updates,
if we've got dead tuples that
we
Nikolay: can't remove.
Hot updates might benefit from
bloat because they have space.
Michael: No, because it's not space
yet if you can't reclaim
it.
If it's not being processed by
the network.
Nikolay: But then eventually a
vacuum will happen.
Michael: Eventually.
And it
Nikolay: will leave a lot of gaps
and these gaps might be reused.
Michael: Maybe, yeah.
Nikolay: There are opposite direction
of effects here.
Michael: But the point I was trying
to get to was often people
are monitoring this in order to
prevent transaction ID wraparound
so the top-down approach could
be how many how much time okay
1,000,000 might not be that bad in
terms of getting towards 2,000,000,000,
but if 2,000,000,000 could accumulate
within an hour or 2, if you've
got like a ridiculously, so actually
you might want an earlier
alert or a lower threshold because
as soon as you find out that
it's going above a healthy number
or above like a not normal
number for your system you've only
got, you're on a stopwatch,
you've only got 2 hours, you've
got to fix this problem.
So it might be that you're thinking
in terms of top down as well.
Nikolay: Yeah, for example if I
see age of this is 100 million
it's already wow, it's a lot, It's
already like emergency.
You
Michael: might only, yeah, exactly.
You might only have a couple of
hours at that point.
Nikolay: Yeah.
Speed is interesting.
I wish we had also good, again,
monitoring systems, like proper
XID consumption, XID growth
rate metric to be covered
properly.
If you think, okay, this is an xmin
horizon, some numbers, okay,
I know it's transaction, but you're
right, we like, it's hard
to map it to real time.
But here I see, okay, we consume
2 billion transactions per week.
It's quite a loaded database, in
fact, right?
Real transaction ideas.
In this case, I see peaks when
there are busy hours, I understand
my seed growth rate.
And this is helpful for me.
I understand, okay, this is, I
can already start playing with
this.
I think we could go even deeper.
I just don't see proper analysis
tools in this area developed.
Like this is like my, like thoughts
and from people I work with
and talk to, and I see huge potential
here.
So it could be much better visualized,
presented for analysis.
Michael: And as you say, separated
out in terms of things that
are risky in terms of locks and
things that are risky in terms
of...
Nikolay: Oh yeah, locks are completely
different.
Michael: I'm just going to keep
saying block.
Yeah.
But that's what I mean, you're
talking about different alerts,
different charts, different...
Yes.
Nikolay: So I would have alert,
and we actually do have alerts
for xmin horizon based on understanding
this particular system
with its growth rate and like table
sizes, right?
And the behavior of vacuum, autovacuum.
And we set it, for example, I think
100 million is a huge number.
We should set it lower and already
start investigating what blocks
our xmin horizon.
And this is the right direction
to understand.
And we also can check logs if you
enable autovacuum.
We usually aim to enable, if not
all, but at least above 1 second
autovacuum processing.
So log_autovacuum_min_duration,
1 second.
Sometimes we set to 0.
There is some risk that you might
see a lot of and have observer
effect, but usually like at least
1 second is beneficial.
So you start Again, I think all
default was 10 minutes.
It's ridiculously huge.
Like you don't see a lot.
And I think it was, it was decreased
recently.
Right?
Oh, cool.
Yeah.
So what I'm talking about here,
if you have a logging for autovacuum
behavior, it reports like xmin horizon,
it reports it, right?
Transaction ID was this, xmin horizon
was that, it's this number
behind.
So you might extract information
from those logs.
And also it reports how many dead
tuples it would remove, but
couldn't because xmin horizon is
blocking.
So autovacuum messages and logs
are extremely helpful for this
type of risk analysis.
Michael: I think it might be just,
I think it's still 10 minutes.
Nikolay: Maybe, maybe.
Michael: Wow.
All right.
Nikolay: So this is about xmin and the vacuum behavior xmin horizon
and this concept again Like I think documentation could be Fair
that these concepts exist Maybe like 1 should propose To define
this, because everyone suffers from bloat.
And I think, in my opinion, everyone needs to understand, working
with Postgres deeply enough, it needs to understand the concept
of xmin horizon to start distinguishing it from long-running
transactions, right and The second problem is like first problem
first problem historically lock acquisition what to say about
it, so it's Yeah,
Michael: I think we should point people.
I think we did a good episode on that specifically.
But yeah, maybe that's a good excuse to switch over to mitigations
though, because...
Nikolay: Mitigation.
Yeah, 1 point I wanted to mention, there are some exotic problems
also, which might happen from long-running transactions.
Michael: Oh yeah, sure.
Nikolay: I point to my sub transactions Post explaining 4 problems
with sub transactions and 1 of them which happened to GitLab
And we had episode I think about it, right?
So it was the most interesting.
It was happening when you have long-running transactions.
I think when you actually hold this xmin horizon on the primary,
hold snapshot, or maybe locks.
This is a question actually I need to do.
I have some homework to do to return to that material and understand
what of 2 problems happened there.
So long-running transactions on the primary and high TPS replica
started to suffer at some point because of sub-transactions.
Once you close long-running transactions on the primary, suffering
ends.
So this is exotic already, exotic problems where I need to understand
is it lock-related or snapshot-related.
Michael: I also think replicas are an interesting thing in general
in terms of long running transactions.
So long running transactions on the primary and whether they
affect the replica and then also long running transactions on
the replicas and even when they can affect the primary.
Nikolay: ...a replica if hot standby feedback
is turned on, as we discussed.
Exactly.
Can report to...
They will report to primary, can have same effect.
Michael: Yeah.
So...
So I think that's not always something people are aware of.
But yeah, in terms of mitigations though, it feels like timeouts
for our friends in a lot of these cases is that, and monitoring,
we've brought up monitoring a lot, right?
Monitoring, alerting, even in extreme
cases, killing certain
problematic long-running transactions.
Nikolay: What did you want
Michael: to talk about in terms
of mitigations?
Nikolay: So, yeah, of course if
you can get rid of all the long
running transactions, both dangers
don't affect your database
health, which is great.
But sometimes we need long-running
transactions, sometimes we
need to accept some lag in replication
slots.
By the way, what caused me to start
distinguishing something
is not right.
We talk about long-running transactions,
but the same problem,
vacuum being blocked.
vacuum not fully blocked, you understand,
so you cannot DELETE
some data.
It's not only about long-running
transactions, it's also about
slots, it's also about prepared
transactions.
What's happening?
Okay, prepared transactions, they
are at least transactions,
but slots, right, or replication,
like anyway.
hot standby feedback.
Okay, hot standby feedback is also
about transactions that just
happen on standby, but just unused
logical slot.
So it's a wrong explanation of
some problem.
Problem is xmin horizon is frozen
it's not progressing right?
Michael: Not progressing yeah.
Nikolay: Maybe it's progressing
but too slow also might happen.
Michael: Yeah so if it's progressing
slowly though it's just
blocked at a new point, right?
Like it's...
Nikolay: Yeah, if it's progressing
too slow, your horizon of
XID, not horizon, but actual XID
is already too far.
So basically what we should look
at is Age of xmin horizon.
Age of the old xmin needed.
Yeah, this is what we, all we need.
And if it's too far in the past,
this is a problem.
And the reasons of this, as I said,
4 reasons.
And mitigation is simple.
Don't allow it to be too far in
the past.
How?
First of all, in OLTP I'm like
very highly recommend to be very
aggressive in time with timeout
settings globally.
I know documentation says that
it's like it's controversial and
so on like I know like If HTTP
server setting can afford 30 seconds
or 60 seconds as maximum for OLTP requests, why do we, database,
cannot afford at least the same
for all regular traffic.
Of course, for CREATE INDEX CONCURRENTLY,
REINDEX CONCURRENTLY,
we need to set it to 0.
What to set?
statement_timeout and 2 other things,
right?
If you think about transaction, it might be like something is
running, I don't transaction, something is running statement,
I don't transaction.
So it's either statement or I don't transaction and there is
a whole transaction.
So basically we have whole and 2 phases, like they switch after
each other, right?
It means we can limit the whole transaction or statement or item
transaction state.
And for all 3, we have setting, right?
And transaction_timeout.
Michael: Only since recently though, only since version 7.
Nikolay: We had an episode about transaction_timeout because
it was my idea.
I was super surprised.
I was walking around thinking, why it doesn't exist?
Like, maybe I don't understand Postgres enough after 15 years
of experience.
Then I told it to Andrei Borodin and he implemented.
Great.
In Postgres 17 it exists.
So in all new Postgres we have transaction_timeout so we can
finally limit the whole thing because before we could limit only
statement and idle_in_transaction_session_timeout.
Yes.
And if you have a series of brief statements and brief waiting
periods between statements, you could have hour or day-long transaction
easily.
Now you can limit the whole thing.
Limiting the whole thing is helpful.
Unfortunately, we cannot limit what we actually would like to
limit is xmin horizon.
Maybe we should have, I just, I just came, this idea came to
my mind just now.
Honestly, I need that info.
We could have like probably like xmin horizon maximum, after which
we start abandoning, like we're canceling transaction, abandoning
prepared transactions, like deleting them, and maybe killing
some application.
Yeah, yeah.
Yeah, why not?
Like 100 million protection, ultimate protection mechanism.
And in this case, you don't need to map what's happening with
your like server in terms of seconds and actual problem, like
how many transactions you don't want to lag, right, for your
garbage collection.
Garbage collection is a vacuum, right?
So maybe we should have those settings, maybe too much settings
at the same time, but it will be flexible.
I would say 100 million is my maximum, or 10 million is my maximum,
regardless of cost.
Michael: Even if Postgres shipped
with conservative defaults
there, it would be fine.
Even if it was like 100,000,000 as
a default, it would be fine
for like cases, wouldn't it?
Nikolay: I'm curious, maybe this
idea was discussed, maybe it's
not a good idea.
From my mind, it just came.
It's interesting.
If you have opinion listening to
us, leave this opinion as a
comment under YouTube or somewhere
like on Twitter or LinkedIn,
I don't know.
Anyway, this is interesting.
But what we can do, we can protect
with transaction_timeout,
statement_timeout, idle_in_transaction_timeout.
What else?
We should, we must actually, monitor
lags of replication slots,
especially logical replication,
but physical as well if we use
hot standby feedback.
And we should not allow lagging
too much.
And finally, we should be very
careful with 2PC and abandoned
prepared transactions.
That's it.
So only these 4 areas.
Michael: Yeah, when you say monitoring,
you mean alerting as
well, right?
Nikolay: Yeah, so if some replication
slot is unused, logical
replication slot, or it's lagging
too much, Again, this is interesting.
It will be also either in bytes
or in seconds.
We should think about seconds,
how much, how we need to map it
to transactions, understanding
our seed growth rate.
So there are different metrics
in play here.
So to understand health properly,
it takes time.
And if you have cattle, not pets,
so you should have simple rules.
If some slot is lagging too much,
By the way, slot lagging too
much, it's in bytes, right?
We cannot limit it in seconds of
lag.
There are bytes, seconds, and then
XID growth, XID like
number of transactions.
So 3 metrics, There is a triangle
here.
Interesting.
Anyway, this is to protect from
affecting vacuum.
You can also reactively just monitor
what the vacuum reports
in logs.
And it will start to complain clearly,
right?
So you can set threshold based
on logs and if you have some good
log analysis system, you could
set up alerts from there.
And as for logs, for me, it's very
completely different topic.
Like yeah, long running transactions
contribute to it, but problems
might happen very fast.
Unlike xmin horizon being blocked,
you might not notice it at
all.
Even you, okay, you ran, you blocked
it for many, many hours.
Maybe it was not, maybe it was
needed at just one time.
It won't affect you so much.
The negative effects, they are
postponed and they grow like it's
not a cliff.
It's like it's slow degradation.
Unlike situation with locks, with
locks it might be very fast.
So if you have, as I said, this
classic example is you have the
weakest heavyweight lock possible,
accessory lock on the table.
You block DDL, DDL blocks other
SELECTs, boom.
What to do here?
We need to use proper DDL deployments.
We need a low lock_timeout and
retries.
This is number one thing I would
recommend.
And number two thing is to craft
schema changes properly to avoid
long running exclusive locks, first
of all, but also accessory
locks as well at some point.
They are not harmful unless you
have conflicting sessions and
to detect conflicts sooner is helpful.
Another measure which is actually
a common problem with Postgres,
if we have a vacuum which blocks
us, vacuum always blocks us
when we do DDL.
We cannot acquire access exclusive
lock on the table to change
its schema if vacuum is happening.
But fortunately in most cases vacuum
is running in normal state
and it will kill itself after I
think a second.
Michael: Do you mean autovacuum
or do you mean anti-wraparound?
Autovacuum.
Nikolay: I'm talking about autovacuum, good correction.
But if it's running in transaction
ID wraparound prevention mode,
which happens often if you have
high XID growth, related topics,
right?
If we'll like insert a lot, not
in batches, but single rows,
for example, we consume XID so
much.
Save points can contribute to XID
consumption and XID growth,
XID growth rates a lot.
In this case, we probably have
a problem like if XID growth
is high, autovacuum transaction
ID wraparound happens more
often.
It might happen, for example, every
two days in a very busy systems.
In this case, it happens every
two days, but we aim to have deployments
five times per day.
We are like very rapidly developed
startup.
In this case, this might clash.
Like you have an autovacuum, transaction
ID wraparound prevention, and
you try to deploy DDL.
And you, even if you have retries,
you must have low lock_timeout
and retries.
You try to deploy it, but it runs
on your terabyte size table
and blocks you.
This is super annoying.
You cannot deploy something you
need to deploy right now.
What to do about it?
Question is open.
Michael: Well, you've talked a
lot about partitioning, so the
tables are smaller and the vacuums
are quicker.
But I actually also think, at the
time I thought it was a good
thing, but I thought the threshold
where an anti-wraparound vacuum
kicks in is relatively low in the
grand scheme of things in terms
of the risk.
I would seriously consider, if
it's happening every 2 days, that
seems like a bad smell to me.
That feels like something that
shouldn't be.
Like, let's make sure normal vacuum
happens enough so that it
doesn't have to run in anti-wraparound
mode.
That would be my.
Nikolay: Yes.
I'm not against this.
Like, it's good that it's happening
often.
And freezing is good.
Like, once.
So partitioning is helpful not
only because tables are smaller,
but more pages are all frozen and
we don't touch them anymore
because they are like our archive.
This is super, it makes things
much more efficient compared to
huge unpartitioned table where
anytime page can receive new tuple
and it's not all frozen anymore
in visibility map, right?
Visibility map has 2 bits, all
frozen and all visible.
And boom, we need to take care
of, autovacuum needs to take care
of this page again.
So the problem is, for me, is not
that it's too often happening.
The problem is like this threshold.
I would sometimes, so it's okay
to have it low and make it very
frequent.
Good partitioning is great, but
sometimes I wish it would be
easier to raise this bar temporarily
while I'm doing my changes
and then raise it back.
But as far as I remember, I might
be mistaken here.
Last time I checked this maybe
a couple of years ago, this topic.
It's not so easy.
You cannot change it globally so
easy.
It requires restart, I think.
Maybe I'm mistaken again.
And the table level, there is also
some caveat.
I don't remember which, but I didn't
see a good way to...
Okay, For each deployment, we want...
And also interesting, you know,
like, you kill that vacuum which
runs in transaction ID wrap around,
and if you don't acquire
log on the table fast enough, it will kick in again.
Especially, like, it might kick in if you have many workers and
nap time is low, so it's an interesting problem as well.
So raising this was not easy, as I remember.
So I came up with 2 ideas.
The first idea is if this is deployment, which is big deployment,
we have like major upgrade or something, because it's a problem
during major upgrades.
We had it during major upgrades when not a vacuum running transaction
ID wrap around was a problem for us or some major changes.
If it's a rare change, we probably should forecast when the vacuum
will start freezing.
And do it before.
Michael: I like that.
Nikolay: Yes, yes.
Do it before.
Another thing is that probably we should just sometimes kill
it ourselves.
But if you want to kill it, you need to do it properly.
You need to issue attempt to acquire lock on your table to change
and kill it immediately.
Kill it to vacuum and your attempt to acquire a lock is successful
and then good luck to try to prevent transaction ID wrap around
while we're doing our job.
This is like...
Michael: Wait, is that so it doesn't start again?
Nikolay: Yeah, yeah.
So I...
Yeah, okay.
I won't.
That's it.
So the lock is mine and I'm doing my job.
It's like, it feels slightly dirty, right?
But it's, there's no, there is no harm if you also monitor XID
and MultiXID or up around risks.
If you know, you're only at 10% of capacity, 200 million, it's
fine to go to 300, 400 million during like even big deployment.
We will be fine afterwards, right?
But this feels dirty, honestly, to do this trick.
I did this trick manually a few times in my career because I
needed to finish deployment and I knew that we are not in danger
zone.
So yeah, Yeah,
Michael: that makes loads of sense to me.
My preference, I think, still would be avoid it happening as
often as that and actually trying to make sure things got frozen
before it was needed.
So it really is a fail safe rather than a normal like thing that
happens in normal operation.
But yeah, super interesting.
We started the episode talking about what we should call it and
I think maybe we should end it like with a, what do you think
now, having talked about it?
Nikolay: Oh.
Yeah, but we try to keep titles very short, right?
We didn't have xmin horizon episode.
I tend to think this should be called xmin horizon episode, because
this is where we spend time most and it's less about locks. It's
less like and I think long running transaction is a wrong concept
to chase
Michael: Yeah, let's call it xmin horizon because I think that's
gonna
Nikolay: be, but my concern is that people think in terms of long-running
transactions and...
Michael: That's fine.
I think long-running transactions is interesting and I actually
don't mind that there's this edge case of long-running transactions
that aren't harmful.
Nikolay: Let's keep long-running transactions because it's easier
to understand what we're talking about.
This is what people expect.
Let's keep the name long running transactions.
Explain horizon. This concept is still, it's not fully clear in
my head.
Like I it's clear, but I want it to become official in documentation.
Maybe it should be explained in documentation.
Michael: And then that opens us up so we can still do an episode
on that at some point.
Nikolay: Yeah.
Maybe I missed some parts of it, how to properly monitor.
I know some people who develop observability tooling listen to
our episodes.
Michael: Oh cool.
Nikolay: Yeah, I hope it will be improved.
Michael: Nice.
Nikolay: It's like clarity.
And I thank you for listening to this.
I feel I went quite deep.
We went quite deep in this episode.
Oh, I personally
Michael: enjoyed it.
Yeah, I hope everyone else did too.
Great, it was
Nikolay: great to chat to you as usual.
Michael: Likewise, Nik, take care and see you next time.