transaction_timeout
Nikolay: Hello, hello, this is
PostgresFM, a podcast about Postgres.
Hi, Michael.
Sorry for Mike last time.
Michael: That's all right, Nik.
I don't even notice normally.
Nikolay: Right.
Nik is fine.
Nik is fine.
So, episode number almost 90.
I don't remember exactly.
Like, Do you remember?
You always remember.
Michael: I remember, but you told
me not to remind you anymore
until we get to Milestone.
And you're the only one that ever
brings up the episode number.
Nikolay: Really?
Michael: I never mention it.
Nikolay: Okay.
Interesting.
So we are going to talk about what?
Michael: Oh yes, so this week was
your suggestion, and it's a
cool 1.
It's going to be a new feature
in Postgres 17, hopefully.
It's committed, but you never know
for sure.
And it's transaction timeout, a
new parameter that we'll have
hopefully in the next version.
Nikolay: I can show off, it was
my idea.
But most of the work, Andrey did,
Andrey Borodin did.
So it started during our PostgresTV
hacking Postgres session.
Michael: Which are back now, right?
After a long pause.
Nikolay: Well, yes, we'll try.
Nobody knows.
It's hard to arrange usually because
to do proper hacking, you
need a lot of time.
Last time we did restart it, we
talked about this transaction
timeout and Andrei tried to write
tests because former tests
needed to be removed.
By the way, I also wanted to say
thanks to Alexander Korotkov
who committed this patch.
Honestly, I even didn't ask.
So it was good to see that it's
committed.
And so session finished after 90
minutes by timeout because we
needed to go.
Session about timeout finished.
Okay, now you're smiling.
So, transaction timeout.
It took a few years for me to understand
something is wrong.
We don't have something.
Then, last year, like a year ago,
I started thinking, it should
be.
My idea was maybe I'm very wrong
or I'm very right.
So it's so strange in 2023 to think
about this basic setting.
But honestly, the last few years,
I already suggested to everyone
to consider the transaction timeout
to be applied on application level.
In 100% of cases, I saw a very
strange look, if it was a video
meeting.
It was a very strange look, like
Postgres cannot do it?
Yes, Postgres cannot do transaction
timeout, but you do need
it and we can discuss why.
So I always say, do it always on
the application side if you
have LTP, because, you know, like
anyway, HTTP has timeout.
A server, for example, NGINX, like
30 seconds, it's normal, right?
Nobody will wait many minutes,
unless it's a specific request.
Sometimes we can wait a few minutes,
but usually we have a limit.
And Postgres doesn't have this
transaction timeout.
It has only statement timeout and
idle transaction session timeout.
But no transaction timeout.
Michael: And also, so I looked
into when we got each of these,
and statement timeout was ages
ago, like 7.3, would you believe?
And then idle in transaction session
timeout was actually more
recent than I realized---
It's 9.6, which is probably only
about 7 or 8 years ago.
And then we also got more recently
idle session timeout.
Nikolay: Which is different.
It's outside of normal work.
It's for like, let's drop the connection
if it doesn't do anything.
Michael: Yeah, but that was a recent,
like, version 14 edition.
So still only a couple of years
ago.
Nikolay: Ah, you want to say that
it's also a very, very basic
thing and only added recently?
Yeah, I agree with that.
Usually people implemented it outside
Postgres as well, to drop
connections.
Michael: I've seen less need for
that one than I have seen for
this one, but like you, I didn't,
it's one of those things that I
kind of didn't realize I needed
because statement timeout can
cover a lot of the base, like a
lot of the use cases can be just
about coped with via statement
timeout, but one or two really can't.
And I think I saw your post and
I'll link it up the mailing list
thread that I think Andre started,
but your first reply to that
included some really good examples
of when you do need this in
addition to statement timeout or
like instead of statement timeout.
So is it worth like covering a
couple of those?
Yeah.
Nikolay: Yeah, let's discuss why
we use statement timeout in
OLTP like in regular the most common
cases like web and mobile
apps.
We had an episode about that and
I think it was the very first
episode maybe.
Speaking of counting.
Michael: Yeah, we've done a few
other episodes related to this
as well.
We did one on connections, one on
zero downtime migrations.
Nikolay: No, I'm talking about
the very first episode.
What was it?
Michael: Yeah, slow queries.
Nikolay: Slow queries, yes.
And this is about statement amount,
basically.
But people don't do SQL queries.
I mean, end users, web and mobile
app users, they don't do SQL
queries.
So they work at a higher level, HTTP
requests.
So why do we need statement timeout
at all?
Why?
What is the problem we are solving?
I have this honest question because
in my opinion, when we think
about what we try to solve with
statement timeout, it's not the
right tool usually.
For example, again, users work
at a high level and we know some
intermediate software settings,
middleware, which is usually
written in some Python, Java, Ruby,
anything.
These application servers and also
an HTTP server, like maybe front
end server, like an NGINX or something,
and VoIP, right?
They usually have some limits to
drop connections which last
too long, if a server is taking too
long to respond, or communication
taking too long, usually like 30
or 60 seconds.
This is what users will get if
something which should take 100
milliseconds or less, takes 30
seconds.
It's time to admit there is a problem
and tell the user that,
like, for 502 gateway timeout,
it's usually what NGINX was returning.
And then we, back to database,
we think, OK, we need to limit
as well.
But what exactly do we need to limit?
This is the question.
Is the statement timeout the right
tool for that?
Michael: I do think there's a like,
a lot of cases where it's
enough.
So if, for example, we're trying
to, in that case, let's say,
for example, the users or like
the app has already given up,
but the database carries on working.
Yeah, that's doing pointless kind
of, that's pointless work.
So avoiding pointless work would
be great.
But there are other cases too, where
even if the app didn't have
a timeout, is it okay for a few
users of a system to be using
all the resources when, you know,
like hogging those resources
by running some really, really
slow or really, really heavy queries.
Is that an acceptable trade-off
or do you want to limit that?
So I think there are other slightly
different cases.
Right, but
Nikolay: Work is not...
Statements are too small.
It's too atomic a piece, right?
Michael: Well, I think there's...
Before we move on to why transaction
timeout, there's also one
other case which I think or maybe
maybe this makes the point
perfectly. These long-running transactions,
even if they're a single
statement transaction, they can
block internal processes like
autovacuum and cause issues that we've,
I mean, we've spent multiple
episodes talking about the kind
of issues that long-running statements
could cause, never mind long-running
multi-statement transactions.
Nikolay: No, no, no, no, not statements
block it.
Transactions.
Michael: But as you've said multiple
times, you can't have a
statement without a transaction.
So like a long-running statement
is enough, but you're right
that...
Nikolay: You can have a statement
without transactions for example
creating an index concurrently. Three transactions
I mean, you're okay
without a transaction completely.
No, no dirty reads.
Michael: Exactly right,
Nikolay: Right, right.
So yeah, but like we want
to apply the right tool.
It means that it should suit for
all edge and corner cases.
And I'm saying statement timeout
is for a different hole to close.
Michael: You gave a really great
example that completely sold
me in that email thread I mentioned.
Nikolay: Email?
Michael: Yes.
And that was a case I've seen actually
quite recently, which
was...
...taratatata.
Nikolay: Yeah, I call it...
It's like from a machine gun.
Yeah, but - statements coming from
a machine gun.
Brief statements with brief pause.
Michael: And somebody setting a
transaction at the beginning,
so in the case I saw, it was deliberately
taking out, like doing
begin, then doing multiple updates
or like upsets in this case,
and then only at the end committing
it.
So any one of those statements was
really short, but because they
were doing tens of thousands or
even hundreds of thousands, I
think, in this case, the transaction
as a whole was longer than
you'd probably want on a OLTP
system.
So in that case, statement timeout
wouldn't have helped, or at
least any sane statement timeout
wouldn't have cancelled that.
But a transaction timeout...
Nikolay: An idle transaction session
timeout wouldn't cancel
this transaction because the breaks
are also short.
Michael: Yeah, it's not idle, exactly.
Nikolay: It's like from a machine
gun.
It can be select plus 1, by the
way, if it's wrapped in a transaction
block.
Select plus 1, this anti-pattern,
which...
Michael: Like n plus 1.
Nikolay: 0, sorry.
Yes, yes.
Yeah, yeah.
I mean, it's actually a terrible
name, but the idea is that it's
n selects in a loop.
So, we didn't notice that they
have a loop.
It can be short updates with the
idea that we want to make them
short because it's better, you
know, like to...
But if it's wrapped into a transaction,
you cannot release any
locks until the very end of the
transaction.
This is the rule.
All locks are released only if it's
a commit or rollback.
That's it.
And so it's a bad idea to split
into batches, updates or deletes.
Insert is almost never, I think,
an issue.
It makes sense to split.
An insert should be a single massive
insert, usually.
Like copy or insert.
But if it's updates and deletes,
people know there's a rule to
split, then they put it in a single
transaction and what's happening
here.
What do we do?
We're accumulating a lot of logs
and not releasing them.
So there are only 2 reasons, the
big reasons I see.
Well, the source of the transition
is 1 reason, but there are 2 problems
which are very bright.
They scream let's have transaction
timeout.
They did it for many years and
I always say, okay, this is a
problem we must solve on the application
side.
We cannot do it in PostgreSQL at
all.
Surprise.
Well, we can use pg-cron and put
maybe some, I call it terminator
snippet, working with pg_stat_activity,
checking the exact start, like
transaction start timestamp, and
then terminating.
So terminator script.
This name is derived from pg_terminate_backend.
So was it renamed or no?
I don't remember.
pg_start_backup was
renamed.
It's about naming and I support
it.
Well, I'm having 3 in my speech,
sorry.
So, pg_start_backup was renamed to
pg_basebackup start.
So the same should happen with
pg_terminate_backend, I think.
Now, pg_terminate_backend should be
pg_backend_terminate.
So, this terminate script is also
like a workaround.
It should be some simple solution.
Let's not allow too long-running
transactions.
Because we keep locks, this is
problem number 1 we just discussed.
And second problem you mentioned,
keeping xmin horizon frozen
basically, not shifting.
Which means that all, like xmin
horizon is the hidden xmin value
of a tuple which is the oldest
in the system and we usually cannot...
Autovacuum usually comes and deletes
it at that time, we delete
it but we cannot delete it if there
is some transaction which
still can read from this because
xmin is like current time not
past, not the past for this transaction
which means that if we
keep our transaction we block this
xmin frozen and that means
that Autovacuum cannot delete dead
tuples, more and more of them,
freshly dead tuples, I call them.
Or maybe it can be improved this
name.
So these 2 problems, locks and
xmin horizon frozen.
Locks are not released and they
both can hit and a lot of people
can notice and you can even be
down down server can be down in
some cases and you cannot limit
it inclusion in Postgres 16 I
was so happy to realize I'm not
mad.
It's obvious and I think it was
discussed maybe, so shouldn't
discuss many times, but we just
went ahead, many things to Andrej,
just went ahead and coded it.
Then many strange cases started
to appear.
For example, sub-transactions or
conflict with statement_timeout,
which should fire first, for example.
Michael: That's a good point.
Do you want to talk a tour about
in this future, like once Postgres
17 is out, if you're designing
a new OLTP system, are you even
using, like, do you normally use
global statement timeouts?
And if you did in the past would
you now replace that or would
you use both with different settings
how would you
Nikolay: do it?
Good question, good question.
So first of all in documentation
we can read that setting statement
timeout globally is not a good
idea and I think it's a very bad
idea to write this in documentation
because we discussed, even
in HTTP server, application server,
they have their own timeouts
database is usually the most like...
Data intensive work is usually
in database unless we talk about
some other stuff happening now
outside of web and mobile apps,
right?
I mean AI stuff like in GPU and
so on.
So we want to limit in all OLTP systems,
we definitely want to limit
our work to 30 seconds maybe.
It's a good limit.
This article and our very first
episode is about basics, which
are so like 30 years, like 40 years
since the beginning of the
Internet.
People don't like to wait more
than a second, in a normal case.
It means that we don't want to
limit globally and be protected.
Only specific users which need
it, always can change it, unless
it's restricted.
So you just set statement timeout
or anything and go.
Of course, if DDL is needed to
be executed, for example, create
index concurrently, of course,
you also should remove this limit.
But global setting is protection,
it should be there for OLTP
case.
And Postgres is like, OLTP is the
main case.
We don't usually think about analytical
queries.
So it means that I always recommend
setting this and documentation
is having bad advice in this case.
So 30 seconds, my recommendation.
Sometimes 15, it's better.
Michael: Oh, wow.
Nikolay: Yeah, well, even 10 maybe.
It depends.
It raises the bar requirement for
queries that this is about
quality actually.
So if your query is poorly designed
or you forgot to create a proper
index something like a bad plan,
be killed.
But now with transaction timeout,
I would say statement timeout
might be not needed at all.
Michael: Interesting.
I wondered if you were going to
say that.
Nikolay: So again, my logic is
applied at a higher level.
This is about each feature.
If we go down, in both cases, database
specific, Postgres specific
cases I mentioned, they also apply
to a transaction level at
higher levels.
We don't care about statement and
breaks between statements.
Michael: And do you think like
same order of magnitude?
Like you mentioned 30, I've seen
60 seconds quite often mentioned
as a sensible default, so 30 seconds
not too different.
Would you go similar, like for
any reason to double it or like
increase it, like add a little
bit of buffer there or just similar
number?
Nikolay: No, no, no, Why buffer?
Set it to 30, solve all problems
and then consider going even
down.
And if we talk about statement
timeout, why at all?
Like we need it, like, okay, we
put it, but what if we have transaction
timeout 30 seconds, for example,
which statement timeout would
I use?
30 seconds maybe as well.
But it's already solved.
Michael: I don't think there's
any, that wouldn't make sense
to me.
But I was wondering if you might
say you know statement timeout
30 seconds transaction timeout,
60 seconds and
Nikolay: Transaction? I don't... transaction
session timeout, 60 seconds.
But when transaction timeout? What?
Session timeout, you mean?
Michael: No, no, no, transaction.
So, transactions can consist
of multiple statements, right?
Right. So, there's no need to have
a statement timeout that is
bigger or equal to the transaction
timeout.
Nikolay: It doesn't make sense.
Michael: So the only thing that
could make sense would be a statement
timeout that's less than the transaction
timeout.
Nikolay: Right.
Michael: But it seems pretty neat.
Yeah.
Nikolay: But again, like if you
take transaction timeout, which
is already quite good, restrictive,
30 or 60 seconds, then I
think statement amount...
And I don't see logical to go down.
Okay, it's about resources.
If you go down, maybe you save
some resources by killing some statement
earlier than the transaction.
But it's not the main reason to
have transaction command for
me.
Well, of course. Maybe it's actually
super important.
Maybe I just forgot that we use
it.
Okay, I don't know.
I saw so many incidents where even
statement timeout was not achieved,
but resources were spent fully.
So I don't know.
I mean, we have very restrictive
statement timeout, very, very
low.
And still CPU 100% and everything
is down because a lot of work
is happening.
So statement timeout doesn't protect
you, even 15 seconds.
I mean, to start protecting, maybe
I would like to have it 1
second.
In this case, I think it will start
protecting.
Michael: Interesting.
Nikolay: But 30 seconds is too high.
Michael: Yeah, it's a tricky one,
right?
Because yeah, I think there's a
growing, and I know you mentioned
analytical isn't our primary case,
but there's a growing trend
that I'm seeing of kind of hybrid.
I've had it called HTAP databases
where people don't want to
spin up a second analytical, like
a warehouse or something, and they're
trying to run some small analytical
queries on their like OLTP
database, which means you do get
these kind of the odd several
second query or at least hundreds
of milliseconds running
on the same database.
So I can see how it could get tricky
for some folks.
But I think that the 30 seconds
and the higher feels like a sane
starting point, as you mentioned.
And if you start to see some errors
from those, look into them.
And if you don't, look into what
your slowest queries are and
whether you can reduce it.
Nikolay: In case of HTAP, I will
just have another database user
specifically for long-running
queries. I will adjust this setting
maybe setting it to a couple or maybe
5 minutes and allowing to
last longer, that's it.
Michael: Yeah, that's a good point
that I don't know if we mentioned
yet that these can be set at the
role level, at a session level,
and globally.
Nikolay: Well, I never tried, but
statement timeout definitely
can be set at user level.
I think lock_timeout should
work as well.
Worth checking by the way.
Michael: I thought the doc said
that, but I was checking the...
Because this is the development
doc.
Nikolay: Maybe even it's...
It should be
Michael: in the
Nikolay: testing tests maybe.
Maybe it's...
So I'm quite sure it should work,
yeah.
So you...
Anyway, we talk about statement
timeout right now.
We don't have lock_timeout
until 17.
So statement timeout can be set
for different users differently.
And I also had a case when people
had very low global statement
timeout, but then adjusted for people,
for human connections,
they adjusted it to set to 0.
Michael: How do you feel about
that?
Nikolay: I feel this is dangerous.
It's like someone can just run
something then drink some coffee
and then go for some call.
It's kind of dangerous.
By the
Michael: way, I've got bad news
for you.
The warning seems to have been
copied over to transaction timeout
documentation as well.
And it says in the docs, in the
new docs, setting transaction
timeout in postgresql.conf is not
recommended because it would
affect all sessions.
Nikolay: Yeah, actually it's a
good point.
I need to raise this again.
I don't agree with this.
It's a bad, very bad statement.
Michael: There's also 1 other note
in here that I had forgotten.
I guess it's only a minor note,
but prepared transactions are
not subject to this timeout.
Thought worth mentioning.
Nikolay: Yeah, it's so.
Michael: So, yeah.
Is there anything else?
I guess the story for how this
came about is pretty interesting.
The thread seems quite long, to
me at least.
Nikolay: As usual, a lot of unpredictable
problems and then people
seem to agree and then I think
okay, not 4.17.
But it was a good surprise last
week, right?
Or a couple of weeks ago.
Yeah, I just wanted to say, if
our world survives, right?
For folks who are listening to
this in 2025 or later, just start
thinking about limiting at transaction
level first and then go
down.
This is the main advice, but for
others, limit the application
side and limit statement amount
and idle transaction session
timeout, of course, both are good
to have limited to below a minute
maybe, at least 1 minute.
I also saw people are afraid of
an idle transaction session
timeout being low.
I don't understand this.
Put it very low, very low.
Like if someone is not doing anything,
bye bye.
OK, maybe for humans, you can limit
it to 1 or 2 minutes but
still you're probably holding some
very important exclusive locks
and blocking someone, it's not a
good idea so either commit or
do something already, right or
Michael: yeah the default
is no limit like there's absolutely
no limit at all.
So, any...
Yeah, even if you're really...
Even if you are concerned, what
would an hour limit look like?
Several hour limit look like?
Even that's going to be better
than none at all.
So yeah, I like that advice for
people to be checking.
Do you even have any of these limits
set?
Because the default is not to.
And even on a lot of cloud providers,
they don't set
Nikolay: defaults for you.
IGOR MINAROVICH-SHMAROVICH Yeah,
Nginx, if you work on web or
mobile apps, Nginx, for example,
has a default proxy timeout limit
of 60 seconds.
This is a good starting point to
think about it.
If 99% of your database work is
to serve HTTP requests, at least
60 seconds should be there.
And go down to details.
Also, one more thing.
Maybe some folks will be inspired.
I'm not a good drawer of some pictures,
but imagine if we had
a schema in documentation showing
the relationships between statement
and breaks between statement and
some big transaction and timeouts
will be visualized.
So for better understanding, this
is transaction timeout, between
them there are multiple statements,
they are also limited and
breaks are limited.
And also outside we have a session
timeout.
So idle session, how is it called?
Michael: Idle session timeout.
We have both.
Nikolay: No, session timeout, what
is it?
I mean-
Michael: We have idle in transaction
session timeout and we have
idle session timeout.
Nikolay: Idle session timeout.
This is what...
Like, it's outside of a transaction
also have limited.
So all periods of workflow are
covered now.
And to visualize this would be a
great idea.
Documentation already has a couple
of pictures.
It would be good picture to have.
I
Michael: think you'd be in the
first 10 still though if you got
this one in.
Nikolay: Right exactly, so maybe
someone will be inspired.
Good.
Michael: Nice.
Yeah.
Thanks so much Nikolay.
Nikolay: Catch you next week.
Bye.
Catch you
Michael: next week.
Nikolay: Bye.