Postgres FM | Transcript: transaction

March 8, 2024 • 27 Minutes

transaction_timeout

Nikolay: Hello, hello, this is
PostgresFM, a podcast about Postgres.

Hi, Michael.

Sorry for Mike last time.

Michael: That's all right, Nik.

I don't even notice normally.

Nikolay: Right.

Nik is fine.

So, episode number almost 90.

I don't remember exactly.

Like, Do you remember?

You always remember.

Michael: I remember, but you told
me not to remind you anymore

until we get to Milestone.

And you're the only one that ever
brings up the episode number.

Nikolay: Really?

Michael: I never mention it.

Nikolay: Okay.

Interesting.

So we are going to talk about what?

Michael: Oh yes, so this week was
your suggestion, and it's a

cool 1.

It's going to be a new feature
in Postgres 17, hopefully.

It's committed, but you never know
for sure.

And it's transaction timeout, a
new parameter that we'll have

hopefully in the next version.

Nikolay: I can show off, it was
my idea.

But most of the work, Andrey did,
Andrey Borodin did.

So it started during our PostgresTV
hacking Postgres session.

Michael: Which are back now, right?

After a long pause.

Nikolay: Well, yes, we'll try.

Nobody knows.

It's hard to arrange usually because
to do proper hacking, you

need a lot of time.

Last time we did restart it, we
talked about this transaction

timeout and Andrei tried to write
tests because former tests

needed to be removed.

By the way, I also wanted to say
thanks to Alexander Korotkov

who committed this patch.

Honestly, I even didn't ask.

So it was good to see that it's
committed.

And so session finished after 90
minutes by timeout because we

needed to go.

Session about timeout finished.

Okay, now you're smiling.

So, transaction timeout.

It took a few years for me to understand
something is wrong.

We don't have something.

Then, last year, like a year ago,
I started thinking, it should

be.

My idea was maybe I'm very wrong
or I'm very right.

So it's so strange in 2023 to think
about this basic setting.

But honestly, the last few years,
I already suggested to everyone

to consider the transaction timeout
to be applied on application level.

In 100% of cases, I saw a very
strange look, if it was a video

meeting.

It was a very strange look, like
Postgres cannot do it?

Yes, Postgres cannot do transaction
timeout, but you do need

it and we can discuss why.

So I always say, do it always on
the application side if you

have LTP, because, you know, like
anyway, HTTP has timeout.

A server, for example, NGINX, like
30 seconds, it's normal, right?

Nobody will wait many minutes,
unless it's a specific request.

Sometimes we can wait a few minutes,
but usually we have a limit.

And Postgres doesn't have this
transaction timeout.

It has only statement timeout and
idle transaction session timeout.

But no transaction timeout.

Michael: And also, so I looked
into when we got each of these,

and statement timeout was ages
ago, like 7.3, would you believe?

And then idle in transaction session
timeout was actually more

recent than I realized---

It's 9.6, which is probably only
about 7 or 8 years ago.

And then we also got more recently
idle session timeout.

Nikolay: Which is different.

It's outside of normal work.

It's for like, let's drop the connection
if it doesn't do anything.

Michael: Yeah, but that was a recent,
like, version 14 edition.

So still only a couple of years
ago.

Nikolay: Ah, you want to say that
it's also a very, very basic

thing and only added recently?

Yeah, I agree with that.

Usually people implemented it outside
Postgres as well, to drop

connections.

Michael: I've seen less need for
that one than I have seen for

this one, but like you, I didn't,
it's one of those things that I

kind of didn't realize I needed
because statement timeout can

cover a lot of the base, like a
lot of the use cases can be just

about coped with via statement
timeout, but one or two really can't.

And I think I saw your post and
I'll link it up the mailing list

thread that I think Andre started,
but your first reply to that

included some really good examples
of when you do need this in

addition to statement timeout or
like instead of statement timeout.

So is it worth like covering a
couple of those?

Yeah.

Nikolay: Yeah, let's discuss why
we use statement timeout in

OLTP like in regular the most common
cases like web and mobile

apps.

We had an episode about that and
I think it was the very first

episode maybe.

Speaking of counting.

Michael: Yeah, we've done a few
other episodes related to this

as well.

We did one on connections, one on
zero downtime migrations.

Nikolay: No, I'm talking about
the very first episode.

What was it?

Michael: Yeah, slow queries.

Nikolay: Slow queries, yes.

And this is about statement amount,
basically.

But people don't do SQL queries.

I mean, end users, web and mobile
app users, they don't do SQL

queries.

So they work at a higher level, HTTP
requests.

So why do we need statement timeout
at all?

Why?

What is the problem we are solving?

I have this honest question because
in my opinion, when we think

about what we try to solve with
statement timeout, it's not the

right tool usually.

For example, again, users work
at a high level and we know some

intermediate software settings,
middleware, which is usually

written in some Python, Java, Ruby,
anything.

These application servers and also
an HTTP server, like maybe front

end server, like an NGINX or something,
and VoIP, right?

They usually have some limits to
drop connections which last

too long, if a server is taking too
long to respond, or communication

taking too long, usually like 30
or 60 seconds.

This is what users will get if
something which should take 100

milliseconds or less, takes 30
seconds.

It's time to admit there is a problem
and tell the user that,

like, for 502 gateway timeout,
it's usually what NGINX was returning.

And then we, back to database,
we think, OK, we need to limit

as well.

But what exactly do we need to limit?

This is the question.

Is the statement timeout the right
tool for that?

Michael: I do think there's a like,
a lot of cases where it's

enough.

So if, for example, we're trying
to, in that case, let's say,

for example, the users or like
the app has already given up,

but the database carries on working.

Yeah, that's doing pointless kind
of, that's pointless work.

So avoiding pointless work would
be great.

But there are other cases too, where
even if the app didn't have

a timeout, is it okay for a few
users of a system to be using

all the resources when, you know,
like hogging those resources

by running some really, really
slow or really, really heavy queries.

Is that an acceptable trade-off
or do you want to limit that?

So I think there are other slightly
different cases.

Right, but

Nikolay: Work is not...

Statements are too small.

It's too atomic a piece, right?

Michael: Well, I think there's...

Before we move on to why transaction
timeout, there's also one

other case which I think or maybe
maybe this makes the point

perfectly. These long-running transactions,
even if they're a single

statement transaction, they can
block internal processes like

autovacuum and cause issues that we've,
I mean, we've spent multiple

episodes talking about the kind
of issues that long-running statements

could cause, never mind long-running
multi-statement transactions.

Nikolay: No, no, no, no, not statements
block it.

Transactions.

Michael: But as you've said multiple
times, you can't have a

statement without a transaction.

So like a long-running statement
is enough, but you're right

that...

Nikolay: You can have a statement
without transactions for example

creating an index concurrently. Three transactions
I mean, you're okay

without a transaction completely.

No, no dirty reads.

Michael: Exactly right,

Nikolay: Right, right.

So yeah, but like we want
to apply the right tool.

It means that it should suit for
all edge and corner cases.

And I'm saying statement timeout
is for a different hole to close.

Michael: You gave a really great
example that completely sold

me in that email thread I mentioned.

Nikolay: Email?

Michael: Yes.

And that was a case I've seen actually
quite recently, which

was...

...taratatata.

Nikolay: Yeah, I call it...

It's like from a machine gun.

Yeah, but - statements coming from
a machine gun.

Brief statements with brief pause.

Michael: And somebody setting a
transaction at the beginning,

so in the case I saw, it was deliberately
taking out, like doing

begin, then doing multiple updates
or like upsets in this case,

and then only at the end committing
it.

So any one of those statements was
really short, but because they

were doing tens of thousands or
even hundreds of thousands, I

think, in this case, the transaction
as a whole was longer than

you'd probably want on a OLTP
system.

So in that case, statement timeout
wouldn't have helped, or at

least any sane statement timeout
wouldn't have cancelled that.

But a transaction timeout...

Nikolay: An idle transaction session
timeout wouldn't cancel

this transaction because the breaks
are also short.

Michael: Yeah, it's not idle, exactly.

Nikolay: It's like from a machine
gun.

It can be select plus 1, by the
way, if it's wrapped in a transaction

block.

Select plus 1, this anti-pattern,
which...

Michael: Like n plus 1.

Nikolay: 0, sorry.

Yes, yes.

Yeah, yeah.

I mean, it's actually a terrible
name, but the idea is that it's

n selects in a loop.

So, we didn't notice that they
have a loop.

It can be short updates with the
idea that we want to make them

short because it's better, you
know, like to...

But if it's wrapped into a transaction,
you cannot release any

locks until the very end of the
transaction.

This is the rule.

All locks are released only if it's
a commit or rollback.

That's it.

And so it's a bad idea to split
into batches, updates or deletes.

Insert is almost never, I think,
an issue.

It makes sense to split.

An insert should be a single massive
insert, usually.

Like copy or insert.

But if it's updates and deletes,
people know there's a rule to

split, then they put it in a single
transaction and what's happening

here.

What do we do?

We're accumulating a lot of logs
and not releasing them.

So there are only 2 reasons, the
big reasons I see.

Well, the source of the transition
is 1 reason, but there are 2 problems

which are very bright.

They scream let's have transaction
timeout.

They did it for many years and
I always say, okay, this is a

problem we must solve on the application
side.

We cannot do it in PostgreSQL at
all.

Surprise.

Well, we can use pg-cron and put
maybe some, I call it terminator

snippet, working with pg_stat_activity,
checking the exact start, like

transaction start timestamp, and
then terminating.

So terminator script.

This name is derived from pg_terminate_backend.

So was it renamed or no?

I don't remember.

pg_start_backup was
renamed.

It's about naming and I support
it.

Well, I'm having 3 in my speech,
sorry.

So, pg_start_backup was renamed to
pg_basebackup start.

So the same should happen with
pg_terminate_backend, I think.

Now, pg_terminate_backend should be
pg_backend_terminate.

So, this terminate script is also
like a workaround.

It should be some simple solution.

Let's not allow too long-running
transactions.

Because we keep locks, this is
problem number 1 we just discussed.

And second problem you mentioned,
keeping xmin horizon frozen

basically, not shifting.

Which means that all, like xmin
horizon is the hidden xmin value

of a tuple which is the oldest
in the system and we usually cannot...

Autovacuum usually comes and deletes
it at that time, we delete

it but we cannot delete it if there
is some transaction which

still can read from this because
xmin is like current time not

past, not the past for this transaction
which means that if we

keep our transaction we block this
xmin frozen and that means

that Autovacuum cannot delete dead
tuples, more and more of them,

freshly dead tuples, I call them.

Or maybe it can be improved this
name.

So these 2 problems, locks and
xmin horizon frozen.

Locks are not released and they
both can hit and a lot of people

can notice and you can even be
down down server can be down in

some cases and you cannot limit
it inclusion in Postgres 16 I

was so happy to realize I'm not
mad.

It's obvious and I think it was
discussed maybe, so shouldn't

discuss many times, but we just
went ahead, many things to Andrej,

just went ahead and coded it.

Then many strange cases started
to appear.

For example, sub-transactions or
conflict with statement_timeout,

which should fire first, for example.

Michael: That's a good point.

Do you want to talk a tour about
in this future, like once Postgres

17 is out, if you're designing
a new OLTP system, are you even

using, like, do you normally use
global statement timeouts?

And if you did in the past would
you now replace that or would

you use both with different settings
how would you

Nikolay: do it?

Good question, good question.

So first of all in documentation
we can read that setting statement

timeout globally is not a good
idea and I think it's a very bad

idea to write this in documentation
because we discussed, even

in HTTP server, application server,
they have their own timeouts

database is usually the most like...

Data intensive work is usually
in database unless we talk about

some other stuff happening now
outside of web and mobile apps,

right?

I mean AI stuff like in GPU and
so on.

So we want to limit in all OLTP systems,
we definitely want to limit

our work to 30 seconds maybe.

It's a good limit.

This article and our very first
episode is about basics, which

are so like 30 years, like 40 years
since the beginning of the

Internet.

People don't like to wait more
than a second, in a normal case.

It means that we don't want to
limit globally and be protected.

Only specific users which need
it, always can change it, unless

it's restricted.

So you just set statement timeout
or anything and go.

Of course, if DDL is needed to
be executed, for example, create

index concurrently, of course,
you also should remove this limit.

But global setting is protection,
it should be there for OLTP

case.

And Postgres is like, OLTP is the
main case.

We don't usually think about analytical
queries.

So it means that I always recommend
setting this and documentation

is having bad advice in this case.

So 30 seconds, my recommendation.

Sometimes 15, it's better.

Michael: Oh, wow.

Nikolay: Yeah, well, even 10 maybe.

It depends.

It raises the bar requirement for
queries that this is about

quality actually.

So if your query is poorly designed
or you forgot to create a proper

index something like a bad plan,
be killed.

But now with transaction timeout,
I would say statement timeout

might be not needed at all.

Michael: Interesting.

I wondered if you were going to
say that.

Nikolay: So again, my logic is
applied at a higher level.

This is about each feature.

If we go down, in both cases, database
specific, Postgres specific

cases I mentioned, they also apply
to a transaction level at

higher levels.

We don't care about statement and
breaks between statements.

Michael: And do you think like
same order of magnitude?

Like you mentioned 30, I've seen
60 seconds quite often mentioned

as a sensible default, so 30 seconds
not too different.

Would you go similar, like for
any reason to double it or like

increase it, like add a little
bit of buffer there or just similar

number?

Nikolay: No, no, no, Why buffer?

Set it to 30, solve all problems
and then consider going even

down.

And if we talk about statement
timeout, why at all?

Like we need it, like, okay, we
put it, but what if we have transaction

timeout 30 seconds, for example,
which statement timeout would

I use?

30 seconds maybe as well.

But it's already solved.

Michael: I don't think there's
any, that wouldn't make sense

to me.

But I was wondering if you might
say you know statement timeout

30 seconds transaction timeout,
60 seconds and

Nikolay: Transaction? I don't... transaction
session timeout, 60 seconds.

But when transaction timeout? What?

Session timeout, you mean?

Michael: No, no, no, transaction.
So, transactions can consist

of multiple statements, right?

Right. So, there's no need to have
a statement timeout that is

bigger or equal to the transaction
timeout.

Nikolay: It doesn't make sense.

Michael: So the only thing that
could make sense would be a statement

timeout that's less than the transaction
timeout.

Nikolay: Right.

Michael: But it seems pretty neat.

Yeah.

Nikolay: But again, like if you
take transaction timeout, which

is already quite good, restrictive,
30 or 60 seconds, then I

think statement amount...

And I don't see logical to go down.

Okay, it's about resources.

If you go down, maybe you save
some resources by killing some statement

earlier than the transaction.

But it's not the main reason to
have transaction command for

me.

Well, of course. Maybe it's actually
super important.

Maybe I just forgot that we use
it.

Okay, I don't know.

I saw so many incidents where even
statement timeout was not achieved,

but resources were spent fully.

So I don't know.

I mean, we have very restrictive
statement timeout, very, very

low.

And still CPU 100% and everything
is down because a lot of work

is happening.

So statement timeout doesn't protect
you, even 15 seconds.

I mean, to start protecting, maybe
I would like to have it 1

second.

In this case, I think it will start
protecting.

Michael: Interesting.

Nikolay: But 30 seconds is too high.

Michael: Yeah, it's a tricky one,
right?

Because yeah, I think there's a
growing, and I know you mentioned

analytical isn't our primary case,
but there's a growing trend

that I'm seeing of kind of hybrid.

I've had it called HTAP databases
where people don't want to

spin up a second analytical, like
a warehouse or something, and they're

trying to run some small analytical
queries on their like OLTP

database, which means you do get
these kind of the odd several

second query or at least hundreds
of milliseconds running

on the same database.

So I can see how it could get tricky
for some folks.

But I think that the 30 seconds
and the higher feels like a sane

starting point, as you mentioned.

And if you start to see some errors
from those, look into them.

And if you don't, look into what
your slowest queries are and

whether you can reduce it.

Nikolay: In case of HTAP, I will
just have another database user

specifically for long-running
queries. I will adjust this setting

maybe setting it to a couple or maybe
5 minutes and allowing to

last longer, that's it.

Michael: Yeah, that's a good point
that I don't know if we mentioned

yet that these can be set at the
role level, at a session level,

and globally.

Nikolay: Well, I never tried, but
statement timeout definitely

can be set at user level.

I think lock_timeout should
work as well.

Worth checking by the way.

Michael: I thought the doc said
that, but I was checking the...

Because this is the development
doc.

Nikolay: Maybe even it's...

It should be

Michael: in the

Nikolay: testing tests maybe.

Maybe it's...

So I'm quite sure it should work,
yeah.

So you...

Anyway, we talk about statement
timeout right now.

We don't have lock_timeout
until 17.

So statement timeout can be set
for different users differently.

And I also had a case when people
had very low global statement

timeout, but then adjusted for people,
for human connections,

they adjusted it to set to 0.

Michael: How do you feel about
that?

Nikolay: I feel this is dangerous.

It's like someone can just run
something then drink some coffee

and then go for some call.

It's kind of dangerous.

By the

Michael: way, I've got bad news
for you.

The warning seems to have been
copied over to transaction timeout

documentation as well.

And it says in the docs, in the
new docs, setting transaction

timeout in postgresql.conf is not
recommended because it would

affect all sessions.

Nikolay: Yeah, actually it's a
good point.

I need to raise this again.

I don't agree with this.

It's a bad, very bad statement.

Michael: There's also 1 other note
in here that I had forgotten.

I guess it's only a minor note,
but prepared transactions are

not subject to this timeout.

Thought worth mentioning.

Nikolay: Yeah, it's so.

Michael: So, yeah.

Is there anything else?

I guess the story for how this
came about is pretty interesting.

The thread seems quite long, to
me at least.

Nikolay: As usual, a lot of unpredictable
problems and then people

seem to agree and then I think
okay, not 4.17.

But it was a good surprise last
week, right?

Or a couple of weeks ago.

Yeah, I just wanted to say, if
our world survives, right?

For folks who are listening to
this in 2025 or later, just start

thinking about limiting at transaction
level first and then go

down.

This is the main advice, but for
others, limit the application

side and limit statement amount
and idle transaction session

timeout, of course, both are good
to have limited to below a minute

maybe, at least 1 minute.

I also saw people are afraid of
an idle transaction session

timeout being low.

I don't understand this.

Put it very low, very low.

Like if someone is not doing anything,
bye bye.

OK, maybe for humans, you can limit
it to 1 or 2 minutes but

still you're probably holding some
very important exclusive locks

and blocking someone, it's not a
good idea so either commit or

do something already, right or

Michael: yeah the default
is no limit like there's absolutely

no limit at all.

So, any...

Yeah, even if you're really...

Even if you are concerned, what
would an hour limit look like?

Several hour limit look like?

Even that's going to be better
than none at all.

So yeah, I like that advice for
people to be checking.

Do you even have any of these limits
set?

Because the default is not to.

And even on a lot of cloud providers,
they don't set

Nikolay: defaults for you.

IGOR MINAROVICH-SHMAROVICH Yeah,
Nginx, if you work on web or

mobile apps, Nginx, for example,
has a default proxy timeout limit

of 60 seconds.

This is a good starting point to
think about it.

If 99% of your database work is
to serve HTTP requests, at least

60 seconds should be there.

And go down to details.

Also, one more thing.

Maybe some folks will be inspired.

I'm not a good drawer of some pictures,
but imagine if we had

a schema in documentation showing
the relationships between statement

and breaks between statement and
some big transaction and timeouts

will be visualized.

So for better understanding, this
is transaction timeout, between

them there are multiple statements,
they are also limited and

breaks are limited.

And also outside we have a session
timeout.

So idle session, how is it called?

Michael: Idle session timeout.

We have both.

Nikolay: No, session timeout, what
is it?

I mean-

Michael: We have idle in transaction
session timeout and we have

idle session timeout.

Nikolay: Idle session timeout.

This is what...

Like, it's outside of a transaction
also have limited.

So all periods of workflow are
covered now.

And to visualize this would be a
great idea.

Documentation already has a couple
of pictures.

It would be good picture to have.

Michael: think you'd be in the
first 10 still though if you got

this one in.

Nikolay: Right exactly, so maybe
someone will be inspired.

Good.

Michael: Nice.

Yeah.

Thanks so much Nikolay.

Nikolay: Catch you next week.

Bye.

Catch you

Michael: next week.

Nikolay: Bye.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

transaction_timeout

Creators and Guests

Some kind things our listeners have said