Postgres FM | Transcript: synchronous

May 2, 2025 • 51 Minutes

synchronous_commit

Michael: Hello and welcome to
Postgres.FM, a weekly show

about all things PostgreSQL.

I am Michael, founder of pgMustard
and this is Nikolay, founder

of Postgres.AI.

Hey Nikolay, how's it going?

Nikolay: Everything is alright,
how are you?

Michael: Yeah, good thank you.

It's been a nice sunny week here
in the UK.

Nikolay: Oh, this is who stole
our sun, okay.

Michael: So this week we got a
listener request, in fact we get

quite a lot of listener requests,
thank you everybody.

This 1 was from Shayon Mukherjee,
apologies for pronunciation,

and they asked us about
synchronous_commit and whether or when

it might make sense to ever turn
it off or use different settings

and I thought it's an interesting
topic full stop so I wanted

to get your thoughts as well on
this.

Nikolay: I have many thoughts on
this.

We need to start somewhere.

Michael: I wondered about, let's,
starting probably not in the

most obvious place but on a single
node setup.

I know this becomes more interesting,
multi-node, but on single

node, pure Postgres, there are
still some interesting discussion

points here, right?

Nikolay: Yeah, and even before
that, there is single...

So the meaning of synchronous_commit
is overloaded.

I know like technically it defines
behavior of commit, right?

And there is local behavior and
if we have synchronous standbys,

there is remote behavior.

But originally it was only about
local.

Yeah.

And then decision was, it was only
on and off 2 options.

This is how I remember it for a
decade or more, more than a decade

of being Postgres user.

And then there is overloaded meaning.

Remote apply, remote write, remote
apply.

2 options.

So I think my first thought is
it's a huge mistake to mix things

here.

It causes pain in users like me.

All the time I like, I read it,
I read it, I read it, like, I

cannot remember this.

To remember this, you need to deal
only with this.

Michael: Yeah, or look up, well
you can't remember it, you just

look it up.

I just look

Nikolay: it up.

Every time you look it up and try
to understand the logic.

It's like, we have many places
in GUC system, Postgres system

like this.

Like for example, you know, we
still do a lot of consulting.

We help our customers, many of
them are startup companies who

grow rapidly, very smart people.

They understand things very quickly,
but you need to explain

what's happening.

And 1 of the things like we, we
deal with bloat all the time.

Bloat is 1 of the major concerns.

And we explain, okay, we need autovacuum
tuning.

And there is autovacuum_work_mem,
which is minus 1.

It means you need to look at maintenance_work_mem.

Why was this done?

This is exactly the same thing.

Like why was this done to my brain?

Like somebody made some decision
to cause huge pain for ages.

It's a UX issue, like a user experience
issue, because instead

of clarifying and separating things,
we develop some implicit

dependencies and you need to be
expert.

And even if you are expert, it's
like, it's really hard to memorize

all these things and the problem
is like at some point we will

be use more and more AI and so
on but to explain these things

to AI it's also difficult because
there are implicit dependencies

So synchronous_commit can be local
but it also can be not local.

And local is on and off and there
is also word local, right?

Yeah.

So meaning of on was rewritten.

Thank you guys.

Thank you.

Thank you who did this.

Thank you.

Like to explain, like if you try
to be consultant and explain

things to others very smart guys
You will see the pain It's really

hard and why is this hard?

This is my first thought why is
this so hard?

Michael: As you say it's because
combining 2 things right and

and the behavior is I can see why
they combined it though like

the behavior

Nikolay: I also can see no no questions
here

Michael: And to their credit the
documentation is really nicely

written so it can be pointed.

Nikolay: It's nicely written but
it doesn't explain all the things

so I hope we will touch additional
topics not covered by documentation

today.

Michael: Nice.

So let's talk about the single
node case on off on by default

right So like why would we want
to turn it off?

What's happening here?

Nikolay: Yeah, if we forget about
multi-node setups, we have

just 1 node, right?

And if we have write heavy workloads
and we think write latency

is not good and eventually right
throughput is not good, because

if latency is not good, throughput
won't be good as well.

There is some dependency there,
which is not straightforward.

And it means that, well, we know
at commit time, Postgres needs

to make sure information is written
to WAL and it's present,

like synchronized, like it's on
disk.

Right.

So we need, before that, Postgres
cannot respond to the user

that commit was successful.

And when we say commit, it can
be explicit commit, or it can

be implicit commit if we have a
single statement transaction,

which is super popular approach
in web and mobile apps.

So when we just have single statement
transactions all the time,

just separate and so separate,
separate update and so on And

all of them have implicit commit
as well.

So commit can be successful only
when WAL is already on disk.

And it means We need to wait for
disk.

And if disk is not a RAMFS, which
is in 99.99% of cases, right?

If it's regular disk, it can be
quite slow.

And in this case, we deal with,
like if our transactions are

super fast, like extremely fast,
which is actually rare, usually

they are not so fast.

Because of various cases, there's
things like including, for

example, index write amplification.

I have moved to criticize Postgres
today, you know, like, Yeah,

you know, you know me.

Michael: Well, yeah, so I feel
like so far you've just yeah,

you've you've so far described
the kind of synchronous_commit

equals on case it's so we have
to wait yes which is default and

we're waiting for the flush to
happen so the rights to disk before

confirming back to the client that
that commit has been successful

but turning it off we don't wait
for the the extra step of flushing

to disk before confirming to the
client, which sounds risky.

Well, the risk is if there happens
to be a crash on the node,

in the time between the write-ahead
log being written and the

flush, so the commit being confirmed
and the flush to disk, we

lose any of that data.

So the risk here is some data loss
on crash.

Nikolay: Yeah, the short tail of
WAL is lost.

Postgres will still reach consistency,
all good, but we lose

some part of data.

It's similar to any multi-node
setup with asynchronous node when

failover happens and if standbys
are asynchronous, nobody guarantees

that at failover everything will
be present there, right?

Because it's asynchronous.

There can be lags, but there's
consistency, all good, and we

can lose, for example, by default
in Patroni settings, as I remember,

there's a setting, I always forget
the name, but a setting like

maximum lag allowed it to fail
over, as I remember by default

it's 1 megabyte, so up to 1 megabyte
is okay to lose.

Here it's similar, there is no
threshold, like unlimited, but

technically it cannot be very, cannot be huge.

Michael: I looked into it and there is the default setting is

a write-ahead log writer delay of 200 milliseconds by default,

And so I thought maybe only 200 milliseconds worth of writes

could be lost, but in the documentation it said actually...

Yes, good point.

Actually, it's for implementation detail reasons, it says it's

3 times that.

So by default, it's 600 milliseconds that could be lost.

But you could reduce that.

This setting only makes sense to change if you're setting synchronous_commit

for any of your transactions.

But you can reduce it if you want.

So you could, if you do decide to go for some use of synchronous_commit

equals off, then this is a setting you could reduce to

minimize data loss.

But at that point, you're still losing some data.

You might still be losing some data, so it's, I'm not sure if

it matters that much.

Nikolay: Right, right.

So maximum is 3 times of that wal_writer_delay, right.

And the question is, like, should we consider this?

My answer is if writes are super fast, maybe yes.

There is a trade-off here, obviously.

Michael: Yeah, I was thinking about it in our last episode about

time series data.

But you made a really good point about if you're if you are writing

kind of lots and lots of little writes often in time series workloads

people suggest actually batching inserts

Nikolay: so

Michael: by that point you're yeah so batching makes a load of

sense for optimization.

Nikolay: It's a good point.

So choosing between synchronous_commit off and batching, you

just choose batching.

Michael: Yeah, so then, but then at that point, your, your transactions

are not that fast because you're batching maybe a thousand or

more of them together and then the benefits of synchronous_commit

equals off of much smaller yeah exactly so it's an in I don't

think you can get the benefit twice if that makes sense you

Nikolay: don't get

Michael: the benefit of doing both.

Nikolay: Single row inserts versus 1 insert of 1, 000 rows.

1, 000 inserts versus 1, 000 row single insert.

Of course, you'll have just 1 commit.

It's great.

And it will be faster.

But of course, or updates for example, of course you should avoid

huge batches, because they have also downsides, like longer lasting

locks, Or in case if such batch is crushed, you lose more.

You need to retry more, right?

Instead of losing just 1 insert, you're losing all 1,000 inserts.

So as usual, batching, my rule of thumb is just to choose batch

size so it's roughly like 1 or 2 seconds maximum to avoid long

lasting locks and risks to repeat too much.

In this case, it's great.

Like in this case, just 1 commit per second or maybe 10 commits

per second, it's a very low number.

So in this case, the overhead of the need to wait for a WAL

to be synced to disk is super low, which is like roughly you

can think about it like not more than 1 millisecond.

In this case, like just forget about it.

synchronous_commit off is not a good idea, that's it.

Yeah.

But usually things are much more complicated because we just

consider 1 type of workload.

Any project has many types of workload, right?

Michael: Yeah, and that's a good point that Shayon actually made

in a follow-up post that I think sometimes gets missed in this

topic, which is that it can be set on a per transaction basis

or per session basis per user basis, so if you're doing a one-off

if you're doing like a one-off migration or bulk load of data

and you can't for some reason not batch yeah then maybe it's

worth considering turning it off just for that user or for that

session or per transaction.

Nikolay: What kind of situation could it be in if you cannot

buy like bulk but not batch?

Michael: I was struggling I was struggling to think of it, maybe

some streaming, like streaming case, but it felt tenuous.

Nikolay: If technically it comes from many, many sources, instead

of saying, okay, we need Kafka before it comes to Postgres, and

then we will batch.

It's like, it's a heavy solution.

Maybe indeed, this is exactly the case when we, it's worth considering

off, right?

Michael: use off.

The other thing I was thinking about is whether this comes up

as a thought for people because of tools like pgbench that by

default, you get a lot of small transactions, a lot of small

very fast transactions.

So because, I wondered if it becomes interesting to people because

they just think let's quickly do a test with pgbench and oh wow

synchronous_commit, what's that?

Nikolay: Have you tried pgbench and turn off synchronous_commit?

I tried many times because it's obvious idea and I don't remember

I saw a huge advantage.

Because pgbench write transactions by default, they are not so

simple.

It's, as I remember, it's update plus 2 inserts, maybe a delete

as well.

It's multi-statement transaction,
which is definitely not super

lightweight.

Yeah, fair enough.

I didn't try it.

Yeah.

Well, I can imagine we can have
situations when the effect is

quite noticeable, like dozens of
percent, dozens of percent.

But in general, from my practice,
I stopped looking in this direction.

Yeah, fair enough.

To think about often.

And of course, data loss is not
a good idea.

So, I don't know, like write heavy
workloads, maybe yes, but

again, batching and all proper
tricks to reduce a number of actual

commits happening and that's it.

Michael: Yeah and I think people
when people talk about data

loss they quite often immediately
jump to things like banking

apps you know things that where
it's like absolutely critical

not to, you know, you'd much rather
have some downtime than have

some data loss.

But I also think user experience
wise, like even for just regular

CRUD apps, losing data can be really
confusing to people, even

if it's only a not that important
thing.

The cases I was thinking of are
much more like the time series

ones, where if we've got 1 sensor
reporting every minute, if

we lose 1 minute, if we lose a
couple of those sensor reports,

it's not that important.

We can still do a bunch of aggregates
on that data.

So I guess for monitoring and logging
type things, maybe it's

less important.

But there can be really important
logs, too, right?

So I personally, I'm not I don't
see too many use cases where

this makes a lot of sense.

But there is a whole page on the
Postgres docs about, it's called

asynchronous commit.

So it must have been somewhat of
interest to quite a lot of people.

Nikolay: Right.

Yeah, yeah, yeah.

Michael: Should we switch to the
topic?

It feels like this gets more interesting
when we start talking

Nikolay: about synchronous.

Let's do it.

Let's move on and demultiply.

First of all, own is overloaded
when you have synchronous_standby_names

non-empty.

synchronous_standby_names non-empty,
right?

We can have purely synchronous
replication or we can have so-called

quorum commit and I just had a
chat with 1 of our team members

and I agree.

I heard criticism as well, like
quorum commit is a very misleading

term here because usually it's used like this, like we have multiple

standbys, They are synchronous originally by nature, but then

we say synchronous_commit, for example, remote write.

And then we configure Postgres to allow synchronous_standby_names

to have any, right?

We say any.

And for example, we have, say, 5 replicas, 5 standbys.

And we say any 1, right, any 1 of them.

And it means that commit happens on the primary and on any 1

additionally.

So on 2 nodes before a user receives success of commit.

It means 2 out of 6.

5 standbys, primary, 6 nodes.

So 2 out of 6 is not quorum, because quorum, definition of quorum

means more than 50.

Michael: So that would be 4 out of 6?

Nikolay: Usually, usually.

I think we can like say this in this case it's the criticism

like it's conditional, right?

We may say okay in our case quorum definition is kind of rewritten

and we need 2 out of 6.

But it sounds strange a little bit.

So maybe, how should we call this situation when we not just

1 synchronous replica, and every commit must happen on both but

we have multiple replicas and we say 1 more or 2 more so how

do we call it semi-synchronous I've heard this term as well semi-synchronous

Michael: yeah yeah I actually yeah I like I kind of like it.

It does imply.

Nikolay: Yeah, but official boxes.

Michael: It's confusing Semi-synchronous is confusing enough.

That means I have to look it up, which is probably helpful.

Nikolay: I don't

Michael: assume I know what it means,

Nikolay: right?

Michael: But yeah, it doesn't fit into either category.

Like it's not synchronous and it's not asynchronous.

Like it's, yeah, I don't like it though.

I don't feel comfortable with it because what's it saving us?

Like what's the benefit of, like is the idea that if both crash

at the same time...

Nikolay: Yeah, let's, before we talk about it, let's finish with

terminology.

Maybe, maybe, I'm looking up right now the word, the meaning

of word quorum, maybe it's not that bad because in some cases

we can say, okay, 25% is our quorum, right?

We can define these rules.

So, like, expectation that it should be more than 50, maybe it's

like false expectation.

Yeah, anyway, but you cannot say percentage.

You say like, like any 1 or 2, right?

Configuring Postgres,
synchronous_standby_names, you can say

any word and then say, You can
have some kind of, it's an interesting

syntax, you saw it, right?

Michael: So- Yeah, I also saw you
could do, is it like priority?

Nikolay: Yeah, yeah, yeah.

Yeah, so yeah, there are some interesting
things there.

But anyway, idea is they work together.

synchronous_standby_names and synchronous_commit
in this case.

And if synchronous_standby_names
is not empty, so there is this

magic happening.

If it's just a Host name, purely
synchronous replication, right?

But also purely synchronous, there
are flavors we will discuss

based on this synchronous_commit
setting.

If there is like expression is
there, like any or like first,

blah, blah, blah, then it's already
more complicated.

But it's interesting.

And official documentation mentions,
it uses the term quorum

commit, right?

Michael: I saw it in a Crunchy
Data blog post.

I didn't actually see it in the
documentation.

Nikolay: Okay, doesn't matter.

Let's talk about flavors.

Let's consider we have only for
simplicity just 1 host name mentioned

in synchronous_standby_names, No
expressions, just 1.

And if we don't change synchronous_commit
setting, default is

on, here the meaning changes.

Right?

Which like this makes me...

I started with this, right?

Like overloaded.

The meaning changes and what does
mean it means that we need

to to wait until what like let's
let's see

Michael: so yeah I think it's worth
thinking about them in terms,

I like to think of them like progressively.

I think the order goes, remote
apply is the strongest.

So remote apply will wait until
it's not only been written to

the write-ahead log and

Nikolay: flush to disk but also

Michael: on the but but available
to to queries so read queries

on the on the standby will be able
to...

Nikolay: And here I think, okay,
apply here but why replay there?

In pg_stat_replication it's called
replay LSN, right?

Apply or replay?

Apply or replay, we consider the
same thing, right?

Yeah.

Right.

Michael: that's the strongest.

I think that's the highest level
we can set, but it has additional

latency, obviously.

I think on is the next strongest.

I think we wait for flush but not for, what did you want to call

it?

Replay.

Nikolay: Replay or apply.

Michael: And then there's 1 level below on which is the right

Nikolay: there is no remote receive there is replay right

Michael: yeah which but I think receive is this they're saying

it's the same as right

Nikolay: mmm WAL data received over transmitted over network

received, but not yet flushed to file, but received, right?

Yeah.

Write and receive are the same.

There is confusion because in different parts of Postgres here

we have different words, different verbs used, right?

But I agree, yeah.

So there is a write, there is flush, and there is apply phases.

And there are 3 steps here.

And there is no remote flush, which I think, okay, overloading

happens, but at least it would be good to have remote flush and

say, okay, on is acting like flush, but there is no remote flush.

There is only remote write and remote apply and instead of flush

we have on saving on number of words right supported

Michael: number of settings yeah but they did add a local which

I think according to the documentation, they're saying is only

for completeness.

So it's interesting that they did add 1.

Nikolay: It's not only for completeness.

And we will go there in a few minutes.

So 1 means remote flush, medium setting, right?

WAL is written.

Yes.

Like it's happening on the primary.

In primary, at commit time, it's flushed to disk.

We know that in the case of crash, it's already, this whole data

is already on disk, but what will happen with data pages during

recovery, they will need to be adjusted, right?

So, like, it's not applied yet.

Right.

Good, good, good.

So, how do you feel about this?

It's like synchronous replication, Which means that if we lose

the primary, the new primary will have everything.

Which is great, right?

It will have everything and it will be needed to be applied,

so recovery.

But we don't lose anything.

It's great.

Agreed?

Michael: Yeah, with a little bit of latency cost on each, right?

Nikolay: Little bit.

Michael: Or a lot.

Nikolay: Yeah, this latency depends
on the network as well, a

lot.

Michael: Of

Nikolay: course.

And usually we prefer to move standbys
to different availability

zones.

And this increases latency, of
course, and this means that if

we keep this as is, like synchronous_commit
on and synchronous_standby_names

have...

This setting has 1 host name, It
means that we will have a significant

latency overhead This is why it's
hard to use in heavily loaded

projects, OLTP-like projects.

But it's quite strict mode, right?

Michael: Yeah, and you can have
an HA setup that's within the

same availability zone, or at least
the same region, I'm guessing,

would massively reduce that.

Nikolay: Well, I wouldn't put.

So for me, 1 availability zone
can disappear.

If you think about HA, we need
to have a replica in a different

availability zone.

Which, of course, there is trade-off
here.

We need to wait.

Wait, like, what is here, what
is here, and then what's worse,

what's best for us.

And we move, for me, like HA means
replica should be in different

AZ.

It's in different AZ, means like
commit in this strict mode,

remote flush or on, there's no
remote flush.

It means we will need to wait both
synchronization with disk

on the primary and there.

And network will add overhead here.

And I expect we won't be able to
scale these writes very well.

So if it's like thousands of writes
per second, it will be noticeable

issue.

That's why in most cases I see
people go to...

Well, by the way, remote apply,
interesting.

It's even worse, like we need to
apply this.

It's even worse and It kills performance
even more, right?

What do you think about this mode?

It's like extreme.

Michael: I actually lost you.

I think our network connection
is not great.

Nikolay: Okay, when I discussed
network issues, we had network

issues.

So I moved us to consideration
of remote apply.

Michael: Well, yeah, so actually
question here then.

This is no longer, I don't think,
about risk of data loss.

Because if we flushed, then we're
good.

So this is about, so is this for
read replicas?

This only makes sense if we're
sending read traffic.

Okay, great.

Nikolay: So this is good in terms
of how data is visible there.

We know by default on, it's acting
like remove flush, which again

doesn't exist.

Changes are not applied, so any
reading session which can be

possible, the only possible option
we can have on the SELECTs,

like reads on standby.

In this case it won't see this
data yet.

And the-

Michael: Unless we're using remote
apply, in which case it would

always be there.

Nikolay: It will be there right
after commit.

It's visible.

It's great.

So the delay between it's written
and visible becomes 0.

Once it's committed, it's immediately
visible.

Great.

Yeah.

And when do we need this?

Question is, when do we need, I
don't know.

I don't work with such products.

So I'm

Michael: right because by the point
you need read replicas, the

point of having read replicas is
because you have so much read

traffic that your primary can't
serve it all, right?

So we're talking about extremely
high, at least extremely high

reads.

Maybe cases where we have a ton
more reads than writes, and write

latency is not that, maybe social
media again.

Nikolay: Yeah, In this case, this
replica can be in the same

availability zone because it's
not for HA purposes, right?

It's just a redistributor needs
maybe.

I honestly don't know.

We usually, we already have mechanisms
everywhere implemented

to deal with lags of replication.

To stick to the primary write after
writes and in the same session

of user for some time or like these
kinds of things right so

Michael: yeah you mentioned sticky
reads before yeah

Nikolay: in this case to slow down
commits for the sake of this

like reduction of this delay of
this leg I don't know I would

prefer to deal with small lags,
but don't slow down commits.

Yeah.

Michael: that's, I guess that's
remote apply and the default

is on.

Do you ever, Do you see use cases
for remote write?

Nikolay: Remote write is the most
popular.

Michael: Oh really?

Interesting.

Nikolay: Of course, most setups
where synchronous binaries is

non-empty, I see they use remote
apply most of it.

Remote write, remote write.

Michael: Right, that makes more
sense.

Nikolay: So the list like complete,
right?

So, and here we have any issues
because this is super popular

option.

And, oh, by the way, I wanted to
mention that yesterday on Hacker

News, it was the article from Aphyr,
right?

Jepson.

Yeah, Jepson.

It was discussing RDS multi-AZ
clusters, which interesting, like

it's closer to our topic today,
but to understand the article,

you need time.

So we, great, we have remote write
as being very popular.

What does it mean?

The standby node received all data,
but hasn't flushed it yet

to the disk.

It means that if all nodes suddenly
shut down, Only primary will

have this data.

After commit, we know primary,
if commit happens, primary flushed

it, right?

We don't have settings which say,
which let us control the behavior

of commit on standby nodes, but
on the primary, it's not flushed.

Like if we have remote write on
or remote apply, it means in

any of these cases, disk on the
primary has all data from this

transaction, right?

So it's like it's already, it's
not off at all already.

It's not off.

And imagine we all know this appeared
on the primary has this

data because a standby since it's
remote right is not guaranteed

this might miss this data and this
yeah this transaction might

be lost And this is okay.

Okay.

But in case we lose the primary
completely, we lose data.

Right?

We lose data here.

Yeah.

But we lose data only if all nodes
are shut down.

Usually when we use the primary,
standby nodes are not shut down.

Unless something like...

Yeah, yeah, yeah.

So Let's say the loss of primary
is our like corner case, not

corner, edge case.

But loss of primary and standby
nodes are not lost but shut down,

it's like kind of very very corner
case, so several edges like,

So several problems simultaneously.

So usually they're still running.

That's why remote write is so popular
because it says, okay,

we lost the primary.

We need to perform auto failover
or manual failover, usually

auto failover.

And standby, 1 or 2, or how many
you configure, of those standby

nodes, pair the setting, pair standby
commit equals remote, right?

They have it in memory.

And they're still running, so they
will flush it.

So the risk is good here.

I mean, risk is understood.

Like loss of primary and all, everyone
is shut down suddenly.

But it's so low, so we can accept
it.

And say it still feels like
synchronous_commit to me because

I doubt when we lose primary standby
node which received but

not yet flushed this whole data,
it will be suddenly off.

It will be running, right?

And we don't lose this data.

This is great.

This is why it's so popular.

But it's great.

Except it's not.

Because if auto failover needs
to happen, it's very rarely, like

sometimes people allocate like
CrunchyBridge or RDS, not multi-AZ

cluster, but multi-AZ standby.

Now there are 2 options.

And multi-AZ standby means there
is synchronous standby only

serving HA purposes.

You cannot send queries to it.

Crunchy Bridge has the same, like,
not the same, not the same.

They use Postgres replication to
solve this.

Exactly, they configure synchronous_standby_names.

And they set synchronous_commit
to very weird value, we will

discuss it.

But the idea is HA replica doesn't
receive traffic, it serves

only for HA purposes.

Great.

But in many other cases, people
use like, they say this quorum

commit, 1 or 2 additional nodes
must say data is received.

Remote, right?

And in this case, if we lose the
primary, auto-failover will

happen to where?

To 1 of these nodes which received,
right?

But the problem is, For example,
if it's Patroni, Patroni cannot

choose this node, because there
is no such capability to find

which...

Like we have, again, we have 5
replicas, for example, 5 standbys.

We know 1 of them received latest
settings.

How to find this?

Michael: At least, yeah.

Nikolay: There is no such capability,
doesn't exist.

Because this pg_last_receive_lsn,

But it has the wrong name because
it shows pg... it shows flush

LSN.

It's a bug.

Oh, a bug or feature, I don't know.

So this is a problem.

And this is like if Patroni uses
that I think it uses that it

chooses the node which received
like which flushed LSN with

maximum value and this might be
different no Yeah

Michael: So but so with that in
mind I guess it's then an acceptable

risk, or that's why it has some
data loss.

But why not use on in that case?

Nikolay: Because it will slow down
commits.

Michael: OK, that's, yes, the trade-off.

Nikolay: We distribute our nodes
among various, multiple AZs,

availability zones, right?

And there is network overhead.

Michael: Yeah, well, Interestingly,
I checked before coming on

with our own Google Cloud SQL setup,
and I wondered what our

setting was, and it was on.

So either they haven't...

I'm pretty sure...

But yeah, it's interesting that
they could have chosen to set

it to remote.

Nikolay: And synchronous_standby_names
is non-empty?

Michael: I didn't check.

It's a good question.

Nikolay: If it's empty, you have
pure asynchronous application.

This is it.

Like, this is it.

Like, I Honestly, it concerns me
a lot how overloaded values

and names here.

On means different things depending
on different settings.

Why?

Maybe We are trying to save on
the number of configuration settings,

GUC names, not to expand it too
much.

Okay, 300, but let's not to expand
it to 500, right?

But this causes pain in brains,
like people are using this.

So if it's on, okay.

Is it?

Michael: No, you're right.

Nikolay: It's asynchronous.

Michael: No synchronous_standby_names

Nikolay: Yeah, yeah.

So it waits for flush only on the
primary.

Michael: Yeah, so that's basically
just local.

Nikolay: Yeah, yeah, yeah.

Good.

Yeah, so there's this problem,
it's unsolved.

It should be solved in Postgres.

It should be solved in Postgres
and I think it's not super difficult

to solve so we need a function
which will show write LSN or receive LSN.

By the way, receive and write may
be also a little bit different.

You receive but okay, it doesn't
matter Because while the receiver

is a receiver, right?

Anyway, so a couple of more things,
deeper, I wanted to mention.

It's deeper things, but 1 of the
things is even if we have a

remote write, there is issue in
the code that causes flushing

periodically.

So this WAL, like we have remote
write, but sometimes we wait

until it's flushed.

And this is also like this, I guess
this issue is known.

It's revealed itself only under
very heavy loads.

And yeah, it also could be improved
maybe, right?

Because if, so remote, right?

It's not purely remote, right?

It's remote, right, right, right.

But at some point it's flush and
all Transactions are currently

committing, being committed, they
wait additional flush.

And then it's again, pure, again,
write, write, write, and then

again, flush.

And it can be seen in the code
that there are things there, F

data sync, right?

Michael: Roughly, yeah.

I mean, is this also the wal_writer_delay
delay or a different kind

Nikolay: of delay?

No, no, no, It's not documented.

Michael: Okay, interesting.

Nikolay: So, I mean, if you use
remote, right, there is like

behavior which is not purely remote,
right?

So it still involves disk sometimes,
like periodically.

I guess some buffer is full and
needs to be written.

But this causes delays of commits.

Michael: So.

Makes sense.

Nikolay: Yeah, And another thing
is probably off topic, but CrunchyBridge,

let's discuss what we observed.

We just created a fresh cluster
and saw a few things.

1 of them is synchronous_standby_names
is filled with, I created

HA cluster.

Yeah, so HA cluster was all good.

And synchronous_standby_names has
this host, which is HA replica.

Like it's not visible to us.

We cannot send queries to it.

And synchronous_commit is set to
local.

Michael: Yeah, so the first thing
you mentioned, it being non-empty

means they're using Postgres replication,
right?

What's the significance of that
being non-empty?

Nikolay: If it's not empty, it's
supposed to be...

If a synchronous_commit was default
on, it would mean we have

a synchronous replication.

Flush, as you said, like on means remote flush in this...

Non-existing remote flash, it means, right?

But synchronous_commit is set to local, which means we don't

see the state of what's happening on standby.

So like commit only depends, the commit behavior depends only

the primary.

Michael: Like my setup, right?

Like on Google CloudSQL for me, even though it's on, it's kind

of like local because they're not using.

Nikolay: Right, but they still have the replica.

Michael: Yeah, Yeah, yeah.

Nikolay: Strange, right?

Michael: But local, local's confusing because I don't, yeah,

I would have, as you said, if, I would have expected it to be

on, or as you say, the more popular one.

Nikolay: If it would be on, if it was on, this replica would

be synchronous.

If it's local,

Michael: I guess it's unsynchronous?

Yeah.

Or asynchronous, yeah.

Nikolay: Asynchronous, right.

Although it's mentioned in synchronous_standby_names.

Michael: Yeah, good point.

What would you prefer it to be remote right, I guess, rather

than on?

Nikolay: No, no, no.

I think they are not, like, I'm not sure it's a mistake.

Michael: Well, it's not the default, right so it's definitely

deliberate.

Nikolay: There are some thoughts about this I think and I'm trying

to guess what because obviously making it synchronous would cause

slow commits right yeah performance would be not great okay we

go to local but why

Michael: do people are doing

Nikolay: yeah we keep it local, so this chair replica becomes

asynchronous.

Huh.

Do you

Michael: think it's to look good in benchmarking?

Nikolay: Why cannot we configure asynchronous replica in a normal

way?

Why should we put it to synchronous_standby_names, right?

Because it could be regular asynchronous replica, like present

in slot, but not available for read-only transactions, right?

But it's there.

And the only idea, it's not my idea, we discussed this situation

in the team and idea came up into one great mind was probably they

want WAL to be written by backends, not by WAL writer, because

there's like this effect when you use synchronous binaries, it's

not empty, backends are writing.

And this can increase throughput.

It's a guess, hypothesis.

If Crunchy is listening to us,
and we know they're listening

because they changed the random
page cost to 1.1, right?

Michael: Oh yeah.

Nikolay: Actually, it's great.

I also wanted to thank them because
since then I stopped saying

1 and I say 1.1 because Crunchy
benchmarked it and decided to

use 1.1 for random page cost.

Great.

Thank you so much.

But if they, if you guys are listening,
please let us know what's

happening with settings.

It's really interesting.

Maybe it should be documented because
it's interesting.

And so overall impression about
synchronous_commit, all those

options and synchronous replication,
I honestly see more asynchronous

setups purely.

But I know serious guys and I see
clusters with synchronous like

quorum commit configured.

My impression still there are things
to be fixed and improved.

This is what I observe.

And 1 more of...

Let's name it.

When we created this cluster with
HA, with HA checkbox, HA replica

was also created.

We see it in synchronous_standby_names.

And then we decided using our super
user, which Crunchy has,

and it's great.

I think it's the only managed service,
1 of maybe 2 actually,

which provides super user access,
which is great.

So we decided to change shared
preload libraries and restart.

And there was an idea, like a question,
is this change propagated

to a standby node or not?

And what happened, Postgres restart,
not cluster restart, Postgres

restart, cost fell over.

This is not all right.

Freshly created cluster and we
did it 3 times, 2 out of 3 times,

failover happened.

And another time it didn't happen.

So I guess it's like some race
condition is present there, right?

And we saw that change was not
propagated to a channel.

So it's actually data loss.

If I say, if I have super user,
it's my right, right?

And I have alter system, it writes
to, we can see it in Pidge

settings and it's not yet applied,
like painting restart is true,

right?

And then we have a lower and we
don't see it anymore.

What is it?

It's data loss.

Michael: Or it's unsupported behavior.

Nikolay: No, no, no, it's data loss.

I wrote something.

Using alter system, I wrote something and I lost it.

Michael: So you can't have it both ways that like III see where

you're coming from 100% But on 1 hand you want super user access

and on the other hand You're gonna call it data loss if you change

anything that doesn't get propagated

Nikolay: Let me contradict any with myself like I I'm too harsh.

I know like alter system.

It's actually local, we can change settings on each standby separately,

independently, so it's not guaranteed, it's not propagated by

replication.

I just expected maybe some automation exists which would propagate

it.

If this replica is a HA replica, maybe we should match settings.

Because this setting required restart, but what happens to settings

that don't require restart?

If I change them, and then what?

Like if I change them using alter system, and do I lose them?

This is not good, right?

Michael: Yeah.

But equally, expectations of automations around this stuff is

probably 1 that could be a nice excuse for other providers not

to add CPUs to Access.

So it's yeah.

No, no, no.

Nikolay: Well, I think it's no, no, no, no, no, no, no, no, no,

no.

Michael: I'm just saying be careful what you wish for.

Nikolay: Well, it could happen, like I change something using

UI and...

Michael: No, because if you use the UI, they change it somewhere

else and it does get propagated.

Nikolay: Maybe.

But I'm rooting for Crunch actually here.

I think it can be solved, this problem.

I don't want them to close super user access.

It's a great feature, super great feature.

I think everyone should be.

Based on our last, previous week's topic, My position, RDS and

others, they are taking too much from us.

It stops feeling open source.

Michael: We don't need to make this point again.

Nikolay: Yeah, okay.

Well, I'm going to repeat it many times, right?

So it's not all right.

So and Crunch are doing a great job here to providing, we are

providing super user access and I think it can be solved, right?

And 1 more thing, if you're listening, when I restart Postgres,

I see restart cluster, right?

Michael: I feel like I need to chop this bit up and send it to

them as a support thing

Nikolay: good yeah it was good

Michael: nice 1

Nikolay: thank you

Michael: yeah thanks Thanks so much Nikolay and thanks for the

request and yeah catch you next week.

Nikolay: See you.

Bye.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

synchronous_commit

Creators and Guests

Some kind things our listeners have said