
synchronous_commit
Michael: Hello and welcome to
Postgres.FM, a weekly show
about all things PostgreSQL.
I am Michael, founder of pgMustard
and this is Nikolay, founder
of Postgres.AI.
Hey Nikolay, how's it going?
Nikolay: Everything is alright,
how are you?
Michael: Yeah, good thank you.
It's been a nice sunny week here
in the UK.
Nikolay: Oh, this is who stole
our sun, okay.
Michael: So this week we got a
listener request, in fact we get
quite a lot of listener requests,
thank you everybody.
This 1 was from Shayon Mukherjee,
apologies for pronunciation,
and they asked us about
synchronous_commit and whether or when
it might make sense to ever turn
it off or use different settings
and I thought it's an interesting
topic full stop so I wanted
to get your thoughts as well on
this.
Nikolay: I have many thoughts on
this.
We need to start somewhere.
So
Michael: I wondered about, let's,
starting probably not in the
most obvious place but on a single
node setup.
I know this becomes more interesting,
multi-node, but on single
node, pure Postgres, there are
still some interesting discussion
points here, right?
Nikolay: Yeah, and even before
that, there is single...
So the meaning of synchronous_commit
is overloaded.
I know like technically it defines
behavior of commit, right?
And there is local behavior and
if we have synchronous standbys,
there is remote behavior.
But originally it was only about
local.
Yeah.
And then decision was, it was only
on and off 2 options.
This is how I remember it for a
decade or more, more than a decade
of being Postgres user.
And then there is overloaded meaning.
Remote apply, remote write, remote
apply.
2 options.
So I think my first thought is
it's a huge mistake to mix things
here.
It causes pain in users like me.
All the time I like, I read it,
I read it, I read it, like, I
cannot remember this.
To remember this, you need to deal
only with this.
Michael: Yeah, or look up, well
you can't remember it, you just
look it up.
I just look
Nikolay: it up.
Every time you look it up and try
to understand the logic.
It's like, we have many places
in GUC system, Postgres system
like this.
Like for example, you know, we
still do a lot of consulting.
We help our customers, many of
them are startup companies who
grow rapidly, very smart people.
They understand things very quickly,
but you need to explain
what's happening.
And 1 of the things like we, we
deal with bloat all the time.
Bloat is 1 of the major concerns.
And we explain, okay, we need autovacuum
tuning.
And there is autovacuum_work_mem,
which is minus 1.
It means you need to look at maintenance_work_mem.
Why was this done?
This is exactly the same thing.
Like why was this done to my brain?
Like somebody made some decision
to cause huge pain for ages.
It's a UX issue, like a user experience
issue, because instead
of clarifying and separating things,
we develop some implicit
dependencies and you need to be
expert.
And even if you are expert, it's
like, it's really hard to memorize
all these things and the problem
is like at some point we will
be use more and more AI and so
on but to explain these things
to AI it's also difficult because
there are implicit dependencies
So synchronous_commit can be local
but it also can be not local.
And local is on and off and there
is also word local, right?
Yeah.
So meaning of on was rewritten.
Thank you guys.
Thank you.
Thank you who did this.
Thank you.
Like to explain, like if you try
to be consultant and explain
things to others very smart guys
You will see the pain It's really
hard and why is this hard?
This is my first thought why is
this so hard?
Michael: As you say it's because
combining 2 things right and
and the behavior is I can see why
they combined it though like
the behavior
Nikolay: I also can see no no questions
here
Michael: And to their credit the
documentation is really nicely
written so it can be pointed.
Nikolay: It's nicely written but
it doesn't explain all the things
so I hope we will touch additional
topics not covered by documentation
today.
Michael: Nice.
So let's talk about the single
node case on off on by default
right So like why would we want
to turn it off?
What's happening here?
Nikolay: Yeah, if we forget about
multi-node setups, we have
just 1 node, right?
And if we have write heavy workloads
and we think write latency
is not good and eventually right
throughput is not good, because
if latency is not good, throughput
won't be good as well.
There is some dependency there,
which is not straightforward.
And it means that, well, we know
at commit time, Postgres needs
to make sure information is written
to WAL and it's present,
like synchronized, like it's on
disk.
Right.
So we need, before that, Postgres
cannot respond to the user
that commit was successful.
And when we say commit, it can
be explicit commit, or it can
be implicit commit if we have a
single statement transaction,
which is super popular approach
in web and mobile apps.
So when we just have single statement
transactions all the time,
just separate and so separate,
separate update and so on And
all of them have implicit commit
as well.
So commit can be successful only
when WAL is already on disk.
And it means We need to wait for
disk.
And if disk is not a RAMFS, which
is in 99.99% of cases, right?
If it's regular disk, it can be
quite slow.
And in this case, we deal with,
like if our transactions are
super fast, like extremely fast,
which is actually rare, usually
they are not so fast.
Because of various cases, there's
things like including, for
example, index write amplification.
I have moved to criticize Postgres
today, you know, like, Yeah,
you know, you know me.
Michael: Well, yeah, so I feel
like so far you've just yeah,
you've you've so far described
the kind of synchronous_commit
equals on case it's so we have
to wait yes which is default and
we're waiting for the flush to
happen so the rights to disk before
confirming back to the client that
that commit has been successful
but turning it off we don't wait
for the the extra step of flushing
to disk before confirming to the
client, which sounds risky.
Well, the risk is if there happens
to be a crash on the node,
in the time between the write-ahead
log being written and the
flush, so the commit being confirmed
and the flush to disk, we
lose any of that data.
So the risk here is some data loss
on crash.
Nikolay: Yeah, the short tail of
WAL is lost.
Postgres will still reach consistency,
all good, but we lose
some part of data.
It's similar to any multi-node
setup with asynchronous node when
failover happens and if standbys
are asynchronous, nobody guarantees
that at failover everything will
be present there, right?
Because it's asynchronous.
There can be lags, but there's
consistency, all good, and we
can lose, for example, by default
in Patroni settings, as I remember,
there's a setting, I always forget
the name, but a setting like
maximum lag allowed it to fail
over, as I remember by default
it's 1 megabyte, so up to 1 megabyte
is okay to lose.
Here it's similar, there is no
threshold, like unlimited, but
technically it cannot be very, cannot be huge.
Michael: I looked into it and there is the default setting is
a write-ahead log writer delay of 200 milliseconds by default,
And so I thought maybe only 200 milliseconds worth of writes
could be lost, but in the documentation it said actually...
Yes, good point.
Actually, it's for implementation detail reasons, it says it's
3 times that.
So by default, it's 600 milliseconds that could be lost.
But you could reduce that.
This setting only makes sense to change if you're setting synchronous_commit
for any of your transactions.
But you can reduce it if you want.
So you could, if you do decide to go for some use of synchronous_commit
equals off, then this is a setting you could reduce to
minimize data loss.
But at that point, you're still losing some data.
You might still be losing some data, so it's, I'm not sure if
it matters that much.
Nikolay: Right, right.
So maximum is 3 times of that wal_writer_delay, right.
And the question is, like, should we consider this?
My answer is if writes are super fast, maybe yes.
There is a trade-off here, obviously.
Michael: Yeah, I was thinking about it in our last episode about
time series data.
But you made a really good point about if you're if you are writing
kind of lots and lots of little writes often in time series workloads
people suggest actually batching inserts
Nikolay: so
Michael: by that point you're yeah so batching makes a load of
sense for optimization.
Nikolay: It's a good point.
So choosing between synchronous_commit off and batching, you
just choose batching.
Michael: Yeah, so then, but then at that point, your, your transactions
are not that fast because you're batching maybe a thousand or
more of them together and then the benefits of synchronous_commit
equals off of much smaller yeah exactly so it's an in I don't
think you can get the benefit twice if that makes sense you
Nikolay: don't get
Michael: the benefit of doing both.
Nikolay: Single row inserts versus 1 insert of 1, 000 rows.
1, 000 inserts versus 1, 000 row single insert.
Of course, you'll have just 1 commit.
It's great.
And it will be faster.
But of course, or updates for example, of course you should avoid
huge batches, because they have also downsides, like longer lasting
locks, Or in case if such batch is crushed, you lose more.
You need to retry more, right?
Instead of losing just 1 insert, you're losing all 1,000 inserts.
So as usual, batching, my rule of thumb is just to choose batch
size so it's roughly like 1 or 2 seconds maximum to avoid long
lasting locks and risks to repeat too much.
In this case, it's great.
Like in this case, just 1 commit per second or maybe 10 commits
per second, it's a very low number.
So in this case, the overhead of the need to wait for a WAL
to be synced to disk is super low, which is like roughly you
can think about it like not more than 1 millisecond.
In this case, like just forget about it.
synchronous_commit off is not a good idea, that's it.
Yeah.
But usually things are much more complicated because we just
consider 1 type of workload.
Any project has many types of workload, right?
Michael: Yeah, and that's a good point that Shayon actually made
in a follow-up post that I think sometimes gets missed in this
topic, which is that it can be set on a per transaction basis
or per session basis per user basis, so if you're doing a one-off
if you're doing like a one-off migration or bulk load of data
and you can't for some reason not batch yeah then maybe it's
worth considering turning it off just for that user or for that
session or per transaction.
Nikolay: What kind of situation could it be in if you cannot
buy like bulk but not batch?
Michael: I was struggling I was struggling to think of it, maybe
some streaming, like streaming case, but it felt tenuous.
Nikolay: If technically it comes from many, many sources, instead
of saying, okay, we need Kafka before it comes to Postgres, and
then we will batch.
It's like, it's a heavy solution.
Maybe indeed, this is exactly the case when we, it's worth considering
off, right?
To
Michael: use off.
The other thing I was thinking about is whether this comes up
as a thought for people because of tools like pgbench that by
default, you get a lot of small transactions, a lot of small
very fast transactions.
So because, I wondered if it becomes interesting to people because
they just think let's quickly do a test with pgbench and oh wow
synchronous_commit, what's that?
Nikolay: Have you tried pgbench and turn off synchronous_commit?
I tried many times because it's obvious idea and I don't remember
I saw a huge advantage.
Because pgbench write transactions by default, they are not so
simple.
It's, as I remember, it's update plus 2 inserts, maybe a delete
as well.
It's multi-statement transaction,
which is definitely not super
lightweight.
Yeah, fair enough.
I didn't try it.
Yeah.
Well, I can imagine we can have
situations when the effect is
quite noticeable, like dozens of
percent, dozens of percent.
But in general, from my practice,
I stopped looking in this direction.
Yeah, fair enough.
To think about often.
And of course, data loss is not
a good idea.
So, I don't know, like write heavy
workloads, maybe yes, but
again, batching and all proper
tricks to reduce a number of actual
commits happening and that's it.
Michael: Yeah and I think people
when people talk about data
loss they quite often immediately
jump to things like banking
apps you know things that where
it's like absolutely critical
not to, you know, you'd much rather
have some downtime than have
some data loss.
But I also think user experience
wise, like even for just regular
CRUD apps, losing data can be really
confusing to people, even
if it's only a not that important
thing.
The cases I was thinking of are
much more like the time series
ones, where if we've got 1 sensor
reporting every minute, if
we lose 1 minute, if we lose a
couple of those sensor reports,
it's not that important.
We can still do a bunch of aggregates
on that data.
So I guess for monitoring and logging
type things, maybe it's
less important.
But there can be really important
logs, too, right?
So I personally, I'm not I don't
see too many use cases where
this makes a lot of sense.
But there is a whole page on the
Postgres docs about, it's called
asynchronous commit.
So it must have been somewhat of
interest to quite a lot of people.
Nikolay: Right.
Yeah, yeah, yeah.
Michael: Should we switch to the
topic?
It feels like this gets more interesting
when we start talking
Nikolay: about synchronous.
Let's do it.
Let's move on and demultiply.
First of all, own is overloaded
when you have synchronous_standby_names
non-empty.
synchronous_standby_names non-empty,
right?
We can have purely synchronous
replication or we can have so-called
quorum commit and I just had a
chat with 1 of our team members
and I agree.
I heard criticism as well, like
quorum commit is a very misleading
term here because usually it's used like this, like we have multiple
standbys, They are synchronous originally by nature, but then
we say synchronous_commit, for example, remote write.
And then we configure Postgres to allow synchronous_standby_names
to have any, right?
We say any.
And for example, we have, say, 5 replicas, 5 standbys.
And we say any 1, right, any 1 of them.
And it means that commit happens on the primary and on any 1
additionally.
So on 2 nodes before a user receives success of commit.
It means 2 out of 6.
5 standbys, primary, 6 nodes.
So 2 out of 6 is not quorum, because quorum, definition of quorum
means more than 50.
Michael: So that would be 4 out of 6?
Nikolay: Usually, usually.
I think we can like say this in this case it's the criticism
like it's conditional, right?
We may say okay in our case quorum definition is kind of rewritten
and we need 2 out of 6.
But it sounds strange a little bit.
So maybe, how should we call this situation when we not just
1 synchronous replica, and every commit must happen on both but
we have multiple replicas and we say 1 more or 2 more so how
do we call it semi-synchronous I've heard this term as well semi-synchronous
Michael: yeah yeah I actually yeah I like I kind of like it.
It does imply.
Nikolay: Yeah, but official boxes.
Michael: It's confusing Semi-synchronous is confusing enough.
That means I have to look it up, which is probably helpful.
Nikolay: I don't
Michael: assume I know what it means,
Nikolay: right?
Michael: But yeah, it doesn't fit into either category.
Like it's not synchronous and it's not asynchronous.
Like it's, yeah, I don't like it though.
I don't feel comfortable with it because what's it saving us?
Like what's the benefit of, like is the idea that if both crash
at the same time...
Nikolay: Yeah, let's, before we talk about it, let's finish with
terminology.
Maybe, maybe, I'm looking up right now the word, the meaning
of word quorum, maybe it's not that bad because in some cases
we can say, okay, 25% is our quorum, right?
We can define these rules.
So, like, expectation that it should be more than 50, maybe it's
like false expectation.
Yeah, anyway, but you cannot say percentage.
You say like, like any 1 or 2, right?
Configuring Postgres,
synchronous_standby_names, you can say
any word and then say, You can
have some kind of, it's an interesting
syntax, you saw it, right?
Michael: So- Yeah, I also saw you
could do, is it like priority?
Nikolay: Yeah, yeah, yeah.
Yeah, so yeah, there are some interesting
things there.
But anyway, idea is they work together.
synchronous_standby_names and synchronous_commit
in this case.
And if synchronous_standby_names
is not empty, so there is this
magic happening.
If it's just a Host name, purely
synchronous replication, right?
But also purely synchronous, there
are flavors we will discuss
based on this synchronous_commit
setting.
If there is like expression is
there, like any or like first,
blah, blah, blah, then it's already
more complicated.
But it's interesting.
And official documentation mentions,
it uses the term quorum
commit, right?
Michael: I saw it in a Crunchy
Data blog post.
I didn't actually see it in the
documentation.
Nikolay: Okay, doesn't matter.
Let's talk about flavors.
Let's consider we have only for
simplicity just 1 host name mentioned
in synchronous_standby_names, No
expressions, just 1.
And if we don't change synchronous_commit
setting, default is
on, here the meaning changes.
Right?
Which like this makes me...
I started with this, right?
Like overloaded.
The meaning changes and what does
mean it means that we need
to to wait until what like let's
let's see
Michael: so yeah I think it's worth
thinking about them in terms,
I like to think of them like progressively.
I think the order goes, remote
apply is the strongest.
So remote apply will wait until
it's not only been written to
the write-ahead log and
Nikolay: flush to disk but also
Michael: on the but but available
to to queries so read queries
on the on the standby will be able
to...
Nikolay: And here I think, okay,
apply here but why replay there?
In pg_stat_replication it's called
replay LSN, right?
Apply or replay?
Apply or replay, we consider the
same thing, right?
Yeah.
Right.
So
Michael: that's the strongest.
I think that's the highest level
we can set, but it has additional
latency, obviously.
I think on is the next strongest.
I think we wait for flush but not for, what did you want to call
it?
Replay.
Nikolay: Replay or apply.
Michael: And then there's 1 level below on which is the right
Nikolay: there is no remote receive there is replay right
Michael: yeah which but I think receive is this they're saying
it's the same as right
Nikolay: mmm WAL data received over transmitted over network
received, but not yet flushed to file, but received, right?
Yeah.
Write and receive are the same.
There is confusion because in different parts of Postgres here
we have different words, different verbs used, right?
But I agree, yeah.
So there is a write, there is flush, and there is apply phases.
And there are 3 steps here.
And there is no remote flush, which I think, okay, overloading
happens, but at least it would be good to have remote flush and
say, okay, on is acting like flush, but there is no remote flush.
There is only remote write and remote apply and instead of flush
we have on saving on number of words right supported
Michael: number of settings yeah but they did add a local which
I think according to the documentation, they're saying is only
for completeness.
So it's interesting that they did add 1.
Nikolay: It's not only for completeness.
And we will go there in a few minutes.
So 1 means remote flush, medium setting, right?
WAL is written.
Yes.
Like it's happening on the primary.
In primary, at commit time, it's flushed to disk.
We know that in the case of crash, it's already, this whole data
is already on disk, but what will happen with data pages during
recovery, they will need to be adjusted, right?
So, like, it's not applied yet.
Right.
Good, good, good.
So, how do you feel about this?
It's like synchronous replication, Which means that if we lose
the primary, the new primary will have everything.
Which is great, right?
It will have everything and it will be needed to be applied,
so recovery.
But we don't lose anything.
It's great.
Agreed?
Michael: Yeah, with a little bit of latency cost on each, right?
Nikolay: Little bit.
Michael: Or a lot.
Nikolay: Yeah, this latency depends
on the network as well, a
lot.
Michael: Of
Nikolay: course.
And usually we prefer to move standbys
to different availability
zones.
And this increases latency, of
course, and this means that if
we keep this as is, like synchronous_commit
on and synchronous_standby_names
have...
This setting has 1 host name, It
means that we will have a significant
latency overhead This is why it's
hard to use in heavily loaded
projects, OLTP-like projects.
But it's quite strict mode, right?
Michael: Yeah, and you can have
an HA setup that's within the
same availability zone, or at least
the same region, I'm guessing,
would massively reduce that.
Nikolay: Well, I wouldn't put.
So for me, 1 availability zone
can disappear.
If you think about HA, we need
to have a replica in a different
availability zone.
Which, of course, there is trade-off
here.
We need to wait.
Wait, like, what is here, what
is here, and then what's worse,
what's best for us.
And we move, for me, like HA means
replica should be in different
AZ.
It's in different AZ, means like
commit in this strict mode,
remote flush or on, there's no
remote flush.
It means we will need to wait both
synchronization with disk
on the primary and there.
And network will add overhead here.
And I expect we won't be able to
scale these writes very well.
So if it's like thousands of writes
per second, it will be noticeable
issue.
That's why in most cases I see
people go to...
Well, by the way, remote apply,
interesting.
It's even worse, like we need to
apply this.
It's even worse and It kills performance
even more, right?
What do you think about this mode?
It's like extreme.
Michael: I actually lost you.
I think our network connection
is not great.
Nikolay: Okay, when I discussed
network issues, we had network
issues.
So I moved us to consideration
of remote apply.
Michael: Well, yeah, so actually
question here then.
This is no longer, I don't think,
about risk of data loss.
Because if we flushed, then we're
good.
So this is about, so is this for
read replicas?
This only makes sense if we're
sending read traffic.
Okay, great.
Nikolay: So this is good in terms
of how data is visible there.
We know by default on, it's acting
like remove flush, which again
doesn't exist.
Changes are not applied, so any
reading session which can be
possible, the only possible option
we can have on the SELECTs,
like reads on standby.
In this case it won't see this
data yet.
And the-
Michael: Unless we're using remote
apply, in which case it would
always be there.
Nikolay: It will be there right
after commit.
It's visible.
It's great.
So the delay between it's written
and visible becomes 0.
Once it's committed, it's immediately
visible.
Great.
Yeah.
And when do we need this?
Question is, when do we need, I
don't know.
I don't work with such products.
So I'm
Michael: right because by the point
you need read replicas, the
point of having read replicas is
because you have so much read
traffic that your primary can't
serve it all, right?
So we're talking about extremely
high, at least extremely high
reads.
Maybe cases where we have a ton
more reads than writes, and write
latency is not that, maybe social
media again.
Nikolay: Yeah, In this case, this
replica can be in the same
availability zone because it's
not for HA purposes, right?
It's just a redistributor needs
maybe.
I honestly don't know.
We usually, we already have mechanisms
everywhere implemented
to deal with lags of replication.
To stick to the primary write after
writes and in the same session
of user for some time or like these
kinds of things right so
Michael: yeah you mentioned sticky
reads before yeah
Nikolay: in this case to slow down
commits for the sake of this
like reduction of this delay of
this leg I don't know I would
prefer to deal with small lags,
but don't slow down commits.
Yeah.
So
Michael: that's, I guess that's
remote apply and the default
is on.
Do you ever, Do you see use cases
for remote write?
Nikolay: Remote write is the most
popular.
Michael: Oh really?
Interesting.
Nikolay: Of course, most setups
where synchronous binaries is
non-empty, I see they use remote
apply most of it.
Remote write, remote write.
Michael: Right, that makes more
sense.
Nikolay: So the list like complete,
right?
So, and here we have any issues
because this is super popular
option.
And, oh, by the way, I wanted to
mention that yesterday on Hacker
News, it was the article from Aphyr,
right?
Jepson.
Yeah, Jepson.
It was discussing RDS multi-AZ
clusters, which interesting, like
it's closer to our topic today,
but to understand the article,
you need time.
So we, great, we have remote write
as being very popular.
What does it mean?
The standby node received all data,
but hasn't flushed it yet
to the disk.
It means that if all nodes suddenly
shut down, Only primary will
have this data.
After commit, we know primary,
if commit happens, primary flushed
it, right?
We don't have settings which say,
which let us control the behavior
of commit on standby nodes, but
on the primary, it's not flushed.
Like if we have remote write on
or remote apply, it means in
any of these cases, disk on the
primary has all data from this
transaction, right?
So it's like it's already, it's
not off at all already.
It's not off.
And imagine we all know this appeared
on the primary has this
data because a standby since it's
remote right is not guaranteed
this might miss this data and this
yeah this transaction might
be lost And this is okay.
Okay.
But in case we lose the primary
completely, we lose data.
Right?
We lose data here.
Yeah.
But we lose data only if all nodes
are shut down.
Usually when we use the primary,
standby nodes are not shut down.
Unless something like...
Yeah, yeah, yeah.
So Let's say the loss of primary
is our like corner case, not
corner, edge case.
But loss of primary and standby
nodes are not lost but shut down,
it's like kind of very very corner
case, so several edges like,
So several problems simultaneously.
So usually they're still running.
That's why remote write is so popular
because it says, okay,
we lost the primary.
We need to perform auto failover
or manual failover, usually
auto failover.
And standby, 1 or 2, or how many
you configure, of those standby
nodes, pair the setting, pair standby
commit equals remote, right?
They have it in memory.
And they're still running, so they
will flush it.
So the risk is good here.
I mean, risk is understood.
Like loss of primary and all, everyone
is shut down suddenly.
But it's so low, so we can accept
it.
And say it still feels like
synchronous_commit to me because
I doubt when we lose primary standby
node which received but
not yet flushed this whole data,
it will be suddenly off.
It will be running, right?
And we don't lose this data.
This is great.
This is why it's so popular.
But it's great.
Except it's not.
Because if auto failover needs
to happen, it's very rarely, like
sometimes people allocate like
CrunchyBridge or RDS, not multi-AZ
cluster, but multi-AZ standby.
Now there are 2 options.
And multi-AZ standby means there
is synchronous standby only
serving HA purposes.
You cannot send queries to it.
Crunchy Bridge has the same, like,
not the same, not the same.
They use Postgres replication to
solve this.
Exactly, they configure synchronous_standby_names.
And they set synchronous_commit
to very weird value, we will
discuss it.
But the idea is HA replica doesn't
receive traffic, it serves
only for HA purposes.
Great.
But in many other cases, people
use like, they say this quorum
commit, 1 or 2 additional nodes
must say data is received.
Remote, right?
And in this case, if we lose the
primary, auto-failover will
happen to where?
To 1 of these nodes which received,
right?
But the problem is, For example,
if it's Patroni, Patroni cannot
choose this node, because there
is no such capability to find
which...
Like we have, again, we have 5
replicas, for example, 5 standbys.
We know 1 of them received latest
settings.
How to find this?
Michael: At least, yeah.
Nikolay: There is no such capability,
doesn't exist.
Because this pg_last_receive_lsn,
But it has the wrong name because
it shows pg... it shows flush
LSN.
It's a bug.
Oh, a bug or feature, I don't know.
So this is a problem.
And this is like if Patroni uses
that I think it uses that it
chooses the node which received
like which flushed LSN with
maximum value and this might be
different no Yeah
Michael: So but so with that in
mind I guess it's then an acceptable
risk, or that's why it has some
data loss.
But why not use on in that case?
Nikolay: Because it will slow down
commits.
Michael: OK, that's, yes, the trade-off.
Nikolay: We distribute our nodes
among various, multiple AZs,
availability zones, right?
And there is network overhead.
Michael: Yeah, well, Interestingly,
I checked before coming on
with our own Google Cloud SQL setup,
and I wondered what our
setting was, and it was on.
So either they haven't...
I'm pretty sure...
But yeah, it's interesting that
they could have chosen to set
it to remote.
Nikolay: And synchronous_standby_names
is non-empty?
Michael: I didn't check.
It's a good question.
Nikolay: If it's empty, you have
pure asynchronous application.
This is it.
Like, this is it.
Like, I Honestly, it concerns me
a lot how overloaded values
and names here.
On means different things depending
on different settings.
Why?
Maybe We are trying to save on
the number of configuration settings,
GUC names, not to expand it too
much.
Okay, 300, but let's not to expand
it to 500, right?
But this causes pain in brains,
like people are using this.
So if it's on, okay.
Is it?
Michael: No, you're right.
Nikolay: It's asynchronous.
Michael: No synchronous_standby_names
Nikolay: Yeah, yeah.
So it waits for flush only on the
primary.
Michael: Yeah, so that's basically
just local.
Nikolay: Yeah, yeah, yeah.
Good.
Yeah, so there's this problem,
it's unsolved.
It should be solved in Postgres.
It should be solved in Postgres
and I think it's not super difficult
to solve so we need a function
which will show write LSN or receive LSN.
By the way, receive and write may
be also a little bit different.
You receive but okay, it doesn't
matter Because while the receiver
is a receiver, right?
Anyway, so a couple of more things,
deeper, I wanted to mention.
It's deeper things, but 1 of the
things is even if we have a
remote write, there is issue in
the code that causes flushing
periodically.
So this WAL, like we have remote
write, but sometimes we wait
until it's flushed.
And this is also like this, I guess
this issue is known.
It's revealed itself only under
very heavy loads.
And yeah, it also could be improved
maybe, right?
Because if, so remote, right?
It's not purely remote, right?
It's remote, right, right, right.
But at some point it's flush and
all Transactions are currently
committing, being committed, they
wait additional flush.
And then it's again, pure, again,
write, write, write, and then
again, flush.
And it can be seen in the code
that there are things there, F
data sync, right?
Michael: Roughly, yeah.
I mean, is this also the wal_writer_delay
delay or a different kind
Nikolay: of delay?
No, no, no, It's not documented.
Michael: Okay, interesting.
Nikolay: So, I mean, if you use
remote, right, there is like
behavior which is not purely remote,
right?
So it still involves disk sometimes,
like periodically.
I guess some buffer is full and
needs to be written.
But this causes delays of commits.
Michael: So.
Makes sense.
Nikolay: Yeah, And another thing
is probably off topic, but CrunchyBridge,
let's discuss what we observed.
We just created a fresh cluster
and saw a few things.
1 of them is synchronous_standby_names
is filled with, I created
HA cluster.
Yeah, so HA cluster was all good.
And synchronous_standby_names has
this host, which is HA replica.
Like it's not visible to us.
We cannot send queries to it.
And synchronous_commit is set to
local.
Michael: Yeah, so the first thing
you mentioned, it being non-empty
means they're using Postgres replication,
right?
What's the significance of that
being non-empty?
Nikolay: If it's not empty, it's
supposed to be...
If a synchronous_commit was default
on, it would mean we have
a synchronous replication.
Flush, as you said, like on means remote flush in this...
Non-existing remote flash, it means, right?
But synchronous_commit is set to local, which means we don't
see the state of what's happening on standby.
So like commit only depends, the commit behavior depends only
the primary.
Michael: Like my setup, right?
Like on Google CloudSQL for me, even though it's on, it's kind
of like local because they're not using.
Nikolay: Right, but they still have the replica.
Michael: Yeah, Yeah, yeah.
Nikolay: Strange, right?
Michael: But local, local's confusing because I don't, yeah,
I would have, as you said, if, I would have expected it to be
on, or as you say, the more popular one.
Nikolay: If it would be on, if it was on, this replica would
be synchronous.
If it's local,
Michael: I guess it's unsynchronous?
Yeah.
Or asynchronous, yeah.
Nikolay: Asynchronous, right.
Although it's mentioned in synchronous_standby_names.
Michael: Yeah, good point.
What would you prefer it to be remote right, I guess, rather
than on?
Nikolay: No, no, no.
I think they are not, like, I'm not sure it's a mistake.
Michael: Well, it's not the default, right so it's definitely
deliberate.
Nikolay: There are some thoughts about this I think and I'm trying
to guess what because obviously making it synchronous would cause
slow commits right yeah performance would be not great okay we
go to local but why
Michael: do people are doing
Nikolay: yeah we keep it local, so this chair replica becomes
asynchronous.
Huh.
Do you
Michael: think it's to look good in benchmarking?
Nikolay: Why cannot we configure asynchronous replica in a normal
way?
Why should we put it to synchronous_standby_names, right?
Because it could be regular asynchronous replica, like present
in slot, but not available for read-only transactions, right?
But it's there.
And the only idea, it's not my idea, we discussed this situation
in the team and idea came up into one great mind was probably they
want WAL to be written by backends, not by WAL writer, because
there's like this effect when you use synchronous binaries, it's
not empty, backends are writing.
And this can increase throughput.
It's a guess, hypothesis.
If Crunchy is listening to us,
and we know they're listening
because they changed the random
page cost to 1.1, right?
Michael: Oh yeah.
Nikolay: Actually, it's great.
I also wanted to thank them because
since then I stopped saying
1 and I say 1.1 because Crunchy
benchmarked it and decided to
use 1.1 for random page cost.
Great.
Thank you so much.
But if they, if you guys are listening,
please let us know what's
happening with settings.
It's really interesting.
Maybe it should be documented because
it's interesting.
And so overall impression about
synchronous_commit, all those
options and synchronous replication,
I honestly see more asynchronous
setups purely.
But I know serious guys and I see
clusters with synchronous like
quorum commit configured.
My impression still there are things
to be fixed and improved.
This is what I observe.
And 1 more of...
Let's name it.
When we created this cluster with
HA, with HA checkbox, HA replica
was also created.
We see it in synchronous_standby_names.
And then we decided using our super
user, which Crunchy has,
and it's great.
I think it's the only managed service,
1 of maybe 2 actually,
which provides super user access,
which is great.
So we decided to change shared
preload libraries and restart.
And there was an idea, like a question,
is this change propagated
to a standby node or not?
And what happened, Postgres restart,
not cluster restart, Postgres
restart, cost fell over.
This is not all right.
This is not all right.
Freshly created cluster and we
did it 3 times, 2 out of 3 times,
failover happened.
And another time it didn't happen.
So I guess it's like some race
condition is present there, right?
And we saw that change was not
propagated to a channel.
So it's actually data loss.
If I say, if I have super user,
it's my right, right?
And I have alter system, it writes
to, we can see it in Pidge
settings and it's not yet applied,
like painting restart is true,
right?
And then we have a lower and we
don't see it anymore.
What is it?
It's data loss.
Michael: Or it's unsupported behavior.
Nikolay: No, no, no, it's data loss.
I wrote something.
Using alter system, I wrote something and I lost it.
Michael: So you can't have it both ways that like III see where
you're coming from 100% But on 1 hand you want super user access
and on the other hand You're gonna call it data loss if you change
anything that doesn't get propagated
Nikolay: Let me contradict any with myself like I I'm too harsh.
I know like alter system.
It's actually local, we can change settings on each standby separately,
independently, so it's not guaranteed, it's not propagated by
replication.
I just expected maybe some automation exists which would propagate
it.
If this replica is a HA replica, maybe we should match settings.
Because this setting required restart, but what happens to settings
that don't require restart?
If I change them, and then what?
Like if I change them using alter system, and do I lose them?
This is not good, right?
Michael: Yeah.
But equally, expectations of automations around this stuff is
probably 1 that could be a nice excuse for other providers not
to add CPUs to Access.
So it's yeah.
No, no, no.
Nikolay: Well, I think it's no, no, no, no, no, no, no, no, no,
no.
Michael: I'm just saying be careful what you wish for.
Nikolay: Well, it could happen, like I change something using
UI and...
Michael: No, because if you use the UI, they change it somewhere
else and it does get propagated.
Nikolay: Maybe.
But I'm rooting for Crunch actually here.
I think it can be solved, this problem.
I don't want them to close super user access.
It's a great feature, super great feature.
I think everyone should be.
Based on our last, previous week's topic, My position, RDS and
others, they are taking too much from us.
It stops feeling open source.
Michael: We don't need to make this point again.
Nikolay: Yeah, okay.
Well, I'm going to repeat it many times, right?
So it's not all right.
So and Crunch are doing a great job here to providing, we are
providing super user access and I think it can be solved, right?
And 1 more thing, if you're listening, when I restart Postgres,
I see restart cluster, right?
Michael: I feel like I need to chop this bit up and send it to
them as a support thing
Nikolay: good yeah it was good
Michael: nice 1
Nikolay: thank you
Michael: yeah thanks Thanks so much Nikolay and thanks for the
request and yeah catch you next week.
Nikolay: See you.
Bye.