Nikolay: Hello, hello, this is
Postgres.FM.

I'm Nik, Postgres.AI, and as usual,
my co-host is Michael, pgMustard.

Hi, Michael, how are you doing?

Michael: I'm good, how are you?

Nikolay: Great, everything's all
right.

A lot of bugs to fix and incidents
to troubleshoot, to perform

root cause analysis as we say,
RCA.

Michael: It sounds related to our
topic today maybe.

Nikolay: Oh yeah, maybe yes.

So The topic I chose is, I'm still
not 100% sure how to name

it properly, so let's decide together.

But the situation is simple for
me, relatively.

So we help a lot of startups, and
at some point I decided to

focus only on startups.

Being like raised a few startups
and helped many startups.

I know how it feels to choose technology
and grow, grow, grow

until some problems start hitting
you.

This is exactly like usually people
choose RDS or CloudSQL or

Supabase, anything.

And they don't need to hire DBAs,
DBREs, and they grow quite

well until a few terabytes of data
or 10,000 TPS, that kind

of scale.

And then problems pop up here and
there, and sometimes they come

in batches, you know, like not
just 1 problem, but several.

And here usually, for us it's good,
they come to us.

I mean, Postgres.AI, we have still
consulting wing, quite strong

and growing.

And we helped more than 20 startups
over the last year, which

I'm very proud of.

And I collected a lot of case studies,
so to speak.

And I decided to have some classification
of problems that feel

not good at very high level, for
example, CTO level or even CEO,

when you think they might start
thinking, is Postgres the right

choice or it's giving us too much
headache?

And it's not about like, oh, out
of disk space suddenly, or major

upgrades requiring some maintenance
window.

Although this also can cause some
headache.

But it's more about problems like
where you don't know what to

do, or you see it requires a lot
of effort to solve it properly.

Michael: Yeah, I've had a sneak
peek of your list, so I like

how you've described it.

I also like the thought process
of whether it hits the CTO or

the CEO, and I was thinking, let's
say you have a non-technical

CEO, if they start hearing the
word Postgres too often it's probably

a bad sign.

Ideally you might mention it once
every few years when you do

a major version upgrade, but then
nothing bad happens and they

don't hear it again for a few years.

But if they're hearing, you know,
if it's getting to the CEO

that Postgres is causing problems
over and over again, the natural

question is going to be, is there
an alternative?

What could we do instead?

Or, you know, is this a big problem?

So I guess it's these kind of dangers,
not just to the startup,

but also to Postgres' continued
use at that startup.

Nikolay: Yeah, I like the word
dangerous here because when you

deal with some of these problems,
it might feel dangerous to

have Postgres for them.

It's bad.

Like I would like if things were
better.

So I have a list of 10 items And
we can discuss and the list

is unordered and I'm going to post
it to my social networks so

folks can discuss.

And I'm like sincerely think that
this list is useful.

If you're a startup, it's great
to just to use this checklist

to see how your cluster is doing
or clusters are doing and are

you ready.

So Postgres growth readiness checklist.

And interesting that I didn't include
vertical and horizontal

scaling there.

I did it indirectly, we will touch
it.

But obviously, like this is the
most discussed topic, the biggest

danger, how Postgres scales, like
cluster, single primary, and

multiple standbys, how far we can
go.

We know we can go very far, very,
very far on a single cluster.

At some point, microservices, or
maybe sharding, it's great.

But we had a great episode with
Lev Kokotov, a PgDog and it

resonates 1 of the items I have
today it resonates with what

he said during our episode.

So anyway, let's exclude vertical
and horizontal scaling and

talk about stuff which kind of
sounds boring.

My first item is heavy lock contention.

This is very popular.

Maybe 50% of companies that come
to us have this issue.

Somehow.

So at some point I decided to start
saying everyone, if you have

queue-like workloads, or additionally,
and or, If you don't know

how dangerous it is to change schema
in Postgres, just adding

column can be a problem, right?

We discussed it, I think, many
times.

You are not ready to grow, and
at some point, sooner or later,

it will hit you.

And it will hit you as a spike
of active sessions.

And we know some managed Postgres
platforms provoke you to have

huge number of huge max connections.

Michael: Max connections,

Nikolay: yeah.

RDS like 5, 000, 2, 500.

Why do they do this?

Easier for them.

But it's dangerous because it creates kind of performance cliff

additionally.

Michael: Yeah, it's another version of these cliffs isn't it?

Nikolay: We

Michael: had another good episode recently.

Nikolay: Yeah, I plan to research this a little bit, probably

we will publish something in this area to prove that it's not

good.

It's still not good even if you have Postgres 14 plus which has

great optimizations for a large number of idle connections it's

still not good.

Michael: And there have been some improvements like I know a

very good engineer who took Postgres down by adding a column

with a default, I think it was.

But it was many years, there's some improvements in recent years

of some DDL changes that are less dangerous than they were.

Nikolay: Yeah, there are several levels.

Michael: Yes, yeah, of course.

And if they get stuck in if they don't have a lock_timeout

for example in fact yeah we're probably going to be pointing

to episodes on every single 1 of these bullet points but we have

we had 1 on 0 downtime migrations I think it's probably the best

for that and we had a separate 1 on queues actually, didn't we?

Nikolay: So yeah.

Yeah.

So definitely there are solutions here and you just need to proactively

deploy.

It's interesting that I see some companies grow quite far not

noticing this problem, for example, with DDL.

It becomes, it's like going to casino, like you can win, you

can win.

Sometimes, boom, you lose.

Because if you deploy some DDL and you get blocked, you can block

others and it can be a disaster.

We discussed it several times.

And if you had a hundred deployments successfully, it doesn't

mean you will keep winning, right?

So it's better to have.

And it concerns me, I have a feeling we should implement this

in Postgres.

Like alter table concurrently or something like this.

It should itself perform these retries with low lock_timeout.

Michael: Yeah, it's tricky, isn't it?

But I agree, but then people still need to know that it existed

to actually use it because I think the main issue here is people

not realizing that it can be a problem.

And the fact it probably hits users.

Let's say you've got a statement_timeout.

When are you actually going to notice that users have been waiting

for it?

Are you going to notice that spike
on your on your monitoring?

I'm not sure like it depends how
many users actually got stuck

waiting behind it and had slow
queries.

So and it's going to be hard to
reproduce that you might not

know why it was that.

So

Nikolay: log_lock_waits is off
so you don't see who blocked

you.

And you might be auto vacuum running
in this aggressive mode

or it can be another session long
running transaction which holds

access share lock to a table and
you cannot alter it.

And boom, you block others.

So this is like a reaction chain.

And yeah, it's not good.

And queue-like workloads same, like
at some smaller scale, you don't

see problems at all.

Then you occasionally experience
them.

But if you grow very fast, you
will start hitting these problems

very badly.

And they look like spikes of heavy
lock contention or just heavy

lock and lock in Postgres terminology
is the same so just lock contention

and yeah it doesn't look good so
and suggestion is so simple

like I It's funny that we talk
a lot and people that come to

us actually they mention they watch
podcast and I say like, okay,

do you like workload?

Just take care of indexes, take
care of bloat, like maybe partitioning,

but most importantly, skip locked.

That's it.

This is a solution, but we spent
hours to discuss details.

Because when you go to reality,
it's not easy to learn this,

like there are objections sometimes,
but this is what we do,

like we work with those objections
and help to implement, right?

So yeah, but we, yeah, for everything
we had episode.

There are episodes for everything.

So this was number 1, heavy load
contention.

And I chose the most popular reasons.

Of course there are other reasons.

But in my view, DDL and queue-like
workloads, not to number biggest,

the biggest ones.

Okay, next it's boring, super boring.

Bloat control and index management.

We had episodes about it, maybe
several actually.

But Since again, managed Postgres
platforms don't give you tools.

For example, RDS, they did great
job in auto-vacuum tuning, but

only half of it.

They made it very aggressive in
terms of how much resources,

like throttling, they gave a lot
of resources.

But they don't adjust scale factors.

So it visits, autovacuum visits
your tables Not often, not often

enough for LTP.

So bloat can be accumulated and
so on, and they don't give you

resources to understand the reasons
of bloat.

I'm thinking about it and I think
it's tricky and it's also a

problem of Postgres documentation
because it lacks clarity how

we troubleshoot reasons of the
bloat because we always say long-running

transaction But not every transaction
is harmful.

For example, in default Transaction
Isolation level Read Committed,

transaction is not that harmful
if it consists of many small

queries.

If it's a single query, it holds
a snapshot, it's harmful.

So I guess with observability we
should shift from long-running

transactional language to xmin horizon
language fully and discuss

that.

Anyway, like I can easily imagine
and I observe how people think,

oh, like MongoDB doesn't have this
stuff.

Or some other Database system,
they don't have the problem with

bloat.

Or indexes, indexes, oh.

Actually with indexes, my true
belief is that degradation of

index health is happening in other
systems as well.

We also discussed it.

So they need to be rebuilt.

Michael: I was listening to a SQL
Server podcast just for fun

the other day and they had the
exact same problem.

But in the episode where we talked
about index maintenance, I

think it came up that even if you're
really on top of autovacuum,

even if you have it configured
really nicely, there can still

be occasions where you get some
bloat.

If you have like a spike or if
you have a large deletion or you

have like a there's a few cases
where you can end up with sparsely

populated indexes that can't self-heal
like if for example you've

got like an or like a even UUIDv7
index and then you have a section

that maybe deletes some old data
and it's not partitioned, then

you've got a gap in your index.

So there's a bunch of reasons why
they can get bloated anyway,

even if you're on top of autovacuum.

So I think this is 1 of those ones
that, yes, autovacuum fixes

most of the problems, but you probably
still want to have some

plan for index maintenance anyway.

Nikolay: Yeah, so there are certain
things that are not automated

by Postgres itself or by Kubernetes
operators or by, Well, some

of them automated some things,
but not everyone, not everything.

Or managed service providers, even
upgrades.

Also like lack of automation there.

We can mention this lack of automation
of analyze, but fortunately

future Postgres versions will be
definitely fine because dump,

restore of statistics is implemented
finally and goes to Postgres

18, which is super great news.

Anyway, lack of automation might
feel like, oh, this is a constant

headache, but it's solvable.

It's solvable.

Fortunately, it requires some effort,
but it's solvable.

Okay.

Next thing is, next thing is, let's
talk about lightweight lock

contention.

So we talked about heavy lock contention
or just lock contention.

Lightweight lock contention is
also, this feels like, like pain

and of various kinds.

So lightweight locks can be called
latches, it's in memory.

So when some operations with buffer
pool happen, for example,

there are the lightweight locks,
Postgres needs to establish or

working with WAL or various data
structures.

Also can mention LockManager.

So things like LWLock:LockManager
or buffer mapping

or sub-trans SLRU, multi-exec SLRU.

When you hear this, for me, like
these terms, imagine like this

font, like bloody, you know, like
red blood, blood drops, drops

of blood because, because III know
so like many projects like

suffered big pain, like big incidents.

So for me, these, these terms are
like bloody terms, you know,

because, because yeah, because
it's, it's, it was a lot of pain

sometimes.

For example, you know I'm a big
fan of sub-transactions, right?

Just my natural advice is just
to eliminate them all.

Well, over time, I'm softer.

I say, okay, you just need to understand
them and use very carefully.

But LockManager, couple of years,
remember Jeremy Schneider

posted like-

Michael: Yeah, great post.

Nikolay: Horror stories, and we
discussed it as well.

So this kind of contention might
hit you and it feels like performance

cliff usually so all good all good
boom

Michael: right it's it what is
or was is it changing in 18 but

it what it was a hard-coded limit,
right?

Nikolay: 2016, you mean for fast
path?

Also SLRU sizes are now configurable,
I think in 2017 already.

Well, nice, good, but not always
enough.

Because okay, you can buy some
time, but still there is a cliff

and if you're not far from it,
again, boom.

Or this, I recently saw it, like,
remember we discussed 4 million

transactions per second.

And we discussed that we first
we found pg_stat_kcache was an

issue, it was fixed, and then pg_stat_statements.

Okay, pg_stat_statements, if the
transactions are super fast,

it's bringing an observer effect.

And we see it in newer Postgres
versions as LWLock:pg_stat_statements

because finally code is covered
by proper props, right?

Not props, like is wrapped and
it's visible in the wait event

analysis observing just that activity.

So I saw it recently at 1 customer,
I know like some layer of

lightweight lock pages and statements,
So we need to discuss what's

happening.

It happens only when you have a
lot of very fast queries, but

it can be a problem as well.

But yeah, and performance cliffs,
it requires some practice to

understand where they are.

It's hard because you need to understand
what kind of, like how

to measure usage, how to understand
like situation risks.

This requires some practice.

Michael: I think this is 1 of the
hardest ones.

I think this is 1 of the hardest
ones to see coming.

Nikolay: After all our stories
with LW LockManager, every time

I see some query exceeds 1000 QPS,
queries per second, I'm already

thinking, okay, this patient is
developing some chronic disease

you know.

Michael: Okay that's another that's
1 I haven't heard we've done

several rules of thumb before but
that's that's another good

1 so a thousand queries per second
for a single query check.

Nikolay: It's very relative also
how many vCPUs we have.

If we have less, it can hit faster.

Although we couldn't reproduce
exactly the same nature as we

see on huge machines like 196 cores,
we couldn't reproduce that

nature on 8 core machines at all.

So yeah, it's for big boys only,
you know.

This is, yeah, this is, like, or
maybe for adults.

So young projects don't experience
these problems.

Michael: That's a good point actually
the startups that have

hit this that you've written about
and things have tended to

be further along in their journeys
huge huge but yeah but still

growing quickly and it's even a
bigger problem at that point

but yeah good point should we move
on?

Nikolay: Yeah so the next 1 is
our usual suspect right it's a

wraparound of 8 byte transaction
ID and multi-exact ID.

So many words already said about
this.

It just bothers me that monitoring
doesn't cover, for example,

usually doesn't cover multi-exact
IDs.

And people still don't have alerts
and so on.

So it's sad.

Yeah, it's easy to create these
days.

Michael: I get the impression though,
I mean, there were a few

high profile incidents that got
blocked about.

I think, yes, yeah, exactly.

And I feel like I haven't seen
1 in a long while and I know there

are a lot of projects that are
having to you know, I think Agen

have spoken about If they weren't
on top of this it would be

it would only be a matter of hours
before they'd hit wrap around,

you know it's that kind of volume.

So they're really really having
to monitor and stay on top of

it all the time.

But I haven't heard of anybody
actually hitting this for quite

a while.

Do you think, I wondered if for
example there were some changes

to I think it was autovacuum being
like I think it kicks in

to do an anti-wraparound vacuum
differently or it might be lighter

a lighter type of vacuum that it
runs now I think I remember

Peter Geoghegan posting about it, something
like that.

Do you remember a change in

Nikolay: that area?

I don't remember, honestly.

I just know this is still a problem.

Again, at CTO level, it feels like,
how come Postgres still has

4 byte transaction IDs and what
kind of risks I need to take

into account.

But you are right, managed Postgres
providers do quite a good

job here.

They take care.

I had a guest at Postgres TV, Hannu
Krosing, who talked about

how to escape from it in a non-traditional
and in his opinion,

and actually my opinion as well,
in a better way, without single

user mode.

And since he is a part of CloudSQL
team, so it also shows how

much of effort managed Postgres
providers do in this area, realizing

this is a huge risk.

Michael: Yeah, and it's not even,
even if it's a small risk,

the impact when it happens is not
small.

So it's 1 of those ones where-
Absolutely

Nikolay: good correction.

It's low risk, high impact, exactly.

Michael: Yes, yes.

So I think the cases that were
blogged about were hours and hours

possibly even getting, was it even
a day or 2 of downtime for

those organizations?

And that was, that is then, I mean,
you're talking about dangers,

right?

Nikolay: That's- Global downtime,
whole Database is down.

Michael: Exactly.

People, you're gonna lose some
customers over that, right?

Nikolay: Yeah, unlike the next
item, a 4-byte integer primary

key is still a thing, you know.

I was surprised to have recently
this case, which was overlooked

by our tooling.

Michael: Oh, really?

Nikolay: I couldn't like, how come?

Yeah, because it was a non-traditional
way to have this.

Michael: Go on.

Nikolay: Well, it was first of
all, the sequence, which was used

by multiple tables.

Yeah, 1 for all of them.

And somehow it was defined, so
our report in postgres-checkup didn't

see it so when it came I was like
how come this like this is

old friend or old enemy, not friend,
enemy.

Michael: Old enemy.

Nikolay: I haven't seen you for
like so many years.

And you look differently, you know,
because multiple tables,

but still, like, it's not fun.

And this causes partial downtime
because some part of workload

cannot work anymore.

You cannot INSERT.

Michael: Yeah.

Nikolay: Yeah.

So, by the way, I also learned
that if you just do in place ALTER

TABLE for huge table, it not so
dumb as I thought.

I checked source code, I was impressed.

And this code is from 9 point something,
maybe even before.

So if you ALTER TABLE, ALTER COLUMN
to change from int4 to int8,

it actually performs a job like
similar to VACUUM FULL.

Recreating indexes.

And you don't have bloat.

I expected like 50% bloat, you
know.

Michael: Oh, why?

Nikolay: Because I thought it will
be, it will rewrite the whole

table.

I was mistaken.

It's quite smart.

Yeah, it's of course it's a blocking
operation, it causes downtime

to perform it, but you end up having
quite clean state of table

and indexes.

Not quite, clean state, it's fresh.

Michael: Yeah, so that is a table
rewrite, no?

Nikolay: Yes, well, table, yes,
well, you are right.

I was thinking about table rewrite
as a very dumb thing, like

create more tuples and DELETE other
tuples.

Michael: Got it, got it, got it.

Nikolay: But there is a mechanism
of table rewrite in the code.

Now I saw it finally, I'm still
learning, you know, sorry.

Michael: You might end up with
some padding issues if you had

it optimized well before, but yeah.

Nikolay: Yeah, it also feels like
Postgres could implement some

reshape like eventually because
there are building blocks in

the code already, I see them like
to first like offline style

to change COLUMN order and then
if you want it and then fully

online style if pg_squeeze goes
to core to core right yeah yeah

it would be great yeah I'm just
like connecting paths here and

can be very powerful in like 3
to 5 years maybe But it's a lot

of work additionally.

So all those who are involved in
moving huge building blocks,

I have huge respect.

So okay.

And I

Michael: think this is, if you
know what you're doing, this one's

easier to recover from.

I assume like with the sequence,
for example, you can handle

it multiple ways, but you can set
the sequence to like negative

2 billion and normally you've got
a good start.

Nikolay: Everyone thinks they're
smart and and this is first

thing I this is the first thing
I hear always when we discuss

this.

This was in the past, like let's
use negative values.

Of course, if you can use negative
values, do it.

Because we know Postgres integers
are signed integers.

So we use only half of capacity
of, 4 byte capacity, half of

it is 2.1 billion, roughly.

So you have 2.1 billion more, but
not always it's possible to

use.

But this is old, old, old story
still making some people nervous

and I think it's good to check
in advance.

Michael: So much better, so much
better to have alerts when you're

getting...

Nikolay: I sell several companies,
big ones, from this, just

raising this.

And I know in some companies, it
was like 1 year or a few years

work to fix it.

Michael: So what was the problem
before?

Was it looking at columns instead
of looking at sequences?

Or what was the-

Nikolay: No, sequences are always
8 bytes.

It was always so, like, if they
are 8 bytes.

Problem with report, I don't remember
honestly.

There was some problem with report.

It was not standard way, not just
create table and you have default

with sequence and you see it.

Something else, some function,
I don't remember exactly.

Okay.

But usually our report catches
such things.

Or you can just check yourself
if you have primary keys for byte,

it's time to move to 8 bytes or
to UUID version 7, right?

Maybe.

Okay, that's it about this.

Then let's heat up the situation,
replication limits.

So in the beginning I mentioned
vertical and horizontal scaling

and usually people say there's
not enough CPU or something and

disk I.O.

And we need to scale and you can
scale read only workloads having

more read only standbys.

But it's hard to scale writes,
and this is true.

But also true that at some point,
and Lev mentioned it, In our

PgDog episode, he mentioned that
in Instacart they had it, right?

So like at 200 megabytes per second
WAL generation, it's already

over 100.

I don't remember exactly what he
mentioned, but somewhere in

that area.

Michael: It was a lot.

Nikolay: Well, just 10 WALs per
second, for example, it gives

you 160.

Well, RDS sets less because they
have 64 megabyte WALs.

They raised it 4 times.

Anyway, 100 or 200, 300 megabytes
per second, you start hitting

the problems with single-threaded
processes.

Actually, it wonders me why we
don't monitor, like any Postgres

monitoring should have, with low
level access.

RDS doesn't have it, but low-level
access.

We should see how much CPU, we
should see CPU usage for every

important single-threaded Postgres
process.

Right?

WAL sender, WAL receiver, Logical
Replication Worker, Checkpointer

maybe as well.

That would

Michael: be helpful.

Nikolay: Of course, because it's
dangerous to grow at scale when

you hit 100% of a single vCPU.

And then you need to either vertically
or horizontally scale

or start saving on WAL generation.

Fortunately, in pg_stat_statements,
we have WAL metrics, 3

columns since Postgres 13.

But unfortunately, this is query
level.

What we need, we need also table
level to understand which tables

are responsible for a lot of WAL.

And pg_stat_activity lacks it.

I think it's a good idea to implement.

If someone wants hacking, this
is a great idea.

Add 3 more columns, WAL-based,
WAL-related metrics to pg_stat_all_tables,

pg_stat_sys_tables, and
user tables.

It would be great.

Also, maybe pg_stat_databases, like
a global view of things, how

much WAL.

Michael: Yeah, I've got a vague
memory there were a couple of

WAL-related new views, new system
views

Nikolay: introduced.

Oh yes, but it's about, Yeah, it's
about...

Are you talking about the pg_stat_io?

No.

WAL-related in Postgres 13, it
went to EXPLAIN and it went to

pg_stat_statements.

This is what happened.

Anyway, this is not an easy problem
to solve.

It's easy to check if you have
access.

Unfortunately, if you're on managed
Postgres, you don't have

access.

They need to check it.

What's happening, especially on
standbys.

And also, it makes sense to tune
compression properly because

compression can eat your CPU.

Remember we discussed WAL compression.

I always said, let's turn it on.

Now I think let's turn it on unless
you have this problem.

In this case, WAL sender.

Yeah, you need to check how much
of that is WAL compression.

And also we have new compression
algorithms implemented in fresh

Postgres versions.

So, yeah.

And that is CD.

So, so, so we can choose better
1.

For example, as it 4 should be
providing similar as remember

some I saw some benchmarks.

I didn't do it myself yet.

So it should be similar to to be
gels at default compression

in terms of compression ratio,
but it takes much less CPU, like

2 to 3 times less CPU.

So it's worth choosing that.

Michael: I looked at it briefly
just for our own use for planned

storage.

And it was, we even got better
performance as well.

So It won on both for us.

Nikolay: Well, you probably saw
what I saw, Small Datum blog,

right?

No?

OK, maybe not.

Let's move on, because there are
more things to discuss.

Michael: Nice.

Nikolay: Design limits.

So some people already think what
they will do when their table

reaches 32 terabytes.

Michael: Yeah, I guess this and
the last 1 both feel like adult

problems again, right?

Like, There aren't too many small
startups hitting the Instacart

level of WAL generation or 32
terabytes of data.

Nikolay: So yeah, it's a really
big clusters.

But we can start thinking about
them earlier and be better prepared.

Maybe sometimes not spending too
much time, because you know

like if you spend too much time
thinking about how you will bring

statistics to new cluster after
measure of red, but it's already

implemented in 18, so just a couple
of years more and everyone

forgets about this problem.

Michael: It's the kind of thing
that's useful to have like a,

if you've got like a to-do list
you're going to be using in a

couple of years time or a calendar
that you know you're going

to get an alert for, Just put a
reminder in for like a year or

two's time, just check.

Nikolay: We also mentioned a few
times during last episodes,

latest episodes, this bothers me
a lot.

First I learned CloudSQL has this
limit, then I learned RDS

also has this limit.

64 terabytes per whole database
what's happening here it's already

not huge database

Michael: but again in a couple
of years who knows they might

have increased that

Nikolay: yeah well I think it's
solvable of course I guess it's

this is limit of single EBS volume
or disk on Google Cloud, PD,

SSD, or how they call it.

So yeah, it's solvable, I think,
right?

With some tricks.

But these days, it doesn't feel
huge.

64, it feels big database, right?

But when we say

Michael: to most of us,

Nikolay: but we have stories, 100
plus databases, all of them

are are self managed.

I think everyone has

Michael: all of them were sharded.

All of those were sharded.

Nikolay: Yeah, 100%, yeah, we had
a great episode, 100 terabytes,

where it was Adyen, who else?

Michael: We had Notion and we had
Figma.

Nikolay: Figma, right.

And Notion and Figma, they are
on RDS, but it's a shard, a single

cluster, it's impossible on RDS,
right?

And I think Adyen has 100 plus terabytes.

No, they have 100 plus terabytes,
but it's self-managed.

Yes.

Because on RDS it's impossible,
not supported.

Michael: Well, and they shard it
also.

Nikolay: Yeah, yeah, yeah.

But yeah, large companies like
that, they're always like some

parts are sharded, some are not.

So anyway, when you have 10 to
20 terabytes, it's time to think

if you are on RDS or CloudSQL,
is it like how you will grow

5X?

Because if you're 20 terabytes
to grow 5X, 100, it's already

not possible with a single cluster,
right?

Another reason to think about splitting
somehow.

So, okay.

Then a few more items, like, so
data loss.

Data loss is a big deal.

If you poorly designed backups
or HA solutions, Yeah, it can

be.

Let's join this with poor HA choice
leading to failures like

split brains.

So data loss, split brain.

Actually I thought we had a discussion.

There is ongoing discussion in
the project called CloudNativePG

where I raised the topic of
split-brain and demonstrated

how to do it a couple of weeks
ago.

And good news as I see, they decided
to implement something to

move in a direction similar to
Patroni because when network partition

happens and the primary is basically
alone, it's bad because

it remains active.

And as I demonstrated, some parts
of application might still

talk into it.

This is classical split-brain.

And I saw, based on discussion,
I saw it triggered, I never thought

deeply, actually.

But is split-brain just a variant
of data loss?

Michael: Well, I guess you technically
might not have lost the

data.

It's still there, you just have
2 versions.

Which 1

Nikolay: is correct?

Michael: It's worse

Nikolay: than data loss.

It's worse than data loss.

It's worse.

Because now you have 2 versions
of reality.

And it's bad.

With data loss, you can apologize
and ask to like, Bring your

data back again, please.

And in some cases we allowed some
data loss.

Of course, data loss is really
a sad thing to have.

But sometimes we have it like...

Officially, some data loss might
happen.

The risk is very low.

And maximum this number of bytes,
for example, but it might happen.

With split-brain it's worse.

You need a lot, spend a lot of
effort to merge realities into

1.

Michael: Most cases of data loss
I've seen have tended to be

at least 2 things gone wrong like
user error or some some way

that the nodes gone down but quite
often it's user error accidental

delete without a where clause or
dropping a table in 1 environment.

Nikolay: Well, this is like higher
level data loss.

Michael: Well, but that, but that
can cause the low level data

loss if you then also don't have
tested backups and it turns

out you didn't have a good, so
it's the combination of the 2

things often.

Nikolay: For me it's still, yeah,
well, yes, yeah, if backups

are missing, it's bad, yeah, you
cannot recover, but also like

data loss for me, classical, like
lower level, it's database

said, commit successful, and then
my data has gone.

Michael: So yeah.

So that's scary and dangerous as
like a CTO.

Nikolay: Undermines trust into
Postgres again, right?

Yeah, yeah.

If procedures are leading to data
loss and also split brains.

Michael: Is it actually happening
often or is it more the CTO

thinking, I don't want to choose
this technology because it could

happen?

Nikolay: Depends on the project
and the level of control.

I'm pretty confident in many, many,
many web and mobile app,

OLTP style projects, if they are,
unless they are like financial

or something like social network,
social media, like maybe even

e-commerce and so on, data loss
happens sometimes unnoticed.

Yeah.

If you have asynchronous replicas
and failover happened and the

process of failover, like Patroni
by default with asynchronous

replicas allows up to 1 megabyte,
it's written in config, Up

to 1 megabyte of data loss, officially.

Maybe byte.

Right, so 1 megabyte is possible
to lose.

And who will complain if it's a
social media comments or something,

we store like comments.

We lost some comments, nobody noticed
maybe.

But if it's a serious project,
it's better not to do it.

Or split brain.

Yeah, anyway, this is more not...

Postgres doesn't drop quite well
here.

It's a question mostly to everything
around Postgres infrastructure

and if it's managed Postgres to
their infrastructure, how they

guarantee there is no data loss.

And last time we discussed the
problem with synchronous_commit

and we discussed in detail how
right now Postgres doesn't do

a good job of not revealing proper
LSNs on standbys, right?

So even Patroni can have a data
loss in case of synchronous_commit

with remote write.

We discussed it.

Okay, anyway, this feels like something
for improvement definitely

here.

Good corruption.

My general feeling, people don't
realize how many types of corruption

might happen.

And it remains unnoticed in so
many cases.

When you start talking, people's
reaction sometimes is, wait,

what?

So yeah, corruption at various
levels.

Michael: So actually, maybe this,
so in terms of the list where

we started, kind of the point of
the topic, there was kind of

dangers, right?

Is this 1 of those ones that if
it silently happened for a while,

it suddenly becomes a complete
loss of trust of the underlying

system or?

Nikolay: Yeah, as usual, I can
rattle about a little about how

Postgres defaults are outdated.

We know data checksums only recently
enabled by default.

Yeah, great change.

Like a month or 2 ago, right?

It will be only in 18.

It should be done 5 years ago.

We saw

Michael: evidence.

Some,

Nikolay: yeah.

Yeah, many managed Postgres providers
did it, like RDS.

That's great.

Michael: Which is great.

And that kind of is then the default
for a lot of people.

Nikolay: Yeah, but it also doesn't
guarantee that you don't have

corruption.

You need to read all the pages
from time to time, right?

And do they offer something, some
anti-corruption tooling?

They don't, nobody.

Okay, they enabled, so what?

This is just a small piece of the
whole puzzle.

Michael: Yeah, and amcheck improving
as well.

I think, is it in 18?

Nikolay: Yeah, that's great.

Michael: In indexes as well.

Nikolay: Yeah, big.

Michael: In fact, I think we, did
we have an episode on amcheck?

I feel like it came up at least
once or twice, maybe index maintenance.

Nikolay: I cannot remember all
our episodes, so many.

Michael: Yeah, me neither.

Nikolay: But it's great to see
the progress on the area of amcheck,

but again, it's a small piece of
the puzzle and it has settings,

like options to choose from.

So it's not trivial to choose among
options.

And also again, like a couple of
episodes ago, we discussed support

and how difficult it is to sometimes
understand what's happening.

Right.

Yeah.

And I, I right now have a case
where very big platform doesn't

help to investigate corruption.

And nobody has like ability to
investigate it but them, and but

they are not helping.

So it was 1 of reasons I was provoked
to talk about that in that

episode.

So it's bad.

It looks really bad.

Like, you guys are responsible
for that corruption case, and

you don't do a great job.

And I think it's a problem of industry,
and we discussed it already,

so let's not repeat.

But in general, in general, I think
if you are CTO or like leader

who decides priorities, my big advice is to take this list and

check this, like evaluate situation for yourself.

Better let us do it, of course, but you can do it yourself.

And then plan some proactive measures because corruption testing

can be done even on RDS proactively.

If it happens, you need support, of course, because sometimes

it's like it's low level, you don't have access, but at least

you will feel control over it, right?

So, over corruption.

So, anti-corruption tooling is needed.

This is what I feel.

That's it.

That's all my list.

I'm sure it's lacking something like security, more security

related stuff for example as usual I tend to like it what do

you think like was it was it good

Michael: yeah you see a lot more of these things than I do obviously

but yeah I think it's a really good list and check with checklists

right it's not of course if you could go on forever with with

things to be scared of but this this feels like if you ticked

all of these off you'd be in such a good position versus most

and obviously things can still go wrong but these are some of

the most at least even if they're not the most common some of

the things that could cause the biggest issues the things that

are most likely to get on the the CEO's desk or in the inbox

so yeah this this feels like if you're on top of all of these

things, you're going to go a long, long way before you hit issues.

Nikolay: I have a question for you.

Guess among these 10 items, which item I never had in my production

life?

Michael: Oh, maybe...

Wait, give me a second.

Nikolay: It's tricky, right?

Michael: Transaction ID wraparound?

Nikolay: Exactly, how, yeah, I never had it.

I only-

Michael: It's rare.

Nikolay: Well, it's rare, yeah.

Let's cross it off, no problem.

Yeah, I only found a way to emulate it, right, which we did multiple

times, but never had it in reality in production.

So yeah, everything else I had.

Yeah.

Not once.

Michael: I nearly guessed too quickly that it was going to be

Split Brain, but then I was like, no wait, read the whole list.

But I'm guessing you had Split Brains like a couple of times,

maybe maximum?

Nikolay: Yeah, Replication Manager, Split Brain as a service,

yes.

Michael: Okay, yeah, pre-Patroni days.

Yeah, makes sense.

All right, nice 1.

Thanks so much, Nikolay.

Nikolay: Thank you.

Some kind things our listeners have said