
Top ten dangerous issues
Nikolay: Hello, hello, this is
Postgres.FM.
I'm Nik, Postgres.AI, and as usual,
my co-host is Michael, pgMustard.
Hi, Michael, how are you doing?
Michael: I'm good, how are you?
Nikolay: Great, everything's all
right.
A lot of bugs to fix and incidents
to troubleshoot, to perform
root cause analysis as we say,
RCA.
Michael: It sounds related to our
topic today maybe.
Nikolay: Oh yeah, maybe yes.
So The topic I chose is, I'm still
not 100% sure how to name
it properly, so let's decide together.
But the situation is simple for
me, relatively.
So we help a lot of startups, and
at some point I decided to
focus only on startups.
Being like raised a few startups
and helped many startups.
I know how it feels to choose technology
and grow, grow, grow
until some problems start hitting
you.
This is exactly like usually people
choose RDS or CloudSQL or
Supabase, anything.
And they don't need to hire DBAs,
DBREs, and they grow quite
well until a few terabytes of data
or 10,000 TPS, that kind
of scale.
And then problems pop up here and
there, and sometimes they come
in batches, you know, like not
just 1 problem, but several.
And here usually, for us it's good,
they come to us.
I mean, Postgres.AI, we have still
consulting wing, quite strong
and growing.
And we helped more than 20 startups
over the last year, which
I'm very proud of.
And I collected a lot of case studies,
so to speak.
And I decided to have some classification
of problems that feel
not good at very high level, for
example, CTO level or even CEO,
when you think they might start
thinking, is Postgres the right
choice or it's giving us too much
headache?
And it's not about like, oh, out
of disk space suddenly, or major
upgrades requiring some maintenance
window.
Although this also can cause some
headache.
But it's more about problems like
where you don't know what to
do, or you see it requires a lot
of effort to solve it properly.
Michael: Yeah, I've had a sneak
peek of your list, so I like
how you've described it.
I also like the thought process
of whether it hits the CTO or
the CEO, and I was thinking, let's
say you have a non-technical
CEO, if they start hearing the
word Postgres too often it's probably
a bad sign.
Ideally you might mention it once
every few years when you do
a major version upgrade, but then
nothing bad happens and they
don't hear it again for a few years.
But if they're hearing, you know,
if it's getting to the CEO
that Postgres is causing problems
over and over again, the natural
question is going to be, is there
an alternative?
What could we do instead?
Or, you know, is this a big problem?
So I guess it's these kind of dangers,
not just to the startup,
but also to Postgres' continued
use at that startup.
Nikolay: Yeah, I like the word
dangerous here because when you
deal with some of these problems,
it might feel dangerous to
have Postgres for them.
It's bad.
Like I would like if things were
better.
So I have a list of 10 items And
we can discuss and the list
is unordered and I'm going to post
it to my social networks so
folks can discuss.
And I'm like sincerely think that
this list is useful.
If you're a startup, it's great
to just to use this checklist
to see how your cluster is doing
or clusters are doing and are
you ready.
So Postgres growth readiness checklist.
And interesting that I didn't include
vertical and horizontal
scaling there.
I did it indirectly, we will touch
it.
But obviously, like this is the
most discussed topic, the biggest
danger, how Postgres scales, like
cluster, single primary, and
multiple standbys, how far we can
go.
We know we can go very far, very,
very far on a single cluster.
At some point, microservices, or
maybe sharding, it's great.
But we had a great episode with
Lev Kokotov, a PgDog and it
resonates 1 of the items I have
today it resonates with what
he said during our episode.
So anyway, let's exclude vertical
and horizontal scaling and
talk about stuff which kind of
sounds boring.
My first item is heavy lock contention.
This is very popular.
Maybe 50% of companies that come
to us have this issue.
Somehow.
So at some point I decided to start
saying everyone, if you have
queue-like workloads, or additionally,
and or, If you don't know
how dangerous it is to change schema
in Postgres, just adding
column can be a problem, right?
We discussed it, I think, many
times.
You are not ready to grow, and
at some point, sooner or later,
it will hit you.
And it will hit you as a spike
of active sessions.
And we know some managed Postgres
platforms provoke you to have
huge number of huge max connections.
Michael: Max connections,
Nikolay: yeah.
RDS like 5, 000, 2, 500.
Why do they do this?
Easier for them.
But it's dangerous because it creates kind of performance cliff
additionally.
Michael: Yeah, it's another version of these cliffs isn't it?
Nikolay: We
Michael: had another good episode recently.
Nikolay: Yeah, I plan to research this a little bit, probably
we will publish something in this area to prove that it's not
good.
It's still not good even if you have Postgres 14 plus which has
great optimizations for a large number of idle connections it's
still not good.
Michael: And there have been some improvements like I know a
very good engineer who took Postgres down by adding a column
with a default, I think it was.
But it was many years, there's some improvements in recent years
of some DDL changes that are less dangerous than they were.
Nikolay: Yeah, there are several levels.
Michael: Yes, yeah, of course.
And if they get stuck in if they don't have a lock_timeout
for example in fact yeah we're probably going to be pointing
to episodes on every single 1 of these bullet points but we have
we had 1 on 0 downtime migrations I think it's probably the best
for that and we had a separate 1 on queues actually, didn't we?
Nikolay: So yeah.
Yeah.
So definitely there are solutions here and you just need to proactively
deploy.
It's interesting that I see some companies grow quite far not
noticing this problem, for example, with DDL.
It becomes, it's like going to casino, like you can win, you
can win.
Sometimes, boom, you lose.
Because if you deploy some DDL and you get blocked, you can block
others and it can be a disaster.
We discussed it several times.
And if you had a hundred deployments successfully, it doesn't
mean you will keep winning, right?
So it's better to have.
And it concerns me, I have a feeling we should implement this
in Postgres.
Like alter table concurrently or something like this.
It should itself perform these retries with low lock_timeout.
Michael: Yeah, it's tricky, isn't it?
But I agree, but then people still need to know that it existed
to actually use it because I think the main issue here is people
not realizing that it can be a problem.
And the fact it probably hits users.
Let's say you've got a statement_timeout.
When are you actually going to notice that users have been waiting
for it?
Are you going to notice that spike
on your on your monitoring?
I'm not sure like it depends how
many users actually got stuck
waiting behind it and had slow
queries.
So and it's going to be hard to
reproduce that you might not
know why it was that.
So
Nikolay: log_lock_waits is off
so you don't see who blocked
you.
And you might be auto vacuum running
in this aggressive mode
or it can be another session long
running transaction which holds
access share lock to a table and
you cannot alter it.
And boom, you block others.
So this is like a reaction chain.
And yeah, it's not good.
And queue-like workloads same, like
at some smaller scale, you don't
see problems at all.
Then you occasionally experience
them.
But if you grow very fast, you
will start hitting these problems
very badly.
And they look like spikes of heavy
lock contention or just heavy
lock and lock in Postgres terminology
is the same so just lock contention
and yeah it doesn't look good so
and suggestion is so simple
like I It's funny that we talk
a lot and people that come to
us actually they mention they watch
podcast and I say like, okay,
do you like workload?
Just take care of indexes, take
care of bloat, like maybe partitioning,
but most importantly, skip locked.
That's it.
This is a solution, but we spent
hours to discuss details.
Because when you go to reality,
it's not easy to learn this,
like there are objections sometimes,
but this is what we do,
like we work with those objections
and help to implement, right?
So yeah, but we, yeah, for everything
we had episode.
There are episodes for everything.
So this was number 1, heavy load
contention.
And I chose the most popular reasons.
Of course there are other reasons.
But in my view, DDL and queue-like
workloads, not to number biggest,
the biggest ones.
Okay, next it's boring, super boring.
Bloat control and index management.
We had episodes about it, maybe
several actually.
But Since again, managed Postgres
platforms don't give you tools.
For example, RDS, they did great
job in auto-vacuum tuning, but
only half of it.
They made it very aggressive in
terms of how much resources,
like throttling, they gave a lot
of resources.
But they don't adjust scale factors.
So it visits, autovacuum visits
your tables Not often, not often
enough for LTP.
So bloat can be accumulated and
so on, and they don't give you
resources to understand the reasons
of bloat.
I'm thinking about it and I think
it's tricky and it's also a
problem of Postgres documentation
because it lacks clarity how
we troubleshoot reasons of the
bloat because we always say long-running
transaction But not every transaction
is harmful.
For example, in default Transaction
Isolation level Read Committed,
transaction is not that harmful
if it consists of many small
queries.
If it's a single query, it holds
a snapshot, it's harmful.
So I guess with observability we
should shift from long-running
transactional language to xmin horizon
language fully and discuss
that.
Anyway, like I can easily imagine
and I observe how people think,
oh, like MongoDB doesn't have this
stuff.
Or some other Database system,
they don't have the problem with
bloat.
Or indexes, indexes, oh.
Actually with indexes, my true
belief is that degradation of
index health is happening in other
systems as well.
We also discussed it.
So they need to be rebuilt.
Michael: I was listening to a SQL
Server podcast just for fun
the other day and they had the
exact same problem.
But in the episode where we talked
about index maintenance, I
think it came up that even if you're
really on top of autovacuum,
even if you have it configured
really nicely, there can still
be occasions where you get some
bloat.
If you have like a spike or if
you have a large deletion or you
have like a there's a few cases
where you can end up with sparsely
populated indexes that can't self-heal
like if for example you've
got like an or like a even UUIDv7
index and then you have a section
that maybe deletes some old data
and it's not partitioned, then
you've got a gap in your index.
So there's a bunch of reasons why
they can get bloated anyway,
even if you're on top of autovacuum.
So I think this is 1 of those ones
that, yes, autovacuum fixes
most of the problems, but you probably
still want to have some
plan for index maintenance anyway.
Nikolay: Yeah, so there are certain
things that are not automated
by Postgres itself or by Kubernetes
operators or by, Well, some
of them automated some things,
but not everyone, not everything.
Or managed service providers, even
upgrades.
Also like lack of automation there.
We can mention this lack of automation
of analyze, but fortunately
future Postgres versions will be
definitely fine because dump,
restore of statistics is implemented
finally and goes to Postgres
18, which is super great news.
Anyway, lack of automation might
feel like, oh, this is a constant
headache, but it's solvable.
It's solvable.
Fortunately, it requires some effort,
but it's solvable.
Okay.
Next thing is, next thing is, let's
talk about lightweight lock
contention.
So we talked about heavy lock contention
or just lock contention.
Lightweight lock contention is
also, this feels like, like pain
and of various kinds.
So lightweight locks can be called
latches, it's in memory.
So when some operations with buffer
pool happen, for example,
there are the lightweight locks,
Postgres needs to establish or
working with WAL or various data
structures.
Also can mention LockManager.
So things like LWLock:LockManager
or buffer mapping
or sub-trans SLRU, multi-exec SLRU.
When you hear this, for me, like
these terms, imagine like this
font, like bloody, you know, like
red blood, blood drops, drops
of blood because, because III know
so like many projects like
suffered big pain, like big incidents.
So for me, these, these terms are
like bloody terms, you know,
because, because yeah, because
it's, it's, it was a lot of pain
sometimes.
For example, you know I'm a big
fan of sub-transactions, right?
Just my natural advice is just
to eliminate them all.
Well, over time, I'm softer.
I say, okay, you just need to understand
them and use very carefully.
But LockManager, couple of years,
remember Jeremy Schneider
posted like-
Michael: Yeah, great post.
Nikolay: Horror stories, and we
discussed it as well.
So this kind of contention might
hit you and it feels like performance
cliff usually so all good all good
boom
Michael: right it's it what is
or was is it changing in 18 but
it what it was a hard-coded limit,
right?
Nikolay: 2016, you mean for fast
path?
Also SLRU sizes are now configurable,
I think in 2017 already.
Well, nice, good, but not always
enough.
Because okay, you can buy some
time, but still there is a cliff
and if you're not far from it,
again, boom.
Or this, I recently saw it, like,
remember we discussed 4 million
transactions per second.
And we discussed that we first
we found pg_stat_kcache was an
issue, it was fixed, and then pg_stat_statements.
Okay, pg_stat_statements, if the
transactions are super fast,
it's bringing an observer effect.
And we see it in newer Postgres
versions as LWLock:pg_stat_statements
because finally code is covered
by proper props, right?
Not props, like is wrapped and
it's visible in the wait event
analysis observing just that activity.
So I saw it recently at 1 customer,
I know like some layer of
lightweight lock pages and statements,
So we need to discuss what's
happening.
It happens only when you have a
lot of very fast queries, but
it can be a problem as well.
But yeah, and performance cliffs,
it requires some practice to
understand where they are.
It's hard because you need to understand
what kind of, like how
to measure usage, how to understand
like situation risks.
This requires some practice.
Michael: I think this is 1 of the
hardest ones.
I think this is 1 of the hardest
ones to see coming.
Nikolay: After all our stories
with LW LockManager, every time
I see some query exceeds 1000 QPS,
queries per second, I'm already
thinking, okay, this patient is
developing some chronic disease
you know.
Michael: Okay that's another that's
1 I haven't heard we've done
several rules of thumb before but
that's that's another good
1 so a thousand queries per second
for a single query check.
Nikolay: It's very relative also
how many vCPUs we have.
If we have less, it can hit faster.
Although we couldn't reproduce
exactly the same nature as we
see on huge machines like 196 cores,
we couldn't reproduce that
nature on 8 core machines at all.
So yeah, it's for big boys only,
you know.
This is, yeah, this is, like, or
maybe for adults.
So young projects don't experience
these problems.
Michael: That's a good point actually
the startups that have
hit this that you've written about
and things have tended to
be further along in their journeys
huge huge but yeah but still
growing quickly and it's even a
bigger problem at that point
but yeah good point should we move
on?
Nikolay: Yeah so the next 1 is
our usual suspect right it's a
wraparound of 8 byte transaction
ID and multi-exact ID.
So many words already said about
this.
It just bothers me that monitoring
doesn't cover, for example,
usually doesn't cover multi-exact
IDs.
And people still don't have alerts
and so on.
So it's sad.
Yeah, it's easy to create these
days.
Michael: I get the impression though,
I mean, there were a few
high profile incidents that got
blocked about.
I think, yes, yeah, exactly.
And I feel like I haven't seen
1 in a long while and I know there
are a lot of projects that are
having to you know, I think Agen
have spoken about If they weren't
on top of this it would be
it would only be a matter of hours
before they'd hit wrap around,
you know it's that kind of volume.
So they're really really having
to monitor and stay on top of
it all the time.
But I haven't heard of anybody
actually hitting this for quite
a while.
Do you think, I wondered if for
example there were some changes
to I think it was autovacuum being
like I think it kicks in
to do an anti-wraparound vacuum
differently or it might be lighter
a lighter type of vacuum that it
runs now I think I remember
Peter Geoghegan posting about it, something
like that.
Do you remember a change in
Nikolay: that area?
I don't remember, honestly.
I just know this is still a problem.
Again, at CTO level, it feels like,
how come Postgres still has
4 byte transaction IDs and what
kind of risks I need to take
into account.
But you are right, managed Postgres
providers do quite a good
job here.
They take care.
I had a guest at Postgres TV, Hannu
Krosing, who talked about
how to escape from it in a non-traditional
and in his opinion,
and actually my opinion as well,
in a better way, without single
user mode.
And since he is a part of CloudSQL
team, so it also shows how
much of effort managed Postgres
providers do in this area, realizing
this is a huge risk.
Michael: Yeah, and it's not even,
even if it's a small risk,
the impact when it happens is not
small.
So it's 1 of those ones where-
Absolutely
Nikolay: good correction.
It's low risk, high impact, exactly.
Michael: Yes, yes.
So I think the cases that were
blogged about were hours and hours
possibly even getting, was it even
a day or 2 of downtime for
those organizations?
And that was, that is then, I mean,
you're talking about dangers,
right?
Nikolay: That's- Global downtime,
whole Database is down.
Michael: Exactly.
People, you're gonna lose some
customers over that, right?
Nikolay: Yeah, unlike the next
item, a 4-byte integer primary
key is still a thing, you know.
I was surprised to have recently
this case, which was overlooked
by our tooling.
Michael: Oh, really?
Nikolay: I couldn't like, how come?
Yeah, because it was a non-traditional
way to have this.
Michael: Go on.
Nikolay: Well, it was first of
all, the sequence, which was used
by multiple tables.
Yeah, 1 for all of them.
And somehow it was defined, so
our report in postgres-checkup didn't
see it so when it came I was like
how come this like this is
old friend or old enemy, not friend,
enemy.
Michael: Old enemy.
Nikolay: I haven't seen you for
like so many years.
And you look differently, you know,
because multiple tables,
but still, like, it's not fun.
And this causes partial downtime
because some part of workload
cannot work anymore.
You cannot INSERT.
Michael: Yeah.
Nikolay: Yeah.
So, by the way, I also learned
that if you just do in place ALTER
TABLE for huge table, it not so
dumb as I thought.
I checked source code, I was impressed.
And this code is from 9 point something,
maybe even before.
So if you ALTER TABLE, ALTER COLUMN
to change from int4 to int8,
it actually performs a job like
similar to VACUUM FULL.
Recreating indexes.
And you don't have bloat.
I expected like 50% bloat, you
know.
Michael: Oh, why?
Nikolay: Because I thought it will
be, it will rewrite the whole
table.
I was mistaken.
It's quite smart.
Yeah, it's of course it's a blocking
operation, it causes downtime
to perform it, but you end up having
quite clean state of table
and indexes.
Not quite, clean state, it's fresh.
Michael: Yeah, so that is a table
rewrite, no?
Nikolay: Yes, well, table, yes,
well, you are right.
I was thinking about table rewrite
as a very dumb thing, like
create more tuples and DELETE other
tuples.
Michael: Got it, got it, got it.
Nikolay: But there is a mechanism
of table rewrite in the code.
Now I saw it finally, I'm still
learning, you know, sorry.
Michael: You might end up with
some padding issues if you had
it optimized well before, but yeah.
Nikolay: Yeah, it also feels like
Postgres could implement some
reshape like eventually because
there are building blocks in
the code already, I see them like
to first like offline style
to change COLUMN order and then
if you want it and then fully
online style if pg_squeeze goes
to core to core right yeah yeah
it would be great yeah I'm just
like connecting paths here and
can be very powerful in like 3
to 5 years maybe But it's a lot
of work additionally.
So all those who are involved in
moving huge building blocks,
I have huge respect.
So okay.
And I
Michael: think this is, if you
know what you're doing, this one's
easier to recover from.
I assume like with the sequence,
for example, you can handle
it multiple ways, but you can set
the sequence to like negative
2 billion and normally you've got
a good start.
Nikolay: Everyone thinks they're
smart and and this is first
thing I this is the first thing
I hear always when we discuss
this.
This was in the past, like let's
use negative values.
Of course, if you can use negative
values, do it.
Because we know Postgres integers
are signed integers.
So we use only half of capacity
of, 4 byte capacity, half of
it is 2.1 billion, roughly.
So you have 2.1 billion more, but
not always it's possible to
use.
But this is old, old, old story
still making some people nervous
and I think it's good to check
in advance.
Michael: So much better, so much
better to have alerts when you're
getting...
Nikolay: I sell several companies,
big ones, from this, just
raising this.
And I know in some companies, it
was like 1 year or a few years
work to fix it.
Michael: So what was the problem
before?
Was it looking at columns instead
of looking at sequences?
Or what was the-
Nikolay: No, sequences are always
8 bytes.
It was always so, like, if they
are 8 bytes.
Problem with report, I don't remember
honestly.
There was some problem with report.
It was not standard way, not just
create table and you have default
with sequence and you see it.
Something else, some function,
I don't remember exactly.
Okay.
But usually our report catches
such things.
Or you can just check yourself
if you have primary keys for byte,
it's time to move to 8 bytes or
to UUID version 7, right?
Maybe.
Okay, that's it about this.
Then let's heat up the situation,
replication limits.
So in the beginning I mentioned
vertical and horizontal scaling
and usually people say there's
not enough CPU or something and
disk I.O.
And we need to scale and you can
scale read only workloads having
more read only standbys.
But it's hard to scale writes,
and this is true.
But also true that at some point,
and Lev mentioned it, In our
PgDog episode, he mentioned that
in Instacart they had it, right?
So like at 200 megabytes per second
WAL generation, it's already
over 100.
I don't remember exactly what he
mentioned, but somewhere in
that area.
Michael: It was a lot.
Nikolay: Well, just 10 WALs per
second, for example, it gives
you 160.
Well, RDS sets less because they
have 64 megabyte WALs.
They raised it 4 times.
Anyway, 100 or 200, 300 megabytes
per second, you start hitting
the problems with single-threaded
processes.
Actually, it wonders me why we
don't monitor, like any Postgres
monitoring should have, with low
level access.
RDS doesn't have it, but low-level
access.
We should see how much CPU, we
should see CPU usage for every
important single-threaded Postgres
process.
Right?
WAL sender, WAL receiver, Logical
Replication Worker, Checkpointer
maybe as well.
That would
Michael: be helpful.
Nikolay: Of course, because it's
dangerous to grow at scale when
you hit 100% of a single vCPU.
And then you need to either vertically
or horizontally scale
or start saving on WAL generation.
Fortunately, in pg_stat_statements,
we have WAL metrics, 3
columns since Postgres 13.
But unfortunately, this is query
level.
What we need, we need also table
level to understand which tables
are responsible for a lot of WAL.
And pg_stat_activity lacks it.
I think it's a good idea to implement.
If someone wants hacking, this
is a great idea.
Add 3 more columns, WAL-based,
WAL-related metrics to pg_stat_all_tables,
pg_stat_sys_tables, and
user tables.
It would be great.
Also, maybe pg_stat_databases, like
a global view of things, how
much WAL.
Michael: Yeah, I've got a vague
memory there were a couple of
WAL-related new views, new system
views
Nikolay: introduced.
Oh yes, but it's about, Yeah, it's
about...
Are you talking about the pg_stat_io?
No.
WAL-related in Postgres 13, it
went to EXPLAIN and it went to
pg_stat_statements.
This is what happened.
Anyway, this is not an easy problem
to solve.
It's easy to check if you have
access.
Unfortunately, if you're on managed
Postgres, you don't have
access.
They need to check it.
What's happening, especially on
standbys.
And also, it makes sense to tune
compression properly because
compression can eat your CPU.
Remember we discussed WAL compression.
I always said, let's turn it on.
Now I think let's turn it on unless
you have this problem.
In this case, WAL sender.
Yeah, you need to check how much
of that is WAL compression.
And also we have new compression
algorithms implemented in fresh
Postgres versions.
So, yeah.
And that is CD.
So, so, so we can choose better
1.
For example, as it 4 should be
providing similar as remember
some I saw some benchmarks.
I didn't do it myself yet.
So it should be similar to to be
gels at default compression
in terms of compression ratio,
but it takes much less CPU, like
2 to 3 times less CPU.
So it's worth choosing that.
Michael: I looked at it briefly
just for our own use for planned
storage.
And it was, we even got better
performance as well.
So It won on both for us.
Nikolay: Well, you probably saw
what I saw, Small Datum blog,
right?
No?
OK, maybe not.
Let's move on, because there are
more things to discuss.
Michael: Nice.
Nikolay: Design limits.
So some people already think what
they will do when their table
reaches 32 terabytes.
Michael: Yeah, I guess this and
the last 1 both feel like adult
problems again, right?
Like, There aren't too many small
startups hitting the Instacart
level of WAL generation or 32
terabytes of data.
Nikolay: So yeah, it's a really
big clusters.
But we can start thinking about
them earlier and be better prepared.
Maybe sometimes not spending too
much time, because you know
like if you spend too much time
thinking about how you will bring
statistics to new cluster after
measure of red, but it's already
implemented in 18, so just a couple
of years more and everyone
forgets about this problem.
Michael: It's the kind of thing
that's useful to have like a,
if you've got like a to-do list
you're going to be using in a
couple of years time or a calendar
that you know you're going
to get an alert for, Just put a
reminder in for like a year or
two's time, just check.
Nikolay: We also mentioned a few
times during last episodes,
latest episodes, this bothers me
a lot.
First I learned CloudSQL has this
limit, then I learned RDS
also has this limit.
64 terabytes per whole database
what's happening here it's already
not huge database
Michael: but again in a couple
of years who knows they might
have increased that
Nikolay: yeah well I think it's
solvable of course I guess it's
this is limit of single EBS volume
or disk on Google Cloud, PD,
SSD, or how they call it.
So yeah, it's solvable, I think,
right?
With some tricks.
But these days, it doesn't feel
huge.
64, it feels big database, right?
But when we say
Michael: to most of us,
Nikolay: but we have stories, 100
plus databases, all of them
are are self managed.
I think everyone has
Michael: all of them were sharded.
All of those were sharded.
Nikolay: Yeah, 100%, yeah, we had
a great episode, 100 terabytes,
where it was Adyen, who else?
Michael: We had Notion and we had
Figma.
Nikolay: Figma, right.
And Notion and Figma, they are
on RDS, but it's a shard, a single
cluster, it's impossible on RDS,
right?
And I think Adyen has 100 plus terabytes.
No, they have 100 plus terabytes,
but it's self-managed.
Yes.
Because on RDS it's impossible,
not supported.
Michael: Well, and they shard it
also.
Nikolay: Yeah, yeah, yeah.
But yeah, large companies like
that, they're always like some
parts are sharded, some are not.
So anyway, when you have 10 to
20 terabytes, it's time to think
if you are on RDS or CloudSQL,
is it like how you will grow
5X?
Because if you're 20 terabytes
to grow 5X, 100, it's already
not possible with a single cluster,
right?
Another reason to think about splitting
somehow.
So, okay.
Then a few more items, like, so
data loss.
Data loss is a big deal.
If you poorly designed backups
or HA solutions, Yeah, it can
be.
Let's join this with poor HA choice
leading to failures like
split brains.
So data loss, split brain.
Actually I thought we had a discussion.
There is ongoing discussion in
the project called CloudNativePG
where I raised the topic of
split-brain and demonstrated
how to do it a couple of weeks
ago.
And good news as I see, they decided
to implement something to
move in a direction similar to
Patroni because when network partition
happens and the primary is basically
alone, it's bad because
it remains active.
And as I demonstrated, some parts
of application might still
talk into it.
This is classical split-brain.
And I saw, based on discussion,
I saw it triggered, I never thought
deeply, actually.
But is split-brain just a variant
of data loss?
Michael: Well, I guess you technically
might not have lost the
data.
It's still there, you just have
2 versions.
Which 1
Nikolay: is correct?
Michael: It's worse
Nikolay: than data loss.
It's worse than data loss.
It's worse.
Because now you have 2 versions
of reality.
And it's bad.
With data loss, you can apologize
and ask to like, Bring your
data back again, please.
And in some cases we allowed some
data loss.
Of course, data loss is really
a sad thing to have.
But sometimes we have it like...
Officially, some data loss might
happen.
The risk is very low.
And maximum this number of bytes,
for example, but it might happen.
With split-brain it's worse.
You need a lot, spend a lot of
effort to merge realities into
1.
Michael: Most cases of data loss
I've seen have tended to be
at least 2 things gone wrong like
user error or some some way
that the nodes gone down but quite
often it's user error accidental
delete without a where clause or
dropping a table in 1 environment.
Nikolay: Well, this is like higher
level data loss.
Michael: Well, but that, but that
can cause the low level data
loss if you then also don't have
tested backups and it turns
out you didn't have a good, so
it's the combination of the 2
things often.
Nikolay: For me it's still, yeah,
well, yes, yeah, if backups
are missing, it's bad, yeah, you
cannot recover, but also like
data loss for me, classical, like
lower level, it's database
said, commit successful, and then
my data has gone.
Michael: So yeah.
So that's scary and dangerous as
like a CTO.
Nikolay: Undermines trust into
Postgres again, right?
Yeah, yeah.
If procedures are leading to data
loss and also split brains.
Michael: Is it actually happening
often or is it more the CTO
thinking, I don't want to choose
this technology because it could
happen?
Nikolay: Depends on the project
and the level of control.
I'm pretty confident in many, many,
many web and mobile app,
OLTP style projects, if they are,
unless they are like financial
or something like social network,
social media, like maybe even
e-commerce and so on, data loss
happens sometimes unnoticed.
Yeah.
If you have asynchronous replicas
and failover happened and the
process of failover, like Patroni
by default with asynchronous
replicas allows up to 1 megabyte,
it's written in config, Up
to 1 megabyte of data loss, officially.
Maybe byte.
Right, so 1 megabyte is possible
to lose.
And who will complain if it's a
social media comments or something,
we store like comments.
We lost some comments, nobody noticed
maybe.
But if it's a serious project,
it's better not to do it.
Or split brain.
Yeah, anyway, this is more not...
Postgres doesn't drop quite well
here.
It's a question mostly to everything
around Postgres infrastructure
and if it's managed Postgres to
their infrastructure, how they
guarantee there is no data loss.
And last time we discussed the
problem with synchronous_commit
and we discussed in detail how
right now Postgres doesn't do
a good job of not revealing proper
LSNs on standbys, right?
So even Patroni can have a data
loss in case of synchronous_commit
with remote write.
We discussed it.
Okay, anyway, this feels like something
for improvement definitely
here.
Good corruption.
My general feeling, people don't
realize how many types of corruption
might happen.
And it remains unnoticed in so
many cases.
When you start talking, people's
reaction sometimes is, wait,
what?
So yeah, corruption at various
levels.
Michael: So actually, maybe this,
so in terms of the list where
we started, kind of the point of
the topic, there was kind of
dangers, right?
Is this 1 of those ones that if
it silently happened for a while,
it suddenly becomes a complete
loss of trust of the underlying
system or?
Nikolay: Yeah, as usual, I can
rattle about a little about how
Postgres defaults are outdated.
We know data checksums only recently
enabled by default.
Yeah, great change.
Like a month or 2 ago, right?
It will be only in 18.
It should be done 5 years ago.
We saw
Michael: evidence.
Some,
Nikolay: yeah.
Yeah, many managed Postgres providers
did it, like RDS.
That's great.
Michael: Which is great.
And that kind of is then the default
for a lot of people.
Nikolay: Yeah, but it also doesn't
guarantee that you don't have
corruption.
You need to read all the pages
from time to time, right?
And do they offer something, some
anti-corruption tooling?
They don't, nobody.
Okay, they enabled, so what?
This is just a small piece of the
whole puzzle.
Michael: Yeah, and amcheck improving
as well.
I think, is it in 18?
Nikolay: Yeah, that's great.
Michael: In indexes as well.
Nikolay: Yeah, big.
Michael: In fact, I think we, did
we have an episode on amcheck?
I feel like it came up at least
once or twice, maybe index maintenance.
Nikolay: I cannot remember all
our episodes, so many.
Michael: Yeah, me neither.
Nikolay: But it's great to see
the progress on the area of amcheck,
but again, it's a small piece of
the puzzle and it has settings,
like options to choose from.
So it's not trivial to choose among
options.
And also again, like a couple of
episodes ago, we discussed support
and how difficult it is to sometimes
understand what's happening.
Right.
Yeah.
And I, I right now have a case
where very big platform doesn't
help to investigate corruption.
And nobody has like ability to
investigate it but them, and but
they are not helping.
So it was 1 of reasons I was provoked
to talk about that in that
episode.
So it's bad.
It looks really bad.
Like, you guys are responsible
for that corruption case, and
you don't do a great job.
And I think it's a problem of industry,
and we discussed it already,
so let's not repeat.
But in general, in general, I think
if you are CTO or like leader
who decides priorities, my big advice is to take this list and
check this, like evaluate situation for yourself.
Better let us do it, of course, but you can do it yourself.
And then plan some proactive measures because corruption testing
can be done even on RDS proactively.
If it happens, you need support, of course, because sometimes
it's like it's low level, you don't have access, but at least
you will feel control over it, right?
So, over corruption.
So, anti-corruption tooling is needed.
This is what I feel.
That's it.
That's all my list.
I'm sure it's lacking something like security, more security
related stuff for example as usual I tend to like it what do
you think like was it was it good
Michael: yeah you see a lot more of these things than I do obviously
but yeah I think it's a really good list and check with checklists
right it's not of course if you could go on forever with with
things to be scared of but this this feels like if you ticked
all of these off you'd be in such a good position versus most
and obviously things can still go wrong but these are some of
the most at least even if they're not the most common some of
the things that could cause the biggest issues the things that
are most likely to get on the the CEO's desk or in the inbox
so yeah this this feels like if you're on top of all of these
things, you're going to go a long, long way before you hit issues.
Nikolay: I have a question for you.
Guess among these 10 items, which item I never had in my production
life?
Michael: Oh, maybe...
Wait, give me a second.
Nikolay: It's tricky, right?
Michael: Transaction ID wraparound?
Nikolay: Exactly, how, yeah, I never had it.
I only-
Michael: It's rare.
Nikolay: Well, it's rare, yeah.
Let's cross it off, no problem.
Yeah, I only found a way to emulate it, right, which we did multiple
times, but never had it in reality in production.
So yeah, everything else I had.
Yeah.
Not once.
Michael: I nearly guessed too quickly that it was going to be
Split Brain, but then I was like, no wait, read the whole list.
But I'm guessing you had Split Brains like a couple of times,
maybe maximum?
Nikolay: Yeah, Replication Manager, Split Brain as a service,
yes.
Michael: Okay, yeah, pre-Patroni days.
Yeah, makes sense.
All right, nice 1.
Thanks so much, Nikolay.
Nikolay: Thank you.