Postgres FM | Transcript: Patroni

October 4, 2024 • 46 Minutes

Patroni

Michael: Hello and welcome to Postgres.FM,
a weekly show about

all things PostgreSQL.

I'm Michael, founder of pgMustard,
I'm joined as usual by

Nikolay, founder of Postgres.AI.

Hey Nikolay.

Nikolay: Hi Michael.

Michael: And today we are delighted to be
joined by Alexander Kukushkin,

a Postgres contributor currently
working at Microsoft and most

famously, maintainer of Patroni.

We had a listener request to discuss
Patroni, so we're delighted

you agreed to join us for an episode.

Alexander.

Alexander: Yeah, hello Michael,
hello Nikolay.

Thank you for inviting me.

I'm really excited to talk about
my favorite project.

Michael: Us too.

Perhaps as a starting point, could
you give us an introduction?

Like for most people, I think will
have heard of Patroni and

know what it is, but for anybody
that doesn't, could you give

an introduction what it is and
why it's important?

Alexander: Yeah, so Patroni, like
in the simple words, it's

a failover manager for Postgres.

It solves the problem of availability
of a primary.

In Postgres, we don't use some
words that are non-inclusive,

like master.

That's why we call it primary,
and Patroni actually recently

get rid of this non-inclusive words
completely.

And the way how Patroni does, it
makes sure that we are running

just a single primary at a time,
and at the same time, Patroni

helps you to manage as many read-only
replicas as you like to

have, and keeping those replicas
ready to become primary in case

the primary has failed.

At the same time, Patroni helps
to automate usual DBA tasks like

switchover, configuration management,
stuff like that.

Nikolay: Node provisioning also,
right?

Alexander: Node provisioning, not
really.

Node provisioning is a task for
DBA.

DBA has to start Patroni, and Patroni
will take care of bootstrapping

this node.

In case it's a totally new cluster,
Patroni will start as a primary.

In case the node joins an existing
cluster, the replica node

will take a pg_basebackup by default
from the running primary

and start the replica.

And the most interesting part,
let's say we bring node back which

was previously running as a primary,
And Patroni does everything

to convert this failed primary
as a new standby to join the cluster

and be prepared for next unforeseen
event.

Nikolay: At least you agree that
it does part of node provisioning,

because otherwise, we wouldn't
have situations when old data

directory, old PGDATA was copied
and new 1 is created and we

are suddenly out of disk space.

And if you don't expect Patroni
to participate in node provisioning,

then you think, what's happening?

Why am I out of disk space?

Right?

It happens sometimes.

Alexander: It used to happen, I
think, with bootstrap mode.

Like, when Patroni, like, I don't
remember up until which version,

but Patroni, when it tries to create
a new cluster, usually by

using initDB, but in some cases
you can configure Patroni to

create a cluster from existing
backup, like from base backup.

And if something goes wrong, Patroni
does not remove this data

directory, but renames it.

And it used to apply current timestamp
to the file name.

And therefore, after the first
failure, it gives up, waits a

little bit and does the next attempt.

Nikolay: To directory, right?

Alexander: Yeah, it creates, it
uses yet another base backup,

creates a new data directory, fails,
and renames.

Now it is not working like this,
it just renames PGDATA to PGDATA-old,

something like this, and that's
why you will not have an infinite

number of directories.

And having just 1 is enough to
investigate the failure.

Nikolay: But maximum we end up
like if we if our data directory

we expected to fill 70% of the
disk we still might have out of

disk space.

Alexander: Yeah that's unfortunate
but the other option really

like you just drop it but at the
same time like all the evidences

of what failed, why it failed are
also gone.

You have nothing to investigate.

Nikolay: Okay.

To me it still sounds like Patroni
participates in node provisioning.

Yes, it doesn't bring you resources
like disk and virtual machine

and so on, but it brings data,
like the most important part of

Postgres node provisioning, right?

Okay, I just wanted to be right
a little bit.

Okay.

Alexander: Okay.

Nikolay: It's a joke.

Okay.

Michael: I think diving deep quickly
is great.

It'd be good to discuss complex
topics but I think something

simple would also be good.

I would love to hear a little bit
almost about the history of

Patroni.

Like the early days, what were
you doing before Patroni to solve

this kind of issue and why was
it built?

What problems were there with the
existing setups?

Alexander: To be honest, while
working for my previous company

We didn't have any automatic failover
solution in place.

What we relied on was just a good
monitoring system that sent

you a message or some on-call engineer
just calls you in the

night, the database failed.

There were a lot of false positives,
unfortunately, but it still

felt more reliable than using solutions
like replication manager,

repmgr.

Nikolay: Yeah, I remember this
very well.

Like people constantly saying we
don't need autofailover, it's

evil because it can switch over
suddenly, failover suddenly,

and it's a mess.

Let's rely on manual actions.

I remember this time very well.

Alexander: Yeah, to our excuse,
an amount of databases, like

database clusters that we run wasn't
so high, like I think a

few dozens, and it was running
on-prem, didn't fail so often,

and therefore it was manageable.

A bit later, we started moving
to the cloud, and suddenly, not

suddenly, but luckily for us, we
found a project named Governor,

which basically brought an idea
of how to implement autofailover

in a very nice manner, without
having so much false positives

and without risks of running to
a split-brain.

Nikolay: Was it abandoned project
already?

Alexander: No, no, so it was not
really abandoned, but it wasn't

also very active.

So we started applying it, found
some problems, reported problems

to the maintainer of the Governor,
got no reaction unfortunately,

started fixing those problems on
our own, and at some moment

a number of fixes and some new
nice features accumulated and

we decided just to fork it and
give a new name to the project.

So this is how Patroni

Nikolay: was born.

Georgian name, right?

Alexander: Right.

Nikolay: What does

Michael: it mean?

Governor in Georgian.

Nikolay: Oh, governor.

I think so.

Michael: Yeah,

Alexander: almost, almost.

Almost.

Very close, but I'm not a good
person to explain like or to translate

from Georgian because I don't

Nikolay: Name

Alexander: I know yet another word
in Georgian and it's a spillo.

Yeah, it translates from Georgian
as elephant.

Nikolay: And the name chose I guess
Valentin Gogitseshvili, right?

Alexander: Yes, he was, no, at
that time he wasn't my boss anymore

but we still worked close together
and I really appreciate his

creativity in inventing good names
for projects.

Michael: Yeah, great names.

And is this a good time to bring
up Spilo?

Like, what is Spilo and how is
that relevant?

Alexander: Spilo, as I said, it
translates from Georgian elephant.

When we started playing with Governor,
we were already targeting

to deploy everything in the cloud.

We had no other choice but to build
a Docker image and provision

Postgres in a container.

And we called this Docker image
Spilo basically we packaged for

Governor Postgres if you Postgres
extensions and I think it was

WAL-G back then as a backup and
point-and-tumble recovery solution.

Michael: And it still exists to
this day as Spilo, but now with

Patroni?

Alexander: Yeah, of course.

Now there is a Patroni inside,
and now Spilo includes plenty

of Postgres major versions, which
may be an anti-pattern, but

it allows you to run major upgrades,
like in-place major upgrades.

It also includes WAL-G nowadays,
as a modern replacement of WAL-G.

And it's used not only by...

Operator, right?

Not really part of Operator.

Spilo is a product on its own.

I know that some people run Postgres
on Kubernetes or even just

on virtual machines with Spilo,
without using the Operator.

Nikolay: But using Docker for example?

Alexander: Yeah, of course.

Michael: But that is a good opportunity
to discuss Postgres-operator.

Postgres-operator was Zalando's...

Was that one of the first operators
of its type?

I know we've got lots these days.

Alexander: Well, maybe it was,
but at the same time, the same

name was used by Crunchy for their
Operator.

They were developed in parallel
and back then Crunchy wasn't

relying on Patroni yet.

As I said, we started moving things
to the cloud and at some

point, Vector moved a little bit
and started running plenty of

workloads on Kubernetes, including
Postgres.

Since deploying everything manually,
and more importantly, managing

so many Postgres clusters manually
was really a nightmare, we

started building Postgres-Operator.

Back then, I don't think some very
nice Go library to implement

the Operator pattern existed and
therefore people had to invent

everything from scratch and there
is a lot of boilerplate code

that copied over and so on.

Nikolay: Is it only the move to
the cloud what mattered here,

but maybe also moving to microservices,
splitting everything

to microservices?

Because I remember from Valentin,
for example...

Alexander: Microservices, of course,
played a big role.

And probably...

Not probably...

Microservices were really driving
force to move to the cloud,

Because with the scale of the organization,
it wasn't possible

to keep monolith.

And the idea was, let's split everything
to microservices, and

every microservice usually requires
its own Database.

Nikolay: Right.

Alexander: Sometimes sharded Database,
like we used application

sharding.

In certain cases, the same Database
is used by multiple microservices,

but it's a different story.

But really, the number of Database
clusters that we had to support

exploded.

From dozens to hundreds and then
to thousands.

Nikolay: Yeah.

And this is already when you cannot
rely on humans to perform

a failover, right?

Alexander: Even when you run a
few hundred Database clusters,

better not to rely on humans to
do maintenance, in my opinion.

Nikolay: Right, so that's interesting
and maybe it's also the

right time to discuss why Postgres
doesn't have internal built-in

to failover.

I remember discussions about replication
when we relied on Slony

and Londiste and some people resisted
to bring replication inside

Postgres, but somehow it was resolved
eventually.

And Postgres has good replication,
physical, logical, sometimes

not good, but it's a different
story.

In general, it's very good and
improving, improving every release.

We just last week discussed with
Michael what improvements of

logical replication in 17, and
maybe it will resonate a little

bit with topic today, Patroni,
but it doesn't happen to autofailover

at all, right?

Why so?

Alexander: I can only guess, because
to do it correctly, we cannot

just have 2 nodes, which most people
run, like primary and standby,

because there are many different
factors involved.

1 of the most critical ones is
the network between those nodes.

When just having 2 machines, you
cannot distinguish between failure

on the networking and failure of
the primary.

If you just run health check from
a standby and making decision

based on the health check, you
may have a false positive.

Basically, the network just experienced
some short glitch, which

could last even a few seconds,
sometimes a few minutes, but at

the same time the old primary is
still there.

If we promote a standby, we get
to a split-brain situation.

With 2 primaries and not being
clear to which 1 transactions

are running.

In the worst case, you end up in
an application connecting to

both of them.

Good luck with assembling all these
changes together.

Nikolay: This is what tools like
repmgr do.

So I ended up calling
repmgr a split-brain solution.

Because I observed it many, many
times.

Alexander: Like as a mitigation,
what maybe is possible to do,

the primary can also run a health
check and in case if standby

is not available, just stop accepting
writes by either restarting

in read-only or maybe by implementing
some other mechanisms.

But it also means that we lose
availability without a good reason.

So with this scenario, when we
promote standby, technically if

standby cannot access someone else,
it shouldn't be accepting

writes either, like in the network
split.

Basically, we closely come to set
up with how repmgr

call it, witness node.

Nikolay: Witness node, yes exactly.

Alexander: Witness node, basically
you need to have more than

And the witness node should help
in making decisions.

Let's say we have a witness node
in some third failure domain,

the primary can see the witness
node, therefore it can continue

to run as a primary.

And standby shouldn't be allowed
to promote if it cannot access

the witness node.

And it already reminds some systems
like ETCD that complement

consensus algorithm and write is
possible when it is accepted

by majority of nodes.

Nikolay: This wheel already invented,
right?

Alexander: Yeah, so this is already
invented, and what Patroni

is really relying on to implement
after failover reliably.

I can guess that at some moment
in Postgres it will be added,

and we already have plenty of such
components in Postgres that

exist.

We have write-ahead log with LSN
which is always incremented.

We have timelines which is very
similar to terms in etcd.

So basically at the end we will
just need to have more than 2

nodes, better 3, so that we don't
stop writes while 1 node is

temporarily down.

It will give possibility to implement
after failover without

even doing pg_rewind, let's say.

Because when primary writes to
write-ahead log, it will be first

confirmed by standby nodes, and
only after that.

So effectively, this is what we
already have, but it's not enough,

unfortunately.

Nikolay: S.

So do you think at some point Patroni
will not be needed and

everything will be inside Postgres
or no?

Alexander: I hope so, really.

Nikolay: S.

I hope so.

Alexander: A.

No, no, no, no, no, no.

I'm tired of maintaining Patroni,
but because this is what people

really want to have.

To deploy highly available Postgres
without necessity to research

and learn a lot of external tools
like Patroni, solutions for

backup and point...

Nikolay: Upgrade them sometimes
because we're always lagging

with these

Alexander: upgrades.

Yeah.

But at the same time, Let's imagine
that it happens in a couple

of years, but with a five-year
support cycle, there will still

be a lot of setups that are running
not recent Postgres versions,

and they still need to use something
external, like Patroni.

Nikolay: Yeah, I'm actually looking
right now at commits of

repmgr.

It looks like the project is inactive
for more than 1 year, almost.

Like a few commits, that's it.

It's like going down.

Alexander: So I have probably some
insights about it, not about

repmgr, but I know
that EnterpriseDB was contributing

some features and bug fixes to
Patroni, so they officially support

Patroni.

Nikolay: So it sounds interesting,
right?

So Patroni is a winner, obviously.

It's used by many Kubernetes operators,
many of them, and not

only Kubernetes, of course, and
winning, of course, some projects

were abandoned, not only
repmgr, we know some others,

right?

But you thinking about 1 day everything
will be in core and Patroni

will be abandoned maybe, right?

And you think it's maybe for good.

Alexander: So every project has
its own life cycle.

At some moment, the project is
abandoned and not used by anyone.

We are not there yet.

Nikolay: Right, right.

While we're in this area, I wanted
to ask you what you think

about, Kubernetes also has, it
also relies on consensus algorithm,

right?

Itself, it has it.

Why some operators choose, why
do they choose to use Patroni

while others like CloudNativePG
decide to rely on Kubernetes

native mechanisms and avoid using
Patroni?

Alexander: To be honest, I don't
know what driving people that

build cloud-native Postgres.

Nikolay: But what's better in general?

What are pros and cons?

How to compare?

What would you do?

Alexander: In a sense, CloudNativePG, there is a component

that tries to manage all Postgres
clusters and decide whether

some primary is failed and promote
1 of the standbys.

I'm not sure how they implement
in the fencing of the failed

primary, because if you don't correctly
implement fencing and

promote the standby to the primary,
you again end up in a split-brain

situation.

And let's imagine that 1 Kubernetes
node is isolated in the network.

Nikolay: Network partition.

Alexander: Yeah.

And it automatically means that
you will not be able to stop

pods for containers that are running
on this node.

At the same time, applications
that are running on this node

will still use Kubernetes services
to be able to connect to the

isolated primary.

Nikolay: Right, yeah.

Alexander: So Patroni detects such
scenarios very easily, because

Patroni component runs in the same
port as Postgres, and in case

it cannot write to Kubernetes API,
it just does self-fencing,

It stops Postgres to read only.

Nikolay: It's simple, by the way,
right?

Alexander: Yeah, so I don't know
if they do something similar.

In case if they don't, it's dangerous.

Michael: We should do a whole separate
episode of CloudNativePG

actually I think that would
be a good 1

Alexander: yeah I'm not saying
that CloudNativePG is better

like does something wrong

Nikolay: I'm just raising questions

Alexander: raising my concerns

Michael: of course right back to
Patroni it worked like this

from the beginning, but it feels
like

Alexander: in version 10, which
is end of life for a couple of

years, by the way.

From the very beginning, we wanted
to support this feature, but

what was stopping us was the promise
of Patroni with synchronous

replication that we want to promote
a node that was synchronous

at the time when primary failed.

If we just have a single name in
synchronous standby names, like

single node, it's very easy to
say, okay, so this node was synchronous

and therefore we can just promote
it.

When there are more than 1 node
and we require all of them to

be synchronous, we can promote
any of them.

But with quorum-based replication,
you can have something like

any 1 from a list of, let's say,
3 nodes.

Which 1 is synchronous when the
primary failed?

I'm not demanding to answer this
question, So I will just explain

how it works in Patroni, like in
the last major release.

This information about current
value of synchronized and bynames

is also stored in etcd.

Therefore, those 3 nodes that are
listed in synchronized and

bynames know that we are listed
as quorum nodes and during the

leader race they need to access
each other and get some number

of votes.

If there are 3 nodes, It means
that every node, to become a new

primary, like a new candidate,
needs to access 2 remaining nodes,

at least.

And get confirmation that they're
not ahead of all LSN on the

current node.

Is it clear?

I should elaborate a little bit
more.

Michael: So if they were ahead,
let me ask the stupid question,

If a node checks that it is ahead
of the current candidate to

be leader, that's then a bad decision
to promote that leader

because a different 1 would...

Alexander: So just for your understanding,
in Patroni there is

no central component that decides
on which node to promote.

Every node makes a decision on
its own.

Therefore, every standby node,
like listed in Synchronous Standby

Names, goes through the cycle of
health checks.

It accesses remaining nodes from
synchronous to node names and

checks at what LSN are there.

And if they're on the same LSN
or behind, we can assume that

this node is the healthiest 1.

And the same procedure happens
on remaining nodes.

Basically this way we can find,
okay, so this node is eligible

to become a new primary.

In case if we have something like
any 2 and 3 nodes, we can make

a decision by asking just a single
node.

Because we know that 2 nodes will
have the latest commits, like

the latest commits that are reported
to the client.

And it will be enough to just ask
a single node.

Although it will ask all nodes
from synchronous standby names,

but in case if 1 of them, let's
say, failed, together with the

primary, it is still enough to
make a decision by asking the

remaining 1.

Nice.

And the tricky part comes when
we need to change synchronous

standby names and the values that
we store in etcd.

Let's say we want to increase the
number of synchronous nodes

from 1 to 2.

What should we change first, synchronous
standby names, GUK,

or value in etcd?

So that we can correctly make a
decision.

If we change first value in etcd,
it will assume, okay, so we

need to ask just a single node
to make a decision, although there

is just 1 node that has the latest
commits, 100%.

And in fact we need to ask 2.

Therefore, when we increase this
from 1 to 2, first we need to

update the synchronous standby
names, and only after that change

in etcd.

And there are almost a dozen of
rules that 1 needs to follow

to do such changes in the correct
order.

Because it's not only about changing
replication factor, It's

also about adding new nodes to
synchronize standby names or removing

nodes that are gone and so on.

And I don't think any other failover
solution implements a general

algorithm to do such changes.

Nikolay: How much time did you
spend to develop this?

Alexander: Originally this feature
was implemented by Ants Aasma,

he's working for CYBERTEC, it
happened in 2018.

I did a few attempts to understand
this great logic of this algorithm.

And finally, almost 5 years after,
I was able to get enough time

to fully focus on the problem.

And even after that I spent, I
don't know, a couple of months

implementing and fixing some bugs
and corner cases and implementing

all possible unit tests to cover
all such transitions.

Nikolay: There is no book which
describes this, that you could

follow.

This is something really new that
needs to be invented, right?

Alexander: Well, the idea was obvious,
like how to do it, like,

or what to do, but like implementing
it correctly and proving

that it is really working correctly,
like, it's really a challenge.

Nikolay: Finding all the edge cases,
right.

There is another thing I would
like to discuss a little bit.

It was in Patroni 3, version 3.0,
DCS failsafe mode.

So DCS is distributed configuration
storage.

And actually we just experienced
a couple of outages because

we are in Google Cloud and they're
running Salon operator, Patroni

of course.

And I just checked the version
of Patroni, and it seems to have

it.

But we...

Alexander: But I don't think it
is enabled by default.

Nikolay: Exactly, this is my second
question, actually, why it's

not enabled.

So, first question, what is it,
like, how do you solve this problem

when etcd or console is temporarily
out?

Alexander: Let's start from problem
statement.

The promise of Patroni is that it
will run as a primary when it

can write to a distributed configuration
store, like to etcd.

If it cannot write to etcd, it
means that maybe something is

wrong with etcd, or maybe this
node is isolated, and therefore

writes are failing.

And when node is isolated, it's
apparently working by design,

Patroni cannot write to etcd, it
will stop Postgres in read-only

mode, but in case if etcd is totally
down, because of some human

mistake, you cannot access any
single node of etcd.

And in this case, Patroni also
stops primary and starts it in

read-only to protect from the case,
let's say, some standby nodes

can access DCS at the same time
and promote 1 of the nodes.

So people were really annoyed by
this problem, and were asking

why we are demoting primary.

So far the answer was always, alright,
so we cannot determine

the state, and therefore we demote
to be on the safe side.

The idea how to improve on that
came at one of Postgres conferences

after talking with other Patroni
users.

Like, how it is improved using
the failsafe mode.

The primary, like when it can determine
that none of etcd nodes

are accessible, it will try to
access all Patroni nodes in the

cluster using the Patroni REST
API.

And in case if the Patroni primary
can get a response from all

nodes in the Patroni cluster in
the failsafe mode, it will continue

to run as a primary.

In this case, it's a much stronger
requirement than quorum or

consensus.

So it is not expecting to get responses
from, let's say, majority.

It really wants to get responses
from all standby nodes to continue

to run as a primary.

This feature was introduced in
Patroni version 3, but it is

not enabled by default, because
I think there are some side effects

when you enable this mode in certain
environments.

Probably it is related to environments
where your node may respond

with a different name.

Nikolay: I need to think about
it.

Alexander: This behavior is documented.

Nikolay: Yeah, we will explore
this.

Thank you so much for it.

But it sounds

Alexander: like...

On Kubernetes it is safe to enable
it.

Nikolay: Yeah, we should start
using this, this is what I think

as well.

Yeah, definitely we'll explore,
thanks.

Alexander: Like pods always have
the same name, just different

IP addresses.

Nikolay: I just got help for it.

And as usual, I just wanted to
publicly thank you for all the

help you do for me and actually
many companies. Many years it's

huge.

Thank you so much So another thing
I wanted to discuss is probably

replication slots.

And I remember a few years ago
you implemented support for failover

of logical slots.

Now we have it in Postgres, right?

So one more, finally, yeah.

One thing was basically removed,
I guess, from Patroni, right?

Or you still keep this functionality?

Alexander: No, We still keep it
and we didn't do anything special

in Postgres 17.

Nikolay: It was, I think it was
16 even, no?

Alexander: Failover of, ah.

Nikolay: Or 17.

Well, ability to use a logical
slot on physical standbys was

in 16, but fell over in 17, we
just discussed it.

Alexander: Yes, exactly, exactly.

I confused you.

That's why I'm saying we didn't
do anything special.

Although I did some tweaks to make
this feature work with Patroni,

because it requires to have your
database name in the primary

coninfo.

Patroni wasn't putting the DB name
to primary coninfo, because

for physical replication, it's
not useful.

Nikolay: But I wonder...

Alexander: How it does it?

Nikolay: I wonder in my head, like,
of course, we create slot

on the primary, it's clear, but
Patroni main task is to keep

primary alive, to take care of
high availability HA for the primary.

Okay, but if we have multiple replicas,
multiple standby nodes,

and 1 of them is used, or maybe
a few, but at least 1, 1 of them

is used to logically replicate
to some Postgres or Snowflake

or anywhere or Kafka or something
in this case if this...

Yeah, from standby because it's
good, we like, less risks on

the primary and so on and wall
sender is not using CPU and so

on.

And no out of disk risks.

So now we have this standby and
it's dead suddenly.

It's not the job of Patroni to
take care of it, right?

Because we need some mechanism
to failover standby now.

Alexander: Well, you mean to keep
logical replication slot on

a new standby where you would like
to connect.

In theory, Patroni maybe can take
care of it, since it's possible

to do logical replication from
standby nodes since Postgres 16.

So how it's implemented currently
in Patroni, like logical failover

slots, it creates logical slots
on standby nodes and uses

pg_replication_slot_advance() to move
the slot to the same LSN as

it's currently on the primary.

So basically the assumption is
that logical replication happens

on the primary.

In theory, there is no reason why
it cannot be done for standby

nodes.

Let's say we create logical slots
on all standby nodes with the

same name, and Patroni can watch
which 1 is active and publish

this information to ATCD and remaining
standby nodes will again,

like Patroni remaining standby
nodes will use pg_replication_slot_advance()

to move LSN on standby nodes.

So in theory it could work, but

Nikolay: I don't

Alexander: know if I would have
time to work on it.

Nikolay: I'm just trying to understand,
This is a relatively

new feature since 16 to be able
to logically replicate from physical

standbys, but...

Alexander: But please keep in mind that it still affects primary.

Nikolay: Right.

Alexander: So, Maybe like pg_wal will not bloat, but pg_catalog

certainly will.

Nikolay: Yeah, this for sure.

I was referring to the need to preserve WAL files on the primary.

This risk has gone if we do this.

But I cannot imagine how we can start using logical slots on

physical standbys in serious projects without HA ideas.

Because right now I don't understand how we solve HA for this.

Alexander: Yeah, and unfortunately, this hack that Patroni implements

with pg_replication_slot_advance() has its downsides.

It literally takes as much time to move the position of the logical

slot as you consume it from the slot.

That's unfortunate.

And how it's solved in Postgres 17, it basically does not need

to parse the whole file and decode it, so it just literally overwrites

some values in the replication slot, because it knows exact locations

and does it safely.

Patroni cannot do it.

Although, probably, pg_failover_slots can also do the same.

For older versions.

Nikolay: Okay, some area, additionally, for me to explore deeper,

because I like understanding many places here.

Good pieces of advice as well, thank you so much.

Anything else, Michael, you wanted to discuss?

Like, obviously, like 1 of the biggest features was Citus support,

right?

But I'm not using Citus actively, so I don't know.

If you want to discuss it, let's discuss.

Alexander: I know that some people certainly do, because from

time to time I get questions about Citus with Patroni on Slack,

or maybe not Citus-specific questions, but according to the output

of the Patroni control list, they are running Citus Cluster.

There is certainly a demand, and I believe with Patroni implementing

Citus support, it improved quality of life of some organizations

and people that want to run sharded setups.

Nikolay: Is there anything specific you needed to solve to support

this or like technical details?

Alexander: To support Citus?

So, Citus, I wouldn't say that it was very hard, but it wasn't

very easy either.

So, Citus has a notion of Citus Coordinator, where you, like

Originally you're supposed to use coordinator for everything,

to do DDL, to run transactional workload and so on.

And on coordinator there is a metadata table where you register

all worker nodes.

And worker nodes, This is where you keep the actual data, like

charts.

And what I had to implement in Patroni is registering automatically

worker nodes inside this metadata.

And in case of failover happens
For the worker nodes, we need

to update metadata and put new
IPs or host names, whatever.

Basically, when you want to scale
out your Citus cluster, you

just start more worker nodes.

Every worker node, in fact, is
another small Patroni cluster.

So technically, in Patroni control,
it looks like just a single

cluster, but in fact it's 1 cluster
for a coordinator, 1 cluster

for every worker node, and on each
of them there is its own failover

happening.

If you start worker nodes in a
different group, like in the new

1, it joins existing Citus cluster
and Patroni, the coordinator,

registers new worker nodes.

But what Patroni will not do,
it will not redistribute existing

data to the new workers.

This is something that you will
have to do manually afterwards

and it has to be your own decision
how to scale your data and

replicate to other nodes.

Although, like nowadays it's possible
to do it without downtime

because all enterprise features
of Citus are included in Citus

version 10.

So everything that was enterprise
now is an open source.

Nikolay: That's cool.

Michael: I saw Alexander has a
good demo of this, of Citus and

Patroni working together, including
rebalancing.

I think it was Citus Con last year?

Alexander: Yeah, it was Citus Con.

Michael: Nice, I'll include that
video in the show notes.

Nikolay: I wish I had all this
a few years ago.

Alexander: When I...

Yeah, of course, like, There was
a little bit more work under

the hood.

In case if you do write workload
via coordinator, it's possible,

like Patroni can do some tricks
to avoid client connection termination,

like while switchover of working
nodes is happening.

This is what I did during the demo.

There are certain tricks, but unfortunately
it works only on

coordinator and only for write
workloads.

For read-only workloads, your connection
will be broken.

That's unfortunate.

Maybe 1 day it will be fixed.

So in the Citus, maybe 1 day the
same stuff will also work on

worker nodes.

And by the way, on Citus, you can
run transactional workload

by connecting to every worker node.

Only DDL must happen via coordinator.

Michael: Nice.

Speaking of improvements in the
future, do you have anything

lined up that you still want to
improve in Patroni?

Alexander: That's a very good question.

Usually some nice improvements are coming out of nothing.

You don't plan anything, but you talk to people and they say,

it would be nice to have this improvement or this feature.

And you start thinking about it, wow, yeah, it's a very nice

idea and it's great to have it.

But I rarely plan some big features from the ground up, let's

say.

So what I had in my mind, for example, it's a failover to a standby

cluster, like in Patroni.

Right now it's possible to run a standby cluster which is not

aware of the source where it replicates from.

It could be replicating from another Patroni cluster.

And what people ask, we have a primary Patroni cluster, we have

standby Patroni clusters, but there is no mechanism to automatically

promote standby cluster, because it's running in a different

region and it is using completely another etcd.

So they simply don't know about each other.

It would be nice to have, but again I cannot promise when I can

start working on it and whether it will happen.

I know that people from CYBERTEC did some experiments and have

some proof-of-concept solutions that seem to work but for some

reason they also they're also not happy with such solution they

implemented.

Michael: Yeah, sounds tricky.

Alexander: Distributed systems are always tricky.

Michael: Yeah,

get that on a t-shirt.

Nikolay: Thank you for coming.

I, as usual, I use podcast and all events, I participate and

organize and so on.

I use just for my personal education and daily work as well.

I just thank you so much for help.

Again.

Alexander: Yes, thank you for inviting me.

Yeah, it's a nice job that you are doing.

I know that many people listening to your podcasts and very happy

about it.

They learn a lot of great stuff and also making a big list of

to-do items like what to check and what to learn.

I cannot say the same about myself that I watch every single

episode but sometimes I do.

Nikolay: Cool, thank you.

Michael: Thanks so much Alexander.

Cheers Nikolay.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

Guest

Alexander Kukushkin

Principal Software Engineer at Microsoft and Citus Data. Maintainer of the Patroni HA tool

Patroni

Creators and Guests

Some kind things our listeners have said