Postgres FM | Transcript: How to move off RDS

May 30, 2025 • 48 Minutes

How to move off RDS

Michael: Hello and welcome to Postgres.FM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard, and I'm joined by Nik, as

usual, from Postgres.AI.

Hey Nik, how's it going?

Nikolay: Everything is fantastic.

How are you, Michael?

Michael: Yeah, good.

Good also.

What are we talking about this week?

Nikolay: So today, this week, we are talking about let's move

out of RDS or put your own managed Postgres name there.

And how?

Michael: Yeah, so you mentioned how to move off RDS but I was

going to ask how specific you wanted to be to, well because there's

even multiple flavors of RDS in terms, even Postgres compatible.

Nikolay: Yeah, it's great And we have a lot of managed service

providers and that's super good.

They have all of them have their pros and cons.

Some of them are already dead.

We must say like, and you know, there's a saying like A human

can be not only can be that can be suddenly that and this is

true about this is from Russian literature

Michael: Okay, okay,

Nikolay: Bulgakov. The Master and Margarita. So managed Postgres

service can be suddenly dead also And we know several examples

and what annoys me is how they provide very short notice sometimes.

I don't expect this from RDS, but who knows?

Michael: Right?

No, definitely not.

Nikolay: RDS is huge, right?

But we know Google can kill services.

Not with 30 days short notice, right, or 2 weeks.

But among recent examples, I remember Bit.io, which was an interesting

idea, drop CSV file and it gives you Postgres and you can start

querying.

I remember, I think Tembo is a recent example, right?

Michael: Yeah, but how much notice are they giving?

Nikolay: Bit.io gave, I think 30 day notice only.

Michael: It's not,

Nikolay: it's not, it's not all right for databases.

We know how difficult it is to do any action.

If you exceed, for example, 1 terabyte and it's already like

significant size when you, any move you need to plan carefully

and practice and test and then perform.

Yeah, well, Tembo, right?

Tembo, I think they completely pivoted to different area and

also PostgresML is a new example.

And, but PostgresML, I think 1 of the founders is PgCat.

He was here long ago.

Michael: Yeah, Lev.

Nikolay: Right.

But PostgresML is closed, but others
as well.

Right.

So I think it's, it happens.

So 1 of the reasons to migrate
out is lack of control.

I think, if we already discussed
reasons, do we?

Michael: Yeah, let's do it a little
bit.

I hear people talking a little
bit about costs, but I think the

ones I see more often seem to be
yeah low-level access extensions

like if you've got a specific extension
you want supported you

might migrate I guess sometimes
between providers rather than

off but yeah migrating off it also
could include migrating on

to a different 1

Nikolay: yeah yeah definitely because
some of them are much more

open some of them not many but
some crunchy right crunchy bridge

Michael: yeah or some support different
except like Maybe the

extension you want is on 1 managed
service but not another.

Nikolay: You want TimescaleDB
or you need to migrate to Timescale Cloud?

Michael: Or even I think I saw
a conversation you were having

that is it PL/Proxy?

Nikolay: Oh, Citus.

Yeah.

Because how the Citus is working
there, this is my theory.

That's why PGQ and SkyTools are
there basically, right?

So and PL/Proxy as well.

So yeah, well, and all others only
have, not actually all others,

RDS doesn't have even PgBouncer,
which also originated from Skype

20, almost 20 years ago.

Michael: I didn't know that.

Nikolay: Yeah, PgBouncer is the
most popular product created

inside Skype 20 years ago.

Others are PGQ, Londiste and
PL/Proxy.

I think there are more, but these
I used and remember very well.

Michael: Nice.

Nikolay: So yeah, if we talk about
reasons, There are many reasons.

For me, inside me, and the services
are different.

Some of them are open,

Michael: some of

Nikolay: them are fully based on
open source, which is great,

like Supabase, and we're not strictly
based on open source,

and this is great.

Some of them provide super user
like Citus or access

to physical backups.

It's, it's miracle among managed
Postgres providers, but most

of them are telling you, we are
protecting you give us almost

twice more money and we will limit
your access to your own things.

And we will take care of backups
and failover, HA and DR, and

some other features.

And all of these features are very
good established already in

open source ecosystem.

So I think anger lives inside me
and grows towards managed service

providers because I think first
of all they make money, much

more money than they should on
open source, not contributing

back enough.

Not everyone, but many.

But most problem is I think, I
truly believe they stole the openness

from us, the open source which
we loved, always comparing to

SQL Server, Oracle, or Linux versus
Microsoft Windows.

Those guys who create managed service
providers, among them are

great engineers, and I admire and
have huge respect to many of

them.

But at grand schema, this is new
like Oracle and vendor lock-in

and closed everything and bad processes,
inability to fix problems,

understand what's happening and
so on.

All things which open source intended
to solve.

Like, for example, you have...

Like, why I stopped using Oracle
8i in 2000?

Because I spent 3 days trying to
fix something.

It worked on 1 machine and didn't
work at our client machine.

And then I read somewhere, OK,
internet already was a thing.

I read somewhere that if host name
has a closing parenthesis,

then Oracle cannot connect.

You cannot connect to Oracle.

And no errors, nothing.

And I eventually switched to open
source where such kind of problems

you can troubleshoot easily.

Like if you have advanced skills.

And many people, they are not dumb,
like many people are quite

smart and can learn how to use
Pure or Solve and troubleshoot.

But managed service providers stole
this openness and ability

to see what's happening.

And they just packaged open source
in the form of service, charged

a lot of money, and also closed
important features like access

to physical backups or ability
to physically physical replication

connection basically implementing
vendor lock-in.

Michael: So well but they didn't
steal it right it's It's licensed

that way deliberately to be very
permissive Postgres.

So there's a...

Nikolay: It's Postgres, right?

It's open source.

It's the same as Postgres.

We almost didn't change it.

We use it as it is.

It's Postgres.

And it's like enterprise level.

Michael: To call it, you mean,
you mean it's the calling it Postgres

is what you're...

Nikolay: Yeah, they call it Postgres
and Postgres popularity.

Actually, it's a two-sided question,
because I also think RDS

contributed to the rise of Postgres
popularity around 2015.

I think JSON was a big contributor,
but if you check carefully,

RDS probably was a bigger contributor
to the growth of Postgres

popularity, because everyone hated
installing Postgres.

This is true.

But things are changing.

It's much much easier these days
to install.

Michael: And when I started, when
I got more and more involved

in the Postgres community, maybe
6 years ago, most Postgres committers

were, well, I didn't look at the
numbers, but I think from memory,

most committers and contributors
to Postgres were working at

consulting firms who did a lot
of community work but also I guess

implementing features for their
clients.

These days I wouldn't be surprised
if, and I know the lines are

blurred between consulting firms
and managed service providers

these days, but I wouldn't be surprised
if a group, I definitely

know that lots and lots of committers
are moving to the hyperscalers

and other managed service providers.

Nikolay: Of course.

Michael: But what I mean is there
is some contributing back from

them.

But I do take your point that there's
an interesting dynamic

here.

And I would be interested in your
take on what proportion of

people are moving for those kind
of ideological reasons versus

cost reasons.

Because you mentioned it's like
double the price for less access.

Which of those is more important
to people?

Double the cost or the less access?

Nikolay: That's a good question.

First of all, again, the question
is not like one-sided.

I admit all things RDS brought
and others brought and

Michael: it's

Nikolay: for sure.

But also I just feel it like how
working with Oracle and SQL

Server, not understanding what's
happening, always like a need

to reach out support and waiting
months to get any small response.

I was like, it's hard.

Then working with open source,
you just see how it works and

you can even fix it if needed,
or you can troubleshoot.

It's like documentation always
lacks details.

So source code is best documentation
and ability to debug.

This is a huge part of open source
concept for me.

And this lacks there, right?

So for me, it's a big problem.

But I understand I'm in minority
here.

But I'm not afraid to be in minority.

20 years ago, Postgres itself was
deeply in minority, deeply.

It was like, MySQL everywhere,
Oracle is in enterprise.

What is Postgres, right?

So I truly believe that RDS, RISE
started roughly 10 years ago,

maybe 12, right, around that time.

It was like a storm.

Heroku started first, and I think
Heroku was historically first,

like, good managed Postgres, right?

And RDS, others.

Everyone has it.

Even IBM or Huawei has it, or Oracle
has it.

It's insane, absolutely insane.

So I truly believe this peak, we
already like should go down

or soon we'll start go down and
something new has to be created.

This is what I feel.

Maybe it will take 10 years or
so, another 10 years, but it should

not be so because the whole philosophical
point of open source

is deeply disturbed.

Right?

It sucks to troubleshoot problem
on RDS trying to reproduce it

on regular Postgres.

You try to guess things all the
time.

You don't see, you are blind, right?

Some database is suffering, their
support is slow or unexperienced.

Michael: We're repeating a conversation
we've had several times.

Nikolay: Yeah, yeah.

Well, the problem is like, for
me, this season is deep, it's

strong, but I also understand I'm
among very few people who feel

it I think

Michael: So why bring this 1 up
about how to move off?

Is it because you want people to
move or is it because you're

seeing that they're moving and
they need advice?

Nikolay: I want to start this discussion.

I also know that probably more
people will move because of budgets

especially like economical reasons
they can yeah you at some

point you realize you spend a lot
on RDS and other services probably,

but RDS sometimes is huge.

Like it can be sometimes it's like
between 20 or 50% of whole

budget for cloud RDS.

And I think there are interesting
pieces of how you can get support.

As I understand, to get support
at RDS, whole account should

be upgraded.

So you cannot upgrade only RDS
part.

But this I'm not sure.

What I'm sure, it costs a lot,
and it costs for something that

can be achieved with open source,
and a little bit of support

from some guys.

You can hire those guys, or you
can use services, but it will

cost a fraction of what you can
have on AWS.

And so sometimes some companies
are in trouble.

They check spendings.

They see big check, like say 10,
000 per month for some small

company.

It's significant money, right?

So you can hire a guy for this.

And you check what you can achieve
on low costers like Hetzner

or OVH or something.

And you see, you can drop it 5
times.

Why not?

Right.

Or maybe it can be a hundred and
20, 000 after migration.

And you see that, like, as I said
between 20 and 50 percent of

costs are databases right we're
talking about guys and not raising

but they need to survive as well
There are such companies and

sometimes they need help as well.

Michael: Yeah, so the cynic in
me is thinking, oh, Nikolay wants

more customers.

Nikolay: No, unfortunately, in
this case, I can help 1 time,

like Moo, but we don't have a solution
yet.

Others might have solutions to
this, and sometimes it's Kubernetes,

sometimes it's not Kubernetes.

I would prefer not Kubernetes in
this case.

If you're very familiar with Kubernetes,
okay, good.

There are options, but I would
just prefer maybe like old-school

solution without additional layers
if it's a small project, it's

maybe easier and yeah, we can help,
but I'm not looking for customers

because this is actually not, not
super huge customer.

Usually We had such cases, but
it's like small project for us

to migrate and I just wanted to
share some pieces of advice and

start like spark this negotiation
about why like we use RDS so

much while backups, HA and DR already
solved in open source,

fully solved.

Take Patroni, and pgBackRest or WAL-G,
that's it.

And just find some packaging or
build your own and that's it.

Yeah.

So not convincing or what?

Michael: Yeah, well, I mean, I
think you're right at a certain

scale, but I think there's a lot
of smaller scales where it just

makes sense.

Like it doesn't make as much sense
to try and save that amount

of money for the...

Nikolay: How much, like, let's
take some company, how much is

total cost and like in percentage,
how much databases?

Michael: Yeah, good, good question.

But I think like, even up to spending,
Well, definitely less

if you're spending less than $1,
000 a month on the database,

like why bother?

Nikolay: Not noticeable, right.

But if it's

Michael: Do you see what I mean?

Like, that's quite a big like,
for a lot of very small startups,

they won't be spending that.

So, well, I think there were a
lot.

Nikolay: Well, depending on the
term small, because if it's still

small team, but they accumulated
a lot of data, the budget for

to keep this data in RDS Postgres
will be high.

And I think we touched on an interesting
question.

I think there is some threshold
where you can estimate how much

effort in terms of engineering
resources it will take.

And then how much, like, there
should be some threshold.

Below that, it's not, like, reasonable
to move out of RDS.

But above that, it might be reasonable,
especially considering

that the quality of Patroni, WAL-G,
and pgBackRest, it's very

good enough.

They are battle-proven many years
already.

And yeah, so I think, and we see
in RDS, I don't know if they

implemented using Patroni, but
this so-called HA cluster, in

addition to HA, not HA, multi-AZ,
multi-AZ cluster, in addition

to multi-AZ instance, 3 node cluster,
it already looks like classic

Patroni setup, right?

And if you compare costs, it will
be interesting.

But also you, if you move out of
AWS to out of cloud or to cheaper

cloud like Hetzner Cloud, which
has regions in the US and a lot

of regions, not a lot, some regions

Michael: in Europe.

As of recently, yeah.

Nikolay: And they also, since I
think last year, have S3 compatible

object storage for backups, but
only in Europe.

That's a fact.

Michael: Oh, really?

Nikolay: That's a fact.

I hope they will add it to US regions
as well soon.

And I actually really like their
dedicated offering.

I used it many years ago when I
was bootstrapping it.

And I know many people use it.

And sometimes customers come to
us and they use Hetzner for bootstrapping,

and it makes sense.

And it's even more cost saving.

But it's also not available in
the US, unfortunately.

So anyway, if you take that price
for instances, EC2 versus virtual

machine and heads in the cloud,
it's already 5X.

But on top of that, premium AWS
ads to run backups and failover

and other services they provide.

It's additionally like 60, 70%,
right?

Michael: Yeah.

Well, but okay.

So I think we've covered enough
on like why.

I think people,

Nikolay: yeah, control and money,
money and control.

Yeah.

Michael: So, so then I think the
more, the trickier question

is how?

And especially how to do it with
minimal or no downtime.

Nikolay: Yeah, well It's again.

This is not only about RDS Maybe
you want to migrate out of other,

like CloudSQL or Crunchy Bridge
or something, despite their achievements.

First thing, Let's talk about versions,
compatibility, plugins.

If you move out to different cloud,
it's of course important

to compare all the extensions and
capabilities, versions supported,

how fast the service delivers minor
upgrades, this is big deal.

This is big deal.

If they lag, it's like some flag
for me, like we need to make

sure if bugs happen in Postgres,
it will be delivered to my setup

quickly.

Also how my major upgrades are
done, right?

What kind of control I have there,
but if it's to your hands,

well, it's easier because you can
take, it's open source.

You can take any extension besides
extensions AWS, for example,

created for RDS and didn't publish
to open source.

They have it, like AWS Lambda,
extensions for plant control in

Aurora, to freeze plants, to mitigate
plant flips, to avoid plant

flips.

These extensions, unfortunately,
are not available in open source,

so you need to find a replacement.

But this is rare, actually.

So usually people...

It's either observability-related
capabilities, you can find

a replacement for it, or It's something
people usually rarely

use.

So most of the offering in RTS,
it's based on open source pieces.

And you can bring even more extensions
if you're thinking in

your hands, right?

You can have- Make

Michael: your own?

Nikolay: Yeah, make your own extension,
easier, compile everything.

Michael: So- P-Y-O-A, yeah.

Nikolay: Yeah, yeah.

So yeah, there was a mini conference
inside, a recent conference

in Montreal, right?

So it was organized by Yurii, right?

It was Postgres Extensions Day or something.

I know several people who presented
talks there.

It was interesting.

So, actually, a part of my lack
of love to extensions over the

last 5 to 10 years is because of
RDS and these guys.

Because I understand if I create
extension, it will take ages

for them to take it, right?

So if you move out, yeah, you have
control and can use more extensions.

What?

Michael: Yeah, so but now we're
talking about some, yeah, of

course you need to do a bit of
research bit of prep as to where

you're going but I'm assuming most
people that are thinking how

do I move off have done that step
I think the the tricky part

is often you don't you're not able
to for example set up logical

replication to you know you an
easy an easy ish way to do this

would be set up a logical replica
somewhere else, set up replication,

decide a cutover point, pause writes,
cut over, send writes to

the new primary, great, easy, done,
episode done, But I believe

that's not possible in most cases.

Nikolay: Well, it depends on the
scale.

So before we go there, 1 more thing
related to extensions and

compatibility.

So if you want to think about how
not to lose anything, it's

important to think about observability
bits.

And in this case, pg_wait_sampling
and pg_stat_kcache and pg_stat_statements

and auto_explain, they bring a lot
of observability bits.

And there are several monitoring
solutions available that can

be used on top of them.

So the idea is everyone who works
with RDS, literally everyone,

they use performance insights.

And it's not a good idea to migrate
off and lose this bit.

Fortunately, pg_wait_sampling, especially
with recent fixes, now,

right?

It's possible to use it and have
similar charts in your monitoring

thanks to that.

So It's not only about extensions
functionality, but also additional

observability, so management extensions.

As for cutover and so on, logical
replication is usually available

everywhere, on RDS included.

You can provision logical replicas.

The tricky part is how to do it
if your database exceeds, say,

a few terabytes or 10 terabytes.

It's really not easy.

Michael: As in that first initialization,
you just never catch

up.

Nikolay: Yeah, yeah.

Well, yeah, never catch up.

You can use multiple slots to catch
up, right?

Michael: Yeah, but so, okay, but
you said it's not easy.

Why, is it because of the time
between setting it up and getting

it to work.

Nikolay: To create a logical replica,
we need to do 2 things.

First is initialization, and next
is switching to CDC and catching

up inside the CDC.

I don't think it's a huge deal
in terms of CDC to catch up if

you use multiple slots.

Usually with 1 slot you can handle
workloads like 1000 tuple

writes per second.

If you check pg_stat_activity, n_tup_del,
n_tup_upd, and n_tup_ins,

These 3 numbers, you check how
many of them you can have per

second.

If it's on modern hardware like
Intel, AMD as well, ARM maybe,

but maybe less.

You need to go down with this threshold.

Thousand tuple writes per second,
of course, it's a relative

number because it depends on the
content of those tuples as well.

You can have a thousand tuples
per second, you should catch up

well just with a single slot.

If you have more, just use multiple
slots.

The only problem is that when you
use a logical replication using

multiple slots, foreign key violation
happens temporarily.

It's eventually consistent on the
subscriber, right?

So you cannot use it while it's
a replica.

But so this is not a problem.

Problem is this initialization
because standard approach.

So there are several tricks here.

First, there is a binary mode.

There is a traditional way to create
a replica with copy data

set to true, but also you can set
binary to true.

I think it's supported since Postgres
15 or 16.

In this case, it gets data in binary
form and it's kind of faster.

It's not the same as to create
physical replica and convert it

to logical replica, which became,
by the way, official in Postgres 17.

I missed this.

There's a new

Michael: CLI tool.

Nikolay: There's a new CLI tool,
but it will be only available

Postgres 17, so in future.

Now, usually we have an older version
on production.

Although, if you can upgrade to
17 first, then this tool is

available.

But these guys don't provide the
physical backups besides CrunchyBridge.

Michael: I was going to say, so
you wouldn't be able to use this

to migrate off RDS.

Nikolay: But this flag, what I'm
talking about, like flag binary

set to true, this is for regular
logical replica provisioning.

It's not physical to logical conversion.

It's something else.

Like, and it, like it should speed
up, should speed things up.

But additionally, you can speed
up things to implement it manually.

So if you just create slot, open
transaction, export snapshot,

and then dump everything in many
workers, actually, pg_dump supports

it.

In pg_dump, you can support hyphen
j, number of jobs.

And also you can specify exact
snapshot to work with.

If that transaction is still open
and this snapshot is being

held in repeatable read transaction
isolation mode, a snapshot

isolation.

So it holds snapshot.

Then you can use multiple pg_dumps
or multiple pg_dump workers

to export data faster.

There is even more.

Even more can be done here.

If you have huge tables, a single
worker will be a work alone.

It will be a single-threaded, single
worker processing a huge

table, like a billion rows, for
example.

And in this case, it's good to
think about custom dumping and

restoring using multiple workers
and using like logical partitioning

like you have ID ranges for example
if it's integer big int or

UUIDv7 or actually if it's
even if it's UUIDv4

you can have ranges because it's
like randomly distributed

Michael: so no longer using pg_dump
but like copying to CSV or

something.

Nikolay: Yeah, yeah.

Yeah, yeah.

So something like this.

And you can specify ranges and
export data in huge table using

multiple workers.

This is to speed up the process.

And the more we speed up, the less
we need to catch up.

Obviously.

But the problem with speeding up
too much is if you do it from

the primary, and until only recently,
16 or 17, it was when it

became possible to create logical
replication from physical standbys.

Before that, you need to deal with
primary.

And this is not fun.

Michael: But also, I don't understand
how you...

Can you on RDS set up logical replicas
from a standby?

Nikolay: Never tried, but it should
work in 17, right?

Michael: Oh, I mean, the features,
they're in Postgres.

I just didn't know it was available
in RDS.

Nikolay: I also don't know.

But actually, you can have, you
can think about having, if, ah,

so it's interesting, right?

If we create a slot on the primary
and we have physical replica.

Theoretically, we could have replica
lagging a little bit.

We create a slot and then we use
recovery_target_lsn to catch

up and synchronize with slot position
using physical replication.

And then we pause it again, and
we can grab data from physical

replica.

And we know it will correspond
exactly to the slot position,

which is waiting for us, right?

But the problem is RDS doesn't
support recovery_target_lsn, you

cannot change it.

So my question is, can we pause
replica and instead of recovery

target_lsn, advance slot, there
is a function,

pg_replication_slot_advance or something like
this, you can, knowing position

of your physical standby, you could
advance the slot.

So they again synchronized.

And then you can grab your data
from physical standby and you

know this data will correspond
to the slot position.

Michael: So, But in reality, how
are you seeing people, when

you do a project to move people
off, what are you actually ending

up doing normally?

Nikolay: Well, recently we decided
to do it with downtime, honestly,

because this project could allocate
maintenance window 3 hours,

no problem.

But the scale was, I think, a couple
of terabytes.

It was not, yeah.

But at this scale, actually, I
would probably use traditional

logical replication provisioning.

But what I'm describing right now
here, it's interesting because

again, we want to move as fast
as possible with initialization

to catch up less later.

But we also don't want to overload
the primary if it's still

being used.

So this trade-off is like competing
reasons, right?

That's why it's a good idea to
move this disk I/O or reading disk

I/O off the primary to a standby.

This is what I just tried to elaborate.

Honestly, I never tried on RDS.

I never tried.

I only tried from from self-managed.

Michael: That's why I wanted to
bring us back a little bit to

how to move, because it feels like,
even though these things

might be possible, they start to
get quite complicated.

And sometimes complexity in these
situations is scary, it's hard

to test, it's hard to be sure it's
going to work well, and often

because it's a project that won't
happen, hopefully you won't

be migrating every year or 2, So
hopefully it's a fairly infrequent

task people can offer.

I think sometimes people can arrange
for a couple of hours of

downtime at really, you know, over
a weekend announce it really

far in advance.

Let people like obviously some
some can't but I think plenty

can and then the complexity just
drops.

Nikolay: So far there are tools
that are open source tools and

there are proprietary tools additionally,
which help with logical

replication.

For example-

Michael: Like CDC type.

Nikolay: Yeah, so DMS from AWS,
and Google also supports their

own tooling.

And there are third party tools
like Qlik.

I wouldn't recommend Qlik.

They suck at Postgres for sure.

But there's Fivetran.

Fivetran is good proprietary tool.

Like They promise to work very
reliably in very big databases,

so you can use them to migrate
out and then that's it.

Or there are open source tools
like there is pgcopydb, which

I think works fully at logical
level, right?

So it should be compatible with
RDS by Dmitry Fontaine, right?

Again, I only know it, I never
use it myself.

There's something from Xata, I
think?

No?

Should be.

Michael: I don't know.

Nikolay: The problem is support
of DDL, because the problem with

logical is DDL always, right?

If you need to create DDL often,
sometimes it's a part of user

activity, DDL creation.

And this is not a good position
to be in, because DDL is not

propagated with logical replication
and it's...

Yeah.

If it's like...

If you can pause it, it's great,
but if you cannot...

Yeah, so...

Michael: Another reason to keep
the window closed.

Nikolay: Yeah, I agree with you.

Complexity can grow, but it's already
not rocket science.

Michael: Yeah, that's fair.

Nikolay: And you can evaluate this
complexity and how much you

need to engineer, compare, compare
with your savings.

Michael: Yeah, it's a good point
though because if we're talking

like projects that are already
spending more like a few thousand

a month or more they're gonna they're
gonna be of a certain scale

so there is gonna be that you I
think you mentioned a couple

of terabytes and it was say about
a couple of 2 hours of downtime

or something.

Nikolay: Sometimes people spend
millions and we know there are

cases of quite good, in terms of
scale, bigger companies who

migrated out of cloud even, you
know, like posts by DHH and 37signals.

So it was I think a couple of years
ago already right?

Michael: Oh yeah all I mean is
the bigger you are the more that

complexity makes sense because
you well unless I'm mistaken I

was always thinking the down the
downtime you'd need to take

would be somewhat proportional
to the amount of data because

of the dump restore time, so it's
like Yeah, if you've got a

small database the amount of downtime
you would need to cut over

would be much lower.

Nikolay: Yeah, yeah, that's

Michael: fair.

But you're not gonna have a small
database and be moving for

cost reasons.

That doesn't make sense to me.

Nikolay: By the way, there is a
good approach.

So we don't want to have long-lasting
initialization just because

we lag a lot.

We just don't want to risk the
health of the primary in case

if we cancel this migration.

Because if we cancel, the health of the...

It will be disturbed because of accumulation of dead tuples and

eventually bloat, right?

And there is an approach, I think some tools I just mentioned,

I implemented, if you start consuming from CDC immediately when

slot is created and put it into some intermediate place like

Kafka or something, or object storage.

In this case, you like xmin horizon is already propagating, right?

Yeah.

So you're already using this slot.

You're just using it not by the final user Postgres, but some

intermediate user who will promise to deliver all the changes

later when final destination will be ready.

So this is also this also like a reasonable approach, but it

adds a little even more complexity because now you need to manage

Kafka and this is a whole another story.

Or something else.

Files.

Files on object storage.

It's like.

Michael: Well, like, is that what Debezium is used for?

Nikolay: Yeah, well, but I stopped hearing from Debezium for

quite long.

I don't know what's happening.

If you know, like, who's listening, if you know, like, can you

leave some comments somewhere?

I'm curious what's happening with this project.

Michael: So, but you raised a good point about testing, like

what if you need to go back, or I mean, I think there's a valid

question around how do you even test this kind of thing?

How do you do a dry run?

Nikolay: So the trickiest part is switchover.

Because it's hard to undo this, because it's already like jump.

But provisioning of logical replica, it can be tested in production.

You start from non-production, then you go to production and

you are very careful, 2 big risks with logical is to be out of

disk space and we have a new settings, I keep forgetting the

name of new setting that can mitigate, like so maximum size of

the leg, you can control it and say, better kill my slot if you

achieve this threshold.

So you can set, I don't know, like 10 or like 100 gigabytes there

to avoid risks.

And second danger is to affect health, because xmin horizon is

not advancing.

So this you can just monitor and then define some threshold when

you say stop we are killing the slot and and let it go but besides

these 2 risks users won't notice anything well disk I/O as well

If you provision right from the primary, disk I/O can be significant

if you use multiple workers.

So it's better to control this.

Michael: Yeah.

You also can't be doing migrations
during that time.

Like you can't be doing DDL.

Nikolay: You can, you can.

If it's a

Michael: good point,

Nikolay: you yeah, they will just
put your application on pause.

And if it's not a lot, you can
just if it's a test, what I did,

I just checked in the logs, I see,
I see DDL, it's good to have

a log_statement DDL, it's good
advice anyway for any setup, just

to control all the changes.

And you see, this is when DDL happened,
and now our replication

gets stuck because of this DDL,
okay, we have up to 1 minute

when I manually propagate.

It's a test, right?

But a good test is a production
test, because it's really hard.

Well, I recommend starting from
non-production tests, but eventually,

due to complexity, production tests
here I would definitely recommend

having as well.

Under control, risk is not high.

And if you see like the lag, okay,
propagate it.

If the deal happens, new partition
is created every hour or what,

right?

So you can just propagate it manually
and understand this spike

was there.

That's it.

So it doesn't affect.

But good thing is that users don't
notice, right?

So it gives you benefit of performing
tests inside, like in the

field, right?

Where the real battle happens.

Unlike switchover, switchover is
different.

Unless you have switchover backed
by PgBouncer, suppose you,

It won't be fully 0 downtime.

You will need to accept like up
to 1 minute loss, for example.

Loss of availability, not of data.

Data should not be lost, any of
data.

It's to switch, right?

And PgBouncer, if you want to
pause/resume, I guess it should

be installed on separate nodes
in Kubernetes or somehow and under

your control.

Because if it's PgBouncer provided
by that service, you need

to switch out of that service,
right?

So It's quite rare when people
control their own PgBouncer.

But it happens.

I mean, in the case of users of
managed Postgres.

But it happens.

It happened recently with us.

People-

Michael: So might that be a sensible
initial thing to consider

migrating?

Like run your own proxy?

Or migrate to running your own?

Nikolay: Oh yeah.

Well, and you can start from it,
actually.

This is what happened recently.

It was managed service and PgBouncer,
that managed service

provided PgBouncer but they didn't
provide important bits of

it, some control.

I think a pause/resume was either
not support.

I don't remember details, but definitely
it was not supported

all the details about monitoring.

So you could not export this bits
to some custom monitoring solution.

So first thing was migrate those
PgBouncers, get control over

them.

Then you can have pure pause/resume.

And it's good.

Oh, important thing I forgot I
also wanted to mention.

When you do all this, migrate out,
you need to first think, you

need to plan.

And inside planning, it's super
important to understand the topology

and route trip times between the
nodes.

The distance, Is it like the same
region?

If it's, for example, you're migrating
out of RDS, but you stay

inside AWS, you can have the same
region.

It's good, because where will be
your users, I mean application

nodes?

If they are far, it's bad.

Sometimes people migrate to like,
okay, Hetzner or something.

And in this case, it's better to
be closer.

And some AWS regions have like
a couple of milliseconds latency,

route trip time to a few Hetzner
regions.

I think we have only 2 or 3, I
don't remember.

So maybe you first need to migrate
to a different region, not

edible yet, your application nodes,
I mean, to be closer to that

Hetzner region.

Geography matters here.

How many miles or kilometers between
them?

And you need to test latency.

How to test latency?

Like, absolutely simple test is,
you have Postgres, You connect

to it, you write backslash timing
in psql, and just select semicolon

multiple times.

You already see, it's not scientific,
you need to do something

better, but it's super easy, because
ping usually doesn't work,

and you need some ways to test
TCPIP level, some router time.

I think there are tools what we
used in the past I don't remember

but this is the easiest way just
to check routed round trip time

to Postgres and choose better because
actually 1 millisecond

is already noticeable, right?

Like 2 milliseconds if you have many queries.

10 milliseconds I would already start hesitating to have it.

It's you remember our first episode, right?

Michael: Yeah, I was going to say, that's what you're aiming

for, to have most queries be less than 10 milliseconds, right?

Nikolay: Yeah, because HTTP can have like many, for example,

10, and HTTP 100 milliseconds is already noticeable.

Not noticeable until 200, okay.

Should be noticeable for engineers, not to users yet.

Okay, that's it.

I think we covered many specific areas.

I just wanted to say it's not like, it feels maybe scary, but

it should not be scary.

And I think in the future we will have more mature products,

purely open source, delivering HA and DR in packaged form.

Not only in Kubernetes, but Kubernetes is already so.

You can choose among multiple Kubernetes operators, fully open-source,

and they have, or at least they promise to have everything.

But if you don't like Kubernetes, like actually I do more and

more, I don't like 4 databases Kubernetes more and more.

In this case, I think more products will arise and help you to

stop worrying about that about backups and availability

Michael: yeah I think it's good to have options as well.

Good to have.

It's good to have a second best option when you're negotiating

as well.

Nikolay: If you

Michael: because if costs are the main concern, step 1 might

be try to negotiate a lower cost.

And at that point, it helps to have a oh, we could actually migrate.

We've looked into it, and here's our plan.

We can move on to something that will cost us this much less.

You might be able to get slightly a best of both worlds and just

get the cost reduced without having to migrate.

But yeah, great.

So we covered a bit of planning, we covered how to in terms of

technically, we covered things you have to make sure you don't

forget about, a few bits there.

Anything else before we wrap

Nikolay: it up?

I just wanted to add a comment that I understand that this discussion

maybe is a little bit early.

So let's see what happens in the next, say, 5 years.

And am I right to predict that the rise of new products around

Postgres should happen?

We'll see.

Michael: Yeah.

Well, I'm going to predict that actually I think we might even

go slightly in the other direction.

I think we're gonna see more and
more managed services, and I

think they're gonna be- Because
of AI.

Maybe because of AI, but also,
like, some of the companies that

are spinning up thousands of these,
like Supabase seem to be

going from strength to strength,
good for them.

I know Neon just got acquired but
the numbers of new databases

on their platform is significant.

They're both

Nikolay: because of AI builders,
so to speak, right?

Michael: Yeah, but not only.

I think there was...

Nikolay: This is what I read in
social media.

Michael: I think the recent continued
trend is definitely that.

But Supabase...

I know a lot of people building
small businesses and a lot of

them are doing it on on Supabase
instead of Firebase in the

old days so I think there are a
lot of non AI projects on using

that kind of service as well so
I I'm seeing a lot of those maybe

not the enterprise and this the
larger scale projects but it'll

be interesting to see in a few
years' time.

Hopefully we'll have both.

Hopefully we'll have both ends
of the spectrum.

Nikolay: What I remember is the
saying that, not a saying, like

phrase that more and more, maybe
more than half of new databases

created, it's from integration
automation from like say some

AI, like coding or something and
they need database and it's

just created basically fully automatically
right yeah yeah interesting

Michael: yeah what the other half
then

Nikolay: I don't know I don't know
exact numbers it's just what

like yeah I just put some random
number 50

Michael: yeah

Nikolay: yeah so

Michael: and scale matters too
right like I think whilst that's

a lot of databases, I don't imagine
that's a lot of huge databases,

like we're probably not talking
about that's not the ones that

are going to be migrating to RDS
anytime soon.

Most of them.

Nikolay: Yeah, well, yeah.

It's interesting.

Like I would I would love to see
some report from good, trustworthy

sources, landscape of databases
in terms of sizes, budgets, and

so on.

About open source, or maybe only
Postgres.

I remember some Gartner report
from 2018.

Yeah, that was already many years
ago.

It was saying that open source
database market exceeded 1000000000

already, so it was great.

But what's the distribution?

Those who are created automatically,
they are small.

And they are definitely below the
threshold we discussed, right?

Where it doesn't work.

And

Michael: also, what counts?

What counts as open source?

Like is Gartner including Aurora
Postgres?

In the open as open source?

Nikolay: Well the basis is open.

Michael: I understand that.

I just I think it's a blurred line
at this point

Nikolay: yeah it's hard to say
there are so many flavors I agree

Michael: yeah all right anyway
really a pleasure to speaking

with you always and catch you next
week

Nikolay: good Good.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

How to move off RDS

Creators and Guests

Some kind things our listeners have said