Postgres FM | Transcript: Self-driving Postgres

August 15, 2025 • 59 Minutes

Self-driving Postgres

Michael: Hello and welcome to Postgres.FM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard, and this is Nik, founder

of Postgres.AI.

Hey Nik, how's it going?

Nikolay: Going great.

I'm very glad to see you.

How are you?

Michael: Likewise.

I'm good, thank you.

And you chose the topic this week.

What are we talking about?

Nikolay: Yeah, I think it's very interesting to discuss the level

of automation we have in terms of all, you know, my position

against managed Postgres, and in this case, it probably will

be opposite, like saying that it's not enough what we have in

terms of managed Postgres and also in terms of Kubernetes operators

and other automation projects Postgres ecosystem has right now.

So why I was thinking about it, imagine 2011, Heroku was started,

Heroku Postgres was started.

2013, RDS Postgres was released in November, I think, and at

the end, I guess, right?

So then this was a foundation of growth of interest, I think.

Like some people say it's because of JSON or something.

I agree with those arguments, but I think the central reason

of why Postgres started to grow in 2014, 15 and by now is that

before that, backend engineers, developers, they were always

complaining how difficult it is to set up Postgres and configure

it and backups and replication that just didn't want to deal

with it and RDS and Heroku before that, they brought automation

for basic things, right?

And this, I think, simplified lives of a lot of engineers.

That's great.

In 2020, Supabase was released.

And I think a new wave of audience was brought to Postgres ecosystem

of front-end people actually.

Because now it's not only Postgres automated, it's very well

automated.

Other components are very well automated, like REST API, real-time

component, right?

These things, authentication component.

So you immediately start working on front-end, forgetting about

backend.

So I admire Supabase for bringing a lot of frontend guys to

Postgres.

That's great.

But at some point they need to learn SQL, I'm pretty sure.

So this is fine.

And now we have AI builders, and this is another wave of users

and basically not humans already sometimes.

So we hear from Supabase and Neon
that a lot of clusters which

are created these days are created
by request from like Cursor

or something, Vibe coding, right?

And so many, many clusters, many
of them are small and maybe

it won't go anywhere because it's
just experiments, prototyping

and so on.

But some clusters grow and they
lack attention.

So with RDS, there was a big shift.

We talked about startup teams who
don't have DBA and they are

fine with it until some point and
this is where Postgres.AI professional

services catch them, right, quite
often.

But now we talk about even lack
of backend engineering team completely,

for example, with Supabase or
even without like lack of engineering

team completely somehow.

Right.

And only guys who understand product
and try to wipe code this,

sometimes with security breaches,
like recently some app was

like storing data in, in Firebase,
right.

Google Firebase.

And that was not secure at all.

5 million users registered.

That was big scandal.

So yeah, anyway, but this is part
of the topic security.

So what I feel is a demand of much
higher automation than just

RDS or Supabase.

Some new level of automation should
be present.

And if you look at enterprise sector,
there is Oracle with this

idea of autonomous database, self-driving
database for many years,

right?

On 1 hand.

On another hand, there are academic
papers like 1 from Carnegie

Mellon University by Andy Pavlo
from 2017, which discusses what

self-driving database management
systems are.

And there's a question.

If you think about zillions of
Postgres clusters, which should

be highly automated And when experts
look at them, everything

should be already transparent and
obvious how to fix and move

on.

What is this?

Like what is self-driving Postgres?

I was thinking.

And to answer that I performed
several waves of research, of

course, with deep research from
cloud and open the ChatGPT,

right?

Latest models, I paid to everyone
a lot of bucks already.

So I was thinking, what is, what
could it be for Postgres?

To answer that, I performed a search
looking at Oracle, first

of all.

And I just asked to, you know,
deep research is when they perform

Google searches or Bing searches
and analyze hundreds of sources

and then write some kind of report
like some student would write.

It might have issues, of course,
with report, but it gives you

a lot of links at least and some
summaries.

So my question was, what people,
after all those years of building

autonomous Oracle, what do people
really like and what they like

less, right?

What's the- What did you find?

Michael: What did they say?

Nikolay: Yeah, and 1 more comment.

In 2013 or 14, I think I was attending
autonomous Oracle webinar

and I was completely shocked.

They promised autonomous Oracle,
but they only talked about clustering

logs, organizing like better log
analysis from hundreds or thousands

of sources and not only database.

And I was like, where is autonomous
Oracle here?

And 10 years since then, almost
more than 10 years, I was thinking

it's stupid.

Now I changed my mind and I hope
you and our audience will understand

why.

So what I found is that people
appreciate a lot self-patching.

It's like many minor releases,
like if new release comes out

with security patches, it's not
a headache at all.

And this kind of automated RDS,
you can just define maintenance

window, yeah, with some caveats.

Michael: Minor versions only, I
think?

Or have they done major versions
now?

Nikolay: Major versions yeah automation
of major version upgrades

and here we definitely have something
to discuss I mean we discussed

it already and I mean my team like
we we had very good recent

cases where our customers had the
0 downtime upgrades.

It's like, we're very happy.

I hope some blog posts are coming.

Michael: 0 downtime is very different
to fully autonomous though.

Like very, very different.

Nikolay: From fully autonomous

Michael: or self-driving a major
upgrade is very different.

Nikolay: Let's do 1 more step back.

What is self-driving car?

Michael: Yeah, great.

Nikolay: There are 6 levels defined
by SAE, like kind of standard

or something.

So there are 6 levels from 0 to
5.

0 means not autonomous at all,
manual, regular car.

And then 5 is fully autonomous.

And looking at first few levels,
I realized interesting thing.

So they talk about not each feature
particularly, but of combination

of features.

For example, level number 1, it
could be either either adaptive

cruise control, like so keeping
my maintaining speed, but safely,

right?

Or maintaining lane, but not both.

If it's both it's already the level
number 2 right and there

are several levels and we and for
example this paper Carnegie

Mellon Andy Pavlo's paper from
from 2017 and discusses how

to map it to database management
systems right

Michael: so a little bit not not
much in my opinion but a little

bit yeah for sure and this this
paper I looked at this in advance

you shared these car things I'll
link them up in the show notes

as well.

It struck me that level 4 to level
5, so the last step, is a

huge jump.

It's like here's loads of features
in the car that will help

the driver and then level 5 is
suddenly and now the driver does

nothing.

So it's like that feels to me like
a whole like a maybe a potentially

huge chasm that maybe there's like
a hundred more levels in between

4 and 5 that we're going to need
to break down at some point.

But like I felt like it was very,
a very hand wavy way of saying

we might be really close to this
because we already have level

4 features, we're very close to
having level 5.

And it feels to me like there might
be like a, like I was unclear,

for example, in a level 5 car,
whether a human could still take

control if needed or was there
absolutely no way of doing that

like that that feels to me like
a level that wasn't defined and

maybe there's like other I believe
there'll be

Nikolay: yeah yeah let me explain
how I see it.

And I think if you ask several
people, they will answer differently.

And I also heard Kubernetes ecosystem
tries to map it as well.

And some operators, they claim
they have a very high automation,

but many people say they are not,
they're not, they don't have,

they don't have, and so on.

So in terms of cars, let me like
propagate it.

So number 2, like both these options,
for example, and this is

what Tesla autopilot, for example,
does.

I use it a lot.

Like you just turn it on, but you
must sit and keep basically

officially you must keep hands
on wheel and be ready to take

control any second basically.

Right.

But it's still, it's, it's great.

It maintains lane and speed.

Like you just relaxes and, and,
spend much less effort.

And I think we can think about
this in databases as well.

Next level is level number 3.

And level number 3, it's everything
is automated, but you still

need to be to be ready to take
control.

And this is what for example, Tesla
full self driving is, well,

this is this is like, you not everything
automated.

It's like under supervision of
you.

Michael: You have just put I'm
just putting it up.

No, no, no, it's it's not quite
that it's like level 3.

It says here, for example, is a
traffic jam chauffeur.

So it can handle like, basic traffic
conditions like a traffic

jam.

But for example, it doesn't account
for like all weather conditions

or like there's a bunch of other
like potential things.

So it's, it's not like it's

Nikolay: the limitations.

Michael: Exactly.

And

Nikolay: you still need to take
control if needed.

Basically, you need to be ready
to take control.

This is level 3, I agree.

There are limits, but it can bring
you fully automatically from

point to point.

This is what Tesla full self-driving
does.

You sit in the driver's seat, and
from point to point you can...

Basically, you can enjoy full automation
of some whole ride,

but if some bad condition occurs,
then you need to take control

and fix things.

For example, if we map to full
major upgrade, full major upgrade,

downtime and so on.

By the way, when we think about
autonomacy, we also bring some

additional features like 0 downtime.

It's like it could be in place
and without downtime, but somehow

our mind wants some good features
additionally to autonomous.

So yeah, this is like natural desire
to have good stuff.

But we can imagine, for example,
we have a whole thing automated

And under many circumstances, like
in many cases it will work,

but in some edge cases it won't
and you will need to take control

and make some decisions before
proceeding on or even postponing

the whole procedure.

This is very similar to Tesla full
self driving and I experienced

it.

It really can drive you from point
to point.

But for example, if you go inside,
for example, my property,

I have some roads inside it, it
won't be able to drive there

at all because it's like, oh, this
is already not a road, not

proper road, you know.

So there is another level 4, it's
also conditional, but there

you like what my perception again
might be wrong, there you can

go to backseat and sit there.

You are allowed to relax completely,
fully, but again it will

work only under some conditions.

For example, there is Waymo.

I tried it multiple times in San
Francisco.

It's amazing.

You go to its Jaguar, you go to
backseat And everything is fine.

But if like some, we saw this YouTube
videos or something, multiple

Waymo machines, the cars, they
just create a traffic jam themselves

and like basically deadlock, right?

Michael: So like if they sense
that they can't drive or they

spot something they can't deal
with, they'll stop as a safety

precaution.

And then what do you do?

You have to get out?

Nikolay: Yeah.

In this case, yes.

In the case of Waymo, you're a
passenger, yes.

So you can imagine, For example,
if we map it to major upgrades,

for example, imagine there is a
procedure for develop, there

is a vendor who can intervene and
can take control sometimes

if allowed.

But the passenger in this case,
those who asked to perform full

major upgrades, They are passengers,
they cannot make decisions,

but this is good for them because
their mind is spent for product

development, for example, right?

But thanks to like million miles
already of experiments and real-life

experience for these cars, If something
goes wrong, safety first,

it will just abandon the trip,
I mean postpone it and cancel

and another car will come later,
right?

So this is approach.

But It's like whole thing encapsulated,
like black box for you.

You don't go down and don't make
decisions according to some

diagram of decisions.

But it has also limitations.

And I think Waymo is perfect example
for level 4, because it

works in San Francisco, in some
areas, but you cannot drive to

San Jose, which is like slightly
more than 1 hour usually drive,

because it's outside of coverage.

And this is like, this can happen
here as well, major upgrades,

but if there are some extensions,
for example, it's outside coverage,

we don't support this kind of upgrade
because this extension,

like, I don't know, Citus or TimescaleDB,
it requires an additional

approach and we don't have it covered
here, right?

So this is what I think here, like
we can map it and why not?

But my main insight was looking
at deep research of feedback

from Oracle users, DBAs and engineers.

They say, upgrades are great, both
minor and major.

Security is great, like security
control, automatic procedures

to level up security.

These kinds of maintenance things
are great, but when we talk

about smart, like index advising
and so on, it's hit and miss

situation.

And this, I had an aha moment because
I was thinking, actually,

if you look at what all people
try to do, they try to invent

configuration tuning, automated,
with machine learning and AI.

This is what Andy Pavlo was doing
with OtterTune.

And by the end of, OtterTune was
closed, but by the end of Auto-Tune

live, I already noticed the shift
which happened to Postgres

AI earlier.

Attention to query tuning and optimization,
like creation of

indexes.

And I see pganalyze is doing a
great job, not resisting to LLMs,

which is also great.

And also some teams go inside Postgres
and try to make planners

smart.

And I had, like, with configuration,
it's pretty straightforward

for me, it's Pareto principle.

Take pgtune.leopard.in.ua, very
simple configuration, 80% of

job done in 1 person, like really,
really fast.

It's just heuristic-based, rule-based
approach for OLTP or like

anything.

And it's good enough for many,
for many, like we don't need any

machine learning and so on.

Michael: And even when you say
for many as well, I think it's

also about time.

So it's for a while, it will then
be good.

And so for long, it will be good
enough.

So it's like...

Nikolay: Yes, I agree.

Yes, I agree.

And then you need, but then you
need to tune.

My way of tuning is to conduct
experiments.

You know this very well, like how
to make experiments faster,

cheaper, reproducible and so on.

Like these kind of things.

Michael: Is it worth, cause I think
self-driving, like probably

level 4, many people would count
as self-driving, but there's

still the kind of enough caveats
that a lot of the benefits of

self-driving cars.

Let's go back to cars briefly.

I'm really excited and optimistic
about self-driving cars.

I love the idea of being able to
get on with something else while

something drives me.

I don't mind that it might not
be like the fastest way or like

it might not drive like completely
optimally.

Nikolay: Not the safest.

What about safest?

Michael: Probably the safest but
not the fastest sorry.

Probably safer than me but maybe
it would be a bit more gentle

you know maybe it wouldn't take
the yellow light when a human

driver would, you know, that kind
of thing.

But I love that you can just watch
a movie or chat to a friend

easily or play a board game in
the back, you know, you do whatever

you want, you get so much time
back, especially in America where

a lot of the time is spent driving.

It makes so much sense to me that
self-driving cars are like

a huge unlock for a lot of people.

But largely only at stages 4 and
5, like at that highest level.

Like cruise control is great, but
I still have to concentrate.

I still have to be watching the
road.

Like I don't actually gain that
much.

And if we go back to Postgres,
I feel like a lot of the automation

features are great.

Nikolay: Yes.

Michael: But we still have to concentrate.

We still need the person.

We still need the DBA.

And as long as we still need the

Nikolay: driver, and as long

Michael: as we still need the driver,
and as long as we still

need fail-safes down to humans,
all I see is kind of this like

gradual need for maybe maybe this
is where the driving analogy

breaks down a little bit but maybe
fewer humans per Server to

like maybe the DBA team for a company
will be smaller on average

compared to how it was in the past
you know that and I think

we've already seen that over time
but I'm struggling with like

the that last step until we get
to those which which feels to

me like a long way off especially
given like the experience you're

saying about like Oracle and the
experience we saw with very

smart people trying to automate
a lot of this stuff from, you

know, with a lot of AI, but kind
of not even LLM stuff, right?

Like, a lot of the research in
this area has been machine learning

and other longer researched AI
methodologies that have lots of

real-world use cases and even there
we've seen mixed results

and it really I feel like the model
struggling to employ In the

experience I've had talking to
customers that use Auto-Tune for

example, I feel like the Constraints
for example were not as

clear as driving or the slightly
different use cases or the performance

trade-offs that different people
have in different cases are

subtle enough that you can't set
the exact same guardrails for

everybody.

And at that point, it breaks down
enough that well, and 1 more

addition, performance cliffs are
so real that if you change 1

thing, and it looks like it's going
to be great as soon as you

hit a cliff it's then a disaster
and then recovering from those

disasters is actually a real problem.

Nikolay: So I feel like...

Troubleshooting disaster is also
a problem.

Root cause analysis is a problem.

Michael: And arguably they get
harder when you involve automation

because the more that it's automated
the less people actually

know what was changed when and
why.

So I think it can get tricky.

Nikolay: I disagree here with you
because you must be an expert.

But if you have automation, you
move much faster with high level

of automation.

Take Cursor, give it a lot of pieces
together, and explain how

you approach methodology of analysis.

This is like expert needs to bring, and then you move much faster.

But this is like, I was trying to explain how I moved to this

area completely.

Right.

Michael: Yeah.

Okay.

Nikolay: So people say in Oracle, this works that 1 doesn't.

What works?

Quite simple things.

I, I said like upgrades, maintenance, security stuff.

Well, not simple, but they are boring, you know.

Of course, replication and backups.

Like for me, HA and DR, It's like auto steering and maintaining

speed and maintaining lane.

Basics.

Cars must do this, so database must have good HA and DR.

If we look closer, actually, there are issues with both the HA

and DR, which will prevent us from reaching very high level of

automation.

But we can dive into this later.

But anyway, boring stuff lacks automation.

Interesting, like, Remember I mentioned level 1 and 2, they talk

about combinational features.

So if we start analyzing each feature particularly, we cannot

apply the same classification.

You know, because classification talks about combinational features.

Michael: Yeah, sure.

Coming back actually, just to make sure I understand, what have

Oracle done in terms of security?

Can you give some example automation features?

Nikolay: Well, I know a little.

I would say what we should do in Postgres and This is like least

topic I would like, I'm ready to discuss now.

Because this is in the roadmap, but right now we're focusing

on different areas.

I can just speculate on this, like identify potential threats,

like checking permissions, roles.

For example, I know organizations which use the same super user

for all human DB engineers, and this is quite easy to identify

or any organization.

I had a couple of them, on consulting contracts, which passed

through IPO process.

Before that you have audit, right?

And during audit, they ask specific questions.

Some of them are quite like silly, I would say, but some of them

are quite good enough.

And you just inspect your pg_hba.conf, you inspect a user model,

you inspect how multi-tenancy is organized.

We had an episode about it, right?

So these kinds of things can be automatically analyzed and so

on.

I don't know.

I don't know details about Oracle.

I just saw feedback that these
stuff engineers really appreciate

and they appreciate less automated
configuration, automated creation

of indexes.

And this was appreciated

Michael: less or appreciate less
or it gets it wrong.

When it gets it wrong, it's more
painful.

Like what's, do you see what

Nikolay: I mean?

Mixed results, mixed results.

I don't know, like there's lack
of trust in some minds.

Like, I don't know.

Michael: I, for example, I catch
up with customers from time

to time just to hear what they're
doing, what they like about

products, what they don't.

And I was speaking to a customer
of mine that did try and use

OtterTune for a while.

And it'd be interesting, I'd love,
maybe we should invite somebody

from that team on to discuss what
happened there.

Yeah.

Why did it shut down?

Nikolay: I can tell you why.

People don't need configuration
tuning that much.

Michael: Okay.

Well, I also think there might
be other issues.

So there's definitely like not
needing

Nikolay: I tell you the big need
in configuration tuning exists

only if you have, say, 10, 000
plus clusters.

Then you can say, okay, we are
going to save like 5 to 10% of

money just with tuning or workload
will be reduced.

We know, you know very well, a
really bad plan can screw all

efforts of configuration tuning.

Michael: Well, yes, but I think
it's worse than that.

I think also in addition to it
not being needed that often and

therefore kind of subscription
based models not working that

well, also when it, well this customer
was telling me, they moved

from a mental state of, when something
goes wrong, let's dive

into what happened, straight to,
when something goes wrong now,

what did OtterTune change?

And that was like a real shift.

It became

Nikolay: like a...

Trust.

Michael: Trust, but also that must
be based on the fact it made

changes in the past that made things
worse.

So it's not trust for no reason,
it's not kind of like distrust

for no reason.

It's every now and again if you
change something, you hit a performance

cliff.

And it's unexpected.

Or maybe it's not even, like, always
a performance cliff.

Maybe it has another unintended
consequence that you care about

more.

Probably not in a lot of these
cases.

But I think they talked about,
for example, making the mistake

of letting it configure some parameters
that even affected, like,

durability and things like that.

So if you're changing depending
on what you allow it to change,

there might be unintended side
effects.

And I think that's, like, putting
parameters around, like, what

you do, you will and won't let
it do, is actually harder than

it sounds, I think.

Like, it's very hard.

Nikolay: To perform, like, enterprise
approach for making a change,

it's extremely complex.

I'm very grateful that 7 years
ago, I was working with Chewy.

They were preparing for IPO, and
I remember CTO was ex-Oracle,

and discussions we had and resistance
they had in any change

I proposed.

It taught me this enterprise approach.

I'm very grateful.

It was great experience for me.

And I realized, oh, actually, if
you want to be serious about

changes, any small change should
be very thoroughly tested, like

experiments, experiments, all risks
must be analyzed, and then

there should be a plan to mitigate
if a risk, like a non-risk

occurs, right?

And AI doesn't help here almost,
you know, like this is framework

you need to build without AI first.

But this

Michael: is, yeah.

I don't necessarily agree.

I think AI could really help with
these things.

When I say AI, I'm including machine
learning and not just the

latest LLM stuff.

I just think we need to define
constraints really clearly and

define what we care about really
clearly and make it really clear

we care more about reliability
and durability than we do about

performance.

So, like,

Nikolay: yes, yes.

Michael: And that's almost always
true.

And I think that might be more
core to the reason that these

performance tuning tools aren't
haven't yet succeeded because

we haven't yet nailed the reliability,
durability stuff.

So that would be my theory as to
like why they didn't like necessarily

succeed.

Because if we help, even if they
did help performance almost

all the time, if they ever hurt
reliability, that's not a tradeoff

most organizations are willing
to make.

And that's a difficult thing to
tell a tool that's trying to

optimize for better performance.

Nikolay: Yeah.

Yeah, I agree.

And durability has issues.

There is a good article from Sugu
just published on Sugu's blog

about synchronous replication issues
and we also know the very

good talk by Alexander Kukushkin
about issues with synchronous

replication.

So durability is a must thing to
have and targets, I agree with

you, targets, durability, availability,
must be reliability,

must be number 1 before performance.

By the way, I also remember from
that research, people appreciate

automated analysis and control
of costs.

And this can be initially quite
simple.

I remember actually talking to
1 huge organization, their database

director told me, you know what,
it's cool stuff, what you're

showing in terms of experiments
and performance tuning, like

query tuning experiments with Database Lab
and so on.

But the number 1 problem we have
is abandoned instances.

And how to stop doing that and
lose a lot of money.

And big organizations, it's a big
problem.

And yeah, so cloud providers still
don't offer good tools.

It still takes a lot of effort
to understand the cost is how

the structure of spending right?

Practically actively like usually
it's a little late.

They are not interested.

Yeah, right. So anyway, what my
aha moment back to it was

Michael: yeah, sure

Nikolay: people like from academia
from really great people, right?

Great minds.

They try to build really cool stuff.

Like let's have automated parameter
tuning, automated indexing,

or even let's go inside Planner
and create an adaptive query

optimizer.

Michael: I saw even more extreme.

In the paper, they were talking
about choosing whether tables

should be row oriented or column
oriented based on the workload

they're observing.

So it's like trying to attack really
cool

Nikolay: areas.

This is great.

And this has been always so.

Academia guys, they attack like
really detached from reality,

things.

Meanwhile, I realized we implemented
with multiple teams already

automated indexing, and this is
what people really need.

And lately I realized in consulting
we almost always say you

need automated re-indexing, but
we didn't have a polished solution.

We have multiple ways to do it.

I always said you take this tool,
like someone from our team

developed as a side project and
then polish it.

And it's only about bit reindexes
and this and that.

And also estimates, bloat estimates
might be off.

We know very well if there are
estimates, not just startuple

numbers or real fucking like reindex
numbers from a clone, right?

And I realized, actually, this
problem is not solved.

And this is a boring problem.

And we can solve it.

That's why, number 1 thing right
now we are going to release.

We already, like, it's about to
be released.

We call it pg_index_pilot.

And there is a good roadmap inside
of this whole project.

And it's going to be real simple.

And A guy who part-time works with
me right now, Maxim Boguk,

1 of the most experienced DBAs,
Postgres DBAs I know, much more

experienced than I.

He created, basically I consider
prototype, I said we fork it

and then we started iterating on
it.

The idea is simple, I call it a
Boguk number.

You take index size, divide by
the number of life tuples in this

table.

Let's forget for a while for partial
indexes.

And we have some ratio, right?

When we just created the index,
let's consider this ratio as

perfect.

And consider tables which exceed,
like, say, a million rows.

Over time, you will see...

This number is nothing, like, it
costs nothing to check it.

You can put it to monitoring for
all indexes.

It's super fast to get a number
of...

Michael: Because these aggregates
are stored.

They're stored, right?

Nikolay: Yeah, yeah, yeah.

It's from system catalogs.

System views.

Immediately you get it.

And then over time you see degradation
of this number.

Why?

Because some pages inside the index,
they are not fully...

They are not full.

They're half empty and so on, like
they're sparse.

Yeah.

So it means at some point you say,
oh, it's time to re-index.

And the best, there are pros and
cons of this approach compared

to like say traditional based on
bloat estimates everyone is

using.

The big con, like a couple of big
cons of this approach, it required

us some effort to get away from
super user.

We did it.

And oh, 1 more thing.

On purpose, we said this component
is going to be inside, like

self-driving inside database.

We don't need external means like
something installed on EC2

instance or lambdas.

We don't need anything.

It will be inside.

This means it's running inside
PL/pgSQL code.

We know since Postgres 12 or earlier,
stored procedures have

transaction control, right?

So we can go, but we need to index
concurrently, right?

So with index concurrently, we
cannot wrap it inside a transaction

block.

So unfortunately we need something
like dblink, right?

And it's a challenge not to do
it properly on RDS, for example,

because dblink, you need to expose
password, and you don't want

to store it in plain text.

So I remembered very old trick
I used a lot of years ago, dblink

over postgres_fdw.

And this is how we do it right
now.

And there is another kind of limitation.

To start this thing to work, you
need baseline.

So baseline means you need to index
everything first or to bring

this data from clone which we are
this is an idea where I'm going

to implement very soon so to avoid
full reindex because sometimes

our customers have like 10 plus
terabytes databases and it's

not cool to ask them to re-index,
it will take forever and so

on, there's also a big impact when
you re-index quite fast.

So, But a good benefit from this
approach, it's not only about

B-tree.

You can take care of GiST, SP-GiST,
and even HNSW and others if

they degrade.

And like basically, yeah, basically
we measure kind of storage

efficiency for index.

It's super trivial.

And I think I believe into this
simple approach.

I think we are going to have it.

And I think also like I talked
about this last couple of weeks

ago with Andrey and Kirk on Postgres
hacking on Postgres TV.

And we started doing something
mind-blowing.

Andrey just said, let's just implement
merge.

Because you know, B-tree and B-tree implementation
in Postgres, it has

only split.

It cannot merge pages.

And since Andrey's PhD and his
work is in the area of indexes,

it was great to have ability to
start this work, and I'm looking

forward to, like, every 1.

Michael: Wait.

So, like, if, for example, an area
of the index starts to get

sparse because we've deleted a
bunch of let's say we've deleted

some historic data and we're not
going to insert historic data

back in that part of the index,
it could proactively, like, it's

kind of self healing.

Wow, cool.

Nikolay: So, our project would
be archived at some point, I hope.

So, I'm not an expert in Oracle
since I have never been an expert

of Oracle, but also not a user.

The last version I used was in
2001 or 2002.

It was 8i.

But people say Oracle doesn't have
this need in re-indexing.

SQL Server has this need.

Over time, health declines.

We talked about it so much.

I said, this is mantra.

Everyone needs re-indexing at some
point.

And we know Peter Geoghegan's work
and Anastasia Lubenikova in 2013-14,

they implemented deduplication
and other improvements.

That's great.

But still, there is no merge.

So pages cannot be merged

Michael: automatically.

And there was like some work, there
was also some work I think

from Peter Geoghegan and others to
help avoid page splits in more

cases.

Yeah, that's great.

And not just the deduplication
work, but yeah, the bottom up

deletion.

So yeah, kind of avoiding them
getting into this state was 1

thing, helping them heal when they
do is another.

It also strikes me as a lot of
this stuff could live in core,

right?

We already have some automation
features, right?

We already have autovacuum, which
does about 3 or 4 different

jobs.

So we have some groundwork here
already.

We have tools like pg_repack and
pg_squeeze, not in core, but some

ideas of moving more of their features
into core.

So we do have some of this in core,
some of it in popular extensions

that are supported by many clouds.

So that feels like it is the project
is already naturally going

in this direction.

Maybe slowly and maybe like not
all the areas you want it to,

but is that like What does the
end goal look like here?

Is it being in core ideal?

Nikolay: In my opinion, we just
need to solve very complex problems.

I know in some big companies, managed
Postgres services, sometimes

1 experience DBA is responsible
for a thousand clusters.

A thousand.

It's insane.

But we need to be prepared to be
responsible for a million clusters

because AI builders will bring
us a lot of clusters.

Like the time changing really fast.

So Postgres need like not to lose
the game by the way right now,

if you check Hacker News trends,
it was growing and not only

about job postings as I usually
mentioned, but everything, all

discussions on Hacker News, it
was growing until the beginning

of last year, 2024.

And then we have slight decline.

And I think Postgres has right
now huge challenge.

It needs to be much more automated.

If things needs to be in core,
it should be in core.

But not everything should be in
core we know autofailover is

still not in core right Patroni
right but Patroni Patroni I just

asked Kukushkin did has he considered
automation of post resume

and switch over without downtime
this is because this is what

people want and expect for highly
automated system he said no

and I started making joke because
I said Patroni is outside of

Postgres because it's not the job
of Postgres to do HA or to

failover and now I expect you Patroni
maintainer to tell me that

automatic switcher 0 downtime switcher
is no job for Patroni

so I need to implement another
layer on top of it, right?

This is what actually we do already.

For 0 downtime upgrades, we do
this.

And by the way, I can share this
already on our consulting page,

we mentioned GitLab, but also Supabase
and also Gadget.dev.

These companies took tech from
us And I have official quotes

from their managers, so I can share
this news.

I think we developed really great
automation for 0 downtime upgrades,

which are not only 0 downtime,
but also, of course, 0 data loss

and reversible.

And reversible without data loss
as well.

So these are the perfect properties,
but it requires a lot of

orchestration.

Fortunately, since Postgres 17,
things are improving and less

pieces need to be automated.

So let me go back and explain my
humble, finally, and my vision

of what I have right now.

I realize that boring stuff needs
to have much higher level of

automation.

This is 1.

Second, I realized PostgreSQL CI
team, this is exactly what we

are doing because with consulting
people bring us these topics.

And I realized also that in every
area if we think about automation

of feature we can apply simplified
approach for classification.

So if every single step must be
executed manually, like CLI call,

like pg_dump call or something,
pg_upgrade call, or SQL snippets,

if they are to be executed manually
by engineer, this is manual.

If there are bigger pieces, which,
like, say, whole procedure

consists of 2 or 3 big pieces, and they are combined, and engineer

only makes choices to proceed or not between pieces.

For example, our major upgrade consists of 2 huge pieces.

Physical to logical with upgrade, they are bundled due to some

specific reasons.

And switchover, 2 big steps, and inside them there is a high

level of automation.

You just call playbook, it executes, right?

In this case, it's already can be considered level 1, say.

Say level 1, or like 2 maybe, I don't know.

And then if we can fully relax and go to passenger seat and just

say, okay, I approve that, that you need to do everything, but

you will do it yourself.

I mean, Postgres or additional system will do everything yourself.

Like full major upgrade with switchover.

In this case, like say at level 2 and then, and you're in passage,

but there are limitations if it will encounter some problem,

it will stop, revert, postpone.

And you need to approve things.

Highest level, if this feature, like if system decides, oh, it's

time to upgrade, and then it schedules it, like, oh, we have

low activity time on the weekend, and you even, like, you are

notified about it, You probably can block it, but it moves on

itself.

This is the highest level of automation.

And for back to remixing, which I chose this as lowest hanging

fruit.

Everyone needs it.

Nobody in my own community, maybe after our episode, people from

RDS, CloudSQL, I know some of them are listening to us, will

rush into implementing this.

I just see everyone needs it, nobody has it in terms of managed

Postgres, nobody offers it.

So I am highest level automation for pg_index_pilot right away.

So it will decide when to re-index.

It will re-index.

It will control activity level so not to saturate disks, and

so on, and so on.

You can check roadmap.

I explained it in readme of this project.

And this is open source.

Because I actually truly believe at some point it won't be needed,

this project.

If Andrey's idea will work for merge, maybe, I don't know.

This is a dream, right?

So forget about index bloat.

Right?

Yeah.

I guess it will be super hard.

I tried to research why it hasn't been done and I didn't see

why.

Maybe it's scary.

This is basics, like foundation
of whole Postgres, of any cluster,

right?

B-tree.

Michael: Yeah, well, I guess when
would it be done?

Would it be done by a Background
worker like autovacuum?

Or would it be like, at what stage
does it make sense to do?

Nikolay: Well, it should be synchronous.

Oh, I know, this is a good question.

I'm going to ask Andrei about this.

Yeah.

Because split is synchronous.

Michael: Yeah.

I also think the thing you brought
up just at the end there is

super interesting.

Like, controlling the rate to not
saturate disks.

Like, this is an interesting thing
with tradeoffs again.

Like, if you're on Aurora Postgres
and you're paying some amount

per IO, like, for you probably
don't want your indexes rebuilt

constantly all the time.

What you really want is to fix
the root cause.

It's probably an application issue.

Why are you updating the same Row
thousands of times a second.

Nikolay: What's the root cause of

the bloat?

Michael: So, we just had a

Nikolay: case of updating multiple
times a queue like workload.

This can be identified.

Yes, yes.

Michael: Yeah.

So, I'm excited to see what you
build.

I think you're right that more
automation in Postgres would be

good, more automation in extensions
around Postgres would be

good.

But a lot of the issues I see are
kind of application source

issues.

And even if we make Postgres completely
self-driving, there is

still this kind of application
level issue that will be...

Nikolay: Mistake there can put
all your efforts to the ground,

yeah.

Michael: Yeah, exactly.

I agree.

Nikolay: That's why in our, like,
I envision, I identify 25 areas

in my blog post.

Yeah.

Office during our launch week.

Maybe we will adjust those areas,
of course, like it's not final

list.

And we are targeting 3 right now.

Like I hope we will expand this list to 5 to 6 next year.

Once we have quite good building blocks, we'll think about whole

system.

But the central part of all these building blocks is new monitoring

system, which is good both for AI and humans.

And This is what we're already actively building.

We replaced all our observability tooling used in consulting

with new monitoring system, which is called Postgres AI monitoring.

We talked about it separately.

This is going to be the source of insights, why things are wrong.

We cannot self-derive application yet, Although in some cases

might be so, because if for example, Supabase, they control

not only Postgres, but also REST API level, right?

So some things might be down there, but in general case, we cannot.

So in this case, we need just to advise and so on, but eventually

things like serverless, like with cell workloads or so, I see

that can be together with database eventually.

In this case, we can discuss full self-driving something.

Right.

But it's, we are very far from that, I think.

Yeah.

And, and we have limited resources.

I wanted to admit that I'm targeting very narrow topics right

now.

So very boring but really doable and We see demand Like things

like partitioning help with partitioning.

It's nightmare to do it for arbitrary engineer We say Oh declarative

partitioning, but for example partitions are not created automatically

Okay, there's pg_partman, but you need to combine this pieces

something like The level of automation there is terrible.

Michael: Yeah, but these are the kind of things I'd love to like

a couple of things.

I'd love to see more of this go to core.

I think there has been improvements in partitioning core, many

over the last few years and continuing.

And the other thing is, I think there's a lot of focus on fewer

people.

With this automation, a lot of the focus is on we can do more

with less humans.

And I'd love to see more effort into what would happen if we

didn't try and reduce the humans but instead try to increase

the reliability.

You know what I mean?

And I think there's a lot of talk about how do

Nikolay: we cut costs.

We have already I think 5 companies who are very fast growing

AI companies.

Very well known already.

And I see a very different approach to Postgres.

For them it's a natural choice,
like let's start with it, but

then they, oh, we need to partition
terabyte size tables.

Let's estimate work, oh, that much
work?

Okay, maybe we should migrate to
a different database system.

They are moving really fast.

And they are not attached to Postgres,
basically.

So I'm scared about Postgres.

That's why I'm saying automation.

There should be a massive shift
right now.

And resources of my team is not
enough.

So that's why I'm talking about
it.

I'm going to put whole focus on
it.

During consulting, we also choose
only paths that are fully automated,

highly automated, fully automated.

And I'm looking for some use cases,
more use cases, more maybe

partners who think similarly, right?

This is a call for Postgres ecosystem,
like it's not enough right

now, like we're not...

Postgres might lose this game again.

Postgres won multiple challenges,
like object-oriented, NoSQL.

But AI, everyone thinks about AI
only about like pgvector or

storing vectors.

It's not.

Not everyone needs vectors.

People build some app, it needs
database, maybe with vectors,

maybe without vectors, but what
they expect is higher level automation

scale easier.

So I'm super happy to see new innovations
in the area of sharding,

right?

Yeah,

Michael: I think that's actually
more, I do think that's more

important, partly for just the
marketing story.

Nikolay: Then partitioning?

Michael: For those, yeah.

Because I think we talked, like,
if you think about partitioning

if you I think for example when
we spoke to through when we had

our hundredth episode and spoke
to Notion, Figma and Adyen.

Yeah.

Notion skipped partitioning and
went straight to sharding.

They decided that with all of the
partitioning downsides, and

then there are quite a few, quite
a few limitations, for their

multi-tenant architecture it actually
made more sense to shard

instead of partition.

That felt like it wasn't 1 of these
things where they were moving

super quick and made a rash decision.

It felt like a really considered
engineering decision.

And I actually think it was the
right call.

And I wouldn't be surprised if
more of these companies are coming

in and building fairly seamless
sharding that doesn't add any

complexity at the application level,
I could see people tempted

into that, even if it wasn't for
good engineering reasons, just

for the marketing reason of, I
don't have to think about scaling

this, and I don't have to deal
with tailback tables.

Nikolay: You know this normal distribution
meme, right?

Yeah, yeah, yeah.

Usually it's like unexpected on
the right side, where it's expected.

So I don't know where, in which
camp I am.

I was thinking Postgres 1 node
is cool enough, then I was thinking

it's not cool enough because of
performance cliffs.

Now I'm thinking maybe it depends
because in some cases like

yeah it's much safer to avoid all
performance cliffs, not just

allowing like 100 plus thousand
transactions per second on 1

node.

You just, and how it's called?

Resiliency, right?

If 1 node is down, only part of,
like, it's just 1 shard, right?

Maybe yes, but at the same time,
it's so cool to see projects

which require only 1 node.

They are isolated.

They don't need a whole, like,
bunch of shards and clusters,

cluster of clusters, or cluster
term is heavily overloaded, right?

Yeah, and I think, oh, that's really
the power to have just 1

single node, sometimes without
replicas.

I see projects without replicas,
Because cloud times change,

right?

And I think I'm open, you know,
like I know my perception over

years changes, sometimes it's a
pendulum.

So it depends, you know, like regular
answer, normal answer from

consultants, it depends.

Yeah.

But sharding, it's really great
to see that it's coming from

multiple teams, there will be competition.

So yeah, the here, the future looks
bright, but who will be helping

to choose proper schema for sharding?

Michael: Well,

Nikolay: yeah.

And rebalance properly and normally.

Well, yeah, so I think they who
build this think about this as

well, automation, and the problem,
like, to choose time for rebalancing

and fully automated?

Michael: Well, we talked to Sugu,
he said it's inevitable that

you're gonna have to change your
sharding scheme at some point.

It's just designing for that upfront
seems really important.

Vitesse handles it.

So yeah, Interesting times ahead.

I'm a bit worried about the complexity
of these systems.

Personally,

Nikolay: I quite like Let's start

Michael: with simple things.

Yeah.

I quite well, starting simple, but also I like the idea of if

I'm a Postgres user, I can still understand the system, like

roughly what's going on, even if I'm a full stack developer,

even if I do the front end, the back end, like, I like, I know

it's already very complex.

I know there's already a lot of internals that you kind of need

to know about to make a performance system.

But I hope we can hold on to that for a while.

Well,

Nikolay: I truly believe that what my team does is going to help

because we observe many problems.

And every time I'm saying guys like we need to write down some

how to and So show with experiment so users understand what's

happening.

Write some how-to for next time.

And when we're writing how-tos recently, we write it both for

humans and AI.

So next time some RCA is happening, root cause analysis is happening.

If you have our how-tos injected to your Cursor, for example,

it's going to take into account some situations which are written

and how to reproduce them, how to troubleshoot them, right?

So I think something else is coming.

I'm not going to spoil it, but RCA and troubleshooting is 1 thing

we will attack early.

We are preparing pieces for this, you know, and 1 of key indicators

of success here is that non-experts understand what's happening.

You know, because...

Michael: Yeah, well, that's what I've...

That's the area that I specialize in.

You know, I'm actually betting a little bit on humans staying

in the loop for quite a long time and that there will always,

well not always, but for a long time, there'll be categories

of issue that we still need somebody to look into.

And I kind of feel like the level of Postgres knowledge, the

median level of Postgres knowledge for people having to look

into those issues is probably going to go down.

Based on all the trends that you're talking about, It depends.

If we only have a few experts that are shared between lots of

companies, maybe that's not true.

But if we do have a lot of individuals starting with web coding

or starting with like doing a full stack kind of single person

companies running the whole product start to finish.

Those guys, like, they have to know a lot about a lot of things.

They can't know that

Nikolay: much.

Have you seen numbers, Supabase shared, how many clusters

were registered and how fast it is.

Yeah.

Can you imagine how like, yeah, or average knowledge of Postgres

there.

Yeah.

But yeah, but the idea is I, I,
my vision is that we, we collect,

knowledge pieces, we experiment
with, automatic experimentation

and so on.

But then obviously this is what
I see with our customers some

human is needed to explain properly
to other humans to answer

questions properly you know to
build trust and confidence and

so on.

Michael: Yeah, but my question,
I guess the question then is

where do the tools live?

Like, can the end user use a tool
to get help?

Like, or does the Supabase team
use a tool to get help?

Or does the consultant that the
Supabase team employ?

Nikolay: My answer is, who's the
tool for?

Michael: And I know I'm just saying
that, like, it depends who

you count as the user in terms
of how much their Postgres knowledge

is.

I'm talking about that end user.

Nikolay: Yeah, for example, with
DBLab we already went this path.

We moved from a couple of guys
answering all the details for

Backend engineers have in terms
of how plan works and what to

do about it, which index to create.

We now have DBLab Backend engineers
experiment themselves.

And if something is unclear, only
then they call an expert for

help.

Right.

But like 90% of the questions answered
by Backend engineers without

involving expert, Postgres experts,
you know, and expertise in

Backend engineering minds grow,
grows as well.

I I'm just thinking this approach
we had for query optimization

can be applied at ground of schema
for many other areas.

Yeah.

Michael: All right.

Probably enough for today.

Nikolay: Thank you so much.

We went much deeper in specific
areas than I expected and I enjoyed

it a lot.

Thank you so much.

Michael: Nice.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

Self-driving Postgres

Creators and Guests

Some kind things our listeners have said