
Self-driving Postgres
Michael: Hello and welcome to Postgres.FM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard, and this is Nik, founder
of Postgres.AI.
Hey Nik, how's it going?
Nikolay: Going great.
I'm very glad to see you.
How are you?
Michael: Likewise.
I'm good, thank you.
And you chose the topic this week.
What are we talking about?
Nikolay: Yeah, I think it's very interesting to discuss the level
of automation we have in terms of all, you know, my position
against managed Postgres, and in this case, it probably will
be opposite, like saying that it's not enough what we have in
terms of managed Postgres and also in terms of Kubernetes operators
and other automation projects Postgres ecosystem has right now.
So why I was thinking about it, imagine 2011, Heroku was started,
Heroku Postgres was started.
2013, RDS Postgres was released in November, I think, and at
the end, I guess, right?
So then this was a foundation of growth of interest, I think.
Like some people say it's because of JSON or something.
I agree with those arguments, but I think the central reason
of why Postgres started to grow in 2014, 15 and by now is that
before that, backend engineers, developers, they were always
complaining how difficult it is to set up Postgres and configure
it and backups and replication that just didn't want to deal
with it and RDS and Heroku before that, they brought automation
for basic things, right?
And this, I think, simplified lives of a lot of engineers.
That's great.
In 2020, Supabase was released.
And I think a new wave of audience was brought to Postgres ecosystem
of front-end people actually.
Because now it's not only Postgres automated, it's very well
automated.
Other components are very well automated, like REST API, real-time
component, right?
These things, authentication component.
So you immediately start working on front-end, forgetting about
backend.
So I admire Supabase for bringing a lot of frontend guys to
Postgres.
That's great.
But at some point they need to learn SQL, I'm pretty sure.
So this is fine.
And now we have AI builders, and this is another wave of users
and basically not humans already sometimes.
So we hear from Supabase and Neon
that a lot of clusters which
are created these days are created
by request from like Cursor
or something, Vibe coding, right?
And so many, many clusters, many
of them are small and maybe
it won't go anywhere because it's
just experiments, prototyping
and so on.
But some clusters grow and they
lack attention.
So with RDS, there was a big shift.
We talked about startup teams who
don't have DBA and they are
fine with it until some point and
this is where Postgres.AI professional
services catch them, right, quite
often.
But now we talk about even lack
of backend engineering team completely,
for example, with Supabase or
even without like lack of engineering
team completely somehow.
Right.
And only guys who understand product
and try to wipe code this,
sometimes with security breaches,
like recently some app was
like storing data in, in Firebase,
right.
Google Firebase.
And that was not secure at all.
5 million users registered.
That was big scandal.
So yeah, anyway, but this is part
of the topic security.
So what I feel is a demand of much
higher automation than just
RDS or Supabase.
Some new level of automation should
be present.
And if you look at enterprise sector,
there is Oracle with this
idea of autonomous database, self-driving
database for many years,
right?
On 1 hand.
On another hand, there are academic
papers like 1 from Carnegie
Mellon University by Andy Pavlo
from 2017, which discusses what
self-driving database management
systems are.
And there's a question.
If you think about zillions of
Postgres clusters, which should
be highly automated And when experts
look at them, everything
should be already transparent and
obvious how to fix and move
on.
What is this?
Like what is self-driving Postgres?
I was thinking.
And to answer that I performed
several waves of research, of
course, with deep research from
cloud and open the ChatGPT,
right?
Latest models, I paid to everyone
a lot of bucks already.
So I was thinking, what is, what
could it be for Postgres?
To answer that, I performed a search
looking at Oracle, first
of all.
And I just asked to, you know,
deep research is when they perform
Google searches or Bing searches
and analyze hundreds of sources
and then write some kind of report
like some student would write.
It might have issues, of course,
with report, but it gives you
a lot of links at least and some
summaries.
So my question was, what people,
after all those years of building
autonomous Oracle, what do people
really like and what they like
less, right?
What's the- What did you find?
Michael: What did they say?
Nikolay: Yeah, and 1 more comment.
In 2013 or 14, I think I was attending
autonomous Oracle webinar
and I was completely shocked.
They promised autonomous Oracle,
but they only talked about clustering
logs, organizing like better log
analysis from hundreds or thousands
of sources and not only database.
And I was like, where is autonomous
Oracle here?
And 10 years since then, almost
more than 10 years, I was thinking
it's stupid.
Now I changed my mind and I hope
you and our audience will understand
why.
So what I found is that people
appreciate a lot self-patching.
It's like many minor releases,
like if new release comes out
with security patches, it's not
a headache at all.
And this kind of automated RDS,
you can just define maintenance
window, yeah, with some caveats.
Michael: Minor versions only, I
think?
Or have they done major versions
now?
Nikolay: Major versions yeah automation
of major version upgrades
and here we definitely have something
to discuss I mean we discussed
it already and I mean my team like
we we had very good recent
cases where our customers had the
0 downtime upgrades.
It's like, we're very happy.
I hope some blog posts are coming.
Michael: 0 downtime is very different
to fully autonomous though.
Like very, very different.
Nikolay: From fully autonomous
Michael: or self-driving a major
upgrade is very different.
Nikolay: Let's do 1 more step back.
What is self-driving car?
Michael: Yeah, great.
Nikolay: There are 6 levels defined
by SAE, like kind of standard
or something.
So there are 6 levels from 0 to
5.
0 means not autonomous at all,
manual, regular car.
And then 5 is fully autonomous.
And looking at first few levels,
I realized interesting thing.
So they talk about not each feature
particularly, but of combination
of features.
For example, level number 1, it
could be either either adaptive
cruise control, like so keeping
my maintaining speed, but safely,
right?
Or maintaining lane, but not both.
If it's both it's already the level
number 2 right and there
are several levels and we and for
example this paper Carnegie
Mellon Andy Pavlo's paper from
from 2017 and discusses how
to map it to database management
systems right
Michael: so a little bit not not
much in my opinion but a little
bit yeah for sure and this this
paper I looked at this in advance
you shared these car things I'll
link them up in the show notes
as well.
It struck me that level 4 to level
5, so the last step, is a
huge jump.
It's like here's loads of features
in the car that will help
the driver and then level 5 is
suddenly and now the driver does
nothing.
So it's like that feels to me like
a whole like a maybe a potentially
huge chasm that maybe there's like
a hundred more levels in between
4 and 5 that we're going to need
to break down at some point.
But like I felt like it was very,
a very hand wavy way of saying
we might be really close to this
because we already have level
4 features, we're very close to
having level 5.
And it feels to me like there might
be like a, like I was unclear,
for example, in a level 5 car,
whether a human could still take
control if needed or was there
absolutely no way of doing that
like that that feels to me like
a level that wasn't defined and
maybe there's like other I believe
there'll be
Nikolay: yeah yeah let me explain
how I see it.
And I think if you ask several
people, they will answer differently.
And I also heard Kubernetes ecosystem
tries to map it as well.
And some operators, they claim
they have a very high automation,
but many people say they are not,
they're not, they don't have,
they don't have, and so on.
So in terms of cars, let me like
propagate it.
So number 2, like both these options,
for example, and this is
what Tesla autopilot, for example,
does.
I use it a lot.
Like you just turn it on, but you
must sit and keep basically
officially you must keep hands
on wheel and be ready to take
control any second basically.
Right.
But it's still, it's, it's great.
It maintains lane and speed.
Like you just relaxes and, and,
spend much less effort.
And I think we can think about
this in databases as well.
Next level is level number 3.
And level number 3, it's everything
is automated, but you still
need to be to be ready to take
control.
And this is what for example, Tesla
full self driving is, well,
this is this is like, you not everything
automated.
It's like under supervision of
you.
Michael: You have just put I'm
just putting it up.
No, no, no, it's it's not quite
that it's like level 3.
It says here, for example, is a
traffic jam chauffeur.
So it can handle like, basic traffic
conditions like a traffic
jam.
But for example, it doesn't account
for like all weather conditions
or like there's a bunch of other
like potential things.
So it's, it's not like it's
Nikolay: the limitations.
Michael: Exactly.
And
Nikolay: you still need to take
control if needed.
Basically, you need to be ready
to take control.
This is level 3, I agree.
There are limits, but it can bring
you fully automatically from
point to point.
This is what Tesla full self-driving
does.
You sit in the driver's seat, and
from point to point you can...
Basically, you can enjoy full automation
of some whole ride,
but if some bad condition occurs,
then you need to take control
and fix things.
For example, if we map to full
major upgrade, full major upgrade,
downtime and so on.
By the way, when we think about
autonomacy, we also bring some
additional features like 0 downtime.
It's like it could be in place
and without downtime, but somehow
our mind wants some good features
additionally to autonomous.
So yeah, this is like natural desire
to have good stuff.
But we can imagine, for example,
we have a whole thing automated
And under many circumstances, like
in many cases it will work,
but in some edge cases it won't
and you will need to take control
and make some decisions before
proceeding on or even postponing
the whole procedure.
This is very similar to Tesla full
self driving and I experienced
it.
It really can drive you from point
to point.
But for example, if you go inside,
for example, my property,
I have some roads inside it, it
won't be able to drive there
at all because it's like, oh, this
is already not a road, not
proper road, you know.
So there is another level 4, it's
also conditional, but there
you like what my perception again
might be wrong, there you can
go to backseat and sit there.
You are allowed to relax completely,
fully, but again it will
work only under some conditions.
For example, there is Waymo.
I tried it multiple times in San
Francisco.
It's amazing.
You go to its Jaguar, you go to
backseat And everything is fine.
But if like some, we saw this YouTube
videos or something, multiple
Waymo machines, the cars, they
just create a traffic jam themselves
and like basically deadlock, right?
Michael: So like if they sense
that they can't drive or they
spot something they can't deal
with, they'll stop as a safety
precaution.
And then what do you do?
You have to get out?
Nikolay: Yeah.
In this case, yes.
In the case of Waymo, you're a
passenger, yes.
So you can imagine, For example,
if we map it to major upgrades,
for example, imagine there is a
procedure for develop, there
is a vendor who can intervene and
can take control sometimes
if allowed.
But the passenger in this case,
those who asked to perform full
major upgrades, They are passengers,
they cannot make decisions,
but this is good for them because
their mind is spent for product
development, for example, right?
But thanks to like million miles
already of experiments and real-life
experience for these cars, If something
goes wrong, safety first,
it will just abandon the trip,
I mean postpone it and cancel
and another car will come later,
right?
So this is approach.
But It's like whole thing encapsulated,
like black box for you.
You don't go down and don't make
decisions according to some
diagram of decisions.
But it has also limitations.
And I think Waymo is perfect example
for level 4, because it
works in San Francisco, in some
areas, but you cannot drive to
San Jose, which is like slightly
more than 1 hour usually drive,
because it's outside of coverage.
And this is like, this can happen
here as well, major upgrades,
but if there are some extensions,
for example, it's outside coverage,
we don't support this kind of upgrade
because this extension,
like, I don't know, Citus or TimescaleDB,
it requires an additional
approach and we don't have it covered
here, right?
So this is what I think here, like
we can map it and why not?
But my main insight was looking
at deep research of feedback
from Oracle users, DBAs and engineers.
They say, upgrades are great, both
minor and major.
Security is great, like security
control, automatic procedures
to level up security.
These kinds of maintenance things
are great, but when we talk
about smart, like index advising
and so on, it's hit and miss
situation.
And this, I had an aha moment because
I was thinking, actually,
if you look at what all people
try to do, they try to invent
configuration tuning, automated,
with machine learning and AI.
This is what Andy Pavlo was doing
with OtterTune.
And by the end of, OtterTune was
closed, but by the end of Auto-Tune
live, I already noticed the shift
which happened to Postgres
AI earlier.
Attention to query tuning and optimization,
like creation of
indexes.
And I see pganalyze is doing a
great job, not resisting to LLMs,
which is also great.
And also some teams go inside Postgres
and try to make planners
smart.
And I had, like, with configuration,
it's pretty straightforward
for me, it's Pareto principle.
Take pgtune.leopard.in.ua, very
simple configuration, 80% of
job done in 1 person, like really,
really fast.
It's just heuristic-based, rule-based
approach for OLTP or like
anything.
And it's good enough for many,
for many, like we don't need any
machine learning and so on.
Michael: And even when you say
for many as well, I think it's
also about time.
So it's for a while, it will then
be good.
And so for long, it will be good
enough.
So it's like...
Nikolay: Yes, I agree.
Yes, I agree.
And then you need, but then you
need to tune.
My way of tuning is to conduct
experiments.
You know this very well, like how
to make experiments faster,
cheaper, reproducible and so on.
Like these kind of things.
Michael: Is it worth, cause I think
self-driving, like probably
level 4, many people would count
as self-driving, but there's
still the kind of enough caveats
that a lot of the benefits of
self-driving cars.
Let's go back to cars briefly.
I'm really excited and optimistic
about self-driving cars.
I love the idea of being able to
get on with something else while
something drives me.
I don't mind that it might not
be like the fastest way or like
it might not drive like completely
optimally.
Nikolay: Not the safest.
What about safest?
Michael: Probably the safest but
not the fastest sorry.
Probably safer than me but maybe
it would be a bit more gentle
you know maybe it wouldn't take
the yellow light when a human
driver would, you know, that kind
of thing.
But I love that you can just watch
a movie or chat to a friend
easily or play a board game in
the back, you know, you do whatever
you want, you get so much time
back, especially in America where
a lot of the time is spent driving.
It makes so much sense to me that
self-driving cars are like
a huge unlock for a lot of people.
But largely only at stages 4 and
5, like at that highest level.
Like cruise control is great, but
I still have to concentrate.
I still have to be watching the
road.
Like I don't actually gain that
much.
And if we go back to Postgres,
I feel like a lot of the automation
features are great.
Nikolay: Yes.
Michael: But we still have to concentrate.
We still need the person.
We still need the DBA.
And as long as we still need the
Nikolay: driver, and as long
Michael: as we still need the driver,
and as long as we still
need fail-safes down to humans,
all I see is kind of this like
gradual need for maybe maybe this
is where the driving analogy
breaks down a little bit but maybe
fewer humans per Server to
like maybe the DBA team for a company
will be smaller on average
compared to how it was in the past
you know that and I think
we've already seen that over time
but I'm struggling with like
the that last step until we get
to those which which feels to
me like a long way off especially
given like the experience you're
saying about like Oracle and the
experience we saw with very
smart people trying to automate
a lot of this stuff from, you
know, with a lot of AI, but kind
of not even LLM stuff, right?
Like, a lot of the research in
this area has been machine learning
and other longer researched AI
methodologies that have lots of
real-world use cases and even there
we've seen mixed results
and it really I feel like the model
struggling to employ In the
experience I've had talking to
customers that use Auto-Tune for
example, I feel like the Constraints
for example were not as
clear as driving or the slightly
different use cases or the performance
trade-offs that different people
have in different cases are
subtle enough that you can't set
the exact same guardrails for
everybody.
And at that point, it breaks down
enough that well, and 1 more
addition, performance cliffs are
so real that if you change 1
thing, and it looks like it's going
to be great as soon as you
hit a cliff it's then a disaster
and then recovering from those
disasters is actually a real problem.
Nikolay: So I feel like...
Troubleshooting disaster is also
a problem.
Root cause analysis is a problem.
Michael: And arguably they get
harder when you involve automation
because the more that it's automated
the less people actually
know what was changed when and
why.
So I think it can get tricky.
Nikolay: I disagree here with you
because you must be an expert.
But if you have automation, you
move much faster with high level
of automation.
Take Cursor, give it a lot of pieces
together, and explain how
you approach methodology of analysis.
This is like expert needs to bring, and then you move much faster.
But this is like, I was trying to explain how I moved to this
area completely.
Right.
Michael: Yeah.
Okay.
Nikolay: So people say in Oracle, this works that 1 doesn't.
What works?
Quite simple things.
I, I said like upgrades, maintenance, security stuff.
Well, not simple, but they are boring, you know.
Of course, replication and backups.
Like for me, HA and DR, It's like auto steering and maintaining
speed and maintaining lane.
Basics.
Cars must do this, so database must have good HA and DR.
If we look closer, actually, there are issues with both the HA
and DR, which will prevent us from reaching very high level of
automation.
But we can dive into this later.
But anyway, boring stuff lacks automation.
Interesting, like, Remember I mentioned level 1 and 2, they talk
about combinational features.
So if we start analyzing each feature particularly, we cannot
apply the same classification.
You know, because classification talks about combinational features.
Michael: Yeah, sure.
Coming back actually, just to make sure I understand, what have
Oracle done in terms of security?
Can you give some example automation features?
Nikolay: Well, I know a little.
I would say what we should do in Postgres and This is like least
topic I would like, I'm ready to discuss now.
Because this is in the roadmap, but right now we're focusing
on different areas.
I can just speculate on this, like identify potential threats,
like checking permissions, roles.
For example, I know organizations which use the same super user
for all human DB engineers, and this is quite easy to identify
or any organization.
I had a couple of them, on consulting contracts, which passed
through IPO process.
Before that you have audit, right?
And during audit, they ask specific questions.
Some of them are quite like silly, I would say, but some of them
are quite good enough.
And you just inspect your pg_hba.conf, you inspect a user model,
you inspect how multi-tenancy is organized.
We had an episode about it, right?
So these kinds of things can be automatically analyzed and so
on.
I don't know.
I don't know details about Oracle.
I just saw feedback that these
stuff engineers really appreciate
and they appreciate less automated
configuration, automated creation
of indexes.
And this was appreciated
Michael: less or appreciate less
or it gets it wrong.
When it gets it wrong, it's more
painful.
Like what's, do you see what
Nikolay: I mean?
Mixed results, mixed results.
I don't know, like there's lack
of trust in some minds.
Like, I don't know.
Michael: I, for example, I catch
up with customers from time
to time just to hear what they're
doing, what they like about
products, what they don't.
And I was speaking to a customer
of mine that did try and use
OtterTune for a while.
And it'd be interesting, I'd love,
maybe we should invite somebody
from that team on to discuss what
happened there.
Yeah.
Why did it shut down?
Nikolay: I can tell you why.
People don't need configuration
tuning that much.
Michael: Okay.
Well, I also think there might
be other issues.
So there's definitely like not
needing
Nikolay: I tell you the big need
in configuration tuning exists
only if you have, say, 10, 000
plus clusters.
Then you can say, okay, we are
going to save like 5 to 10% of
money just with tuning or workload
will be reduced.
We know, you know very well, a
really bad plan can screw all
efforts of configuration tuning.
Michael: Well, yes, but I think
it's worse than that.
I think also in addition to it
not being needed that often and
therefore kind of subscription
based models not working that
well, also when it, well this customer
was telling me, they moved
from a mental state of, when something
goes wrong, let's dive
into what happened, straight to,
when something goes wrong now,
what did OtterTune change?
And that was like a real shift.
It became
Nikolay: like a...
Trust.
Michael: Trust, but also that must
be based on the fact it made
changes in the past that made things
worse.
So it's not trust for no reason,
it's not kind of like distrust
for no reason.
It's every now and again if you
change something, you hit a performance
cliff.
And it's unexpected.
Or maybe it's not even, like, always
a performance cliff.
Maybe it has another unintended
consequence that you care about
more.
Probably not in a lot of these
cases.
But I think they talked about,
for example, making the mistake
of letting it configure some parameters
that even affected, like,
durability and things like that.
So if you're changing depending
on what you allow it to change,
there might be unintended side
effects.
And I think that's, like, putting
parameters around, like, what
you do, you will and won't let
it do, is actually harder than
it sounds, I think.
Like, it's very hard.
Nikolay: To perform, like, enterprise
approach for making a change,
it's extremely complex.
I'm very grateful that 7 years
ago, I was working with Chewy.
They were preparing for IPO, and
I remember CTO was ex-Oracle,
and discussions we had and resistance
they had in any change
I proposed.
It taught me this enterprise approach.
I'm very grateful.
It was great experience for me.
And I realized, oh, actually, if
you want to be serious about
changes, any small change should
be very thoroughly tested, like
experiments, experiments, all risks
must be analyzed, and then
there should be a plan to mitigate
if a risk, like a non-risk
occurs, right?
And AI doesn't help here almost,
you know, like this is framework
you need to build without AI first.
But this
Michael: is, yeah.
I don't necessarily agree.
I think AI could really help with
these things.
When I say AI, I'm including machine
learning and not just the
latest LLM stuff.
I just think we need to define
constraints really clearly and
define what we care about really
clearly and make it really clear
we care more about reliability
and durability than we do about
performance.
So, like,
Nikolay: yes, yes.
Michael: And that's almost always
true.
And I think that might be more
core to the reason that these
performance tuning tools aren't
haven't yet succeeded because
we haven't yet nailed the reliability,
durability stuff.
So that would be my theory as to
like why they didn't like necessarily
succeed.
Because if we help, even if they
did help performance almost
all the time, if they ever hurt
reliability, that's not a tradeoff
most organizations are willing
to make.
And that's a difficult thing to
tell a tool that's trying to
optimize for better performance.
Nikolay: Yeah.
Yeah, I agree.
And durability has issues.
There is a good article from Sugu
just published on Sugu's blog
about synchronous replication issues
and we also know the very
good talk by Alexander Kukushkin
about issues with synchronous
replication.
So durability is a must thing to
have and targets, I agree with
you, targets, durability, availability,
must be reliability,
must be number 1 before performance.
By the way, I also remember from
that research, people appreciate
automated analysis and control
of costs.
And this can be initially quite
simple.
I remember actually talking to
1 huge organization, their database
director told me, you know what,
it's cool stuff, what you're
showing in terms of experiments
and performance tuning, like
query tuning experiments with Database Lab
and so on.
But the number 1 problem we have
is abandoned instances.
And how to stop doing that and
lose a lot of money.
And big organizations, it's a big
problem.
And yeah, so cloud providers still
don't offer good tools.
It still takes a lot of effort
to understand the cost is how
the structure of spending right?
Practically actively like usually
it's a little late.
They are not interested.
Yeah, right. So anyway, what my
aha moment back to it was
Michael: yeah, sure
Nikolay: people like from academia
from really great people, right?
Great minds.
They try to build really cool stuff.
Like let's have automated parameter
tuning, automated indexing,
or even let's go inside Planner
and create an adaptive query
optimizer.
Michael: I saw even more extreme.
In the paper, they were talking
about choosing whether tables
should be row oriented or column
oriented based on the workload
they're observing.
So it's like trying to attack really
cool
Nikolay: areas.
This is great.
And this has been always so.
Academia guys, they attack like
really detached from reality,
things.
Meanwhile, I realized we implemented
with multiple teams already
automated indexing, and this is
what people really need.
And lately I realized in consulting
we almost always say you
need automated re-indexing, but
we didn't have a polished solution.
We have multiple ways to do it.
I always said you take this tool,
like someone from our team
developed as a side project and
then polish it.
And it's only about bit reindexes
and this and that.
And also estimates, bloat estimates
might be off.
We know very well if there are
estimates, not just startuple
numbers or real fucking like reindex
numbers from a clone, right?
And I realized, actually, this
problem is not solved.
And this is a boring problem.
And we can solve it.
That's why, number 1 thing right
now we are going to release.
We already, like, it's about to
be released.
We call it pg_index_pilot.
And there is a good roadmap inside
of this whole project.
And it's going to be real simple.
And A guy who part-time works with
me right now, Maxim Boguk,
1 of the most experienced DBAs,
Postgres DBAs I know, much more
experienced than I.
He created, basically I consider
prototype, I said we fork it
and then we started iterating on
it.
The idea is simple, I call it a
Boguk number.
You take index size, divide by
the number of life tuples in this
table.
Let's forget for a while for partial
indexes.
And we have some ratio, right?
When we just created the index,
let's consider this ratio as
perfect.
And consider tables which exceed,
like, say, a million rows.
Over time, you will see...
This number is nothing, like, it
costs nothing to check it.
You can put it to monitoring for
all indexes.
It's super fast to get a number
of...
Michael: Because these aggregates
are stored.
They're stored, right?
Nikolay: Yeah, yeah, yeah.
It's from system catalogs.
System views.
Immediately you get it.
And then over time you see degradation
of this number.
Why?
Because some pages inside the index,
they are not fully...
They are not full.
They're half empty and so on, like
they're sparse.
Yeah.
So it means at some point you say,
oh, it's time to re-index.
And the best, there are pros and
cons of this approach compared
to like say traditional based on
bloat estimates everyone is
using.
The big con, like a couple of big
cons of this approach, it required
us some effort to get away from
super user.
We did it.
And oh, 1 more thing.
On purpose, we said this component
is going to be inside, like
self-driving inside database.
We don't need external means like
something installed on EC2
instance or lambdas.
We don't need anything.
It will be inside.
This means it's running inside
PL/pgSQL code.
We know since Postgres 12 or earlier,
stored procedures have
transaction control, right?
So we can go, but we need to index
concurrently, right?
So with index concurrently, we
cannot wrap it inside a transaction
block.
So unfortunately we need something
like dblink, right?
And it's a challenge not to do
it properly on RDS, for example,
because dblink, you need to expose
password, and you don't want
to store it in plain text.
So I remembered very old trick
I used a lot of years ago, dblink
over postgres_fdw.
And this is how we do it right
now.
And there is another kind of limitation.
To start this thing to work, you
need baseline.
So baseline means you need to index
everything first or to bring
this data from clone which we are
this is an idea where I'm going
to implement very soon so to avoid
full reindex because sometimes
our customers have like 10 plus
terabytes databases and it's
not cool to ask them to re-index,
it will take forever and so
on, there's also a big impact when
you re-index quite fast.
So, But a good benefit from this
approach, it's not only about
B-tree.
You can take care of GiST, SP-GiST,
and even HNSW and others if
they degrade.
And like basically, yeah, basically
we measure kind of storage
efficiency for index.
It's super trivial.
And I think I believe into this
simple approach.
I think we are going to have it.
And I think also like I talked
about this last couple of weeks
ago with Andrey and Kirk on Postgres
hacking on Postgres TV.
And we started doing something
mind-blowing.
Andrey just said, let's just implement
merge.
Because you know, B-tree and B-tree implementation
in Postgres, it has
only split.
It cannot merge pages.
And since Andrey's PhD and his
work is in the area of indexes,
it was great to have ability to
start this work, and I'm looking
forward to, like, every 1.
Michael: Wait.
So, like, if, for example, an area
of the index starts to get
sparse because we've deleted a
bunch of let's say we've deleted
some historic data and we're not
going to insert historic data
back in that part of the index,
it could proactively, like, it's
kind of self healing.
Wow, cool.
Nikolay: So, our project would
be archived at some point, I hope.
So, I'm not an expert in Oracle
since I have never been an expert
of Oracle, but also not a user.
The last version I used was in
2001 or 2002.
It was 8i.
But people say Oracle doesn't have
this need in re-indexing.
SQL Server has this need.
Over time, health declines.
We talked about it so much.
I said, this is mantra.
Everyone needs re-indexing at some
point.
And we know Peter Geoghegan's work
and Anastasia Lubenikova in 2013-14,
they implemented deduplication
and other improvements.
That's great.
But still, there is no merge.
So pages cannot be merged
Michael: automatically.
And there was like some work, there
was also some work I think
from Peter Geoghegan and others to
help avoid page splits in more
cases.
Yeah, that's great.
And not just the deduplication
work, but yeah, the bottom up
deletion.
So yeah, kind of avoiding them
getting into this state was 1
thing, helping them heal when they
do is another.
It also strikes me as a lot of
this stuff could live in core,
right?
We already have some automation
features, right?
We already have autovacuum, which
does about 3 or 4 different
jobs.
So we have some groundwork here
already.
We have tools like pg_repack and
pg_squeeze, not in core, but some
ideas of moving more of their features
into core.
So we do have some of this in core,
some of it in popular extensions
that are supported by many clouds.
So that feels like it is the project
is already naturally going
in this direction.
Maybe slowly and maybe like not
all the areas you want it to,
but is that like What does the
end goal look like here?
Is it being in core ideal?
Nikolay: In my opinion, we just
need to solve very complex problems.
I know in some big companies, managed
Postgres services, sometimes
1 experience DBA is responsible
for a thousand clusters.
A thousand.
It's insane.
But we need to be prepared to be
responsible for a million clusters
because AI builders will bring
us a lot of clusters.
Like the time changing really fast.
So Postgres need like not to lose
the game by the way right now,
if you check Hacker News trends,
it was growing and not only
about job postings as I usually
mentioned, but everything, all
discussions on Hacker News, it
was growing until the beginning
of last year, 2024.
And then we have slight decline.
And I think Postgres has right
now huge challenge.
It needs to be much more automated.
If things needs to be in core,
it should be in core.
But not everything should be in
core we know autofailover is
still not in core right Patroni
right but Patroni Patroni I just
asked Kukushkin did has he considered
automation of post resume
and switch over without downtime
this is because this is what
people want and expect for highly
automated system he said no
and I started making joke because
I said Patroni is outside of
Postgres because it's not the job
of Postgres to do HA or to
failover and now I expect you Patroni
maintainer to tell me that
automatic switcher 0 downtime switcher
is no job for Patroni
so I need to implement another
layer on top of it, right?
This is what actually we do already.
For 0 downtime upgrades, we do
this.
And by the way, I can share this
already on our consulting page,
we mentioned GitLab, but also Supabase
and also Gadget.dev.
These companies took tech from
us And I have official quotes
from their managers, so I can share
this news.
I think we developed really great
automation for 0 downtime upgrades,
which are not only 0 downtime,
but also, of course, 0 data loss
and reversible.
And reversible without data loss
as well.
So these are the perfect properties,
but it requires a lot of
orchestration.
Fortunately, since Postgres 17,
things are improving and less
pieces need to be automated.
So let me go back and explain my
humble, finally, and my vision
of what I have right now.
I realize that boring stuff needs
to have much higher level of
automation.
This is 1.
Second, I realized PostgreSQL CI
team, this is exactly what we
are doing because with consulting
people bring us these topics.
And I realized also that in every
area if we think about automation
of feature we can apply simplified
approach for classification.
So if every single step must be
executed manually, like CLI call,
like pg_dump call or something,
pg_upgrade call, or SQL snippets,
if they are to be executed manually
by engineer, this is manual.
If there are bigger pieces, which,
like, say, whole procedure
consists of 2 or 3 big pieces, and they are combined, and engineer
only makes choices to proceed or not between pieces.
For example, our major upgrade consists of 2 huge pieces.
Physical to logical with upgrade, they are bundled due to some
specific reasons.
And switchover, 2 big steps, and inside them there is a high
level of automation.
You just call playbook, it executes, right?
In this case, it's already can be considered level 1, say.
Say level 1, or like 2 maybe, I don't know.
And then if we can fully relax and go to passenger seat and just
say, okay, I approve that, that you need to do everything, but
you will do it yourself.
I mean, Postgres or additional system will do everything yourself.
Like full major upgrade with switchover.
In this case, like say at level 2 and then, and you're in passage,
but there are limitations if it will encounter some problem,
it will stop, revert, postpone.
And you need to approve things.
Highest level, if this feature, like if system decides, oh, it's
time to upgrade, and then it schedules it, like, oh, we have
low activity time on the weekend, and you even, like, you are
notified about it, You probably can block it, but it moves on
itself.
This is the highest level of automation.
And for back to remixing, which I chose this as lowest hanging
fruit.
Everyone needs it.
Nobody in my own community, maybe after our episode, people from
RDS, CloudSQL, I know some of them are listening to us, will
rush into implementing this.
I just see everyone needs it, nobody has it in terms of managed
Postgres, nobody offers it.
So I am highest level automation for pg_index_pilot right away.
So it will decide when to re-index.
It will re-index.
It will control activity level so not to saturate disks, and
so on, and so on.
You can check roadmap.
I explained it in readme of this project.
And this is open source.
Because I actually truly believe at some point it won't be needed,
this project.
If Andrey's idea will work for merge, maybe, I don't know.
This is a dream, right?
So forget about index bloat.
Right?
Yeah.
I guess it will be super hard.
I tried to research why it hasn't been done and I didn't see
why.
Maybe it's scary.
This is basics, like foundation
of whole Postgres, of any cluster,
right?
B-tree.
Michael: Yeah, well, I guess when
would it be done?
Would it be done by a Background
worker like autovacuum?
Or would it be like, at what stage
does it make sense to do?
Nikolay: Well, it should be synchronous.
Oh, I know, this is a good question.
I'm going to ask Andrei about this.
Yeah.
Because split is synchronous.
Michael: Yeah.
I also think the thing you brought
up just at the end there is
super interesting.
Like, controlling the rate to not
saturate disks.
Like, this is an interesting thing
with tradeoffs again.
Like, if you're on Aurora Postgres
and you're paying some amount
per IO, like, for you probably
don't want your indexes rebuilt
constantly all the time.
What you really want is to fix
the root cause.
It's probably an application issue.
Why are you updating the same Row
thousands of times a second.
Nikolay: What's the root cause of
the bloat?
Michael: So, we just had a
Nikolay: case of updating multiple
times a queue like workload.
This can be identified.
Yes, yes.
Michael: Yeah.
So, I'm excited to see what you
build.
I think you're right that more
automation in Postgres would be
good, more automation in extensions
around Postgres would be
good.
But a lot of the issues I see are
kind of application source
issues.
And even if we make Postgres completely
self-driving, there is
still this kind of application
level issue that will be...
Nikolay: Mistake there can put
all your efforts to the ground,
yeah.
Michael: Yeah, exactly.
I agree.
Nikolay: That's why in our, like,
I envision, I identify 25 areas
in my blog post.
Yeah.
Office during our launch week.
Maybe we will adjust those areas,
of course, like it's not final
list.
And we are targeting 3 right now.
Like I hope we will expand this list to 5 to 6 next year.
Once we have quite good building blocks, we'll think about whole
system.
But the central part of all these building blocks is new monitoring
system, which is good both for AI and humans.
And This is what we're already actively building.
We replaced all our observability tooling used in consulting
with new monitoring system, which is called Postgres AI monitoring.
We talked about it separately.
This is going to be the source of insights, why things are wrong.
We cannot self-derive application yet, Although in some cases
might be so, because if for example, Supabase, they control
not only Postgres, but also REST API level, right?
So some things might be down there, but in general case, we cannot.
So in this case, we need just to advise and so on, but eventually
things like serverless, like with cell workloads or so, I see
that can be together with database eventually.
In this case, we can discuss full self-driving something.
Right.
But it's, we are very far from that, I think.
Yeah.
Yeah.
And, and we have limited resources.
I wanted to admit that I'm targeting very narrow topics right
now.
So very boring but really doable and We see demand Like things
like partitioning help with partitioning.
It's nightmare to do it for arbitrary engineer We say Oh declarative
partitioning, but for example partitions are not created automatically
Okay, there's pg_partman, but you need to combine this pieces
something like The level of automation there is terrible.
Michael: Yeah, but these are the kind of things I'd love to like
a couple of things.
I'd love to see more of this go to core.
I think there has been improvements in partitioning core, many
over the last few years and continuing.
And the other thing is, I think there's a lot of focus on fewer
people.
With this automation, a lot of the focus is on we can do more
with less humans.
And I'd love to see more effort into what would happen if we
didn't try and reduce the humans but instead try to increase
the reliability.
You know what I mean?
And I think there's a lot of talk about how do
Nikolay: we cut costs.
We have already I think 5 companies who are very fast growing
AI companies.
Very well known already.
And I see a very different approach to Postgres.
For them it's a natural choice,
like let's start with it, but
then they, oh, we need to partition
terabyte size tables.
Let's estimate work, oh, that much
work?
Okay, maybe we should migrate to
a different database system.
They are moving really fast.
And they are not attached to Postgres,
basically.
So I'm scared about Postgres.
That's why I'm saying automation.
There should be a massive shift
right now.
And resources of my team is not
enough.
So that's why I'm talking about
it.
I'm going to put whole focus on
it.
During consulting, we also choose
only paths that are fully automated,
highly automated, fully automated.
And I'm looking for some use cases,
more use cases, more maybe
partners who think similarly, right?
This is a call for Postgres ecosystem,
like it's not enough right
now, like we're not...
Postgres might lose this game again.
Postgres won multiple challenges,
like object-oriented, NoSQL.
But AI, everyone thinks about AI
only about like pgvector or
storing vectors.
It's not.
Not everyone needs vectors.
People build some app, it needs
database, maybe with vectors,
maybe without vectors, but what
they expect is higher level automation
scale easier.
So I'm super happy to see new innovations
in the area of sharding,
right?
Yeah,
Michael: I think that's actually
more, I do think that's more
important, partly for just the
marketing story.
Nikolay: Then partitioning?
Michael: For those, yeah.
Because I think we talked, like,
if you think about partitioning
if you I think for example when
we spoke to through when we had
our hundredth episode and spoke
to Notion, Figma and Adyen.
Yeah.
Notion skipped partitioning and
went straight to sharding.
They decided that with all of the
partitioning downsides, and
then there are quite a few, quite
a few limitations, for their
multi-tenant architecture it actually
made more sense to shard
instead of partition.
That felt like it wasn't 1 of these
things where they were moving
super quick and made a rash decision.
It felt like a really considered
engineering decision.
And I actually think it was the
right call.
And I wouldn't be surprised if
more of these companies are coming
in and building fairly seamless
sharding that doesn't add any
complexity at the application level,
I could see people tempted
into that, even if it wasn't for
good engineering reasons, just
for the marketing reason of, I
don't have to think about scaling
this, and I don't have to deal
with tailback tables.
Nikolay: You know this normal distribution
meme, right?
Yeah, yeah, yeah.
Usually it's like unexpected on
the right side, where it's expected.
So I don't know where, in which
camp I am.
I was thinking Postgres 1 node
is cool enough, then I was thinking
it's not cool enough because of
performance cliffs.
Now I'm thinking maybe it depends
because in some cases like
yeah it's much safer to avoid all
performance cliffs, not just
allowing like 100 plus thousand
transactions per second on 1
node.
You just, and how it's called?
Resiliency, right?
If 1 node is down, only part of,
like, it's just 1 shard, right?
Maybe yes, but at the same time,
it's so cool to see projects
which require only 1 node.
They are isolated.
They don't need a whole, like,
bunch of shards and clusters,
cluster of clusters, or cluster
term is heavily overloaded, right?
Yeah, and I think, oh, that's really
the power to have just 1
single node, sometimes without
replicas.
I see projects without replicas,
Because cloud times change,
right?
And I think I'm open, you know,
like I know my perception over
years changes, sometimes it's a
pendulum.
So it depends, you know, like regular
answer, normal answer from
consultants, it depends.
Yeah.
But sharding, it's really great
to see that it's coming from
multiple teams, there will be competition.
So yeah, the here, the future looks
bright, but who will be helping
to choose proper schema for sharding?
Michael: Well,
Nikolay: yeah.
And rebalance properly and normally.
Well, yeah, so I think they who
build this think about this as
well, automation, and the problem,
like, to choose time for rebalancing
and fully automated?
Michael: Well, we talked to Sugu,
he said it's inevitable that
you're gonna have to change your
sharding scheme at some point.
It's just designing for that upfront
seems really important.
Vitesse handles it.
So yeah, Interesting times ahead.
I'm a bit worried about the complexity
of these systems.
Personally,
Nikolay: I quite like Let's start
Michael: with simple things.
Yeah.
I quite well, starting simple, but also I like the idea of if
I'm a Postgres user, I can still understand the system, like
roughly what's going on, even if I'm a full stack developer,
even if I do the front end, the back end, like, I like, I know
it's already very complex.
I know there's already a lot of internals that you kind of need
to know about to make a performance system.
But I hope we can hold on to that for a while.
Well,
Nikolay: I truly believe that what my team does is going to help
because we observe many problems.
And every time I'm saying guys like we need to write down some
how to and So show with experiment so users understand what's
happening.
Write some how-to for next time.
And when we're writing how-tos recently, we write it both for
humans and AI.
So next time some RCA is happening, root cause analysis is happening.
If you have our how-tos injected to your Cursor, for example,
it's going to take into account some situations which are written
and how to reproduce them, how to troubleshoot them, right?
So I think something else is coming.
I'm not going to spoil it, but RCA and troubleshooting is 1 thing
we will attack early.
We are preparing pieces for this, you know, and 1 of key indicators
of success here is that non-experts understand what's happening.
You know, because...
Michael: Yeah, well, that's what I've...
That's the area that I specialize in.
You know, I'm actually betting a little bit on humans staying
in the loop for quite a long time and that there will always,
well not always, but for a long time, there'll be categories
of issue that we still need somebody to look into.
And I kind of feel like the level of Postgres knowledge, the
median level of Postgres knowledge for people having to look
into those issues is probably going to go down.
Based on all the trends that you're talking about, It depends.
If we only have a few experts that are shared between lots of
companies, maybe that's not true.
But if we do have a lot of individuals starting with web coding
or starting with like doing a full stack kind of single person
companies running the whole product start to finish.
Those guys, like, they have to know a lot about a lot of things.
They can't know that
Nikolay: much.
Have you seen numbers, Supabase shared, how many clusters
were registered and how fast it is.
Yeah.
Can you imagine how like, yeah, or average knowledge of Postgres
there.
Yeah.
But yeah, but the idea is I, I,
my vision is that we, we collect,
knowledge pieces, we experiment
with, automatic experimentation
and so on.
But then obviously this is what
I see with our customers some
human is needed to explain properly
to other humans to answer
questions properly you know to
build trust and confidence and
so on.
Michael: Yeah, but my question,
I guess the question then is
where do the tools live?
Like, can the end user use a tool
to get help?
Like, or does the Supabase team
use a tool to get help?
Or does the consultant that the
Supabase team employ?
Nikolay: My answer is, who's the
tool for?
Michael: And I know I'm just saying
that, like, it depends who
you count as the user in terms
of how much their Postgres knowledge
is.
I'm talking about that end user.
Nikolay: Yeah, for example, with
DBLab we already went this path.
We moved from a couple of guys
answering all the details for
Backend engineers have in terms
of how plan works and what to
do about it, which index to create.
We now have DBLab Backend engineers
experiment themselves.
And if something is unclear, only
then they call an expert for
help.
Right.
But like 90% of the questions answered
by Backend engineers without
involving expert, Postgres experts,
you know, and expertise in
Backend engineering minds grow,
grows as well.
I I'm just thinking this approach
we had for query optimization
can be applied at ground of schema
for many other areas.
Yeah.
Michael: All right.
Probably enough for today.
Nikolay: Thank you so much.
We went much deeper in specific
areas than I expected and I enjoyed
it a lot.
Thank you so much.
Michael: Nice.