
Multi-tenant options
Michael: Hello and welcome to Postgres.FM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard, and I'm joined as usual
by my co-host Nikolay, founder of Postgres.AI.
Hey Nikolay.
Nikolay: Hi Michael, hi.
Michael: And today we are joined by special guest Gwen Shapira,
co-founder and chief product officer at Nile, to talk all things
multi-tenancy.
Hello Gwen, thank you for joining us.
Gwen: Thank you for having me, it's very exciting for me to be
in your show.
Michael: Oh, we're excited to have you, thank you.
So to start perhaps you could give us a little bit of background
and what got you interested in this topic of multi-tenancy
in general?
Gwen: As many things it started with an incident but this time
not actually one of mine, so when my co-founder and I started Nile
we actually started with a very different idea.
After about 9 months, we were like, this idea is not working
very well.
We developed some things, we're not finding the market we hoped
to find.
We were sitting in the Hacker Space over in Mountain View, and
we're like, what did we learn?
We talked to 200 companies all doing SaaS.
What have we found out as a result?
And the things that called to us is that very early on, basically
when the first lines of code are getting written, you have to
choose a multi-tenancy model.
And then about 2, 3 or 4 years later, depending on how fast you're
growing, you have to change it.
And we heard a lot of stories on what caused people to change
it and whether they regret earlier choices or they're like, we
didn't know better and how things went for them.
And then we started looking in different blogs and we found so
many by very famous companies with either incidents where something
that was done to a single tenant caused a whole chain of events
that took down their entire system, sometimes for days.
And also a lot of slightly better stories, how we sharded our
highly multi-tenant database.
And we found story after story after story on how people had
to re-architect their entire database, which is, as you guys
know, extremely painful to do after you're a successful company
3 years in.
And we're like, this is a good problem.
So many people have it.
It is so common.
My past was in databases, not as much Postgres, more Oracle and
MySQL.
But I've seen this problem again
and again in all kinds of companies.
My co-founders seen this problem
again and again in all kinds
of companies.
This is such a good problem.
Everyone has it and nobody's working
on it.
Why is nobody working on it?
And that's how we got into it.
Michael: Yeah, nice.
Should we go back then in terms
of how do you like to describe
the different models or the different
options that people have
at the in the early stages?
Gwen: Yeah, so everyone basically
starts with 1 out of 2.
And I'm using AWS terminology,
even though there is other terminology
that people apply to it.
AWS calls it the pooled model versus
isolated model.
And in a pooled model, you basically
create your tables as normal,
and then add a tenant ID Column
to each and every 1 of your tables,
pretty much.
Some, maybe not everyone, some
have shared data, but most of
your tables are going to end up
with a tenant ID Column that
tells you which tenant this Row
belongs to.
Very easy when you start out and
all you have to do is sprinkle
some work clauses every now and
then and you're pretty much good
to go.
How hard can it possibly be?
The places where it gets you is
that you have no solution.
Everyone is in 1 big pool.
And if you have a problem tenant,
if someone grows really, really
large and suddenly you need to
Query start getting slow for them,
If you need to do an upgrade and
1 Customer absolutely refuses
to accept changes or needs their
own time window.
There is a lot of different ways
you may discover that by putting
all your Customers in 1 big pool,
you save a lot of effort, you
save a lot of money.
This is by far the cheapest option
you're going to have.
It's shared resources in 1 Database.
But you are not allowing yourself
to do anything specific for
any 1 Customer should they need
it.
The other approach is basically
the reverse.
It gives every Customer its own
Database or sometimes its own
schema.
This still counts as isolated,
even though It's not all that
isolated.
You share quite a lot in that scenario.
But the schema is separate.
And it's quite a bit easier to
move them out if needed, if the
schema is separated.
And In this scenario, first of
all, there is a nice benefit that
you get help from a lot of popular
frameworks.
I think Ruby has an, I think it's
called apartment plugin for
having this kind of multi-tenancy
model.
Django has something.
So a lot of very popular frameworks
have something that helps
you in that model.
But if you accidentally grow to
a large number of databases,
it starts being very painful.
Obviously, a database with a large
number of objects is no longer
as easy to work with.
Suddenly, you start learning how
much space and memory the catalog
can really take when you have connections.
Suddenly you start learning that
pg_dump can take a very long
time.
If you actually have a database
for each tenant, then doing any
kind of maintenance on 100 databases
is already not fun.
If you end up going into a thousand
databases, it's really not
fun.
And if you think about it, a lot
of SaaS have customers in the
hundreds of thousands, not just
thousand.
So it becomes very painful exactly
when you grow.
Nikolay: Yeah.
For example, upgrade includes dump
restore schema.
And we've had cases with 250,000
tables, 2 million indexes.
Well, indexes are there not dumped,
but tables.
Gwen: Exactly.
This is worse than no dump, right?
If only we could dump indexes.
Nikolay: Yeah, and then you need
to update statistics after upgrade
for all of those tables.
It's a nightmare, honestly.
Gwen: Yeah.
On the other hand, at least you
get to do it to a customer at
a time.
Imagine that everyone is in 1 really
big database and now you
have to upgrade all of them together.
Nikolay: Yeah.
Yeah.
It's painful, all this.
Gwen: Yeah.
And usually, and then there is
the mixed model where I think
pretty much everyone ends up with
where you basically have, you
start with the pool model, and
as you grow, you shard it, but
you're actually pretty smart, and
you realize that not all customers
are of the same size, and you can
have some dedicated shards
for your biggest or most sensitive
or most demanding customers.
And this is, I think, if you look
5 years into the life of a
company, I would say that this
is the dominant model.
Some variation of we have shared
databases with pool and then
some dedicated databases with specific
customers.
Nikolay: What is the problem we're
trying to solve?
Is it security or performance or
both?
And if we go back to some customers,
they share this pool, it
affects security goal, right?
So we don't achieve it.
Gwen: Absolutely.
So this is 1, First of all, this
is 1 driver that people actually
start with the isolated model.
They know that they're going into
a sensitive area.
They're focusing on SaaS for healthcare,
SaaS for finance.
Those companies definitely start
with isolated model and try
to figure out how to manage large
number of databases.
A lot of times those companies
don't become huge.
There is that many hospitals in
the United States, but they still
have to build all the tooling to
manage a large number of isolated
databases.
For other companies, it's more
complicated.
I would say maybe 70% of the time,
the reason for eventually
moving customers out and sharding
and isolating would be performance.
It's amazing how many performance
problems can be solved by just
having less data in each database.
On the other side, there are the
story where 2 years in, suddenly
a very sensitive customer shows
up or you want to sell into,
like you thought you were building
a normal CRM or some kind
of a rug database, but then a healthcare
company shows up, a
bank shows up, or even worse, a
government shows up, and they
show up with a list of demands
and since they usually have good
amounts of money to back those
demands there is a lot of incentive
to figure out a solution for them.
Michael: Nice.
We have 1 kind of ugly duckling
in the Postgres world that I'm
not sure quite fits either of these
models.
I wonder if it's worth discussing
row level security briefly,
because if I was to bucket it based
on those definitions, it's
kind of the pooled model in a way
because all of the data is
together but there is some isolation
Gwen: between tenants
Michael: absolutely yeah
Gwen: yeah I have a love-hate relationship
with RLS I think a
lot of people do.
Because you're right, on 1 hand,
it's absolutely a lifesaver
in the pooled model.
Developers make mistakes as joins
get, and conditions get more
complicated.
It's very easy to misplace a work
clause and actually leak data
that you don't want to.
So RLS will prevent you from doing
it if you do it right.
It turns out that a lot of times
the rules could get complicated
and then it leads to bugs.
It also turns out that a lot of
times the rules get complicated
and it leads to terrible performance.
And 1 thing that developers really
don't realize, I'd say almost
no developer realizes it until
they run into it.
The work conditions that RLS introduces
are not optimized like
the work conditions that you introduce.
Because Postgres, thankfully, is
very good about security, It
treats the work conditions in RLS
differently.
I think they have, they call it
security conditions or something
like that.
And they are very, very conservative
on how they optimize it
and how they plan for it.
This has benefits.
You get very strong security guarantees,
very few bugs as a result.
This is fantastic.
On the other hand, the plan would
be sometimes significantly
worse than what you would come
up with if you were to look at
it really hard and do it yourself.
And with RLS, there is basically
no way to force the plan you
want.
You cannot set, enable, disable
different rules because, again,
the main overriding rule is that
we're very conservative on how
we optimize those RLS conditions.
So Some people call RLS a performance
killer.
I wouldn't necessarily go this
far, but you can definitely run
into gotchas and you need to be
aware that it's not a normal
wear conditions that you're looking
at.
Michael: Yeah.
Nice.
Nikolay: So what does Nile offer
today and what is the ideal
solution to all this in Postgres
context of course?
Gwen: Yeah, so basically we wanted
to do maybe 3 things.
First of all, give isolation while
not degrading the developer
experience.
So for example, we partitioned
data by tenant out of the box
for you, completely transparently,
because we know that a bit
later on, you're going to want
it, and it's going to be a pain
in the ass to edit.
We shard it Transparently.
Basically, your database may be
spread across multiple different
charts.
We will route the queries for you
and make sure that they are
working as you should.
So, in a way, you get the model
you will have anyway in 4 to
5 years, but you're getting it
from the get-go and without doing
a lot of the work because we are
doing a lot of the management
for you.
The other thing is that we have
done some work to basically bypass
RLS and still give you isolation.
So the queries, you kind of do
the same RLS set tenant ID equal.
We use that to actually direct
queries.
We rewrite the queries immediately
to the partitions that we
know has the data for that tenant.
So we have a small extension that
kind of replaces table names
with partition names in the query
itself.
And this vastly improves performance
in the majority of cases.
I mean, we've seen it in a bunch
of cases, especially if you
have slightly weird indexes that
RLS may conservatively not use.
The improvement is quite stark,
depending, obviously, on table
sizes.
You can get a benchmark that proves
anything, so I don't want
to throw numbers out there.
But obviously, if you break down
a table with a million rows
into thousand tenants with a thousand
rows each, then you can
show quite a, you can see where
I'm going with that.
The other thing we did, and that
was probably the most work,
and this is still work in progress,
is allow moving tenants around.
Because 1 of the biggest problems
is that the tenant gets large
or noisy.
And you want to give it its own
machine, moving it is usually
a long downtime.
If you catch it after the tenant
is already large.
By doing the compute storage separation,
we can basically make
it transparent.
It's a latency spike while we're
holding off some queries, while
we're moving things like setting
up sequence ideas, moving, pointing
into a different compute into the
same part of the storage.
But it's essentially a no downtime
operation.
So we think it's a huge deal because
again, it's just a problem
that we keep seeing again and again.
Nikolay: Yeah, I'm curious, is
it all open source what you build
or only parts of it?
Gwen: Right now it's mostly hidden.
We have started registering parts
of it under a Apache license
veto.
So yeah, the goal is to open source
it and we already publicly
declared that it's going to be
open source.
We have not a date, but a point
of completion where we plan to
open it.
Nikolay: Yeah.
So I've heard about this, several
interesting things here.
So 1 is extension for this, I guess
it's called, in your documentation,
it's called RLS virtualization
or how's it called?
Gwen: We call it tenant virtualization.
The extension itself, I think we
called it Karnak.
We call everything after stuff
in Egypt and Karnak is a famous
temple.
Nikolay: So data is stored in separate
tables and in the same
database, but extension rewrites
queries to basically route query
to proper table, right?
Is this based on?
Gwen: It's in separate partitions
And we rewrite queries to go
to the correct partition.
We basically bypass RLS, we bypass
the planner trying to make
those calls.
We found out that with a large
number of partitions, this is
significantly more efficient.
Nikolay: I see.
So it's Postgres partitioning.
I see.
And you mentioned also another
thing you mentioned here is sharding,
right?
Gwen: Yeah.
So we use foreign data wrappers
to allow...
So we have 2 things in the architecture.
First of all, we have a proxy,
it's a routing proxy.
So it keeps track of every connection,
which tenant is the current
tenant and it routes it to the
shard that has the correct tenant
in it.
And then we also have some cases
where some developers want to
write queries that touch multiple
tenants.
Those are not going to be as fast,
but we do allow them by use
of foreign data wrappers.
And mixing partitions with our
own partitioning rules with foreign
data wrappers and still keeping
things efficient, we didn't want
the planner on any machine to be
aware of all the partitions
in all the other shards, because
it just explodes the planning
time in ways that we saw as unacceptable.
So what we did is represent each
shard with a table and then
put hierarchical table inheritance
on top of it.
And the end result is basically
a union all between the table
with the table inheritance that
points to all those other shards,
and the table with the partitions.
Now this gives us basically predicate
pushdown because the planner
will push the plan to all those
different charts.
The other charts know that they
have partitions, which the source
planner didn't know.
They will plan correctly with all
the partitions, but they only
know about a subset of partitions.
So we see it's a bit hacky and
it's a bit weird to explain.
And we think we can do better with
some modifications to Postgres,
which we have not done yet, but
this does give us predicate pushdown,
fairly fast planning, and the ability
to do queries that cross
tenants in situations where this
is required, essentially.
Nikolay: Yeah, if you don't involve
2PC, just rely on front-end
wrappers, no two-phase commit.
I'm curious what kind of anomalies
can happen there.
Gwen: Yes, and we prevent a lot
of things that could cause anomalies.
So we do have a transaction coordinator,
but in order to not
overload and also not overcomplicate
our architecture, we limit
some things.
So, DDL has to be done on a single
tenant, and you cannot mix
cross-tenant queries.
Sorry, DML, INSERTs and UPDATEs have to be done on a single tenant,
and you cannot mix in a Transaction cross, so the moment you
start a Transaction, you have to know what tenant you're working
on.
And then we route it to the correct shard, which has the correct
table, and everything has the absolutely correct guarantees.
If you need to do something cross-tenant, you do not involve
it with any kind of UPDATE.
You could still be exposed to some anomalies, I agree, because
it could be ongoing Transactions from other people in other places.
So you get the basic Read Committed guarantees that Postgres
gives you, I believe, but not anything more than that.
But again, we believe that cross tenant Queries are rare and
mostly done in analytical cases, where you do reporting where
it's slightly less critical to have those.
Nikolay: So you forbid the writing to 2 shards in 1 Transaction?
Gwen: If you want to write to a shard, it's fantastic.
You tell us what tenant you're writing data into, and we will
direct you to the correct shard.
Nikolay: I mean, if there is a Transaction which needs to write
to 2 different shards, this is a big problem, because without
PC.
Gwen: Yes, exactly.
And we don't let you do that, essentially, in order to avoid
anomalies.
Nikolay: I see.
Another question here, have you considered the approach used
in Vitesse?
As I understand it, maybe I'm wrong.
Where mostly for analytical Queries maybe, to avoid distributed
Transactions, data is brought synchronously from 1 shard to another.
And we have it locally, like basically kind of Materialized U
on top of logical replication, for example, or something.
And it has eventual consistency approach, of course, but you
can just join it in 1 Postgres, in 1 shard, right?
Have you considered this approach?
Gwen: Y.K.
We have considered it.
I think maybe CitusDB has something similar, if I remember correctly.
I'm not 100% sure.
But yeah, it's something that we were like, yeah, this is a good
idea that we may examine in the future.
It's definitely, we are trying to build something useful gradually,
and we understand that early on, it's almost safer to have a
bunch of limitations that over time will resolve, rather than
allow people to do something unsafe and also build a kitchen
sink.
Nikolay: It sounds to me like Postgres versus MySQL approaches
because MySQL approach, you remember MyISAM?
Maybe you don't remember, but it was like quite bad.
You need to run a repair table
all the time because it's not
ACID and so on.
It allowed too much.
Gwen: Exactly, yeah.
And Marathon had a lot of issues.
I mean, there is a reason why InnoDB
became extremely popular.
Nikolay: Right, right.
Michael: But also, with a multi-tenancy
use case, I think you're
quite right that, well, we, I mean,
you'll find out soon enough,
right, if lots of people want these
cross-tenant or cross-shard
queries which are by definition
cross-tenant queries and if they
don't, if you don't need to worry
about it, you save a bunch
of effort having to even implement
that.
So yeah, I like that a lot.
Nikolay: Yeah, last comment here
is I'm excited to see that finally
Postgres ecosystem receives some
tension in the area of sharding.
I guess it's just time has come
and more and more databases became
too large to be handled.
Gwen: I'm almost surprised.
I'm honestly surprised it took
that long.
I mean, again, if you look at MySQL.
Nikolay: We just, it's just unfortunate.
First time I touched this topic,
it was 2006 immediately when
we started working with Postgres,
honestly.
And There was a PL/Proxy from Skype
at that time already, but
it required you to write everything
in functions.
Gwen: Partitioning didn't exist
back then.
Nikolay: Existed.
It was based on inheritance.
It
required much more manual.
It was fun, actually.
You understood it better, you know?
But yeah, but it was not convenient,
100%.
Not super convenient.
It's just, I see that it's just
unfortunate how it turned out
in Postgres ecosystem.
And now definitely there is huge
pressure.
Many companies need partitioning...
Need sharding.
And it seems
Gwen: like How would history work?
Imagine that YouTube picked Postgres
and not MySQL as their first
database.
Nikolay: Yeah Google or Facebook
they both chose MySQL somehow
Yeah.
Michael: The test would be for
Postgres first, right?
Gwen: Exactly.
It could have turned out so differently.
Michael: Yeah.
Funny, isn't it?
Changing topics slightly or LLMs
are quite top of mind at the
moment, are you seeing people's
initial choices change as a result
of asking for advice earlier from
our robot friends?
Gwen: Oh my god, we're seeing so
many weird things, it's just
unbelievable how much things are
changing.
First of all, we're having people
show up on our Discord and
say things like, I'm using Nile
because my LLM thought it's a
good idea.
And I don't really know Postgres,
so I need some help, but my
LLM assured me that this is still
a good idea.
Like a lot of people are just like
people who are very much beginners,
like maybe I can say when I started
developing, the first time
I had to use a database, my company
sent me to a 3 weeks database
class.
I think it was Oracle about 20
years back.
And I came back a lot more confident
that I know how to use Oracle
and not to leave transactions open
for too long because people
will yell at me.
These days people don't do the
three-week class before they try
using a database.
So you see a lot more, people are
using a database earlier on
and they do expect more hand-holding
from the vendors.
Like the LLMs give them advice
up to a certain point, but eventually,
if things are slow, they will come
to you and say, hey, why is
my query slow?
I'm sure you guys have seen your
share of that.
I'm also seeing people use Postgres
for their LLMs in different
ways.
And this is really exciting to
me.
People use Postgres via MCPs, people
use Postgres with vectors,
people building AI applications
on top of Postgres.
We're seeing a lot of that.
And, I mean, personally, I'm really
excited that people can program
with LLMs, not knowing a lot about
Postgres, not knowing even
a lot about software engineering
at all, and still get reasonable
security guarantees.
You don't need to know to ask your
LLM about RLS or about, Are
you sure this query actually properly
isolates tenants?
And it's also interesting how much
the results differ when people
use different LLMs.
Like I would say syncing models
do fairly well iterating in order
to get good code and checking their
own results, again, given
via MCP access to a database.
I would say that if you use HRGPT
4.0, you will get a lot of
random hallucinatory stuff still
in your code.
Nikolay: Yeah, it's funny.
I already told Michael, we had
the cases doing consulting, But
like maybe already almost a year
ago.
I started noticing that people
send us Like some we are building
this part of database.
Can you review it?
We are reviewing we use different
lamps supporting this review
and then we have a call and I'm
curious, code looks great, I
mean schema looks great but something
is off, right?
And then we have a call and I see
they open tabs and charge a
PT, Claude there as well so I realize
they used LLM to create
schema and then send us for review
and we use a lens to review
it and then There's like a 4 party
process, you know It's it
leads to good place, but someone
needs to jump with proper expertise
and say, this is not a good approach.
Gwen: This is hilarious.
Do you think at some point you
and your customer can just step
out and let the LLMs figure it
out between themselves?
Nikolay: Well, there's a problem
here because like it's great,
but there is like it does 80% of
drop in 1% not not even 20%
of effort very quickly, but there
is 20% of problems, which again,
like, I don't know, maybe in next
few years, it will change.
I think it will change.
But right now I feel my internal
LLM trained much better than
than Charger PT.
Gwen: You trained your own LLM,
right?
Nikolay: Yeah, no, I mean my own.
Gwen: Oh yeah, no, but I think
you actually trained your own.
Nikolay: Yeah, we experiment.
We do some stuff, we experiment.
We have some things like we have,
we start with fine tuning,
moving to our own LLM, but not
yet there still.
So yeah, there is a lot of stuff
can be done there to properly.
It's hard to compete with Cloud
and they have very high pace.
With Cloud 4 release, I see, wow,
it's really great.
But it's still missing many things
you learn from practice, which
were not discussed.
That's why they don't bring it.
Many problems were not discussed
yet.
And you explore them if you have
a lot of data and heavy workloads.
Gwen: So you're saying that even
training on the Postgres mailing
list doesn't have all the information
in it, essentially?
Nikolay: Well, yes.
For example, random problem, we
recently touched it.
There is a buffer pool in Postgres.
And there are 128 basically partitions.
So you can have 128 locks.
And if you have a huge buffer pool,
this becomes bottleneck.
And some people say, let's maybe
make it tunable, configurable.
But when you start researching
this topic, you end up finding
some recent, well, recent last
August and September last year,
conversation in hackers mailing
list, which is open-ended.
It's not complete, because somebody
needs to run benchmarks and
prove that this is worth having
new setting.
And that's it.
There's a patch proposed, but that's
it.
No experiments yet.
Gwen: This is, by the way, 1 of
the reasons I was so excited
about your LLM approach, because
if you think about what is the
bottleneck for doing a lot of the
database improvement things,
and I am feeling it Very personally,
running benchmarks is hard.
Properly planning a benchmark is
hard.
LLMs can help with that, but again,
they sometimes just go off
the rails.
And even if they help plan that,
I don't know LLMs that actually
run benchmarks to the point where
they provision the machines
in AWS and know that you have to
provision a separate machine
as a database and a separate machine,
maybe a few of them, to
drive the workload and they both
need to have appropriate resources.
Like all those and then don't get
me started on analyzing the
results, which is kind of 99.9%
of the work.
So the fact that you actually kind
of started your LLM for I
have an LLM that can actually do
benchmarks, is just, I think
this will be the biggest breakthrough
in both people tuning their
own Postgres, and also Postgres
as a community being able to
advance the state of the art?
Nikolay: Let me share sad news.
Like, I'm not giving up, but it's
a roller coaster.
I spend more than we spend, like
the team of maybe 5 engineers
spent more than 1 year trying to
achieve that.
We achieved many things, but first
of all, we chose Gemini because
they gave us credits.
And I think it was a huge mistake.
Gemini has a lot of problems.
Suddenly you have 500 errors, which
are so many problems.
It's just not mature product, Gemini,
and it has hallucinations
all the time.
It's good, for example, for JSON,
working with JSON, because
when you need to run experiment,
we decided to choose JSON as
config format.
It writes much better than GPT-4,
4.0 and so on.
But many things, it just hallucinates.
It invents all the time some things.
It just makes up results all the
time.
And we have a system to control
it but it bypasses all the time.
Like it's silly hard.
So then yeah, we experimented with
additional DeepSeek, Lama
and we've fine-tuned a lot.
All versions of GPT, all modern
fresh versions, we also bring
them all the time.
And Claude, Claude is much better,
we just added it to this system
we have.
But After 1 year, I decided, you
know what, I'm like, benchmarks
is extremely hard topic.
And we cannot trust it anymore.
I mean, we cannot trust LLM to
create precise configuration and
process results fully.
So we decided that LLM is just
more like connection thing.
When you engineer benchmark, expert
needs to engineer.
I don't trust any LLM for now,
because any experiment, we plan
to publish maybe 15 to 20 experiments
in our blog last year.
And if you open our blog, you see
just 1 experiment.
And even there we screwed up and
someone on Twitter said this
is not right and we quickly corrected,
which is good.
And we had achievements like interesting
things, bottlenecks
popped up here and there.
It's really fun to iterate with
LLM.
But once you allow to think, to
design experiment and to treat
results, in 99% of time we have
wrong, wrong results, wrong conclusions
and so on.
So for now we are thinking, okay,
this is just, accelerator of
performing experiments, but design
and understanding results
should be in human brain for now.
100%.
Gwen: I love the fact that Benchmarks
is also hard for robots,
to be honest.
It's so hard for humans, right?
Nikolay: Yeah, and we collect so
many artifacts, But somehow
it's super hard still.
So you always think where is the
bottleneck?
And simple question, right?
But for now, it's extremely hard
to let LLM find bottleneck and
draw proper conclusions.
Gwen: And honestly, if you just
had an LM that always said it's
a network, it would be correct
about 80% of the time.
Nikolay: What if it's local?
If it's, there's no network, everything
through Unix sockets
and we had these cases as well.
So I agree with you in production.
Like in production, yes, but in
experiments when we learn Postgres
behavior on single machine running
pgbench locally, we don't
care sometimes.
So there's no network there sometimes.
So it's hard.
So I had many moments of frustration.
But it's so good, like I still
believe that iterations are great.
So if you say this is great benchmark,
just check it on new version.
Just changing 1 thing.
This is good.
We have automation, we have interface,
and it repeats the process
of analysis again.
This is where LLM helps a lot.
Because without LLM, you could
just, you have some form, and
oh, we don't have this parameter
programmed.
It's not exposed in the interface.
It's bad.
With LLM, you have freedom to change
things and iterate based
on existing good benchmarks.
So, yeah, I cannot say we are there
yet.
This project, like, right now is
we are, like, thinking about
next level of it, where I think
we will let LLM, we will give
it less freedom, you know, that's
the key, and control more by
human brain.
Gwen: Human in the loop kind of
thing.
Nikolay: Yeah, yeah, exactly.
Exactly.
So design and first analysis, only
human should be there.
But once you have confidence that
you're moving in the right
direction and you just need to
iterate and expand, for example,
to different versions, platforms,
everything, this is where you
can relax.
You already verified results.
You can say just repeat, but on
different something.
This is where LLM already can bring
you.
It can just speed up everything.
You can throw it to this benchmarking
process and have like 10
experiments running in 10 different
versions or something.
Gwen: And I think this is also
kind of almost the general directions
that agents are taking shape.
I mean, people started to 2025
was supposed to be the year of
the agent.
I think it's almost becoming the
year of the human in the loop
with the agent.
Yeah.
Like all the successful products
I see are you tell the agent
to do some stuff.
You ask it, please plan something.
You give it feedback.
You then say, okay, now that we
have a good plan, go and execute
on it.
You come back an hour later, you
had your coffee.
Okay, let's see what you've got.
Here's some feedback.
Go fix some stuff.
I think it's always every successful
product is a bit like this.
Nikolay: Right.
But sometimes humans start using
different LLM when reviewing
things, right?
Being lazy, right?
Gwen: Yes.
Nikolay: That's interesting.
So I'm not sure how successful
it will be, but my gut tells me
that we need to move, move, move
in this direction anyway and
have some, I don't know, like more
experiments and so on.
I hope we will have more soon to
publish, I mean, and start iterating.
But yeah, it was a rollercoaster
last year, So now we are rebuilding
stuff.
We will see how it works.
And for example, 1 of the experiments
we must do, I think, is
to conduct various benchmarks for
RLS.
Because it's obviously...
That would
Gwen: be a fantastic example.
Yes.
And I mean, this is something that
humans with experience are
pretty good at finding cases where
you're like, is RLS going
to actually be an issue and have
the stories that then the LLM
can go implement and test.
Nikolay: Yeah, we had several cases
and also Supabase has public
materials, blog posts about this.
Obviously, there is already quite
known case when you have like
current_setting function inside
RLS expression.
And this is, yeah, if you select
count 1000000 rows, it's terrible.
And it's quite easy to fix actually.
But yeah, so these kinds of experiments
to collect them and see
how.
Actually, my goal with this experiment
I think will be to prove
that they are not a problem if
you do it right.
Gwen: Interesting.
I would contribute, I think there
was a recent post on the bug
tracker where basically the optimizer,
the planner refused to
use, I think it was a GiST index
or a GiN index due to the belief
that the RLS optimization is incorrect,
is unsafe.
I can look it up and send it to
you.
Yeah, it's interesting.
But yeah, that can also be interesting.
Like if you have a fix for Postgres,
then you can, it will obviously
be nice to showcase.
Nikolay: Right.
I think also if you, so you don't
use RLS, you said bypass it,
right?
Gwen: We bypass it, yeah.
Nikolay: Yeah, that's interesting.
I'm curious if you, in this mixed
schema when we have partitions
or shards, and RLS doesn't make
sense at all to involve RLS additionally
locally if some of shards have
a mixed pool of customers.
Gwen: Okay, 1 of the questions
that I have in mind, and this
is something that we're trying
to help figure out for our users,
often on top of the tenant, you
still have permissions for specific
users in the tenant.
Like you have an admin that can
do anything, and then you may
have someone who is not allowed
to see some rows at all because
they're too sensitive, all this
kind of stuff.
And our users ask us, they can
do it with RLS inside those partitions,
or they can do it in their application.
There is a lot of application level
tools or like middleware
kind of tools that will do it for
them.
Is it better to do it in the app
layer or in Postgres with RLS
is a good question that I don't
have an immediate answer for.
Nikolay: Right.
So there are several layers of
multi-tenancy basically.
Yeah.
Not multi-tenancy, but different
layers.
And If your customer being a tenant
for you, they might have
tenants for them, but also inside
them, they might have additional
clusters or segments of users inside
each tenant.
So it's apartments and rooms inside
apartments.
Gwen: Nice.
I like that.
Yeah.
This is 1 of the things that make
multi-tenancy confusing, right?
Because it's almost like Matryoshka
kind of scenario.
Nikolay: Yes.
This term, by the way, used in
Postgres ecosystem multiple times,
starting with, you mentioned GiST
original paper by Hellerstein
mentions that Matryoshka..,
RD-tree, Russian doll tree, so basically
Matryoshka
tree.
Gwen: That's true.
And also in embeddings, they have
the Matryoshka type embeddings
where you can make them of any
size.
Nikolay: Yeah, I read about this
as well.
Yeah, it's funny.
So yeah, great.
And what are your plans for AI?
I saw MCP servers already with
some integration.
Gwen: We basically have 3 directions.
1 is MCP server and making it public,
giving it authentication.
Right now, it's open-source, you
can run it on your own, but
we're not hosting it, so we should
start hosting it at some point.
So things that just make people
who use LLM, make it easier for
them to use it.
Other thing we've done that we
think is very useful for LLMs,
this we already have, just make
it 0 time to create new databases.
Because LLMs just 0 time and 0
cost.
Because they love creating a lot
of them.
Every time something goes wrong,
okay, let's try from scratch
with a new database kind of situation.
So we're making it fast and cheap.
The other thing that we're still
working on is really to make
our documentation more LLM friendly.
LLMs.txt is absolutely not enough.
It is actually, it creates a very
large file.
The LLMs tend to get lost in it.
We need to figure out how to make it better.
And then also, everyone is kind of thinking, can we have our
own agents?
Can we do something around that?
We're kind of thinking about that.
Something we already have that is useful is just that with the
multi-tenant model, the embedding, the vector indexes are much
smaller.
And this is a huge deal for people building those agents and
LLMs.
Nikolay: Yeah.
And actually, you mentioned, so this zero startup goal, you mentioned
separation of compute and data.
How is that achieved?
Using what approach?
Can you elaborate?
Gwen: Oh, yeah.
Sorry.
This is more than a two-minute answer, but the short one is that
you kind of, you patch Postgres, and then you find a better way
to do your storage and basically wrap every Postgres storage
function with an equivalent with your storage.
And then you also need to apply the WAL continuously to the
storage layer.
Nikolay: Great job doing this like in actually 20 seconds.
I understood very well.
Yeah.
So, and do you have plans to make this open source as well?
Gwen: Yes, absolutely.
I mean, this is a first of all, as you know, WAL readers have
to be registered, ours already is, and with the open-source license.
And then, yeah, we are planning to open
Nikolay: source our
Gwen: storage layer, our extension.
I think at this point it's about seven different patches we've made
on Postgres.
I don't think we'll want to open source it as a Postgres fork,
because the number of patches is quite small and we are maintaining
it.
I think everything from Postgres 12 to 18 at this point, or maybe
13 to 18, something along those lines.
Nikolay: All supported versions, I guess.
Yeah.
That's great.
Yeah.
Well, looking forward to checking it out.
And actually, one more question for me, maybe last one.
Are you open to some benchmarks we probably will do?
We plan to do some benchmarks with various platforms.
It all started after acquisition of Neon and there's a discussion
on LinkedIn.
So I thought about trying some
benchmarks for different platforms.
How do you think, like what are
your thoughts about it?
Gwen: Oh my God, this is very scary.
We are benchmarking ourselves all
the time.
So I'm keenly aware of exactly
like what benchmarks make us look
good and what benchmark make us
look bad.
And I think that's also how we
react to benchmarks in social
media.
Unless it's a benchmark that shows
something in Postgres itself
and like clearly attempting to
educate and help people.
You can design a benchmark to make
anyone in the world look bad.
Nikolay: Benchmarketing, it's called.
Michael: You can design
Nikolay: a benchmark to
Gwen: make everyone look good.
Exactly.
So, I will, at any given time,
I can publish my benchmarks that
makes Nile look fantastic, and
you can run benchmarks that you
decided how to build them, and
may be very realistic and may
even expose a problem that we have
that I didn't know about.
But, yeah, in general, I love benchmarks.
I just have opinions on how benchmarks
are used by marketing
people.
Michael: Very good answer.
For anybody interested in a bit
more about Niall's architecture,
Gwen gave a really good talk at
PGConf.dev recently and the video
just went up on YouTube so I will
put that in the show notes
for anybody that wants a deeper
dive there.
There was also a good talk I saw
at a PgDay Paris event by Pierre
Ducroquet, I'm not sure if I'm
pronouncing that anywhere near
correctly, all about multi-tenant
database design, especially
focusing on something we didn't
focus much on today, which was
the downsides of schema per tenant
design, including things like
observability and monitoring, which
I thought was really fascinating.
So anybody considering going down
that route definitely check
out that video and I'll put that
in the show notes as well.
So yeah thank you so much Gwen
I think we're out of time it's
been a real pleasure.
Gwen: It's been a pleasure Thank
you for having me on.