Schema design checklist
Nikolay: Hello, hello, this is
PostgresFM.
My name is Nik, PostgresAI, and
as usual with me, Michael, pgMustard.
Hi, Michael.
Michael: Hi, Nik.
How are you doing?
Nikolay: So the topic...
I'm doing great, how are
you?
Michael: I'm good, Thank you.
What's the topic?
Nikolay: What's the topic?
The topic I chose, and everyone
already saw it on cover you created,
right?
So the topic is 5 or so, maybe
6 things you should check when
you start new schema or you already
have some schema and your
project is growing and you want
to be better prepared for bigger
scale and performance wise.
So what, especially if you use
AI to design and improve your
schema and your project is growing.
So what should you look at to check
your schema health?
Michael: Makes sense.
Just on the topic in general, I
do think there are some things
that are better suited to asking
for help from AI and not?
Do you think like schema design
is 1 of them or actually is it
1 of those ones that maybe we should
still do it a little bit
more ourselves at the moment?
Nikolay: Well as you know I'm very
very very intensive user of
AI.
Yeah.
Honestly, like I do everything
through AI right now.
And I'm 100% schema design is a
topic where AI can help a lot.
But it also matters which questions
you ask.
If you just say, I want this and
don't pay attention, you will
have low quality schema in many
cases still.
That's why I said, let's have this
episode because it might be
quite basic for many advanced users
of Postgres, But this checklist
we are going to present today,
it can be good thing to keep in
mind always when you design schema,
just to ask right questions
and revisit.
Also, I'm not sure how exactly
you use AI.
I use it always like I work with
people, developers.
When you just ask something and
think those engineers, humans
or AI, will solve all your problems
and make all the decisions,
you're offloading full responsibility
to their shoulders, so
then you have big negative surprises
always, right?
So it's better to keep requests
as detailed as possible when
you work with AI, and also make
some iterations to review and
improve.
And when making iterations, like
first, like I want some schema,
like, I don't know, like LinkedIn
social media schema or something,
right?
It will give you immediately something,
but then you want some
details to be checked.
And this is exactly when this checklist can be very helpful.
So you can revisit and think and imagine when you will have gigabytes
and terabytes of data, what will happen.
So to avoid painful refactoring in the future and outages maybe
or performance degradation, the price you can pay in the beginning
is much lower than the price you pay later.
So I never work with AI in 1 shot, I always make iterations to
improve in any question and to revisit and sometimes combine
multiple LLMs.
This is exactly when you can use this list of questions.
So first topic is the data type of primary key.
Michael: Yeah, and I feel like the answer to this has changed,
or my opinion on what the best default for this has changed,
actually, in the last few years, just off the back of learning
more about timestamp ordered UUIDs, right?
This used to be an age-old debate, right, UUID versus integer
or big integer really.
And most of the kind of blog posts I would see are like trying
to just encourage people to use BIGINT over int.
But now that we have first-class support for time-ordered UUIDs,
I don't see many reasons not to go with that.
What do you think?
Nikolay: Well, the size, 8, is double.
Yeah.
Yeah, your ID is 16 bytes, so it's quite a big price to pay.
Each row, you need to have 8 more bytes.
It's not that much, is it?
In many cases, it's fine.
Well, indexes will be also bigger.
Indexes double, yeah.
It might make sense if volumes are huge.
But anyway, to keep things simple, a rule to remember, red flags,
If you are prepared for really large volumes of data, it must
not be int4, it should be int8.
And in terms of UUID path, data type is always UUID for both
version 4 and version 7.
But you can check default how it's generated.
And in modern Postgres, there is a function, I think in 18, only
in the latest version.
I think
Michael: so.
It didn't quite make 17, did it?
Nikolay: So if you're on Postgres 18, so it's better to use UUIDv7
but also remembering trade-offs, because there is
timestamp there, and the values will be ordered, like with integer
or big integer.
So there is a nuance here.
But performance-wise it's much better than UUIDv4.
There are many articles about it and you just need to check,
okay, are we going to go with integer?
Then it's int8 or BIGINT, right?
Are we going to go with UUID?
Then default generation, default
should be version 7 for Postgres
18.
For older version, it's also not
an excuse to stick with version
4.
You can write your own function.
And you ask AI to write your own
function.
Michael: Yes, or generate them
outside of the database.
Nikolay: I slightly prefer this
option less because if you have
multiple application nodes, Who
knows what will happen with their
clocks?
And since UUIDv7 includes
timestamp inside it, order might
be not, I don't know, like I prefer
to leave it on database shoulders
to have like on the primary single
source of true.
On the other side, UUID by design
is generated, it serves purpose
to work on multiple, like on distributed
systems, right?
So maybe it's not a big deal.
Generate on client side.
Michael: All I mean is if you're
on an old Postgres and it saves
you creating your own function
or whatever.
Nikolay: Postgres waited a couple
of years to bring UUIDv7 officially.
Although again, you can create
even SQL function, even not PL/pgSQL
function, simple SQL function.
And if you ask your AI to search
for example, probably it will
find my page how to do it, but
or other people's pages.
But in other libraries from Google,
from Node.js library, everything,
they have it for a couple of years
already, so for longer.
They implemented before standard
was finalized.
Michael: Yeah, so huge red flags
to look out for if it's using
the integer data type, which is
the same as int4, or serial.
I think AI sometimes uses, it will
define it as serial.
So if you just see that word without
bigserial, you might hit
that limit at 2000000000 or whatever
the, I can't remember the
exact upper limit.
Nikolay: Yeah, and avoid UUID version
4 if you want good performance
on large volumes of data.
Yeah, that's it actually.
Let's keep it simple because we
could discuss more advanced situations.
Michael: I think we should do a
whole episode someday on migrations,
like for people that are getting
close to that 2 billion
Nikolay: number, that
Michael: would be a good topic.
Nikolay: I can share, I had a lot
of experience with different
approaches.
Let's do that.
Good.
So how to define primary key?
I can actually have an article
about it, how to...
Cool.
Michael: Let's do that.
Nikolay: Covering multiple paths so we could talk about this.
Number 2 is constraints.
Topic is constraints.
Michael: Other constraints.
Nikolay: Yeah, yeah.
Yeah, other.
Beyond primary key.
Yeah.
So I would just ask, first thing, are we using constraints well?
Like to myself or AI and so on.
And because Postgres has 6 types of constraints and it's quite
rich tooling, and it's much more convenient to have them earlier,
not because adding them later is a big pain.
Right now we have quite good support of two-step constraint definitions.
I think in Postgres 19 it will be possible also if not null constraints,
which is interesting.
But anyway, check constraints, foreign keys, unique constraints,
it's possible to create first in almost finished state and then
finalize it to avoid the long-lasting locks.
But this is not the reason why I prefer to think about it in
the very beginning.
The main reason is the quality of data.
This is why constraints exist in the first place, right?
So if you introduce constraints early in your project, data quality
is good.
Introducing them
Michael: later probably will...
Yeah, and database side as well.
I mean, it just...
Nikolay: Oh yeah, yeah.
I don't trust constraints outside.
With age, I became less radical.
My points of view became less radical with it.
So I just observe without fighting, like with backend engineers
who try to implement constraints on application side.
I just have a note, like they are not reliable because who knows
which other layers of types of applications, especially these
days with AI, it's so easy to add something else, implement it
in a different framework or even a language, it's super easy
right now.
And you will need to reimplement those constraints, or you will
just forget and miss them and data quality will suffer.
Michael: So specifically, something I think I'd like to see more
of or at least I see a lot of people not doing is columns that
you're not expecting to put any null values in, or you don't
want to have any null values in, actually explicitly making them
not null.
Like, if you don't specify, then columns are nullable.
And I think most people, or at
least most AIs I see, are creating,
like by default, they don't put
NOT NULL on every column.
Nikolay: Yeah, there's an opinion
that NOT NULL should be default.
Anyway, it's hard to have a default,
but inside your project
you can say, okay, unless it's
explicitly specified, let's make
columns NOT NULL.
Exactly.
And avoid troubles with NULL we
talked about in a special episode
about NULLs, right?
Michael: Yeah, well and I guess
in this specific case there is
also, well with some constraints,
mostly we care about, we're
doing it for quality reasons, but
there is, there is like a tiny
amount of performance optimization
that comes out of them.
I guess a bit more for unique constraints,
like if you, if the
database knows for sure this is
going to be, this is unique,
it can do some optimizations.
So yeah, anyway, but yeah, good
point about, I like that about
NOT NULL.
Nikolay: Yeah, yeah, unique is
interesting stuff.
And also, I don't think foreign
keys are underused in Postgres
ecosystem.
Days when everyone suggested that
foreign keys are super bad,
for me it's 20 years ago, I remember
these topics, let's avoid
foreign keys completely.
Although we know that foreign keys
can bite badly, we had an
episode with Metronome company
about issues basically with MultiXact
related issues which can be caused
when you use foreign keys
a lot.
But anyway, I still think foreign
keys is a good thing to have,
data quality.
But I think what is usually underused
is check constraints.
They are great.
They could be used much more often.
And for example, a classic example,
when I say, like, when I
see in schema ENUM data type,
I think, okay, later, If we want
to change this list of possible
values, how will our migration?
Michael: Especially removing any.
Yeah,
Nikolay: so check constraints are
slightly more flexible.
Well, Also removing is a topic,
right?
But check constraint you can introduce
in 2 stages.
Again, you can drop the old 1 and
you can first create a new
1 in NOT VALID state and then validate
it in a separate transaction.
But I just see it like quite useful,
quite convenient tool for
me and you see it in schema right
away like this this column
should be like for example in this
range or it should be this
list or something.
So I wish it was used more often.
Michael: Yeah.
What do you see in...
Going back to foreign keys quickly, do you see a lot of the LLMs
doing a lot of things on update cascade, on delete set NULL,
things like that.
Do you see any cascade stuff popping in there by default or not
when it should be or when you think it
Nikolay: should be?
It's a good question.
I actually don't remember having issues with this question.
I know deferred constraints can might be an issue.
I remember a case when pg_repack, it was a company called Miro,
maybe you know it, right?
They had issues with pg_repack and deferred constraints.
There is an article, old article from them about this.
But I don't remember particular issues here, and I think this
leads us actually to the question for our particular projects,
what's better?
And If I had any doubts with AI, I would just make some experiments.
Based on the product we are developing, I would say, okay, for
example, in 1 year, how many of entities in which entity table
we expect?
All other tables, let's fill it with fake data, run vacuum Analyze
to have good state, and then just explore plans with pgMustard,
for example.
Let's think what kind of workload we might have according to
user stories in our specification of the product and so on.
And then collect plans and see, right?
And then include also modifying queries and collect those plans
and think how it will work.
It's so easy these days to make these experiments before we finalize
all the decisions and then you will see it in action and it's
great.
So collect all the plans, find like weak spots where we have
suboptimal plans and how we can deal with this.
If, for example, we have entity and another, and 1 too many,
and it's many, so many, right?
And then we want Delete to be supported in UI, where people,
as we know, expect 200 milliseconds, less than 1 second, let's
say, right?
And when this Delete of 1 row triggers propagating according
to foreign key triggers, deletion of 1000000 rows, it cannot
meet our requirements, right?
Now we need to think about asynchronous deletion, maybe using
some event queue system or something, right?
Yeah.
And this will pop up if you prepare a good experiment here, right?
Without experiment, it's hard to say.
You need to answer questions.
How big are the tables and how this relationship between tables
on foreign keys will be in the worst case, for example.
Can user create like million of entities and then we need to
delete this user, for example.
This is a great question and with AI you can iterate so fast
these days.
So great.
Collect plans and then think and decide.
Okay, it's a good question, but I don't see it as a simple question.
Michael: No, no, no.
It was more, I just asked in case you'd seen there was a chance
that LLMs were throwing it in on every foreign key and That was
maybe a bad like bad deal Oh, but it sounds like they're not
Nikolay: maybe a wrong person to ask about this because I don't
just I design sometimes these days I have a few applications
designed Yeah created from scratch with AI but I mostly deal
with things which other people already created, and I just see
how AI can help.
So I'm not sure will this propagation of deletion, for example,
will be supported in how exactly it will be implemented.
But it's definitely worth thinking in the context of latencies
and what should be offloaded and what tools we'll be using to
offload in this, to background jobs, for example.
It's just a thing that you need to, instead of thinking, looking
at code, I would think looking at dynamic results, experiments.
Michael: Yeah, and I'm thinking, like if we're talking specifically
about schema design when you expect to hit scale, I might default
to not using ON X CASCADE just by default because of those, because
at scale it can become problematic so that would probably be,
if you had to come up with a best practice for if you're designing
at scale, maybe lean in that direction.
Nikolay: That's interesting, that's interesting because This
was my position for a long time, but it's not about AI building
or something.
This was my position.
Let's, like, it's dangerous because who knows how many rows we
will need to delete.
Like, it's like it's unbounded delete, right?
Unlimited delete, This is how it feels.
Let's design without it.
But then I see in my consulting practice, I see cases where really
large projects have it and it's just not 1 project.
Many projects I see and they keep it and somehow live with it.
They at some point implement non-cascaded, like asynchronous
deletion, but by default they still rely on cascaded deletes
somehow and they are fine.
This is interesting.
So reality shows it's not that bad.
But I agree with your way of thinking and this is my default
way of thinking as well.
Postgres can survive very terrible things these days much better
than 20 years ago.
Including the revolution
of 100 years...
Michael: The only
time I see that being
a horrendous issue, even in small
to medium-sized projects, is if
people haven't indexed the
Nikolay: second part of foreign
key.
Michael: Exactly, the referencing
column.
Nikolay: There are 2 ends of foreign
key.
1 is always indexed because it's
primary key, the other 1, it's
your job to index it, right?
To not forget.
Yeah, yeah, yeah.
We have now a checkup, old checkup,
actually.
We have a special report just for
this case, yeah.
Good, but it's all good questions,
but I feel we can spend a
lot in the Constraints area.
I think we had Constraints episodes.
We did
Michael: have an episode, yeah.
Let's link that up and move on.
Nikolay: Yeah, let's move on.
Some entertaining topic, Column
Tetris, which is sometimes it
might bring some quite interesting
and good savings.
In many cases it won't, but I just
see people enjoy this topic,
and I decided to include it.
So the idea is, when you design
a new Schema, the price to pay
to reorder Columns is very low.
You can reorder it easily.
The problem is that for Postgres,
physically, the Column order
matters.
Due to padding alignment, if int2, for example, is followed
by int8, you will see a gap
of 6 bytes of zeros in every
row.
And this is bad.
It's like a waste of storage and
also memory, which is not only
disk, it will pollute with those
zeros, it will pollute memory.
And sometimes you can be so unlucky,
I saw cases like 30% of
each row is like zeros.
It's, And Postgres doesn't have
automatic reordering of Columns.
It could, actually, but it doesn't
have it.
Michael: So the order you define
it in, let's say you do create
table, the order you put those
Columns in, that's the order they
end up on disk.
And every new Column you add goes
at the end.
That makes sense, right?
But even if it could be slotted
in, let's say you add a new Boolean
type, and it could, I don't actually
know if that would work,
but it always goes on the end.
Nikolay: And there's no way to
sum up.
Boolean is 1 byte.
Michael: Yeah, but I wonder if
that would work.
Nikolay: In Boolean, 7 bits are
wasted already.
If you zoom in like under my
Michael: let's put brilliant brilliant
brilliant next to each
other right
Nikolay: no no no no no Boolean
Michael: really
Nikolay: is 1 bit.
Michael: Oh bit.
Yeah, good point good point, so
yeah,
Nikolay: But it's stored as a 1
byte.
7 bits already wasted.
But if you have Boolean and then
timestamp, for example.
For example, first column, as we
all agree, should be id, the
primary key.
I'm joking, it's not always, but
it's very popular.
Id is the first column, int8, occupies the whole 8-byte
word.
And then we have Boolean and then
timestamp.
Not only you wasted 7 bits inside
Boolean, it's already by design,
like Postgres doesn't have less
than 1 byte.
But then you waste 7 bytes, so
7 bits first and 7 bytes because
of this alignment padding.
And your Boolean, which is like,
by sense, it's only 1 bit, suddenly
occupies 8 bytes.
It's a huge waste of resources,
right?
But the bigger picture here is
that people usually realize that
waste is significant only when
it's a huge table, like 100 million
rows, right?
And then, oh, so many spaces wasted
because we didn't play this
column Tetris initially.
But this is like chicken or egg
problem, maybe inverted.
Anyway, so you realize it later,
but you could do it earlier,
but you didn't think about it because
when you have only 1,000
rows, for example, it costs you
nothing.
But again, with AI, it's so easy.
You just say, let's play Column
Tetris in Postgres.
AI should know it.
I haven't tried, but I'm pretty
sure it should know it, because
there are many articles about it.
Michael: Oh, and we did an episode.
Nikolay: Yeah, starting from Stack
Overflow, where this topic
was very well described.
So AI should know it, this topic.
When you create a new table, you
think many rows will be stored,
just let's apply column Tetris
to this.
Boom.
It will reorder columns.
Never try it again.
Michael: Yeah, what do you think,
like I feel like there's a...
I don't know if I'd still want
to have, I'd still want to do
it myself just because I feel like
there's also a natural order
for columns.
So I would want to put all the
8-byte ones together and 16-byte
ones in to make sure they're all
with similar and then all of
the...
Yeah, exactly.
Nikolay: Ah, similar.
Michael: But equally, like, imagine
if it's like a users table
and I could put the I could put
certain information about that
user that makes sense to go together
together and then information
about something like maybe like
there all the IDs that maybe
the foreign keys like their team
ID or their organization, all
those kind of like referencing
columns, put those together.
So I know the order doesn't matter.
I know you could just specify column
order, but you know when
you do SELECT * or you just...
Nikolay: This is what I was trying
to say, like, are you trying
to tell me that you're using SELECT *?
And here we go.
So I'm joking, actually.
I'm using SELECT * all the time.
Also all the time I see articles
how bad it is to use SELECT *.
Right?
It's convenient, I know, yeah,
but also if you manually work,
it's inconvenient to have order
and SELECT * doesn't work well.
But applications shouldn't use
SELECT *.
Michael: You're right.
And I'm even thinking about examples
like the system views in
Postgres.
Like you do SELECT * FROM pg_stat_statements or something.
But that's not even the table,
right?
That's the view.
And then they can define the order.
So it doesn't actually matter.
Nikolay: Yeah, it's an interesting
topic.
I played Column Tetris a few times
heavily, and then it was really
inconvenient.
Oh, this is an ugly table.
It looks ugly.
Like, this order is ugly.
But at the same time, it's your
problem that because you use
SELECT *, which is considered
like not good practice.
It's practice for exploring and
so on.
Okay, explore with AI, ask to provide
better order or something.
I don't know.
But anyway, if you talk about,
if you focus on performance, again
I saw cases where it was significant
waste.
Michael: 1 of the first articles
I saw and it was by Braintree
and I think they reported an average
of about 10% on some of
their largest tables.
Nikolay: It depends.
Michael: Obviously it completely
depends but that was quite surprising
to me.
I know you mentioned an extreme
case where you saw 30%, but I
was quite surprised that in an
actual production case, it was
as high as 10%.
And when you're talking about hundreds
of gigabytes or terabytes,
That's actually a decent amount.
Nikolay: Yeah, yeah, exactly.
The problem also, of course, in
existing projects which were
developed during many years, usually
it's not so simple because
the schema is evolving during many
years and you cannot insert
your new column in the middle.
It's always, as you said, at the
end.
So sometimes we do need to, for
example, to change the order
using some redesign thing.
By the way, have you heard that
Postgres 19 will have repack
and also repack concurrently, which
is basically vacuum concurrently.
We had an episode with author of
pg_squeeze.
It's a vacuum full concurrently.
Exactly, exactly.
Not vacuum concurrently.
But the decision was made to name
it Repack concurrently.
Michael: I think that's sensible,
getting away from people thinking
it's vacuum related, yeah.
Nikolay: I haven't looked into
details, but naturally this is
the point where you could reorder
columns, because it's based
on logical replication.
Michael: Yeah, I suspect if they
do that, it will be a different
keyword again.
Nikolay: Who knows?
Again, I haven't looked into details.
There are long discussions on pgsql-hackers
mailing list.
It just feels naturally, because
it's based on logical replication.
This is where you can reorder columns
and play Tetris for existing
large tables.
I hope it will be implemented at
some point.
Maybe it's already partially implemented
or something.
I don't know.
You basically, before you start
filling the new table, you can
reorder it, because it's logical,
right?
It should be possible.
Anyway, this is content, it's an
entertaining topic.
Let's move on to a more serious
topic called indexes, index chaos.
If you just use AI and if you don't
spend enough time for things
as I said like experimenting with
real data benchmarks you might
end up with under-indexed situation
or over-indexed situation
And usually people are scared to
have under-indexed.
In my practice, I see over-indexed
very often.
Sometimes to extreme.
Michael: I guess for new projects,
I would be surprised to see
them over-indexed.
But for- it's
Nikolay: possible, it's possible
still.
Michael: Is that like as a result
of AI or something else?
Nikolay: Well, the problem here
is that, again, everything depends
on your prompts, depends on angles.
If you say we will need to order
by arbitrary column, right?
Remember our episode with CTO Gadget,
Harry, right?
So any column we can reorder.
Naturally, AI might decide to put
index on every column, right?
But is it really
Michael: what you want?
But that's what Gadget decided
to do as well, right?
Nikolay: Right, but with understanding
consequences, right?
But in reality, will it be used
If 1 year later you see that
90% of those indexes are unused
and the gadget has special case
They have many applications to
support if you design your own
only 1 application This is where
I would just stop and think
with AI again with some experiments
I would ask questions like
this all depends on workload, right?
Which queries we will have, right?
And I would just think, which queries
we really need and which
we won't need, and then choose
a proper minimal set of indexes.
Michael: I like that.
I think there's also like some
rules of thumb we can give people
in terms of like how many is where
it starts to get dangerous
for other reasons.
But the kind of 1 place where it's
not even and it depends is
overlapping and I see more and
more where people are adding indexes
based on AI suggestions.
They're adding indexes they've
effectively already got or they're
adding ones that overlap with indexes
they've already got, and
therefore only 1 of them is needed.
And by the way, it won't always
show up in your stats, it won't
always show up as an unused index,
because it is still using
the smaller index.
Nikolay: It will show up in PostgreSQL
stats because it's called
in our case redundant index in
Postgres monitoring systems and
CheckUp.
I never heard the term overlapping.
I see people talk about duplicate
indexes, which is like trivial
case, exact same definition.
But we call it redundant indexes.
It's not an easy topic as 1 might
think, but we solved it in
many parts.
Over years, we have quite a good
detection approach, how to reliably
detect redundant indexes.
And as you say, this is true, in
many cases all of those indexes
are used.
And it's quite like, there is a
big fear, oh, I'm going to drop
this, it's used, should I drop
it?
Yeah, but the reports we have quite
reliable and of course I
wish we had very native standard
algorithm to disable indexes
for quite for some time like without
involving HypoPG and hypothetical
indexes, which currently supports
it.
Michael: Our last episode was on
what's missing in Postgres,
and I think we got a nice comment,
I've forgotten who from, but
saying 1 of their features would
be make this
Nikolay: invisible.
Temporally hide index from the
planner.
There are extensions for it, HypoPG
supports it, but we want
some mechanism beyond index setting,
and index is valid to false.
We want this.
Yeah, yeah, yeah, I agree.
And this is a good thing to probably
work on for those who want
to start hacking Postgres so it's
a good thing to officially implement
something.
Anyway, questions to ask is, they
all start here, should start
from understanding workload.
So we cannot think about indexes
without understanding workload.
So first we need to generate some
fake data and then start thinking
about usage patterns.
And without AI It's really hard
work.
It's a lot of time.
With AI it becomes easier.
If you have user stories defined
already, you understand what
kind of patterns we will have,
and you can start imagining.
It will not be 100% accurate, but
it will be good enough, much
better than without it or with
manual work.
So you can iterate here and see,
and then with queries in hand
and some like lab already developed,
I see it should take 1 or
2 hours for like medium-sized project
I mean in terms of complexity
and then you can start iterating
collecting plans and think which
indexes you can have playing
Michael: But also this whilst this
might be the most important
in terms of performance that we've
talked about so far it's also
the the easiest to change later
right we can yeah yeah we can
set up a certain set of indexes
and later decide we want a slightly
different set of indexes and that's
not super painful in a lot
in most cases especially if there's
no partitioning involved
so that for me feels like yeah
it's important it's really important
But if we don't get it right on
day 1, it's much, much easier
to fix than the primary key being
the wrong data type or not
having the right constraints in
place, that kind of thing.
Or even the column order, yeah.
Nikolay: So the point is you should
build like wind tunnel for
your project, right?
Like, you know, like, Test lab,
yeah.
Database lab.
And then you should put this workload like wind and see like
how your profile of your database behaves under this wind.
Michael: Yeah, a lot of people do that with production.
They just do it with their slow queries.
Nikolay: But with AI it's easier to have massive experiments
before you finalize all the decisions.
So my point is do it.
And then questions to ask, are there any indexes we don't need
in terms of they are redundant or unused or what indexes are
missing because we have bad plans and this again like pgMustard
will visualize it and so on and you have API already, right?
Michael: Yeah Yeah, it's quite nice
Nikolay: You can connect your AI to pgMustard for example and
ask to use pgMustard to explain what's happening and find bad
problems and missing indexes with this lab environment.
Great?
Michael: Yeah, OK.
Going just like to the extreme, How many indexes on a table would
you automatically just think, oh wow, that's like a lot?
Maybe it's justified, but I'm already thinking that's too many.
Nikolay: These rule of thumbs are quite weak in my opinion, but
we have them anyway.
Like for example, if the volume of indexes exceeds the volume
of data, it's already some bell is ringing.
Or before Postgres 18 with this nasty lightweight lock manager
problem means I don't want more than 15 indexes per table.
Michael: Because of the 16 relation limit.
Nikolay: So simple, primary key lookup which quite likely will
be very, will have quite high QPS will suffer, and because of
locking.
So there are a couple of kind of these rules, right, but they
are not strict and again I say that I consider them weak, but
maybe...
Michael: But I think it's still helpful, right, if you've got
a schema designed by AI and it's got 20 indexes on 1 table, maybe
consider is that smart?
Nikolay: But it happens, index data exceeding heap data it happens.
Michael: Especially with some, especially like if you're doing
gin indexes.
Nikolay: Oh yes.
Michael: If it's just B-trees, yeah.
Nikolay: Yeah, okay, great.
So the final topic, we couldn't decide on which 1 to choose.
We had 2 ideas.
1 is RLS and another is partitioning.
Let's touch both maybe.
Michael: Yeah I think so.
So from an AI perspective, like when you're asking it for a new
schema, do you see partitioning come up?
Does it over partition?
Nikolay: It won't come up Unless
you start saying I'm going to
have a lot of data here and I want
good performance, I want billions
of rows stored in this table, it
should be maintainable.
It will naturally come to the idea
of having partitioning.
Partitioning will be interesting
to join with UUID version 7 or
8, it doesn't matter.
It's possible, again, I have a
recipe for this in my set of how-tos.
And it can be also the question,
like partitioning you cannot
develop without understanding workload
again.
It's similar to indexes, like it
will be something, but how it
will behave, who knows?
Maybe you will end up, your queries
will need to scan all partitions,
which is terrible in most cases.
And then plan behavior.
Anyway, the question is to ask,
like you said requirements, I
want a lot of rows and I want well
maintainability of this.
Help me.
Michael: Like deleting old data,
are you thinking?
Nikolay: No, creating indexes,
for example, or reindexing.
Michael: OK, yeah, sure.
Vacuum.
Nikolay: Vacuum, yeah, yeah.
These are problems biting us much
more often and badly than just,
I don't know, like, we talk about
this a lot, as well, right?
Michael: I know, I know.
Nikolay: Yeah.
Direct performance benefits from
partitioning are good, but also
the problem when creation of index
or full table vacuum, it will
take hours or half a day, it's
already so painful.
Especially if it's to prevent transaction
wraparound, or if it's,
again, indexing means that
xmin horizon...
Like anyway, let's not go too deep.
The question to ask, will we survive
n number of rows because
we expect all the data.
I want-
Michael: And just out of interest,
is it pretty much always doing
range partitioning based on timestamp
type?
Is that
Nikolay: like- Again, I'm the wrong
person to ask.
Michael: Okay, yeah, fair
Nikolay: enough.
In many cases, I just use to still
use TimescaleDB if it's time-based.
Michael: Makes sense.
Nikolay: And I'm happy because
there is compression there, it's
great.
But I also saw cases when it was
decided, not AI, but it was
decided, like this partitioning
and full control over partitions,
special table created.
It was also working well and served
needs, but on surface, you
should think about workload and
then push your AI or something
to think about how it will behave
if you have that much data.
Partitioning is inevitable there.
And then experiments again.
Even experiment with 1000000000
rows, it won't take a lot, just
allocate machine and do it.
It's like just so easy, right?
Half an hour of wait to generate
it, or more like even if it's
1 hour why I don't understand why
people don't do it all the
time They come to they come to
us with questions, which I'm grateful.
Thank you with money and so on
But it's so easy just to experiment
more often these days
Michael: Yeah, even when it even
leave it running overnight
Nikolay: Yeah
Michael: Yeah, I'm still I'm still
in the fence I love partitioning
for the reasons you mentioned,
But I think so many projects won't
ever need it that I understand
not doing it by default with a
new project unless you know, unless
it's like a new feature for
an existing system that you know
is going to get heavy amounts
of data really quickly.
Just there are so many trade-offs
like that it's still a little
bit like...
Nikolay: Which trade-offs?
Michael: So for example indexes,
being able to create indexes
concurrently, delete, drop indexes
concurrently.
Nikolay: You can concurrently create
index on each partition
and then yeah it exists on each
partition you can create this
these things can be automated and
of course I like to be living
much more better automated there
is a big potential here but
also every Every Postgres version
gets a lot of improvements
in this area, last maybe 10 years
already.
So it's not like frozen.
Yeah, but also make like,
Michael: just being extra careful
that every high frequency query
you have contains the partition
key and is getting pruned at
planning time like there are a
few gotchas
Nikolay: that hit you worse.
Michael: Yeah there are,
Nikolay: I have a series of articles
about this and I went quite
deep there and it was crazy yeah.
Michael: But that's my, they're
my hesitations.
Nikolay: Creation
Michael: of foreign keys.
Nikolay: That's why I'm hesitant
for example working with timescale
timescale DB something like often
we just Abandoned idea of having
foreign key because yeah, it's
important If you do partitioning
yourself, you will deal with, if
you have large tables, adding
foreign key between them.
If partitioning is involved, it's
an art, I would say.
And GitLab mastered it.
They have Migration Helpers RB.
This is a great source of experience
of many people involved
and I can't stop recommending how
great it is because it's open
source.
It's a great place to look at and
also they have great documentation
about it.
But anyway, it's a thing to decide.
If you don't expect billions of
rows, don't do it, of course.
At the same time, we have also
very sad cases when people come
to us for consulting.
We identify problems, we say partitioning
is needed.
Sometimes we go and spend a lot
of effort because the project
is huge or there are many databases
to take care of.
And then finally it's implemented
great.
There are many hidden dangers to
explore.
Mid-journey, for example, they
came to us, helped, forgot about
lightweight lock, well, not forgot,
we didn't know about that
at that time, and they hit it,
lightweight LockManager problem
was hit badly, and there are articles
about it.
But then some projects just say,
okay, they evaluate how much
effort it is to deal with partitioning
when tables are huge,
and start talking, okay, maybe
we should migrate to my MongoDB.
Michael: Yeah, Or even sharding,
right?
Like instead, I think when we talked
to Notion, Arka, he mentioned
that because they keep their shards
relatively small, they've
opted to never partition at all.
Nikolay: Yeah, the need for partitioning
diminishes.
I agree with this, but not vanishes
completely.
Michael: What, maybe?
Nikolay: Not
Michael: maybe.
You mentioned, it depends how small
you keep your shards, right?
Nikolay: Yeah, yeah.
Michael: Like Sugu, for example,
Multigres, he talked about
having lots of smaller databases
being much easier.
Nikolay: Yeah, yeah, yeah, yeah.
Maybe you're right.
Michael: Okay.
What about RLS?
You mentioned you wanted to touch
on
Nikolay: that.
Yeah, finally, RLS, if you, for
example, I just noticed it's
not like Supabase is very heavily
promoting RLS.
They have a lot of articles including
how to avoid problems.
But if you just use AI, I noticed
it often adds RLS.
Michael: Oh, really?
Nikolay: Yeah, I just...
Interesting.
This is how it works.
I just noticed I had a couple of
cases where suddenly it brought
like, we are going to use RLS.
It's multi-tenancy here, let's
do it.
Okay, let's do it, but let's also
pause and think about performance
and avoid problems like current_setting
inside a loop and you
SELECT count of million rows suddenly
becomes super slow.
So the question to ask AI, again,
benchmark is ideal here.
With RLS, without RLS, what is
the overhead of having RLS in
our particular case?
Let's benchmark it.
And if you see it's not okay, then
let's optimize it, because
there are tricks how to optimize
it.
So if I notice an AI-generated
schema with RLS, I would ask question
like, what will be performance
impact?
Let's benchmark it, and if it's
bad, let's optimize.
That's it.
Yeah.
Great.
Great.
That's it, actually.
I think
Michael: that's a useful checklist,
actually.
Nikolay: 5 or 6 things to ask your
AI when you design schema.
Michael: Yeah or your colleague
even.
Nikolay: Yeah also or your own
schema which you have already
10 years, why not.
Michael: Yeah exactly or even a
new table if you're doing a new
feature it still makes sense that
all of these make sense.
Nikolay: I agree.
Cool, thank you for listening.
Yeah, see you soon.