Postgres FM | Transcript: Schema design checklist

April 17, 2026 • 45 Minutes

Schema design checklist

Nikolay: Hello, hello, this is
PostgresFM.

My name is Nik, PostgresAI, and
as usual with me, Michael, pgMustard.

Hi, Michael.

Michael: Hi, Nik.
How are you doing?

Nikolay: So the topic...

I'm doing great, how are
you?

Michael: I'm good, Thank you.

What's the topic?

Nikolay: What's the topic?

The topic I chose, and everyone
already saw it on cover you created,

right?

So the topic is 5 or so, maybe
6 things you should check when

you start new schema or you already
have some schema and your

project is growing and you want
to be better prepared for bigger

scale and performance wise.

So what, especially if you use
AI to design and improve your

schema and your project is growing.

So what should you look at to check
your schema health?

Michael: Makes sense.

Just on the topic in general, I
do think there are some things

that are better suited to asking
for help from AI and not?

Do you think like schema design
is 1 of them or actually is it

1 of those ones that maybe we should
still do it a little bit

more ourselves at the moment?

Nikolay: Well as you know I'm very
very very intensive user of

AI.

Yeah.

Honestly, like I do everything
through AI right now.

And I'm 100% schema design is a
topic where AI can help a lot.

But it also matters which questions
you ask.

If you just say, I want this and
don't pay attention, you will

have low quality schema in many
cases still.

That's why I said, let's have this
episode because it might be

quite basic for many advanced users
of Postgres, But this checklist

we are going to present today,
it can be good thing to keep in

mind always when you design schema,
just to ask right questions

and revisit.

Also, I'm not sure how exactly
you use AI.

I use it always like I work with
people, developers.

When you just ask something and
think those engineers, humans

or AI, will solve all your problems
and make all the decisions,

you're offloading full responsibility
to their shoulders, so

then you have big negative surprises
always, right?

So it's better to keep requests
as detailed as possible when

you work with AI, and also make
some iterations to review and

improve.

And when making iterations, like
first, like I want some schema,

like, I don't know, like LinkedIn
social media schema or something,

right?

It will give you immediately something,
but then you want some

details to be checked.

And this is exactly when this checklist can be very helpful.

So you can revisit and think and imagine when you will have gigabytes

and terabytes of data, what will happen.

So to avoid painful refactoring in the future and outages maybe

or performance degradation, the price you can pay in the beginning

is much lower than the price you pay later.

So I never work with AI in 1 shot, I always make iterations to

improve in any question and to revisit and sometimes combine

multiple LLMs.

This is exactly when you can use this list of questions.

So first topic is the data type of primary key.

Michael: Yeah, and I feel like the answer to this has changed,

or my opinion on what the best default for this has changed,

actually, in the last few years, just off the back of learning

more about timestamp ordered UUIDs, right?

This used to be an age-old debate, right, UUID versus integer

or big integer really.

And most of the kind of blog posts I would see are like trying

to just encourage people to use BIGINT over int.

But now that we have first-class support for time-ordered UUIDs,

I don't see many reasons not to go with that.

What do you think?

Nikolay: Well, the size, 8, is double.

Yeah.

Yeah, your ID is 16 bytes, so it's quite a big price to pay.

Each row, you need to have 8 more bytes.

It's not that much, is it?

In many cases, it's fine.

Well, indexes will be also bigger.

Indexes double, yeah.

It might make sense if volumes are huge.

But anyway, to keep things simple, a rule to remember, red flags,

If you are prepared for really large volumes of data, it must

not be int4, it should be int8.

And in terms of UUID path, data type is always UUID for both

version 4 and version 7.

But you can check default how it's generated.

And in modern Postgres, there is a function, I think in 18, only

in the latest version.

I think

Michael: so.

It didn't quite make 17, did it?

Nikolay: So if you're on Postgres 18, so it's better to use UUIDv7

but also remembering trade-offs, because there is

timestamp there, and the values will be ordered, like with integer

or big integer.

So there is a nuance here.

But performance-wise it's much better than UUIDv4.

There are many articles about it and you just need to check,

okay, are we going to go with integer?

Then it's int8 or BIGINT, right?

Are we going to go with UUID?

Then default generation, default
should be version 7 for Postgres

18.

For older version, it's also not
an excuse to stick with version

You can write your own function.

And you ask AI to write your own
function.

Michael: Yes, or generate them
outside of the database.

Nikolay: I slightly prefer this
option less because if you have

multiple application nodes, Who
knows what will happen with their

clocks?

And since UUIDv7 includes
timestamp inside it, order might

be not, I don't know, like I prefer
to leave it on database shoulders

to have like on the primary single
source of true.

On the other side, UUID by design
is generated, it serves purpose

to work on multiple, like on distributed
systems, right?

So maybe it's not a big deal.

Generate on client side.

Michael: All I mean is if you're
on an old Postgres and it saves

you creating your own function
or whatever.

Nikolay: Postgres waited a couple
of years to bring UUIDv7 officially.

Although again, you can create
even SQL function, even not PL/pgSQL

function, simple SQL function.

And if you ask your AI to search
for example, probably it will

find my page how to do it, but
or other people's pages.

But in other libraries from Google,
from Node.js library, everything,

they have it for a couple of years
already, so for longer.

They implemented before standard
was finalized.

Michael: Yeah, so huge red flags
to look out for if it's using

the integer data type, which is
the same as int4, or serial.

I think AI sometimes uses, it will
define it as serial.

So if you just see that word without
bigserial, you might hit

that limit at 2000000000 or whatever
the, I can't remember the

exact upper limit.

Nikolay: Yeah, and avoid UUID version
4 if you want good performance

on large volumes of data.

Yeah, that's it actually.

Let's keep it simple because we
could discuss more advanced situations.

Michael: I think we should do a
whole episode someday on migrations,

like for people that are getting
close to that 2 billion

Nikolay: number, that

Michael: would be a good topic.

Nikolay: I can share, I had a lot
of experience with different

approaches.

Let's do that.

Good.

So how to define primary key?

I can actually have an article
about it, how to...

Cool.

Michael: Let's do that.

Nikolay: Covering multiple paths so we could talk about this.

Number 2 is constraints.

Topic is constraints.

Michael: Other constraints.

Nikolay: Yeah, yeah.

Yeah, other.

Beyond primary key.

Yeah.

So I would just ask, first thing, are we using constraints well?

Like to myself or AI and so on.

And because Postgres has 6 types of constraints and it's quite

rich tooling, and it's much more convenient to have them earlier,

not because adding them later is a big pain.

Right now we have quite good support of two-step constraint definitions.

I think in Postgres 19 it will be possible also if not null constraints,

which is interesting.

But anyway, check constraints, foreign keys, unique constraints,

it's possible to create first in almost finished state and then

finalize it to avoid the long-lasting locks.

But this is not the reason why I prefer to think about it in

the very beginning.

The main reason is the quality of data.

This is why constraints exist in the first place, right?

So if you introduce constraints early in your project, data quality

is good.

Introducing them

Michael: later probably will...

Yeah, and database side as well.

I mean, it just...

Nikolay: Oh yeah, yeah.

I don't trust constraints outside.

With age, I became less radical.

My points of view became less radical with it.

So I just observe without fighting, like with backend engineers

who try to implement constraints on application side.

I just have a note, like they are not reliable because who knows

which other layers of types of applications, especially these

days with AI, it's so easy to add something else, implement it

in a different framework or even a language, it's super easy

right now.

And you will need to reimplement those constraints, or you will

just forget and miss them and data quality will suffer.

Michael: So specifically, something I think I'd like to see more

of or at least I see a lot of people not doing is columns that

you're not expecting to put any null values in, or you don't

want to have any null values in, actually explicitly making them

not null.

Like, if you don't specify, then columns are nullable.

And I think most people, or at
least most AIs I see, are creating,

like by default, they don't put
NOT NULL on every column.

Nikolay: Yeah, there's an opinion
that NOT NULL should be default.

Anyway, it's hard to have a default,
but inside your project

you can say, okay, unless it's
explicitly specified, let's make

columns NOT NULL.

Exactly.

And avoid troubles with NULL we
talked about in a special episode

about NULLs, right?

Michael: Yeah, well and I guess
in this specific case there is

also, well with some constraints,
mostly we care about, we're

doing it for quality reasons, but
there is, there is like a tiny

amount of performance optimization
that comes out of them.

I guess a bit more for unique constraints,
like if you, if the

database knows for sure this is
going to be, this is unique,

it can do some optimizations.

So yeah, anyway, but yeah, good
point about, I like that about

NOT NULL.

Nikolay: Yeah, yeah, unique is
interesting stuff.

And also, I don't think foreign
keys are underused in Postgres

ecosystem.

Days when everyone suggested that
foreign keys are super bad,

for me it's 20 years ago, I remember
these topics, let's avoid

foreign keys completely.

Although we know that foreign keys
can bite badly, we had an

episode with Metronome company
about issues basically with MultiXact

related issues which can be caused
when you use foreign keys

a lot.

But anyway, I still think foreign
keys is a good thing to have,

data quality.

But I think what is usually underused
is check constraints.

They are great.

They could be used much more often.

And for example, a classic example,
when I say, like, when I

see in schema ENUM data type,
I think, okay, later, If we want

to change this list of possible
values, how will our migration?

Michael: Especially removing any.

Yeah,

Nikolay: so check constraints are
slightly more flexible.

Well, Also removing is a topic,
right?

But check constraint you can introduce
in 2 stages.

Again, you can drop the old 1 and
you can first create a new

1 in NOT VALID state and then validate
it in a separate transaction.

But I just see it like quite useful,
quite convenient tool for

me and you see it in schema right
away like this this column

should be like for example in this
range or it should be this

list or something.

So I wish it was used more often.

Michael: Yeah.

What do you see in...

Going back to foreign keys quickly, do you see a lot of the LLMs

doing a lot of things on update cascade, on delete set NULL,

things like that.

Do you see any cascade stuff popping in there by default or not

when it should be or when you think it

Nikolay: should be?

It's a good question.

I actually don't remember having issues with this question.

I know deferred constraints can might be an issue.

I remember a case when pg_repack, it was a company called Miro,

maybe you know it, right?

They had issues with pg_repack and deferred constraints.

There is an article, old article from them about this.

But I don't remember particular issues here, and I think this

leads us actually to the question for our particular projects,

what's better?

And If I had any doubts with AI, I would just make some experiments.

Based on the product we are developing, I would say, okay, for

example, in 1 year, how many of entities in which entity table

we expect?

All other tables, let's fill it with fake data, run vacuum Analyze

to have good state, and then just explore plans with pgMustard,

for example.

Let's think what kind of workload we might have according to

user stories in our specification of the product and so on.

And then collect plans and see, right?

And then include also modifying queries and collect those plans

and think how it will work.

It's so easy these days to make these experiments before we finalize

all the decisions and then you will see it in action and it's

great.

So collect all the plans, find like weak spots where we have

suboptimal plans and how we can deal with this.

If, for example, we have entity and another, and 1 too many,

and it's many, so many, right?

And then we want Delete to be supported in UI, where people,

as we know, expect 200 milliseconds, less than 1 second, let's

say, right?

And when this Delete of 1 row triggers propagating according

to foreign key triggers, deletion of 1000000 rows, it cannot

meet our requirements, right?

Now we need to think about asynchronous deletion, maybe using

some event queue system or something, right?

Yeah.

And this will pop up if you prepare a good experiment here, right?

Without experiment, it's hard to say.

You need to answer questions.

How big are the tables and how this relationship between tables

on foreign keys will be in the worst case, for example.

Can user create like million of entities and then we need to

delete this user, for example.

This is a great question and with AI you can iterate so fast

these days.

So great.

Collect plans and then think and decide.

Okay, it's a good question, but I don't see it as a simple question.

Michael: No, no, no.

It was more, I just asked in case you'd seen there was a chance

that LLMs were throwing it in on every foreign key and That was

maybe a bad like bad deal Oh, but it sounds like they're not

Nikolay: maybe a wrong person to ask about this because I don't

just I design sometimes these days I have a few applications

designed Yeah created from scratch with AI but I mostly deal

with things which other people already created, and I just see

how AI can help.

So I'm not sure will this propagation of deletion, for example,

will be supported in how exactly it will be implemented.

But it's definitely worth thinking in the context of latencies

and what should be offloaded and what tools we'll be using to

offload in this, to background jobs, for example.

It's just a thing that you need to, instead of thinking, looking

at code, I would think looking at dynamic results, experiments.

Michael: Yeah, and I'm thinking, like if we're talking specifically

about schema design when you expect to hit scale, I might default

to not using ON X CASCADE just by default because of those, because

at scale it can become problematic so that would probably be,

if you had to come up with a best practice for if you're designing

at scale, maybe lean in that direction.

Nikolay: That's interesting, that's interesting because This

was my position for a long time, but it's not about AI building

or something.

This was my position.

Let's, like, it's dangerous because who knows how many rows we

will need to delete.

Like, it's like it's unbounded delete, right?

Unlimited delete, This is how it feels.

Let's design without it.

But then I see in my consulting practice, I see cases where really

large projects have it and it's just not 1 project.

Many projects I see and they keep it and somehow live with it.

They at some point implement non-cascaded, like asynchronous

deletion, but by default they still rely on cascaded deletes

somehow and they are fine.

This is interesting.

So reality shows it's not that bad.

But I agree with your way of thinking and this is my default

way of thinking as well.

Postgres can survive very terrible things these days much better

than 20 years ago.

Including the revolution

of 100 years...

Michael: The only

time I see that being
a horrendous issue, even in small

to medium-sized projects, is if
people haven't indexed the

Nikolay: second part of foreign
key.

Michael: Exactly, the referencing
column.

Nikolay: There are 2 ends of foreign
key.

1 is always indexed because it's
primary key, the other 1, it's

your job to index it, right?

To not forget.

Yeah, yeah, yeah.

We have now a checkup, old checkup,
actually.

We have a special report just for
this case, yeah.

Good, but it's all good questions,
but I feel we can spend a

lot in the Constraints area.

I think we had Constraints episodes.

We did

Michael: have an episode, yeah.

Let's link that up and move on.

Nikolay: Yeah, let's move on.

Some entertaining topic, Column
Tetris, which is sometimes it

might bring some quite interesting
and good savings.

In many cases it won't, but I just
see people enjoy this topic,

and I decided to include it.

So the idea is, when you design
a new Schema, the price to pay

to reorder Columns is very low.

You can reorder it easily.

The problem is that for Postgres,
physically, the Column order

matters.

Due to padding alignment, if int2, for example, is followed

by int8, you will see a gap
of 6 bytes of zeros in every

row.

And this is bad.

It's like a waste of storage and
also memory, which is not only

disk, it will pollute with those
zeros, it will pollute memory.

And sometimes you can be so unlucky,
I saw cases like 30% of

each row is like zeros.

It's, And Postgres doesn't have
automatic reordering of Columns.

It could, actually, but it doesn't
have it.

Michael: So the order you define
it in, let's say you do create

table, the order you put those
Columns in, that's the order they

end up on disk.

And every new Column you add goes
at the end.

That makes sense, right?

But even if it could be slotted
in, let's say you add a new Boolean

type, and it could, I don't actually
know if that would work,

but it always goes on the end.

Nikolay: And there's no way to
sum up.

Boolean is 1 byte.

Michael: Yeah, but I wonder if
that would work.

Nikolay: In Boolean, 7 bits are
wasted already.

If you zoom in like under my

Michael: let's put brilliant brilliant
brilliant next to each

other right

Nikolay: no no no no no Boolean

Michael: really

Nikolay: is 1 bit.

Michael: Oh bit.

Yeah, good point good point, so
yeah,

Nikolay: But it's stored as a 1
byte.

7 bits already wasted.

But if you have Boolean and then
timestamp, for example.

For example, first column, as we
all agree, should be id, the

primary key.

I'm joking, it's not always, but
it's very popular.

Id is the first column, int8, occupies the whole 8-byte

word.

And then we have Boolean and then
timestamp.

Not only you wasted 7 bits inside
Boolean, it's already by design,

like Postgres doesn't have less
than 1 byte.

But then you waste 7 bytes, so
7 bits first and 7 bytes because

of this alignment padding.

And your Boolean, which is like,
by sense, it's only 1 bit, suddenly

occupies 8 bytes.

It's a huge waste of resources,
right?

But the bigger picture here is
that people usually realize that

waste is significant only when
it's a huge table, like 100 million

rows, right?

And then, oh, so many spaces wasted
because we didn't play this

column Tetris initially.

But this is like chicken or egg
problem, maybe inverted.

Anyway, so you realize it later,
but you could do it earlier,

but you didn't think about it because
when you have only 1,000

rows, for example, it costs you
nothing.

But again, with AI, it's so easy.

You just say, let's play Column
Tetris in Postgres.

AI should know it.

I haven't tried, but I'm pretty
sure it should know it, because

there are many articles about it.

Michael: Oh, and we did an episode.

Nikolay: Yeah, starting from Stack
Overflow, where this topic

was very well described.

So AI should know it, this topic.

When you create a new table, you
think many rows will be stored,

just let's apply column Tetris
to this.

Boom.

It will reorder columns.

Never try it again.

Michael: Yeah, what do you think,
like I feel like there's a...

I don't know if I'd still want
to have, I'd still want to do

it myself just because I feel like
there's also a natural order

for columns.

So I would want to put all the
8-byte ones together and 16-byte

ones in to make sure they're all
with similar and then all of

the...

Yeah, exactly.

Nikolay: Ah, similar.

Michael: But equally, like, imagine
if it's like a users table

and I could put the I could put
certain information about that

user that makes sense to go together
together and then information

about something like maybe like
there all the IDs that maybe

the foreign keys like their team
ID or their organization, all

those kind of like referencing
columns, put those together.

So I know the order doesn't matter.

I know you could just specify column
order, but you know when

you do SELECT * or you just...

Nikolay: This is what I was trying
to say, like, are you trying

to tell me that you're using SELECT *?

And here we go.

So I'm joking, actually.

I'm using SELECT * all the time.

Also all the time I see articles
how bad it is to use SELECT *.

Right?

It's convenient, I know, yeah,
but also if you manually work,

it's inconvenient to have order
and SELECT * doesn't work well.

But applications shouldn't use
SELECT *.

Michael: You're right.

And I'm even thinking about examples
like the system views in

Postgres.

Like you do SELECT * FROM pg_stat_statements or something.

But that's not even the table,
right?

That's the view.

And then they can define the order.

So it doesn't actually matter.

Nikolay: Yeah, it's an interesting
topic.

I played Column Tetris a few times
heavily, and then it was really

inconvenient.

Oh, this is an ugly table.

It looks ugly.

Like, this order is ugly.

But at the same time, it's your
problem that because you use

SELECT *, which is considered
like not good practice.

It's practice for exploring and
so on.

Okay, explore with AI, ask to provide
better order or something.

I don't know.

But anyway, if you talk about,
if you focus on performance, again

I saw cases where it was significant
waste.

Michael: 1 of the first articles
I saw and it was by Braintree

and I think they reported an average
of about 10% on some of

their largest tables.

Nikolay: It depends.

Michael: Obviously it completely
depends but that was quite surprising

to me.

I know you mentioned an extreme
case where you saw 30%, but I

was quite surprised that in an
actual production case, it was

as high as 10%.

And when you're talking about hundreds
of gigabytes or terabytes,

That's actually a decent amount.

Nikolay: Yeah, yeah, exactly.

The problem also, of course, in
existing projects which were

developed during many years, usually
it's not so simple because

the schema is evolving during many
years and you cannot insert

your new column in the middle.

It's always, as you said, at the
end.

So sometimes we do need to, for
example, to change the order

using some redesign thing.

By the way, have you heard that
Postgres 19 will have repack

and also repack concurrently, which
is basically vacuum concurrently.

We had an episode with author of
pg_squeeze.

It's a vacuum full concurrently.

Exactly, exactly.

Not vacuum concurrently.

But the decision was made to name
it Repack concurrently.

Michael: I think that's sensible,
getting away from people thinking

it's vacuum related, yeah.

Nikolay: I haven't looked into
details, but naturally this is

the point where you could reorder
columns, because it's based

on logical replication.

Michael: Yeah, I suspect if they
do that, it will be a different

keyword again.

Nikolay: Who knows?

Again, I haven't looked into details.

There are long discussions on pgsql-hackers
mailing list.

It just feels naturally, because
it's based on logical replication.

This is where you can reorder columns
and play Tetris for existing

large tables.

I hope it will be implemented at
some point.

Maybe it's already partially implemented
or something.

I don't know.

You basically, before you start
filling the new table, you can

reorder it, because it's logical,
right?

It should be possible.

Anyway, this is content, it's an
entertaining topic.

Let's move on to a more serious
topic called indexes, index chaos.

If you just use AI and if you don't
spend enough time for things

as I said like experimenting with
real data benchmarks you might

end up with under-indexed situation
or over-indexed situation

And usually people are scared to
have under-indexed.

In my practice, I see over-indexed
very often.

Sometimes to extreme.

Michael: I guess for new projects,
I would be surprised to see

them over-indexed.

But for- it's

Nikolay: possible, it's possible
still.

Michael: Is that like as a result
of AI or something else?

Nikolay: Well, the problem here
is that, again, everything depends

on your prompts, depends on angles.

If you say we will need to order
by arbitrary column, right?

Remember our episode with CTO Gadget,
Harry, right?

So any column we can reorder.

Naturally, AI might decide to put
index on every column, right?

But is it really

Michael: what you want?

But that's what Gadget decided
to do as well, right?

Nikolay: Right, but with understanding
consequences, right?

But in reality, will it be used
If 1 year later you see that

90% of those indexes are unused
and the gadget has special case

They have many applications to
support if you design your own

only 1 application This is where
I would just stop and think

with AI again with some experiments
I would ask questions like

this all depends on workload, right?

Which queries we will have, right?

And I would just think, which queries
we really need and which

we won't need, and then choose
a proper minimal set of indexes.

Michael: I like that.

I think there's also like some
rules of thumb we can give people

in terms of like how many is where
it starts to get dangerous

for other reasons.

But the kind of 1 place where it's
not even and it depends is

overlapping and I see more and
more where people are adding indexes

based on AI suggestions.

They're adding indexes they've
effectively already got or they're

adding ones that overlap with indexes
they've already got, and

therefore only 1 of them is needed.

And by the way, it won't always
show up in your stats, it won't

always show up as an unused index,
because it is still using

the smaller index.

Nikolay: It will show up in PostgreSQL
stats because it's called

in our case redundant index in
Postgres monitoring systems and

CheckUp.

I never heard the term overlapping.

I see people talk about duplicate
indexes, which is like trivial

case, exact same definition.

But we call it redundant indexes.

It's not an easy topic as 1 might
think, but we solved it in

many parts.

Over years, we have quite a good
detection approach, how to reliably

detect redundant indexes.

And as you say, this is true, in
many cases all of those indexes

are used.

And it's quite like, there is a
big fear, oh, I'm going to drop

this, it's used, should I drop
it?

Yeah, but the reports we have quite
reliable and of course I

wish we had very native standard
algorithm to disable indexes

for quite for some time like without
involving HypoPG and hypothetical

indexes, which currently supports
it.

Michael: Our last episode was on
what's missing in Postgres,

and I think we got a nice comment,
I've forgotten who from, but

saying 1 of their features would
be make this

Nikolay: invisible.

Temporally hide index from the
planner.

There are extensions for it, HypoPG
supports it, but we want

some mechanism beyond index setting,
and index is valid to false.

We want this.

Yeah, yeah, yeah, I agree.

And this is a good thing to probably
work on for those who want

to start hacking Postgres so it's
a good thing to officially implement

something.

Anyway, questions to ask is, they
all start here, should start

from understanding workload.

So we cannot think about indexes
without understanding workload.

So first we need to generate some
fake data and then start thinking

about usage patterns.

And without AI It's really hard
work.

It's a lot of time.

With AI it becomes easier.

If you have user stories defined
already, you understand what

kind of patterns we will have,
and you can start imagining.

It will not be 100% accurate, but
it will be good enough, much

better than without it or with
manual work.

So you can iterate here and see,
and then with queries in hand

and some like lab already developed,
I see it should take 1 or

2 hours for like medium-sized project
I mean in terms of complexity

and then you can start iterating
collecting plans and think which

indexes you can have playing

Michael: But also this whilst this
might be the most important

in terms of performance that we've
talked about so far it's also

the the easiest to change later
right we can yeah yeah we can

set up a certain set of indexes
and later decide we want a slightly

different set of indexes and that's
not super painful in a lot

in most cases especially if there's
no partitioning involved

so that for me feels like yeah
it's important it's really important

But if we don't get it right on
day 1, it's much, much easier

to fix than the primary key being
the wrong data type or not

having the right constraints in
place, that kind of thing.

Or even the column order, yeah.

Nikolay: So the point is you should
build like wind tunnel for

your project, right?

Like, you know, like, Test lab,
yeah.

Database lab.

And then you should put this workload like wind and see like

how your profile of your database behaves under this wind.

Michael: Yeah, a lot of people do that with production.

They just do it with their slow queries.

Nikolay: But with AI it's easier to have massive experiments

before you finalize all the decisions.

So my point is do it.

And then questions to ask, are there any indexes we don't need

in terms of they are redundant or unused or what indexes are

missing because we have bad plans and this again like pgMustard

will visualize it and so on and you have API already, right?

Michael: Yeah Yeah, it's quite nice

Nikolay: You can connect your AI to pgMustard for example and

ask to use pgMustard to explain what's happening and find bad

problems and missing indexes with this lab environment.

Great?

Michael: Yeah, OK.

Going just like to the extreme, How many indexes on a table would

you automatically just think, oh wow, that's like a lot?

Maybe it's justified, but I'm already thinking that's too many.

Nikolay: These rule of thumbs are quite weak in my opinion, but

we have them anyway.

Like for example, if the volume of indexes exceeds the volume

of data, it's already some bell is ringing.

Or before Postgres 18 with this nasty lightweight lock manager

problem means I don't want more than 15 indexes per table.

Michael: Because of the 16 relation limit.

Nikolay: So simple, primary key lookup which quite likely will

be very, will have quite high QPS will suffer, and because of

locking.

So there are a couple of kind of these rules, right, but they

are not strict and again I say that I consider them weak, but

maybe...

Michael: But I think it's still helpful, right, if you've got

a schema designed by AI and it's got 20 indexes on 1 table, maybe

consider is that smart?

Nikolay: But it happens, index data exceeding heap data it happens.

Michael: Especially with some, especially like if you're doing

gin indexes.

Nikolay: Oh yes.

Michael: If it's just B-trees, yeah.

Nikolay: Yeah, okay, great.

So the final topic, we couldn't decide on which 1 to choose.

We had 2 ideas.

1 is RLS and another is partitioning.

Let's touch both maybe.

Michael: Yeah I think so.

So from an AI perspective, like when you're asking it for a new

schema, do you see partitioning come up?

Does it over partition?

Nikolay: It won't come up Unless
you start saying I'm going to

have a lot of data here and I want
good performance, I want billions

of rows stored in this table, it
should be maintainable.

It will naturally come to the idea
of having partitioning.

Partitioning will be interesting
to join with UUID version 7 or

8, it doesn't matter.

It's possible, again, I have a
recipe for this in my set of how-tos.

And it can be also the question,
like partitioning you cannot

develop without understanding workload
again.

It's similar to indexes, like it
will be something, but how it

will behave, who knows?

Maybe you will end up, your queries
will need to scan all partitions,

which is terrible in most cases.

And then plan behavior.

Anyway, the question is to ask,
like you said requirements, I

want a lot of rows and I want well
maintainability of this.

Help me.

Michael: Like deleting old data,
are you thinking?

Nikolay: No, creating indexes,
for example, or reindexing.

Michael: OK, yeah, sure.

Vacuum.

Nikolay: Vacuum, yeah, yeah.

These are problems biting us much
more often and badly than just,

I don't know, like, we talk about
this a lot, as well, right?

Michael: I know, I know.

Nikolay: Yeah.

Direct performance benefits from
partitioning are good, but also

the problem when creation of index
or full table vacuum, it will

take hours or half a day, it's
already so painful.

Especially if it's to prevent transaction
wraparound, or if it's,

again, indexing means that
xmin horizon...

Like anyway, let's not go too deep.

The question to ask, will we survive
n number of rows because

we expect all the data.

I want-

Michael: And just out of interest,
is it pretty much always doing

range partitioning based on timestamp
type?

Is that

Nikolay: like- Again, I'm the wrong
person to ask.

Michael: Okay, yeah, fair

Nikolay: enough.

In many cases, I just use to still
use TimescaleDB if it's time-based.

Michael: Makes sense.

Nikolay: And I'm happy because
there is compression there, it's

great.

But I also saw cases when it was
decided, not AI, but it was

decided, like this partitioning
and full control over partitions,

special table created.

It was also working well and served
needs, but on surface, you

should think about workload and
then push your AI or something

to think about how it will behave
if you have that much data.

Partitioning is inevitable there.

And then experiments again.

Even experiment with 1000000000
rows, it won't take a lot, just

allocate machine and do it.

It's like just so easy, right?

Half an hour of wait to generate
it, or more like even if it's

1 hour why I don't understand why
people don't do it all the

time They come to they come to
us with questions, which I'm grateful.

Thank you with money and so on
But it's so easy just to experiment

more often these days

Michael: Yeah, even when it even
leave it running overnight

Nikolay: Yeah

Michael: Yeah, I'm still I'm still
in the fence I love partitioning

for the reasons you mentioned,
But I think so many projects won't

ever need it that I understand
not doing it by default with a

new project unless you know, unless
it's like a new feature for

an existing system that you know
is going to get heavy amounts

of data really quickly.

Just there are so many trade-offs
like that it's still a little

bit like...

Nikolay: Which trade-offs?

Michael: So for example indexes,
being able to create indexes

concurrently, delete, drop indexes
concurrently.

Nikolay: You can concurrently create
index on each partition

and then yeah it exists on each
partition you can create this

these things can be automated and
of course I like to be living

much more better automated there
is a big potential here but

also every Every Postgres version
gets a lot of improvements

in this area, last maybe 10 years
already.

So it's not like frozen.

Yeah, but also make like,

Michael: just being extra careful
that every high frequency query

you have contains the partition
key and is getting pruned at

planning time like there are a
few gotchas

Nikolay: that hit you worse.

Michael: Yeah there are,

Nikolay: I have a series of articles
about this and I went quite

deep there and it was crazy yeah.

Michael: But that's my, they're
my hesitations.

Nikolay: Creation

Michael: of foreign keys.

Nikolay: That's why I'm hesitant
for example working with timescale

timescale DB something like often
we just Abandoned idea of having

foreign key because yeah, it's
important If you do partitioning

yourself, you will deal with, if
you have large tables, adding

foreign key between them.

If partitioning is involved, it's
an art, I would say.

And GitLab mastered it.

They have Migration Helpers RB.

This is a great source of experience
of many people involved

and I can't stop recommending how
great it is because it's open

source.

It's a great place to look at and
also they have great documentation

about it.

But anyway, it's a thing to decide.

If you don't expect billions of
rows, don't do it, of course.

At the same time, we have also
very sad cases when people come

to us for consulting.

We identify problems, we say partitioning
is needed.

Sometimes we go and spend a lot
of effort because the project

is huge or there are many databases
to take care of.

And then finally it's implemented
great.

There are many hidden dangers to
explore.

Mid-journey, for example, they
came to us, helped, forgot about

lightweight lock, well, not forgot,
we didn't know about that

at that time, and they hit it,
lightweight LockManager problem

was hit badly, and there are articles
about it.

But then some projects just say,
okay, they evaluate how much

effort it is to deal with partitioning
when tables are huge,

and start talking, okay, maybe
we should migrate to my MongoDB.

Michael: Yeah, Or even sharding,
right?

Like instead, I think when we talked
to Notion, Arka, he mentioned

that because they keep their shards
relatively small, they've

opted to never partition at all.

Nikolay: Yeah, the need for partitioning
diminishes.

I agree with this, but not vanishes
completely.

Michael: What, maybe?

Nikolay: Not

Michael: maybe.

You mentioned, it depends how small
you keep your shards, right?

Nikolay: Yeah, yeah.

Michael: Like Sugu, for example,
Multigres, he talked about

having lots of smaller databases
being much easier.

Nikolay: Yeah, yeah, yeah, yeah.

Maybe you're right.

Michael: Okay.

What about RLS?

You mentioned you wanted to touch
on

Nikolay: that.

Yeah, finally, RLS, if you, for
example, I just noticed it's

not like Supabase is very heavily
promoting RLS.

They have a lot of articles including
how to avoid problems.

But if you just use AI, I noticed
it often adds RLS.

Michael: Oh, really?

Nikolay: Yeah, I just...

Interesting.

This is how it works.

I just noticed I had a couple of
cases where suddenly it brought

like, we are going to use RLS.

It's multi-tenancy here, let's
do it.

Okay, let's do it, but let's also
pause and think about performance

and avoid problems like current_setting
inside a loop and you

SELECT count of million rows suddenly
becomes super slow.

So the question to ask AI, again,
benchmark is ideal here.

With RLS, without RLS, what is
the overhead of having RLS in

our particular case?

Let's benchmark it.

And if you see it's not okay, then
let's optimize it, because

there are tricks how to optimize
it.

So if I notice an AI-generated
schema with RLS, I would ask question

like, what will be performance
impact?

Let's benchmark it, and if it's
bad, let's optimize.

That's it.

Yeah.

Great.

That's it, actually.

I think

Michael: that's a useful checklist,
actually.

Nikolay: 5 or 6 things to ask your
AI when you design schema.

Michael: Yeah or your colleague
even.

Nikolay: Yeah also or your own
schema which you have already

10 years, why not.

Michael: Yeah exactly or even a
new table if you're doing a new

feature it still makes sense that
all of these make sense.

Nikolay: I agree.

Cool, thank you for listening.

Yeah, see you soon.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

Schema design checklist

Creators and Guests

Some kind things our listeners have said