Postgres FM | Transcript: Multi-column indexes

August 22, 2025 • 41 Minutes

Multi-column indexes

Nikolay: Hello, hello, this is Postgres.FM.

My name is Nik, Postgres.AI and as usual, my co-host is Michael,

pgMustard.

Hi, Michael.

Michael: Hello, Nik.

How are you doing?

Nikolay: I'm doing great.

How are you?

Michael: I'm good, thank you.

Nikolay: I tried to say I thought, but actually this is you who

thought it's not enough episodes starting with "multi" word.

So what is the topic today?

Tell us.

Michael: Yeah, I chose multi-column indexes and I know we've

talked about them a bunch.

We've done episodes on things like index-only scans, for example.

But it felt to me looking back at those that we didn't really

get into some of the details around multi-column indexes that

I find particularly interesting.

I think also this is still the area I see people, especially

fairly experienced full stack engineers, backend engineers, struggling

with in terms of performance or not realizing how big a difference

they can make or how, how helpful they can be in certain, especially

a lot of the workloads I see are web application, just like a

lot of reads, like maybe 80, 90% reads and a lot of trying to

optimize performance of those reads, especially with the odd

kind of aggregate type query.

And index-only scans help a bunch, but only once you understand

multi-column indexes.

And I think there's some detail in there that I'd like to discuss

with you.

So yeah, I thought it was about time we actually had one dedicated

to this.

Nikolay: Yeah.

Great.

I guess we will be going slightly beyond just multi-column indexes.

Michael: Right.

Possibly.

What do you mean?

Nikolay: Well,

maybe the question is how we can compare them, for example,

having just indexes on a single column, but many of them.

And sometimes I have a case where people intentionally understanding

consequences chose to cover every column with index, a single

column, I guess.

And yeah, we usually say it's an anti-pattern, right?

Michael: Well, I mean, yeah, I guess it depends.

I've not seen a workload, like with a set of access patterns

where that would make the most
sense.

Personally I tend to see tables
that have some columns that get

queried a lot more than others
and the groups of columns that

get queried very often together.

Lots of equality searches amongst,
for example, account ID, User

ID, but there may be a range search
in terms of date.

And like, that's the kind of Query
that ends up being like so

much more efficient with a multi-column
index than with a series

of single column indexes.

So yeah, the kind of access patterns
I see, I can't imagine living

without them, But I'm sure that,
well, I'm looking forward to

hearing about the case where it
made most sense to...

Nikolay: Well, it was multi-tenant
application where the freedom

of Queries is unpredictable.

So every tenant can choose different
types of filtering and ordering,

and we don't know in advance.

And they are not engineers, those
tenants, right?

So we need to give them some support,
at least somehow.

But index-write amplification,
I guess, is a killer in this use

case if data size grows.

Michael: Yeah.

And given it's a multi-tenant application,
are we talking single

Database per...

Nikolay: Separate tables.

Michael: That makes way more sense.

So that starts to make some sense
then at least.

Nikolay: Yeah, there are smaller
Tables of course.

And there's isolation, complete isolation.

So all blocks and all tuples in
these blocks, buffers or pages,

how you name it, they are belonging
only to specific tenants,

so all problems are localized.

Anyway, if we think columns A and
B and we have 2 indexes on

both columns, when does it make
to consider a 2 column index

instead?

This is a question, I guess.

Michael: Yeah, yeah, I like that
a lot.

And I guess we probably should
go 1 step back and say probably

we're going to be talking about
B-tree indexes for most of this.

Nikolay: How about btree_gin indexes?

Michael: Yeah well there's a few
the docs mention and it's quite

funny The word only is still in
there, I think from the initial

commit.

So it says currently only the B-tree,
GiST, GIN and BRIN index

types support multi-key column
indexes.

And that, the key is important
word there, but basically that

means, key columns not include
columns.

So there is a subtle difference
there.

Nikolay: Well, GIN indexes can't
be multi-column?

Michael: GIN, So GIN can, BRIN
can, GiST can and B-tree can.

So when it says only, I think the
only inbuilt index type that

doesn't support multi-column indexes
is hash.

So I think only is a bit of a dubious
word there.

Nikolay: Is that it?

Is that it?

Michael: That's it?

I can't think of a sixth 1.

Nikolay: Okay.

Michael: Or maybe SP-GiST, but like
I think that's a subcategory

of GiST.

Anyway, basically I think we're
probably gonna be talking about

a B-tree, like the vast majority
of indexes is a B-tree and the

vast majority of multi-column indexes
I see a B-tree.

Have you got any different examples
other than, well yeah, maybe

B-tree, btree_gin?

Nikolay: Yeah, there's a problem
with GIN.

I always like to talk about that
if you have a scalar column,

column of scalar values, like timestamp
or integer, bigint, and

you need to order by it and limit
like 125.

Then it can be inefficient because
GIN doesn't know, like planner

needs to choose to use only B-tree
or only GIN.

And this is where btree_gin can help
to combine them into multi-column

index, right?

So we can talk about hybrid indexes
when 1 layer is GIN and another

is B-tree, for example.

Michael: Yeah, well, and I think
that kind of starts to get to

why we need or want multi-column
indexes in the first place,

because I think a large part of
it is ordering or clustering.

Like you want it, B-tree is a simple
example.

Beautifully, it has a single order
through the entire structure,

right?

Nikolay: Single.

Michael: Yes, exactly.

So the order is really useful for
things like enforcing unique

constraints.

Like things can be, things are
placed in an order and you can

see, is there 1 of this already
in place?

And multi-column just gives you
kind of order by the first of

these columns first and then for
any that are equal order by

the second column next and I don't
think the docs make this super

clear I think they give a good
example but that could be clearer

I think that it's single order
and order by A, then order by

B and then C and then D and however
many you you list and that's

if they're key columns.

Includes is a bit different, that
doesn't affect the ordering

and that has some benefits.

Well, we can discuss whether it
does, I guess, in a bit.

That's 1 of the main things I wanted
to discuss with you.

Actually, I think that's a particularly
interesting part.

But that ordering thing is really,
really important in terms

of then choosing which order to
index things in in the first

place.

So I've never, I've never come up with or come across a guideline

that I'm totally happy with in terms of...

Nikolay: Well, it's straightforward.

We discussed this 20 years ago in some circles.

So, by the way, not only ordering, but also filtering, if your

filters involve range comparison or like between operator, right?

Or in case of GiST, it can be something else.

So anyway, so it can be not only strict ordering, but also just,

is it inside the range or no?

Is it more or less these operators?

But as for the rule, ah, let me this time also step 1 step back

and say super roughly at high level forgetting about parallel

operations and so on.

We have 3 types of access to data.

Sequential scan on 1 side for lots of data, it's the slowest.

index and index-only scan, let's combine them in this case.

On another side, the fastest single index scan, especially single

index-only scan is the fastest if you have a lot of data.

And between them there is a bitmap scan, which is a combination

of bitmap index scan and bitmap heap scan.

And if you have like, This is like a proposal to have an explanation

why we need a multi-column index.

In case of 2 columns and filtering or ordering on them, If you

have 2 separate single column indexes, you likely will end up

using either 1 of them or maybe both of them, but in the form

of bitmap scan, which is not the most efficient.

And with multi-column index can bring you the most efficient

either index scan or even index-only scan, right?

Michael: Yeah, and I think There are exceptions or times where

they're pretty much equal in efficiency.

Nikolay: Of course.

Michael: But it's very, very easy to demonstrate with non overlapping

like datasets.

Like you quite often use the social media thing, right?

If we had an index on user ID and a separate index on created_at

or you know some kind of timestamp looking at all of the users

posts from a certain time range even if we can do the bitmap

index scan on users just to find all of that users posts and

then a bitmap index scan on all of the posts in that time range,

it's very easy to see how we're having to scan a ton more data

in that And then even though we're only looking at, they're doing

like the bitmap index scan, only having to look at the index

entries and the bitmap itself will be relatively small if that

user doesn't have any posts.

And then the eventual heap scan will be relatively small and

relatively efficient.

The actual initial scans to get the right data is huge compared

to having a multi-column index with all of the users posts next

to each other with their created_at timestamps.

So it's, yeah, in terms of, we
often talk about buffer reads,

right?

Nikolay: We could create statistics,
right?

And so the planner would know that
this user was recently active,

so we need to follow the index
on created_at.

And another is not recently active,
so it's better to fetch all

entries for this user and then
order by memory and have top end

sort in memory.

Michael: But both with any sufficiently
active system, both would

lose massively in terms of buffer
reads still to the index-only

scan or even an index scan on a
single multicolumn index.

Nikolay: Yeah.

Yeah.

Anyway, also, also what matters,
random page cost matters here,

right?

Because planner might, in reality,
index scan might still be

better or vice versa, but planner
might think it can choose a

bitmap scan in favor of index scan
because random page cost

is 4, for example, default.

Anyway, let's return...

Michael: While we're on the topic
of outdated things, I actually

think there's a relevant part of
the docs that I have an issue

with that I wanted your opinion
on.

It's in this area.

It says multi-column indexes should
be used sparingly.

In most situations, an index on
a single column is sufficient

and saves space and time.

Indexes with more than 3 columns
are unlikely to be helpful unless

the usage of the table is extremely
stylized. They're still

there in the latest versions and
to be fair to them, they've

existed pretty much since 7.2.

Nikolay: I understand.

In reality, it depends on the application.

In some applications, there are
absolutely critical queries which

must have index scan and sometimes
even index-only scan with

aggressively tuned to vacuum and
visibility maps maintained very

well so index-only scan doesn't
degrade.

So I think this is like, I can
see why this was written, but

I cannot agree with it in all cases
because definitely we have

3 column indexes sometimes because
we choose to have the most

efficient path specifically to
support specific queries, which

are most important for us.

They receive a lot of traffic,
so we cannot allow degradation

here.

Michael: Yeah.

What about 4 or 5 columns?

I guess it's rarer.

I don't see many, but I've seen
a few.

Nikolay: Well let's a little bit
postpone this question because

I know you wanted also to touch
covering indexes, right?

Because maybe this is where we will discuss this, but let me

return to your previous question.

Yeah.

How to choose the order.

And I remember the rule originally was, if it doesn't matter,

so there are 2 rules.

You need to spend time understanding them.

I'm pretty sure listeners who are most experienced, they know

this very well, but I know a lot of unexperienced listeners our

podcast receives.

So 2 rules, 1 rule, and this is maybe most popular in terms of

mentionings.

The column on, of the index on columns A, B makes a separate

index on columns A, B makes an index on column A, a single column

index, like not needed, you can drop it.

Right, because these 2 column index can support queries that

need only index on column A.

Right, and this means that if you have queries dealing with both

A and B, in terms of ordering and filtering, And also you have

queries that deal with only A, you should put A on the first

place.

This is rule number 1.

And there are nuances, of course.

And second rule is if you cannot choose, if you have patterns

that kind of… imagine you have only 1 query which involves both

columns.

How to choose the order?

And the rule says choose that column which will give you the

highest selectivity and put it to the first place.

So the scope of like your scope narrows down faster.

For example.

Michael: Yeah, I like it.

And I think for single query, and I think there's a, another

1 I've heard is your equality conditions in the leftmost.

And then if you have a range condition or an inequality condition,

like a greater than or less than, that needs to come last.

So you want all of your quality conditions first, then a range

condition like created_at.

Nikolay: So with previous 1, imagine

Michael: if- Well, but it's really important for, let's say you

have a query that has like where A equals 1, B equals 2, C equals

3, and then D is between 17 and a thousand and 1.

It's way more efficient to have the AB again, depending a little

bit on selectivity, I'm sure you can come up with examples where

it isn't true, but having a, b and c in some order and then d

last is really important to making sure that it's as like low

buffers as possible in terms of that index scan.

Nikolay: But imagine we have a query, forgetting about order

by limit, We have a query which selects with 2 filters.

1 is some boolean column and another is time range.

So what should go first?

Michael: Well, I, so think about it though, because like, let's

say the 50 50, that's the simplest case, right?

We've got 50% true, 50% false.

Probably if we have a different thing, It's trickier.

Nikolay: It should go, it should be partial index actually.

But yeah.

Michael: Maybe depending if we need to like query the opposite

1 sometimes.

Nikolay: Maybe it should be, maybe it should be 2 partial indexes.

Michael: Or I don't like, I don't think so.

I don't think 2 partial indexes are more efficient than…

Nikolay: But do you agree that filter on date, usually, if we,

for example, we know that we usually select narrow ranges like

day or maximum month, and we have a lot of data for many years.

Do you agree that the filter on timestamp gives you better selectivity

in general case?

Michael: I think it has to, right?

What do you mean?

Nikolay: Well, of course, if a Boolean column has some, like

it has 2 values and most of columns have 1 value, but you're

selecting another 1, it's a different story.

Where is deleted?

And we know we have most rows not soft deleted, right?

Or something like this.

In this case, the Boolean filter can give you better selectivity,

but it's less.

Michael: But I think a scan, let's say, let's think about like

the buffer reads, if you, if it's true and between these dates

or between these timestamps, our index scan can be perfectly

efficient if we are indexed by boolean first then timestamp because

we only have to look at the true values between a certain date

range.

If we mix that, if we do date range then boolean values, we discard

half the values we read.

We read all of the date range and have to discard half the Boolean

values.

Why should I get

Nikolay: all the date?

I don't get it.

Why?

Michael: We should test it.

I reckon there'll be double the buffer reads.

Nikolay: I forgot to mention my main rule, always doubt and test

in reality.

Michael: Based on my understanding, you'll get double the buffer

reads for putting the date first and the Boolean second.

All of the rules of thumb are good, but they're not perfect.

And actually they're slightly contradictory.

And I would say like even the most selective first, think about

the, the shared, multi-tenancy thing.

Yeah.

I like, I prefer account ID before user ID, even though user

ID is definitely more selective.

Because like partly because we're never really, well, in most

cases, data locality is 1 thing.

Yeah.

Data locality is 1 thing, but also we're very rarely serving

a single query with a single index.

Normally we're serving more than 1 query.

But then the rules start to become a little bit more of an art

form and you're trying to think what will they be in future?

What's most likely, what, what might we want to query together?

Like it's quite likely we want to query users in the same account

together at some point, maybe we want to look at all posts by

this account or this team And having that data closer together

in the index has benefits than if we do it by user ID and it's,

let's say it's, or no matter how you're doing user ID, whether

it's, maybe there'll be clustered teams.

Like maybe if you've got like a big int, maybe they'll be clustered

close together, but they, they won't all be.

You might add a team member many years after setting up your

account and that kind of thing.

So I prefer some breaking of those rules.

Nikolay: There is 1 more rule, Which is foundational.

Data locality, by the way, I would like to think about it.

And also partitioning, if we have it, of course we don't want

to scan all partitions, right?

And actually partitioning serves as kind of an index.

Michael: Kind of leading, like partition keys, kind of the leading

column or leading columns.

Nikolay: I remember I was trying to create an index on partitioning

key in each partition.

It was so strange because this partition had only 1 value.

And I was trying to include this as additional column and somebody

said, we have partitioning, why do we do this?

And I realized, oh, partitioning eliminates the need in additional

step in indexes or some indexes.

So what I was trying to say, additional rule is to remember the

order of execution of select clause.

Oh, not select clause, the whole query.

So first, like we should think, for example, simple, let's make

it, keep it simple.

Order by goes last.

Right.

So if we have order by created_at desk limit 25, it means created_at

should be the last step in the multi-column index.

Right?

Michael: Yeah.

I never thought about that.

But yeah, that fits with the range
thing I was talking about.

Nikolay: Yeah, but because of the
order of execution, if you

put created_at on the first place,
it won't be helpful.

It's a simple rule.

Intuition should tell you this,
but maybe not.

But I agree with you that rules
can be confusing a little bit,

and it's better to test and learn
and sometimes doubt and test

again.

And testing, we should pay attention
to buffers.

Michael: Yeah.

This, well, and this is going to
get better in Postgres 18, right?

People are going to get it by default.

Yeah, me neither.

Nikolay: We already have customers
migrating to 17.

So next year I expect buffers will
become default and I hope

you will see a growing number of
data with buffers.

I mean plans with buffers involved
because it's difficult.

Michael: Yeah, we definitely will.

Nikolay: Cool.

What about covering indexes?

Michael: Well, yeah, when I saw
the commit for adding include

to create index.

So create index added this option
to include certain columns

in the leaf pages, so in the tuples
that are stored in the leaf

pages so you can return more data.

Like if a query for example wants,
your order by like maybe wants

to filter by user ID equals this
and name equals this, order

by this day, the name could be
a payload, doesn't have to be

in the index ordering, but it could
be there in the index leaf

pages.

To achieve index-only scans.

Yeah, very, very important.

That's I think the only use case
for this, but it's a very important

Important enough to get added.

Nikolay: The whole purpose is to
avoid heap fetches, but that

also works if under heavy load,
it works only if you tuned to

vacuum properly and avoid blockers
of xmin horizon.

Michael: Yes, yes.

Otherwise your index-only scans
will end up just regular.

Nikolay: Slowly degrade to index
scans basically.

Michael: Yeah.

So yeah, but covering index, Like
the word covering for me, I

think is slightly loaded because
I think a, an index for me is

covering for a query.

If you can serve it with an
index-only scan.

In my opinion, we had covering
indexes before this, before includes,

because if you had a multi-column
index on all of the columns

that you needed, that was a covering
index for that query.

Nikolay: My favorite
topic, criticizing the choice

of terminology.

I agree with you.

Michael: But I'm
only reiterating this because

I think it's important for people
to understand.

Like I think it could be confusing
if you hear that the covering

indexes are only for the word when
you use includes because it

is confusing.

So yeah, it got added in Postgres
11, 2018.

And I'm interested in kind of like
the, the benefits over like,

why not just add it as another
key?

What are the downsides to doing
that?

Nikolay: I spent
some time thinking about it and

I came to conclusion when they
were added in 2018, I spent some

time and I came to conclusion that
the key difference is the

semantics of uniqueness.

If you add additional columns to
your multi-column indexes, you

change uniqueness semantics, right?

Michael: That's a great point,
yeah.

Nikolay: If you want an index on
this can work well, but you

cannot, but this is unique index.

This is where you should have including,
But it's quite a rare

case actually.

Michael: So let's say we want an
index on email, back to last

week's episode or whenever it was.

We want it to be unique on email,
but we want to include the

name as a payload, but we want
to let people have the same name

as each other, even if they, But
wait, because it's unique on

email already, wouldn't it always
be unique on email and name?

Oh no, because then we, we could,
if they had a different name,

they could use the same email.

Yeah.

That's a problem.

Nikolay: So we don't want to have
uniqueness on pairs of values.

Yeah.

For example, If you want uniqueness
only on email and name it

something like this, you said payload.

This is good point of view on it.

You're just putting there as a
passenger.

Michael: Yeah, that makes perfect
sense.

Nikolay: Yeah.

I also think, do you remember,
is there any size difference?

Michael: Well, I would, yeah, I
was going to bring up the exact

same thing.

I wondered if it also got done
for that reason a little bit,

because if we've got the structure
of a B-tree, you've got the,

Once it gets sufficiently large,
you've got levels of kind of,

they're kind of like an org tree,
don't they?

Like layers of management almost.

And the, if you have a multi-column
index with keys, so where

they're part of the ordering, you
need to preserve that information

at the different levels, like for
ordering purposes, but because

includes columns aren't needed
for ordering, they're not relevant,

they only live in the leaf pages
at the bottom level of the like

B-tree.

So size-wise there should be a
difference, but, and I think this

has changed recently, de-duplication
I think changes this, not

for unique indexes, But in cases
where you can have duplicates,

which is quite a lot of indexes,
I actually think the deduplication

will normally beat the benefits
of not having to store it multiple

like storing the data at multiple
levels.

Obviously depending a little bit
on how big this data is, that

you're probably not putting huge
paid loads in this.

But yeah, I think there was a size
benefit.

I think it's less now that we've
got the B2D duplication work.

Nikolay: Yeah, obviously, in the
columns you're including part

of the create index, you cannot use
them in filtering and ordering.

They are like passengers, they
are only beneficial for

index-only scans.

But since they are sitting like
basically in leaves, do you remember

this?

There is a problem with B-tree on text.

There is a limitation due to key
size.

Right.

So you could not like 2000 something
bytes or something like

this.

I guess in this case for for columns
used in including there

is no such limitation.

Right.

I'm not

Michael: there must be a a limit.

Right.

Because it's still going in normal
data pages.

Nikolay: I'm not sure.

I'm not sure.

Let's not jump to...

Michael: Wait, yeah, but I read
this in documentation and I haven't

made note of it, but it is in the
docs, I'll find it.

Nikolay: But I must submit, this
is a rare beast to meet in the

wild.

I don't see you including the word
often.

I see, but not often at all.

Michael: Oh yeah, I was going to
ask.

So you, so generally you see people
with multi-column indexes,

but they're all keys.

Is that fair?

Nikolay: Yeah.

Yeah.

It's, well, this is what Minds
tells you.

Including is still kind of exotic.

It's good to have it, but not so
obvious when exactly you should

use it.

Well, we discussed it, but still
in normal practice you say, ah,

I will just add it.

If it's not unique index, why should
I think about it?

If there are no limitations in
key to be met, if it's just some

timestamp or number, it's easier
to edit.

So I wanted to touch 1 more topic
here.

When we should avoid completely
putting another column to multi-column

index or this type of indexes we
just discussed, which are called

the covering but you don't like
it.

When should we avoid adding such
columns, another column there?

For example, we see, oh, there
is obvious benefit.

Michael: So when you say benefit,
like benefit to a specific

set of read queries, what could
be faster?

Nikolay: Imagine we have some simple
case, like a table, some

filter with 4 thereby, and it's
already working well, but we

see there is also column status,
it has several values, and we

see most of the time we select
70% of everything, like status

equals active or something.

But we think, actually we could
shave off, we see some rows are

filtered out dynamically.

This is what pgMustard always suggests
immediately.

Oh, index works inefficiently because later some

rows you exclude, right?

Rows filtered or something like
that.

What is it called?

Michael: Well, yeah, there's a
couple of ways.

It can be either a filter or an
index recheck.

Some types of index and sometimes,
for example, with a bitmap

scan.

I was looking at this today actually,
because Postgres 18 changes

this a little bit, but exact heap
blocks and lossy heap blocks,

so that can end up with a recheck.

Nikolay: But I'm talking about
rows removed because we have

an additional part of the work,
work,

Michael: filter, yeah, filter

Nikolay: and they are filtered out
dynamically and we think,

oh, it's not efficient.

Let's go fully efficient.

And we, we start adding, for example,
status to like as additional

column in our multi column index,
or it was a single column,

but it becomes 2 columns, right?

Or we decide to use it as a condition
and make index partial.

So we say where status equals active
because we know most of

the time it's active or something
like this.

It doesn't matter.

So index starts using this column
somehow.

Any of 3 ways, another column, including or where?

Keywords.

Michael: I know what you're going to talk about.

Are we going into HOT update territory?

Nikolay: Exactly.

Exactly.

So this is what happened to me.

I was like enthusiastic optimizing some project.

It was, I think, I remember first time it was very long ago.

It was DocSend.

Later it was acquired by Dropbox.

It was in San Francisco.

And I was like, oh, we optimized additionally, like shaved off

like 20-30% of buffer hits and reads and it's great.

It's small optimization, but it's helpful.

And we can pay this price, okay, we need to maintain this index

anyway, it's slightly bigger now, but it's still worth it.

And then later I see degradation of updates.

And you know why, right?

Because we had HOT updates, because basically most of updates,

they updated status.

Yeah.

And now it's participating in the index definition, so HOT updates,

it makes HOT updates not possible at all.

And HOT updates, just a reminder, we know there is a new audience

usually, So heap only tuple updates.

This is update which changes only table pages, pages of table

data and doesn't touch indexes.

Because normally regular update, it needs to touch every single

index, which produces more dirty pages, Checkpointer or budget

writer or even backend, they need to flush it to disk, first

of all, it doesn't matter.

And then it generates more WAL and it slows down updates.

Regular updates are much slower than HOT updates.

So just putting another column to your multi-column index might

affect updates.

This is worth remembering.

Michael: If it wasn't indexed already.

Like if it was indexed already in some other way.

Fair enough.

It's a really good point And I think those index efficiency tips

are the ones, like, I think people are most surprised by, like

people think they look through their query.

They see, I'm talking about like experienced engineers, but not

experienced Postgres folk look through a query plan, see index

scan, index scan, index scan, maybe there's a lot of loops.

Maybe there is a rows removed by filter, but maybe it only says

1, but it's like, 20, 000 loops and they don't realize that that's

the most inefficient part of their query.

Like just, anyway, so I love those, but normally when it's like

index efficiency is we talk about
a percentage, like 0.1%, like

that, or less than 0.1%.

Great.

If it's 70%, we score it much less.

Well,

Nikolay: It's a combination of
factors, this choice.

Yeah.

But 1 particular case which is
very common and when I see it

I immediately recognize the problem.

If I see SELECT blah blah blah
where some filters and then order

by created_at desk limit 100.

Definitely there are multiple indexes
which include updated_at

and there is a trigger which or
just some Ruby or Django doesn't

doesn't matter application which
updates updated_at always.

So this table doesn't see what
updates 100%.

And yeah and how to break that
issue not losing the the mechanics

of updated_at timestamp, I don't
know.

I try to sometimes to avoid indexing
this.

If you present to your users values,
really like you need order

by updated_at desk, limit something.

You need it, right?

I try to say, okay, updated_at
is very close to created_at usually

let's use created_at And then like
somehow like it doesn't work

well.

Michael: So it doesn't, I would
have thought in most cases it

would actually, but yeah.

Nikolay: Depends of course.

It depends.

Yeah.

Sometimes it's good.

Like you basically already have
almost everything and on the

fly.

Yeah.

Sometimes there is correlations,
very good correlation.

But sometimes if it's on partition
table, updates happening all

the time and it's a mess.

And updates might happen a month
later, for example.

This is like, screw this.

Michael: I will, for newer listeners,
I will link up our episodes

on HOT updates and we did 1 on
limit that's really good for this.

Nikolay: Yeah, yeah, yeah.

I just wanted to say this is a
warning.

Michael: Yeah, yeah.

Nikolay: We have multi-column releases,
Everyone needs them,

100%.

But we should be careful if we
need fast updates and fewer WAL

bytes put to WAL.

Because WAL bytes can be an issue.

And they put pressure on both backups
and replication systems.

Michael: Yes, and this is the old
read versus write thing, isn't

it?

And also, like if you've got a
write-heavy application or if

you're not doing many reads at
all, you might not want many indexes

at all.

Like It is a trade-off in general,
not just multi-column, right?

Nikolay: Sometimes, you know, sometimes
here it makes sense to

split and have one-on-one relationship,
2 tables, and in 1 you

update a lot And you give up, but
you have only like 3 or 2 columns

and that's it.

Right.

And another 1 has everything else
and search and so on.

But in this case, you will end
up with the problem that we cannot

create an index on 2 tables.

We lose for some queries, we will
lose in a single index scan.

We will need to join.

Yeah.

So it's

Michael: back to denormalization
topic again.

Yeah.

Nikolay: Yeah.

Yeah.

Michael: I had 1 more on include.

I wondered if it might end up with
less bloat in some workloads.

So when we're updating that column
that's been included as a

payload for example, the tuple
doesn't need to move in the index.

So in the case where we've got
them both as key columns, We're

having to move the tuple to a different
page most likely.

If it's like ordered by email,
then name, for example, if the

name changes, probably the entry
will move page and you could

end up with a split.

So I was just thinking in terms
of updates, I wonder if we'd

end up with less index bloat if
we used include instead of multiple

keys.

It seems like a minor advantage
but I wondered if that could

help as well.

Nikolay: Yeah it's an interesting
point I should think about

it.

Michael: Yeah but yeah it feels
to me like a Include feels like

a scalpel, like a very specific
tool that is helping.

Nikolay: Including, yeah, the word
is including.

Are you sure?

No, I'm not sure.

Include?

Yeah, it's includes.

It was a bug in my mind sorry yeah
include

Michael: it's not even includes
ploys include singular

Nikolay: yes yeah

Michael: yeah but that shows how
common like shows how commonly

you see it.

Nikolay: Yeah.

Super rare.

Michael: Including would probably make more sense, but it's just

include, yeah.

Nikolay: So, okay.

Michael: Good factor naming.

I think we're done.

Nikolay: Yeah, it was a bag of something, some pieces of advice.

Michael: And I think we'll come back to some of the other index

types at some point.

By the way, I found it just as like an interesting tidbit GIN

and BRIN because of their structures doesn't actually matter

about the ordering of the columns which I found fascinating.

Nikolay: Say that again?

Michael: GIN indexes and BRIN indexes, the order of the columns

doesn't matter.

Nikolay: Yeah.

Interesting.

Okay.

Michael: I mean mostly we're worried about B-tree so it's not

too important.

Nikolay: Yeah, 90 plus percent.

Michael: Yeah.

Alright, nice 1 Niccolo.

Nikolay: Thank you so much.

Michael: Catch you next week.

Nikolay: Bye-bye.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

Multi-column indexes

Creators and Guests

Some kind things our listeners have said