Postgres FM | Transcript: GIN Indexes

March 14, 2025 • 41 Minutes

GIN Indexes

Nikolay: Hello, hello, this is
Postgres.FM.

As usual, I don't remember the
episode number.

Michael: 137?

Nikolay: 137.

My name is Nikolay, Postgres.AI,
and as usual, my co-host is Michael,

pgMustard.

Hi, Michael, how are you doing
today?

Michael: Hello, Nikolay.

I am good, thank you.

How are you doing?

Nikolay: Fantastic, honestly.

Good.

A lot of things are happening,
but all good.

I mean, like, things are changing
and a lot of things are changing

in a constructive way, which I
like.

So, yeah, let's discuss GIN.

It's one of things a lot of folks
use and a lot of like it's one

of index types many people use
still And I think it's not going

to disappear, despite the rise
of vector-related types of indexes

and approximate k-nearest-neighbors
indexes.

GIN is a very interesting type
of index developed by the same

folks who brought WAL support
to GiST.

GiST was the only thing for complex
data types before GIN, but

then With the rise of JSON and
JSONB, Gene became very popular,

finding a lot of applications.

Not only full-text search, as it
was, I think, originally, it

was created for full-text search.

Michael: Yeah.

Yeah, I looked at the history.

I was shocked to see that it was
added back in 8.2, so 2006.

Nearly 20 years ago.

Nikolay: Yeah, this is exactly
when I started to be involved

with Postgres.

And 8.2 when I worked on XML as
well and I thought how to provide

good indexing capabilities for
XML But it was too big task for

me being like too young guy, but
we discussed with People who

worked on GIN these ideas, of
course, and of course I missed

those days and I feel Sad very
sad about what's happening in

the world and how disconnected
we became all of us Yeah

Michael: because these guys are
all Russian folks, but also I

was looking at the domains and
one of the TL there's a there's

a an article mentioned in the Postgres
docs on GIN indexes

that references the initial release
and mentions who's who sponsored

the work which I think was a German
company but one of the domain

the domain it's hosted on is .su
which I didn't even know I didn't

even know what that stood for it
turns out it's USSR or Soviet

Union which makes sense.

Nikolay: Yeah I wanted to say that
originally the primary purpose

was to support better performance
for full-text search.

And originally, full-text search
was powered by GiST.

And this version called RD3, Russian
Doll Tree, where its B-tree-like

structure was created for basically
arrays, for sets.

And supported operations, like
instead of less than, greater

than, supported operations is contained,
1 array is contained

in another, or contains.

Right?

So, also overlaps.

And it just can work with this,
it can support full text search,

but it's slow if you have large
volumes of data and a lot of

words, different words, like vocabularies
is huge.

Actually used vocabulary, right?

So GiN is inverted index.

And the name was inherited from
just just as Generalized Search

Tree.

I like this so much.

Folks who are interested in databases
and want to learn something

cool, read article by Hellerstein
from Berkeley, Joseph Hellerstein,

and article about jumping from
single dimension B-tree to two-dimensional

space R-tree and then generalizing
this idea to GiST, generalized

search tree.

It's so cool.

And then additionally, RD-tree,
it's 4 sets already.

And of course, you can have multiple
dimensions, like many dimensions.

Not so many as for vectors, unfortunately,
like not 2,000, but

for a few dimensions, GiST works
well.

And Even there was a history for
R-tree, the original implementation

of R-tree, native implementation.

GiST behaved much better, so it
was replaced.

The original implementation of
R-tree for multiple dimensions was

replaced by GiST, not for B-tree.

But there is extension, GiST B-tree,
for various multi-column

indexes and so on, when you have
columns of different nature.

And trees are great, but for full-text
search, they are not so

great.

So they jumped from idea of working
with trees to inverted index,

which is a more common idea for
search engines.

And compared to tree-like structure,
inverted index is very simple

explanation can be, okay, we have
list of words, and for each

word we have list of references
to places where this word was

noticed, right?

Like,

Michael: like, where is

Nikolay: links to tuples?

Yeah,

Michael: no, yeah.

Nikolay: Rovers.

So list of words, of course, normalized.

And then for each word list of
places where it can be found.

It's very, very similar to index
in a book, in the end of the

Michael: book.

Yeah.

I read this.

And for so many years, I've been using and reading about, you

know, indexes in the back of books or, you know, in recipe books

is like, there's a really famous example, really easy 1 to understand

how indexing speeds things up.

But it never occurred to me that they're not a B-tree structure

they are a list of words yeah exactly it's a it's an inverted

index so yeah and I think the thing that clicked for me as to

like why the word inverted is used was in a standard, in a non-inverted

index, we have an entry per row or row version and in an inverted

1 each row could be referenced multiple times assuming...

Nikolay: Exactly.

Michael: In a full text search case for example, each document

is likely to have lots of words in it.

So that's where the word inverted comes from.

Nikolay: Yeah, in tree-like structure, in tree, in B-tree, R-tree,

RD-tree, GiST in general, We perform descending.

We say we need to find this value, for example, and we jump from

1 node to another, down to the index, towards leaf nodes, leaf

nodes, right?

Leaves.

Performing some operations to understand where to go, which child

node to choose.

Not only direction, because it's, how to say, each internal node

have multiple, many children, unlike binary tree.

And my favorite example, favorite question, when hiring engineer,

not Database engineer.

For Database engineer, this question is mandatory.

But software engineer, Backend engineer, you can ask what is

B-tree and if you hear binary, this is game over.

And this, Unfortunately, I had situations when very, very good

software engineers, I had this feeling, do I want to ask that

question?

Because it's like kind of flipping the coin.

And then I hear binary, and okay, my internal principles say

this is a no-go.

So, yeah.

So, B-tree is not binary tree.

B-tree, unlike binary, where only 2 children is maximum.

We have many children for each internal node, maintaining some

rules.

Michael: In case anybody's wondering, balanced?

Nikolay: Well, there are several theories about it, as I remember,

but I don't remember theories.

But okay, yeah, balanced, but actually it's almost balanced,

meaning that, like, conceptually, the path from the root to each

leaf can be n or n plus 1.

This helps with balancing.

So almost balanced.

Okay, Again, when we work with
tree, we have several operations

until we reach leaf.

And leaf has pointer to the tuple
we are looking for.

Inverted index, okay, we find our
word.

And then we find many places for
this word, right?

And this also means there are several
challenges here.

I remember originally, as I understand,
maybe like my perception

might be wrong, but it was long
ago.

How many?

20 years, right?

Almost.

I remember, GIN was lacking some
internal B-tree indexes, it needs

B-tree indexes twice.

First, to find the word, because
if the number of different words

is huge, we need additional mechanism
to perform the search faster

and it can be B-tree, right?

And second, inside posting lists,
we also have B-tree.

So inside GIN, there are 2 types
of B-tree indexes, as I remember.

And I saw Oleg Bartunov's articles
and book about internals.

And this book has, of course, a
whole section about GIN indexes.

And I saw a representation, visualization
for GIN was, I think

actually, the commutation also
should have it, right?

It's 1 of the, maybe the first
illustration which was added to

Postgres documentation was GIN,
if I'm not mistaken.

My memory might trick me a little
bit, but I remember I saw it

was 90 degrees rotated.

So instead of like in book, we
have words, and then we see page

numbers to the right for each word.

It was rotated 90 degrees so everything
looks down.

So it seems like we going down
all the time like in case of GiST

or B-tree.

So starting from search of our
word, then we using internal B-tree,

then we go in posting link, posting
list, we also use B-tree

and also descending, right?

And I honestly feel confusion.

I like the book-like visualization
Because it distinguishes from

tree structures.

And also we have 3 internal trees
here, and it only adds confusion

in my opinion.

But, well, this is the choice.

I think the guys who I think created
visualization and wrote

and Igor Rogov and others, they
are great in terms of education.

And I like, I definitely admire
them, their work in education.

So, but I personally like original
representation.

We could write B-tree also rotated,
highlighting that it's kind of

different beast in terms of indexes,
right?

Michael: Yeah, good.

Nikolay: Yeah, this is, sorry,
this was my introduction, like

just like history and so on like
and some simple explanation

Yeah,

Michael: yeah, I really liked it and I really want to like diagrams

in the docs I think they a lot of effort goes into making diagrams

that are easy to understand unfortunately we don't have that

many in the Postgres documentation but I also find that I personally,

I don't learn best visually I don't think and I found the description

in the Postgres docs of how GIN works particularly useful.

It's really well written like most of the documentation But I

found that really helpful.

So I'll link to that in the show notes as well I think that's

an incredibly good description with a few examples of I mean,

obviously Fulltext search is the main use case, but it's not

the only

Nikolay: I think right now not a full text search full text search

was a regional main use case right now when use cases JSONB

Michael: yeah interesting well I don't honestly I don't see that

I'd say I Think I'd be interested in the results of your poll

I think you've put a poll on Twitter, but I I do I do think GIN

is probably the second most popular index type of city after

B-tree, but it's by somewhere in the region I was thinking order

of magnitude wise it's I say it's there's fewer than 1 in 10

maybe but probably more than 1 in a hundred so like somewhere

in between those So it's still not super common for me to see

them.

Nikolay: Yeah, almost half of folks voted.

They say 2 index types are used.

And I guess

Michael: it's

Nikolay: either b3 and GIN, or b3 and vector types like HNSW.

Interesting.

Who knows?

Michael: I do see the occasional BRIN and the occasional hash,

but often those are kind of people have left them lying around

after an experiment, after like trying them out, but not necessarily

getting that much usage.

Nikolay: Postgres should have telemetry enabled by default.

I'm joking, I'm joking.

To get usage stats.

Michael: Well, actually, I know this is a real tangent now, But

I do think that the usage stats can be misleading for people

as well, because the last person I came across that had a hash

index, they had a hash index on the same column they had a regular

B-tree index on.

And I said, you know, that's redundant, like, if you could, the

B-tree could...

Nikolay: You already pay the price for me.

Yeah, I'm b3.

Michael: Exactly, but the hash index was still getting used because

it's a bit smaller for that Like in so was still getting usage

in the time I'm

Nikolay: about hash indexes.

Remember?

Michael: Yeah but all I mean is
the telemetry isn't enough to

know that you can drop that hash
index because it's still getting

usage.

Nikolay: I'm joking about telemetry.

I cannot imagine Postgres would
add some telemetry stuff enabled

by default.

It's like it's completely opposite
to what it's usually done

in Postgres.

But I'm very curious of course.

I think big platforms might have
some stats but who knows.

Anyway, It definitely looks like
B-tree is number 1, but GIN may

be number 2 or number 3.

Michael: 0, I'm sorry, I understand
what you mean by telemetry.

I thought you meant like for the
users to see what they were

using.

Nikolay: No, no, imagine Postgres
has telemetry sending to some

server or user shop.

Michael: The cloud providers could
do it like exactly anonymized.

Nikolay: Yeah, if, if they can,
because it's also a question

how, how much I can go to schema
analysis, usually should be

possible.

But yeah, okay.

But it's better to focus on different
tasks for cloud providers.

We can discuss this another time.

This is not number 1 priority,
I'm pretty sure.

No.

So, okay.

Michael: How about you?

What do you see?

Like, how often do you see GIN
being used in the wild?

Nikolay: Well, it's used for full-text
search, but often it's

losing to Elastic in larger companies.

In smaller projects, it's great.

But in larger companies, we should
discuss performance gains

and overhead next, right?

But yeah, what I see, it's great
until it's not, in terms of

full-text search.

For JSONB, kind of similar, but
there we can tune it, we can

use smaller size of index, just
preserving all the information

about keys and values.

JSONB pattern ops, these things
like additional, you can tweak

it, right?

Yeah, path ops, maybe.

Path ops, exactly.

Yeah, JSONB path ops.

So you can say I don't need everything
to be indexed, all the

knowledge about structure, I just
need keys and values and support

only part of search operations
I have.

And for JSONB for sure I see it,
but at the same time in many

cases JSON or JSONB values left
unindexed because it's just

storage for some metadata, for
example, and we don't need to

search in it.

Or we decide to search only using
specific paths inside JSON

value.

And in this case, for both JSON
and JSONB, and even for XML

data type, you can have B-tree because
it becomes a scalar value

when you apply some.

You need to get the value from
some path.

That's it.

Michael: It's just an expression
index, yeah.

Nikolay: Yeah, exactly.

So I also see quite a lot trigram,
pg_trgm being used sometimes

with some issues, but yeah, this
I see as well, yeah.

Michael: Again, but again, normally
related to full text search

right for I say trigram by the
way and I thought maybe it's

trigram I don't

Nikolay: know I think JSONB is
number 1 in my head it's so I

don't have I don't have analytics
so it's some subjective perception

Michael: it makes sense right It
makes sense if you think I might

want to do some search on this.

It's the perfect...

I mean, it came 10 years before
JSONB, but it's the perfect

index type for it, if you don't
know in advance what you're going

to want to be searching for.

Nikolay: There is also a full text
search for JSONB, additionally,

a specific case.

I remember some support was added
to perform full text search

inside JSON values, JSONB values,
also powered by GIN, I guess.

Anyway, this index is great.

Well, for arrays as well, I use
it many times for arrays.

If I know I need to store arrays,
text arrays, number, like integer

arrays, doesn't matter.

I like how Postgres supports them
for decades.

Arrays support in Postgres is great
except 1 thing you know it

right?

Michael: Nodes?

Nikolay: No no no similar but no
it's

Michael: I don't know it no

Nikolay: it's indexes it starts
with 1 it's very confusing oh

you

Michael: confused me saying indexes
I was thinking

Nikolay: no no yeah it's different
yeah Yeah I mean to access

the first element, you know everywhere
else Outside Postgres

at 0 right but in Postgres it's
1 and you need to switch switch

your mind all the time Not everywhere
else.

But yeah and For a race, it's great.

For example, I remember my very
first talk in the US in 2008

was about how cool Postgres is
for Web 2.0.

And I talked about EAV structure,
how it's performance-wise not

good for larger data volumes.

And when we talked, I talked about
tagging, tags, right?

For social media, we need tags.

And putting tags into separate
tables following EAV, entity attribute

value structure, is going to hit
performance.

So instead, we can put it as arrays
in the same table.

And then to search we can just
use GiN, right?

And Russian Doll Tree will be effectively
used, but supported

by GiN implicitly, right?

So in this case you can find quickly
like give me rows where

where we have this tag right there
will be problem although of

ordering Right.

This is the like Before we move
on to discussing this problem,

I think this is the biggest problem
with full-text search and

general GiN.

People suffer from it, but we will
discuss it.

Before that, let's discuss performance
gains and losses.

Gain is obvious.

Search speed is much better for
inverted index than for tree-like

structure.

So for larger volumes of data,
search is good, SELECT is good,

right?

If we, if like 1 comment here,
Only if we talk about the use

Michael: of GiN only, right?

And like only searching by 1 dimension.

Nikolay: Yeah, well, 1 word or
few words also, it has an internal

kind of additional internal query
language, you can say, and

or phrase search, a lot of features
there.

But also when we say this, search
speed is better.

We need to think about edge cases
as well.

For example, if we have a very,
very popular word, which was

not excluded by putting it to stop
list or vocabulary, stop list

dictionary.

If we didn't put it there, we're
going to have poor selectivity

and high cardinality of results
and performance.

I don't know.

The limit works, right?

But then there are edge cases where
it's quite a popular word,

we use GiN.

And there is a big problem of the
need to combine GiN based

search with something else, like
for example, additional filter

on some scholar like timestamp
or anything, or ordering.

Right.

And this, this is huge problem.

Like I want to find everything
which includes these words or,

or these values doesn't matter,
but I also Need to get 25 last

items from there And this is somehow

Michael: wait it like you might
want to wait based on the the

more the more recent The more you
want to wait it, but if it's

got, you know, the word many many times over I I get it I think

we're probably venturing a bit too far into full-text search

stuff rather than GIN stuff, but you're right.

Nikolay: It's about GIN,

it's not about full-text search.

For trigram, don't you want to have most popular or used or like

the latest?

Or for JSONB search, I want to find everything which includes

these things, but I want very fresh.

I want to order somehow.

And in most cases in modern world, like when it was created,

the idea was we are going to order by ranking.

It's all was full search, search, right?

Ranking like most relevant, but in social media, usually don't

want that.

We usually want the freshest information.

Fresh.

Yeah.

So creation time or ID, if it's numeric or your ID version 7

would work.

So we need to order by some scalar.

And this works not well with GIN.

Michael: I think you're right though.

I think maybe at scale it's always going to be a problem.

There is a feature that was added from the beginning with GIN,

at least for the stop word issue, and that's GIN fuzzy search

limit that you can set in, that it recommends in the thousands,

so that if your search would have returned more than that number

roughly it will limit them but it will limit them randomly so

that that plays into your which which 1 should be returned but

the idea behind that feature is if you're returning thousands

of options anyway how good is that search in the first place?

Like it's a...

So, yeah, you're right.

Nikolay: Also statistics-related work, like over time it was

improved, but I remember still having issues like with lossiness

of fore a planner, right?

Like we need to recheck.

I don't know actually details right now.

I'm using GIN blindly lately in terms of performance.

But for me, the key problem is inability for GIN to combine search

with additional Scalar-based filters or ordering.

Ordering is number 1 problem for me.

It led the same folks to create RUM indexes.

Michael: Yeah, which confused me because when they launched GIN

they said you should think of it as a genie not as an alcoholic

drink but then they come out with RUM and it's definitely about

alcoholic drinks.

So I don't believe them.

Yeah.

Nikolay: Ambush, right?

Yeah.

Well, and the RUM, unfortunately,
in my case, didn't work well

because of like, it was huge and
so on.

Maybe it was because I tried to
put...

The idea of RUM is let's put columns
into a structure, right?

Michael: Like timestamp.

So that you can do the thing you
wanted.

Nikolay: Right, exactly.

And creation time, it's how many?

8 bytes or 16?

I keep forgetting.

It was huge.

I guess I could reduce it.

For example, putting 4 byte integers
just to support ordering.

There are ideas how to improve,
but still, this is an Extension,

it's not available on many platforms,
unfortunately.

But there is another thing, there
is a standard contrib module

Extension which is available everywhere
called B-tree GIN.

At least this can help you to combine
different filters.

For example, if you want to have
filter to search in JSONB or

full text search or trigram index,
you want to combine it with

additional filtering, for example,
price, right?

Or age, depending on data, right?

This is possible, this is worth
considering.

So to have multi-column index,
and maybe for ordering it also

can be helpful, but if you don't
do anything, what is usually

happening, the planner needs to
decide.

Should it use, for example, a primary
key to order, or create

and add timestamp to order by, and
then apply filter on...

Instead of GIN, it will be just
applying filter for whole data

it will find.

And it can be very...

It's like...

Alternative is to perform GIN filtering,
find everything and

then sort in memory.

And it's hit and miss.

Sometimes we see that a planner
thinks, a planner is very optimistic

thinking we will quickly find just
following B-tree, by created

at, in reverse order, like very
fresh item, dynamically check,

does it match our filter which
is supposed to go through GIN?

Well, it's not matching, not matching,
not matching.

And if it's a miss, it might skip
many, many, many rows until

it finds our 25 target house.

And just following B-tree, not involving
GIN at all.

And this is a terrible situation.

I saw many incidents because of
that.

So, or vice versa, GIN, again,
we use some popular word, it

finds millions of entries, then
tries to sort them in memory

to find top 25.

Maybe it's like edge cases here
may be hitting less in terms

of performance, but also not not
not fun at all to have this

situation Right.

Yeah, and this is number 1 be

Michael: where the the fuzzy limit
could actually maybe help

there, depending on how important
it is to not have false negative,

like things that should show up
not showing up, I guess, depending

on the use case.

If you just want to show something,
let's say you're showing

a social media feed, trying to
show engaging things to people,

it doesn't matter if there's a
few really good ones that don't

show.

But then if you've got a different
use case, then it matters

a lot.

Like if the full-text search, if
you don't show the most relevant

thing, that's an issue.

Nikolay: Or for example, user just
added something, goes to search,

expects this to be there, but it's
not there, it's not good.

It's like, feeling of this is like
it's a bug.

So it's in many cases, it's unacceptable,
unfortunately.

And also there is a problem with
pending list, right?

Pending list.

Michael: this is about inserts
rather than selects.

Nikolay: Yeah, not posting lists.

Posting lists For each word, basically,
we have references to

tuples.

It's posting list.

And then there is pending list,
which was like the trick to,

we forgot to mention the main problem
with GIN, actually.

The key problem is slow writes,
right?

Because search is fast with all
the things we discussed, but

writes are slower than writes to
tree-like structures, to trees,

to B-tree, to GiST.

So even with rebalancing, I guess
it's like slower.

And what was created?

Fast update option.

Michael: It's an unfair comparison,
right?

It's almost by design that they
have to be slower because documents

contain lots of words.

1 row being updated or added creates
multiple writes to the index.

So it's not a fair fight, it's
not a fair comparison.

Nikolay: I agree.

So to mitigate this fast update
technique was created and there's

an option fast update when you
create an index.

And I guess it's on by default,
right?

Yeah.

So fast update means that changes
will go to some temporary like

buffer, like kind of location inside
the structure called...

Michael: Pending list.

Nikolay: Pending list, yes.

And then at some point when they
reach by default 4 megabyte.

Michael: Yeah, there's a few things
that can trigger it.

That's 1.

There's

Nikolay: like a

Michael: catch there.

What happens?

But also, Oh, it batch processes
them, right?

Nikolay: Yeah, and it's happening,
it can happen synchronously.

So some write, some Update or INSERT
might suddenly be very slow.

So slow it might hit like your
statement amount, which is like

30 seconds, for example, or 60
depending, right, or 10 seconds.

And this is like unpredictable
performance for writes Yeah, or

vacuum can do it a synchronously.

Wow Yeah, vacuum.

I like but what I saw from practice
vacuum is good and mitigate

this mitigate this and not allowing
this job to be done in a

synchronous manner.

But it works only if our duration
of vacuum is reasonable, which

points to the direction we do need
Partitioning.

Because if we have a huge Table,
vacuum is super slow.

It cannot catch up with our writes.

And pending list problem hits our
writes.

And users notice it.

At some point, it might be so that
we decide, OK, we need to

switch off FastUpdate for some
Indexes, for large Tables, because

we want predictable, maybe not
good, but predictable performance

of writes.

It depends.

There is a trade-off, and I see
both cases are not ideal, you

know, like if you have really large
volumes of data and workload,

heavy workload.

So Partitioning can help here and
to keep FastUpdate on, in

my opinion.

Michael: Well, yeah, have you tried
if you tried that I'm just

thinking actually then you've got
to have the Partition key in

the Query and then suddenly you've
got another dimension there

sounds tricky

Nikolay: and and so what we need
different Index in this case

right

Michael: Well it just might not
then use the GIN.

I've had trouble convincing the
planner to use GIN indexes in

the past when like they've got
another choice and it doesn't

I'm probably doing I'm maybe I'm
doing something wrong but I

wanted to try and get a bitmap
scan like anding the Index like

with a GIN Index as 1 bitmap Index
scan and let's say a B-tree

Index is another.

Nikolay: Sounds like the case for
B-tree GIN for me.

Michael: Well a multi-column GIN
index with B-tree in it could

have helped but it wouldn't have
been as like it wouldn't have

been structured quite as like optimally
in my opinion so I was

hoping to get this and list really
short and then a really like

quick scan and I didn't have luck
with multi-column B-tree GIN

index being as like as performant
as I thought we could get it

with a bitmap scan.

But the partitioning thing, I would
love to be wrong, but the

partitioning thing sounds like
you've now got a GIN index per

partition, right?

So you've got to get to the right
partition so then it's like

again if it's time

Nikolay: partition should work
partition pruning should be in

good shape of course but this is
general rule for partitioning

Michael: yeah but so let's say
you're partitioning by time, it

means we've got the time filter
on, right?

Nikolay: Right.

Michael: Is that the case?

Like for full text search, for
example, are we going to be filtering

by time?

Nikolay: Well, if we're ordering
by time, as I, like For social

media, again, we need fresh information.

In most cases, this is the rule
compared to previous, like pre-previous

era for Web 1.0.

Michael: Well, unless there isn't
fresh information, then you

need to go back and back and back,
right?

Nikolay: Maybe, yes.

Well, depends, depends.

So my question is how we limit
how we order Like if we don't

limit don't order This is found
fundamental question to this

query.

Why do we request unlimited?

Results Right.

This is this question.

I raised very often during consulting
practice very often I see

no limit or no order by and this
is the question in most cases

people say oh, actually, yeah,
it makes sense to have pagination

here, for example.

Or pagination with GIN.

Yeah.

It's interesting.

Michael: It's difficult, right?

Nikolay: Yeah.

Well, again, if we order by ID
or timestamp by scalar value,

it's like key set pagination works,
and that's it.

And GIN is just additional filtering
in this case.

Michael: And we're kind of back
to where we started with your

original comment that it works
until it doesn't.

So in scale, it's fine to not have
these, it's fine to not have

the limit and the order by because
your data set is small enough

that your query response times
are pretty good.

Nikolay: But we designed for the
future, not only for now, right?

Because, yeah, so it's always worth
asking yourself how much

data you expect in the future,
how much data you need to return

with this query to your application
or frontend, I don't know.

Do we need everything?

I saw some case, interesting case
when people decided, intentionally

decided to return everything and
then do a lot of work on frontend.

I saw it many times actually, with
guys who decided that our

frontend should be very thick,
and a lot of logic there, and

it's better to cache everything.

Like, let's load and then work.

Well, in my opinion, it's a dead
end.

Because if your project grows,
it's a dead end.

You will experience poor performance.

And other case was guys who decided
to perform analytics with

R, so they also needed to load
everything and then build B-tree

and so on and kind of group by
and so on was performed only in

memory using R, which for me looks
like, why do you do this?

Do it inside your database.

Let me not forget, when pending
list is growing to the limit,

it can exceed limits, right?

Or no?

Michael: Is it a soft limit or
no?

I'm not sure.

Nikolay: This is a good question.

But anyway, maybe it was a bug
when it exceeded.

I don't remember.

I remember some parts of my memory
tell me that it happened in

the past.

I'm not sure it was a bug or not.

But obviously, also, select performance
will be affected if pending

list is huge.

For example, imagine for specific
filters, we are unlucky and

we need to process all like we
we need to deal with it and selects

become slower so

Michael: yeah so it's for it's
4 megabytes right the

Nikolay: default I

Michael: suspect I suspect it's
not too much of a penalty if

you leave it at the 4MB, but a
tempting thing to do might be

to increase that so that you hopefully
have autovacuum kick

in and do the work in the background
instead of having a random

insert paying the penalty.

Nikolay: Right.

Michael: But then once you've increased
the list, then each select

has to look through more pending
entries, especially towards

the end, and then you're paying
the price on select as well.

So yes, I don't think I haven't
seen in fact, actually, we probably

should mention there's a really
good write up about these kinds

of issues by Lukas Fittl of pganalyze who also

Nikolay: use

Michael: the GitLab work.

Nikolay: I think Lukas does a fantastic
job analyzing various Postgres

cases.

I think that's why his company
is called pganalyze, but this

is your joke, not mine.

Michael: You're stealing my jokes
now.

Nikolay: Yes.

Well, yeah, good job with PG analysis.

And yeah, in GitLab's case, I was
slightly involved only, and

I remember that case was a few
years ago and the result was a decision

for specific indexes to switch
off fast update because the table

was huge and so on, right

Michael: yeah anyway so switching
off fast update that means

we're not using the pending list
anymore.

And every insert and update pays
the price at the time of insertion

to update the index.

And I think that's a really interesting
trade-off to make each

insert a bit slower, but not have
to have occasional very slow

ones.

I like that a lot in terms of performance,
in terms of trade-off.

Nikolay: Anyway, kudos to GitLab
for openness as usual, because

all the details are visible to
the public, to the community, and it's

super helpful for, like, for general
development of Postgres and

so on.

And kudos to Lukas for a very great
analysis of this Postgres

case.

Good.

I think we touched a little bit
of some deeper waters but not

super deep.

If someone wants to read internals
Igor Rogov's articles and

books are really great in this area.

And yeah, I think that's it, right?

Or anything else?

Michael: Yeah, 1 last tiny tip
that I think is worth, you mentioned

JSONBPathOps briefly.

I think for most other data types,
the defaults are the only

operator class, but for JSONB,
the default is actually JSONBOps,

not JSONBPathOps, and you can get
a performance boost with JSONBPathOps

if you don't need set and op, if
you're just using the usual JSON,

the usual contains operators.

Nikolay: Final words, GIN is a
very important index type.

It's not going to disappear.

It's going to be used heavily.

It's 1 of the strengths of Postgres,
like, rich set of index

types.

Yeah, for sure.

Good.

Okay, So thank you.

See you next time.

Michael: Thanks so much, Nikolay.

Catch you next week.

Bye.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

GIN Indexes

Creators and Guests

Some kind things our listeners have said