
GIN Indexes
Nikolay: Hello, hello, this is
Postgres.FM.
As usual, I don't remember the
episode number.
Michael: 137?
Nikolay: 137.
My name is Nikolay, Postgres.AI,
and as usual, my co-host is Michael,
pgMustard.
Hi, Michael, how are you doing
today?
Michael: Hello, Nikolay.
I am good, thank you.
How are you doing?
Nikolay: Fantastic, honestly.
Good.
A lot of things are happening,
but all good.
I mean, like, things are changing
and a lot of things are changing
in a constructive way, which I
like.
So, yeah, let's discuss GIN.
It's one of things a lot of folks
use and a lot of like it's one
of index types many people use
still And I think it's not going
to disappear, despite the rise
of vector-related types of indexes
and approximate k-nearest-neighbors
indexes.
GIN is a very interesting type
of index developed by the same
folks who brought WAL support
to GiST.
GiST was the only thing for complex
data types before GIN, but
then With the rise of JSON and
JSONB, Gene became very popular,
finding a lot of applications.
Not only full-text search, as it
was, I think, originally, it
was created for full-text search.
Michael: Yeah.
Yeah, I looked at the history.
I was shocked to see that it was
added back in 8.2, so 2006.
Nearly 20 years ago.
Nikolay: Yeah, this is exactly
when I started to be involved
with Postgres.
And 8.2 when I worked on XML as
well and I thought how to provide
good indexing capabilities for
XML But it was too big task for
me being like too young guy, but
we discussed with People who
worked on GIN these ideas, of
course, and of course I missed
those days and I feel Sad very
sad about what's happening in
the world and how disconnected
we became all of us Yeah
Michael: because these guys are
all Russian folks, but also I
was looking at the domains and
one of the TL there's a there's
a an article mentioned in the Postgres
docs on GIN indexes
that references the initial release
and mentions who's who sponsored
the work which I think was a German
company but one of the domain
the domain it's hosted on is .su
which I didn't even know I didn't
even know what that stood for it
turns out it's USSR or Soviet
Union which makes sense.
Nikolay: Yeah I wanted to say that
originally the primary purpose
was to support better performance
for full-text search.
And originally, full-text search
was powered by GiST.
And this version called RD3, Russian
Doll Tree, where its B-tree-like
structure was created for basically
arrays, for sets.
And supported operations, like
instead of less than, greater
than, supported operations is contained,
1 array is contained
in another, or contains.
Right?
So, also overlaps.
And it just can work with this,
it can support full text search,
but it's slow if you have large
volumes of data and a lot of
words, different words, like vocabularies
is huge.
Actually used vocabulary, right?
So GiN is inverted index.
And the name was inherited from
just just as Generalized Search
Tree.
I like this so much.
Folks who are interested in databases
and want to learn something
cool, read article by Hellerstein
from Berkeley, Joseph Hellerstein,
and article about jumping from
single dimension B-tree to two-dimensional
space R-tree and then generalizing
this idea to GiST, generalized
search tree.
It's so cool.
And then additionally, RD-tree,
it's 4 sets already.
And of course, you can have multiple
dimensions, like many dimensions.
Not so many as for vectors, unfortunately,
like not 2,000, but
for a few dimensions, GiST works
well.
And Even there was a history for
R-tree, the original implementation
of R-tree, native implementation.
GiST behaved much better, so it
was replaced.
The original implementation of
R-tree for multiple dimensions was
replaced by GiST, not for B-tree.
But there is extension, GiST B-tree,
for various multi-column
indexes and so on, when you have
columns of different nature.
And trees are great, but for full-text
search, they are not so
great.
So they jumped from idea of working
with trees to inverted index,
which is a more common idea for
search engines.
And compared to tree-like structure,
inverted index is very simple
explanation can be, okay, we have
list of words, and for each
word we have list of references
to places where this word was
noticed, right?
Like,
Michael: like, where is
Nikolay: links to tuples?
Yeah,
Michael: no, yeah.
Nikolay: Rovers.
So list of words, of course, normalized.
And then for each word list of
places where it can be found.
It's very, very similar to index
in a book, in the end of the
Michael: book.
Yeah.
I read this.
And for so many years, I've been using and reading about, you
know, indexes in the back of books or, you know, in recipe books
is like, there's a really famous example, really easy 1 to understand
how indexing speeds things up.
But it never occurred to me that they're not a B-tree structure
they are a list of words yeah exactly it's a it's an inverted
index so yeah and I think the thing that clicked for me as to
like why the word inverted is used was in a standard, in a non-inverted
index, we have an entry per row or row version and in an inverted
1 each row could be referenced multiple times assuming...
Nikolay: Exactly.
Michael: In a full text search case for example, each document
is likely to have lots of words in it.
So that's where the word inverted comes from.
Nikolay: Yeah, in tree-like structure, in tree, in B-tree, R-tree,
RD-tree, GiST in general, We perform descending.
We say we need to find this value, for example, and we jump from
1 node to another, down to the index, towards leaf nodes, leaf
nodes, right?
Leaves.
Performing some operations to understand where to go, which child
node to choose.
Not only direction, because it's, how to say, each internal node
have multiple, many children, unlike binary tree.
And my favorite example, favorite question, when hiring engineer,
not Database engineer.
For Database engineer, this question is mandatory.
But software engineer, Backend engineer, you can ask what is
B-tree and if you hear binary, this is game over.
And this, Unfortunately, I had situations when very, very good
software engineers, I had this feeling, do I want to ask that
question?
Because it's like kind of flipping the coin.
And then I hear binary, and okay, my internal principles say
this is a no-go.
So, yeah.
So, B-tree is not binary tree.
B-tree, unlike binary, where only 2 children is maximum.
We have many children for each internal node, maintaining some
rules.
Michael: In case anybody's wondering, balanced?
Nikolay: Well, there are several theories about it, as I remember,
but I don't remember theories.
But okay, yeah, balanced, but actually it's almost balanced,
meaning that, like, conceptually, the path from the root to each
leaf can be n or n plus 1.
This helps with balancing.
So almost balanced.
Okay, Again, when we work with
tree, we have several operations
until we reach leaf.
And leaf has pointer to the tuple
we are looking for.
Inverted index, okay, we find our
word.
And then we find many places for
this word, right?
And this also means there are several
challenges here.
I remember originally, as I understand,
maybe like my perception
might be wrong, but it was long
ago.
How many?
20 years, right?
Almost.
I remember, GIN was lacking some
internal B-tree indexes, it needs
B-tree indexes twice.
First, to find the word, because
if the number of different words
is huge, we need additional mechanism
to perform the search faster
and it can be B-tree, right?
And second, inside posting lists,
we also have B-tree.
So inside GIN, there are 2 types
of B-tree indexes, as I remember.
And I saw Oleg Bartunov's articles
and book about internals.
And this book has, of course, a
whole section about GIN indexes.
And I saw a representation, visualization
for GIN was, I think
actually, the commutation also
should have it, right?
It's 1 of the, maybe the first
illustration which was added to
Postgres documentation was GIN,
if I'm not mistaken.
My memory might trick me a little
bit, but I remember I saw it
was 90 degrees rotated.
So instead of like in book, we
have words, and then we see page
numbers to the right for each word.
It was rotated 90 degrees so everything
looks down.
So it seems like we going down
all the time like in case of GiST
or B-tree.
So starting from search of our
word, then we using internal B-tree,
then we go in posting link, posting
list, we also use B-tree
and also descending, right?
And I honestly feel confusion.
I like the book-like visualization
Because it distinguishes from
tree structures.
And also we have 3 internal trees
here, and it only adds confusion
in my opinion.
But, well, this is the choice.
I think the guys who I think created
visualization and wrote
and Igor Rogov and others, they
are great in terms of education.
And I like, I definitely admire
them, their work in education.
So, but I personally like original
representation.
We could write B-tree also rotated,
highlighting that it's kind of
different beast in terms of indexes,
right?
Michael: Yeah, good.
Nikolay: Yeah, this is, sorry,
this was my introduction, like
just like history and so on like
and some simple explanation
Yeah,
Michael: yeah, I really liked it and I really want to like diagrams
in the docs I think they a lot of effort goes into making diagrams
that are easy to understand unfortunately we don't have that
many in the Postgres documentation but I also find that I personally,
I don't learn best visually I don't think and I found the description
in the Postgres docs of how GIN works particularly useful.
It's really well written like most of the documentation But I
found that really helpful.
So I'll link to that in the show notes as well I think that's
an incredibly good description with a few examples of I mean,
obviously Fulltext search is the main use case, but it's not
the only
Nikolay: I think right now not a full text search full text search
was a regional main use case right now when use cases JSONB
Michael: yeah interesting well I don't honestly I don't see that
I'd say I Think I'd be interested in the results of your poll
I think you've put a poll on Twitter, but I I do I do think GIN
is probably the second most popular index type of city after
B-tree, but it's by somewhere in the region I was thinking order
of magnitude wise it's I say it's there's fewer than 1 in 10
maybe but probably more than 1 in a hundred so like somewhere
in between those So it's still not super common for me to see
them.
Nikolay: Yeah, almost half of folks voted.
They say 2 index types are used.
And I guess
Michael: it's
Nikolay: either b3 and GIN, or b3 and vector types like HNSW.
Interesting.
Who knows?
Michael: I do see the occasional BRIN and the occasional hash,
but often those are kind of people have left them lying around
after an experiment, after like trying them out, but not necessarily
getting that much usage.
Nikolay: Postgres should have telemetry enabled by default.
I'm joking, I'm joking.
To get usage stats.
Michael: Well, actually, I know this is a real tangent now, But
I do think that the usage stats can be misleading for people
as well, because the last person I came across that had a hash
index, they had a hash index on the same column they had a regular
B-tree index on.
And I said, you know, that's redundant, like, if you could, the
B-tree could...
Nikolay: You already pay the price for me.
Yeah, I'm b3.
So
Michael: Exactly, but the hash index was still getting used because
it's a bit smaller for that Like in so was still getting usage
in the time I'm
Nikolay: about hash indexes.
Remember?
Michael: Yeah but all I mean is
the telemetry isn't enough to
know that you can drop that hash
index because it's still getting
usage.
Nikolay: I'm joking about telemetry.
I cannot imagine Postgres would
add some telemetry stuff enabled
by default.
It's like it's completely opposite
to what it's usually done
in Postgres.
But I'm very curious of course.
I think big platforms might have
some stats but who knows.
Anyway, It definitely looks like
B-tree is number 1, but GIN may
be number 2 or number 3.
Michael: 0, I'm sorry, I understand
what you mean by telemetry.
I thought you meant like for the
users to see what they were
using.
Nikolay: No, no, imagine Postgres
has telemetry sending to some
server or user shop.
Michael: The cloud providers could
do it like exactly anonymized.
Nikolay: Yeah, if, if they can,
because it's also a question
how, how much I can go to schema
analysis, usually should be
possible.
But yeah, okay.
But it's better to focus on different
tasks for cloud providers.
We can discuss this another time.
This is not number 1 priority,
I'm pretty sure.
No.
So, okay.
Michael: How about you?
What do you see?
Like, how often do you see GIN
being used in the wild?
Nikolay: Well, it's used for full-text
search, but often it's
losing to Elastic in larger companies.
In smaller projects, it's great.
But in larger companies, we should
discuss performance gains
and overhead next, right?
But yeah, what I see, it's great
until it's not, in terms of
full-text search.
For JSONB, kind of similar, but
there we can tune it, we can
use smaller size of index, just
preserving all the information
about keys and values.
JSONB pattern ops, these things
like additional, you can tweak
it, right?
Yeah, path ops, maybe.
Path ops, exactly.
Yeah, JSONB path ops.
So you can say I don't need everything
to be indexed, all the
knowledge about structure, I just
need keys and values and support
only part of search operations
I have.
And for JSONB for sure I see it,
but at the same time in many
cases JSON or JSONB values left
unindexed because it's just
storage for some metadata, for
example, and we don't need to
search in it.
Or we decide to search only using
specific paths inside JSON
value.
And in this case, for both JSON
and JSONB, and even for XML
data type, you can have B-tree because
it becomes a scalar value
when you apply some.
You need to get the value from
some path.
That's it.
Michael: It's just an expression
index, yeah.
Nikolay: Yeah, exactly.
So I also see quite a lot trigram,
pg_trgm being used sometimes
with some issues, but yeah, this
I see as well, yeah.
Michael: Again, but again, normally
related to full text search
right for I say trigram by the
way and I thought maybe it's
trigram I don't
Nikolay: know I think JSONB is
number 1 in my head it's so I
don't have I don't have analytics
so it's some subjective perception
Michael: it makes sense right It
makes sense if you think I might
want to do some search on this.
It's the perfect...
I mean, it came 10 years before
JSONB, but it's the perfect
index type for it, if you don't
know in advance what you're going
to want to be searching for.
Nikolay: There is also a full text
search for JSONB, additionally,
a specific case.
I remember some support was added
to perform full text search
inside JSON values, JSONB values,
also powered by GIN, I guess.
Anyway, this index is great.
Well, for arrays as well, I use
it many times for arrays.
If I know I need to store arrays,
text arrays, number, like integer
arrays, doesn't matter.
I like how Postgres supports them
for decades.
Arrays support in Postgres is great
except 1 thing you know it
right?
Michael: Nodes?
Nikolay: No no no similar but no
it's
Michael: I don't know it no
Nikolay: it's indexes it starts
with 1 it's very confusing oh
you
Michael: confused me saying indexes
I was thinking
Nikolay: no no yeah it's different
yeah Yeah I mean to access
the first element, you know everywhere
else Outside Postgres
at 0 right but in Postgres it's
1 and you need to switch switch
your mind all the time Not everywhere
else.
But yeah and For a race, it's great.
For example, I remember my very
first talk in the US in 2008
was about how cool Postgres is
for Web 2.0.
And I talked about EAV structure,
how it's performance-wise not
good for larger data volumes.
And when we talked, I talked about
tagging, tags, right?
For social media, we need tags.
And putting tags into separate
tables following EAV, entity attribute
value structure, is going to hit
performance.
So instead, we can put it as arrays
in the same table.
And then to search we can just
use GiN, right?
And Russian Doll Tree will be effectively
used, but supported
by GiN implicitly, right?
So in this case you can find quickly
like give me rows where
where we have this tag right there
will be problem although of
ordering Right.
This is the like Before we move
on to discussing this problem,
I think this is the biggest problem
with full-text search and
general GiN.
People suffer from it, but we will
discuss it.
Before that, let's discuss performance
gains and losses.
Gain is obvious.
Search speed is much better for
inverted index than for tree-like
structure.
So for larger volumes of data,
search is good, SELECT is good,
right?
If we, if like 1 comment here,
Only if we talk about the use
Michael: of GiN only, right?
And like only searching by 1 dimension.
Nikolay: Yeah, well, 1 word or
few words also, it has an internal
kind of additional internal query
language, you can say, and
or phrase search, a lot of features
there.
But also when we say this, search
speed is better.
We need to think about edge cases
as well.
For example, if we have a very,
very popular word, which was
not excluded by putting it to stop
list or vocabulary, stop list
dictionary.
If we didn't put it there, we're
going to have poor selectivity
and high cardinality of results
and performance.
I don't know.
The limit works, right?
But then there are edge cases where
it's quite a popular word,
we use GiN.
And there is a big problem of the
need to combine GiN based
search with something else, like
for example, additional filter
on some scholar like timestamp
or anything, or ordering.
Right.
And this, this is huge problem.
Like I want to find everything
which includes these words or,
or these values doesn't matter,
but I also Need to get 25 last
items from there And this is somehow
Michael: wait it like you might
want to wait based on the the
more the more recent The more you
want to wait it, but if it's
got, you know, the word many many times over I I get it I think
we're probably venturing a bit too far into full-text search
stuff rather than GIN stuff, but you're right.
Nikolay: It's about GIN,
it's not about full-text search.
For trigram, don't you want to have most popular or used or like
the latest?
Or for JSONB search, I want to find everything which includes
these things, but I want very fresh.
I want to order somehow.
And in most cases in modern world, like when it was created,
the idea was we are going to order by ranking.
It's all was full search, search, right?
Ranking like most relevant, but in social media, usually don't
want that.
We usually want the freshest information.
Fresh.
Yeah.
So creation time or ID, if it's numeric or your ID version 7
would work.
So we need to order by some scalar.
And this works not well with GIN.
Michael: I think you're right though.
I think maybe at scale it's always going to be a problem.
There is a feature that was added from the beginning with GIN,
at least for the stop word issue, and that's GIN fuzzy search
limit that you can set in, that it recommends in the thousands,
so that if your search would have returned more than that number
roughly it will limit them but it will limit them randomly so
that that plays into your which which 1 should be returned but
the idea behind that feature is if you're returning thousands
of options anyway how good is that search in the first place?
Like it's a...
So, yeah, you're right.
Nikolay: Also statistics-related work, like over time it was
improved, but I remember still having issues like with lossiness
of fore a planner, right?
Like we need to recheck.
I don't know actually details right now.
I'm using GIN blindly lately in terms of performance.
But for me, the key problem is inability for GIN to combine search
with additional Scalar-based filters or ordering.
Ordering is number 1 problem for me.
It led the same folks to create RUM indexes.
Michael: Yeah, which confused me because when they launched GIN
they said you should think of it as a genie not as an alcoholic
drink but then they come out with RUM and it's definitely about
alcoholic drinks.
So I don't believe them.
Yeah.
Nikolay: Ambush, right?
Yeah.
Yeah.
Well, and the RUM, unfortunately,
in my case, didn't work well
because of like, it was huge and
so on.
Maybe it was because I tried to
put...
The idea of RUM is let's put columns
into a structure, right?
Michael: Like timestamp.
So that you can do the thing you
wanted.
Nikolay: Right, exactly.
And creation time, it's how many?
8 bytes or 16?
I keep forgetting.
It was huge.
I guess I could reduce it.
For example, putting 4 byte integers
just to support ordering.
There are ideas how to improve,
but still, this is an Extension,
it's not available on many platforms,
unfortunately.
But there is another thing, there
is a standard contrib module
Extension which is available everywhere
called B-tree GIN.
At least this can help you to combine
different filters.
For example, if you want to have
filter to search in JSONB or
full text search or trigram index,
you want to combine it with
additional filtering, for example,
price, right?
Or age, depending on data, right?
This is possible, this is worth
considering.
So to have multi-column index,
and maybe for ordering it also
can be helpful, but if you don't
do anything, what is usually
happening, the planner needs to
decide.
Should it use, for example, a primary
key to order, or create
and add timestamp to order by, and
then apply filter on...
Instead of GIN, it will be just
applying filter for whole data
it will find.
And it can be very...
It's like...
Alternative is to perform GIN filtering,
find everything and
then sort in memory.
And it's hit and miss.
Sometimes we see that a planner
thinks, a planner is very optimistic
thinking we will quickly find just
following B-tree, by created
at, in reverse order, like very
fresh item, dynamically check,
does it match our filter which
is supposed to go through GIN?
Well, it's not matching, not matching,
not matching.
And if it's a miss, it might skip
many, many, many rows until
it finds our 25 target house.
And just following B-tree, not involving
GIN at all.
And this is a terrible situation.
I saw many incidents because of
that.
So, or vice versa, GIN, again,
we use some popular word, it
finds millions of entries, then
tries to sort them in memory
to find top 25.
Maybe it's like edge cases here
may be hitting less in terms
of performance, but also not not
not fun at all to have this
situation Right.
Yeah, and this is number 1 be
Michael: where the the fuzzy limit
could actually maybe help
there, depending on how important
it is to not have false negative,
like things that should show up
not showing up, I guess, depending
on the use case.
If you just want to show something,
let's say you're showing
a social media feed, trying to
show engaging things to people,
it doesn't matter if there's a
few really good ones that don't
show.
But then if you've got a different
use case, then it matters
a lot.
Like if the full-text search, if
you don't show the most relevant
thing, that's an issue.
Nikolay: Or for example, user just
added something, goes to search,
expects this to be there, but it's
not there, it's not good.
It's like, feeling of this is like
it's a bug.
So it's in many cases, it's unacceptable,
unfortunately.
And also there is a problem with
pending list, right?
Pending list.
So
Michael: this is about inserts
rather than selects.
Nikolay: Yeah, not posting lists.
Posting lists For each word, basically,
we have references to
tuples.
It's posting list.
And then there is pending list,
which was like the trick to,
we forgot to mention the main problem
with GIN, actually.
The key problem is slow writes,
right?
Because search is fast with all
the things we discussed, but
writes are slower than writes to
tree-like structures, to trees,
to B-tree, to GiST.
So even with rebalancing, I guess
it's like slower.
And what was created?
Fast update option.
Michael: It's an unfair comparison,
right?
It's almost by design that they
have to be slower because documents
contain lots of words.
1 row being updated or added creates
multiple writes to the index.
So it's not a fair fight, it's
not a fair comparison.
Nikolay: I agree.
So to mitigate this fast update
technique was created and there's
an option fast update when you
create an index.
And I guess it's on by default,
right?
Yeah.
Yeah.
So fast update means that changes
will go to some temporary like
buffer, like kind of location inside
the structure called...
Michael: Pending list.
Nikolay: Pending list, yes.
And then at some point when they
reach by default 4 megabyte.
Michael: Yeah, there's a few things
that can trigger it.
That's 1.
There's
Nikolay: like a
Michael: catch there.
What happens?
But also, Oh, it batch processes
them, right?
Nikolay: Yeah, and it's happening,
it can happen synchronously.
So some write, some Update or INSERT
might suddenly be very slow.
So slow it might hit like your
statement amount, which is like
30 seconds, for example, or 60
depending, right, or 10 seconds.
And this is like unpredictable
performance for writes Yeah, or
vacuum can do it a synchronously.
Wow Yeah, vacuum.
I like but what I saw from practice
vacuum is good and mitigate
this mitigate this and not allowing
this job to be done in a
synchronous manner.
But it works only if our duration
of vacuum is reasonable, which
points to the direction we do need
Partitioning.
Because if we have a huge Table,
vacuum is super slow.
It cannot catch up with our writes.
And pending list problem hits our
writes.
And users notice it.
At some point, it might be so that
we decide, OK, we need to
switch off FastUpdate for some
Indexes, for large Tables, because
we want predictable, maybe not
good, but predictable performance
of writes.
It depends.
There is a trade-off, and I see
both cases are not ideal, you
know, like if you have really large
volumes of data and workload,
heavy workload.
So Partitioning can help here and
to keep FastUpdate on, in
my opinion.
Michael: Well, yeah, have you tried
if you tried that I'm just
thinking actually then you've got
to have the Partition key in
the Query and then suddenly you've
got another dimension there
sounds tricky
Nikolay: and and so what we need
different Index in this case
right
Michael: Well it just might not
then use the GIN.
I've had trouble convincing the
planner to use GIN indexes in
the past when like they've got
another choice and it doesn't
I'm probably doing I'm maybe I'm
doing something wrong but I
wanted to try and get a bitmap
scan like anding the Index like
with a GIN Index as 1 bitmap Index
scan and let's say a B-tree
Index is another.
Nikolay: Sounds like the case for
B-tree GIN for me.
Michael: Well a multi-column GIN
index with B-tree in it could
have helped but it wouldn't have
been as like it wouldn't have
been structured quite as like optimally
in my opinion so I was
hoping to get this and list really
short and then a really like
quick scan and I didn't have luck
with multi-column B-tree GIN
index being as like as performant
as I thought we could get it
with a bitmap scan.
But the partitioning thing, I would
love to be wrong, but the
partitioning thing sounds like
you've now got a GIN index per
partition, right?
So you've got to get to the right
partition so then it's like
again if it's time
Nikolay: partition should work
partition pruning should be in
good shape of course but this is
general rule for partitioning
Michael: yeah but so let's say
you're partitioning by time, it
means we've got the time filter
on, right?
Nikolay: Right.
Michael: Is that the case?
Like for full text search, for
example, are we going to be filtering
by time?
Nikolay: Well, if we're ordering
by time, as I, like For social
media, again, we need fresh information.
In most cases, this is the rule
compared to previous, like pre-previous
era for Web 1.0.
Michael: Well, unless there isn't
fresh information, then you
need to go back and back and back,
right?
Nikolay: Maybe, yes.
Well, depends, depends.
So my question is how we limit
how we order Like if we don't
limit don't order This is found
fundamental question to this
query.
Why do we request unlimited?
Results Right.
This is this question.
I raised very often during consulting
practice very often I see
no limit or no order by and this
is the question in most cases
people say oh, actually, yeah,
it makes sense to have pagination
here, for example.
Or pagination with GIN.
Yeah.
It's interesting.
Michael: It's difficult, right?
Nikolay: Yeah.
Well, again, if we order by ID
or timestamp by scalar value,
it's like key set pagination works,
and that's it.
And GIN is just additional filtering
in this case.
Michael: And we're kind of back
to where we started with your
original comment that it works
until it doesn't.
So in scale, it's fine to not have
these, it's fine to not have
the limit and the order by because
your data set is small enough
that your query response times
are pretty good.
Nikolay: But we designed for the
future, not only for now, right?
Because, yeah, so it's always worth
asking yourself how much
data you expect in the future,
how much data you need to return
with this query to your application
or frontend, I don't know.
Do we need everything?
I saw some case, interesting case
when people decided, intentionally
decided to return everything and
then do a lot of work on frontend.
I saw it many times actually, with
guys who decided that our
frontend should be very thick,
and a lot of logic there, and
it's better to cache everything.
Like, let's load and then work.
Well, in my opinion, it's a dead
end.
Because if your project grows,
it's a dead end.
You will experience poor performance.
And other case was guys who decided
to perform analytics with
R, so they also needed to load
everything and then build B-tree
and so on and kind of group by
and so on was performed only in
memory using R, which for me looks
like, why do you do this?
Do it inside your database.
Let me not forget, when pending
list is growing to the limit,
it can exceed limits, right?
Or no?
Michael: Is it a soft limit or
no?
I'm not sure.
Nikolay: This is a good question.
But anyway, maybe it was a bug
when it exceeded.
I don't remember.
I remember some parts of my memory
tell me that it happened in
the past.
I'm not sure it was a bug or not.
But obviously, also, select performance
will be affected if pending
list is huge.
For example, imagine for specific
filters, we are unlucky and
we need to process all like we
we need to deal with it and selects
become slower so
Michael: yeah so it's for it's
4 megabytes right the
Nikolay: default I
Michael: suspect I suspect it's
not too much of a penalty if
you leave it at the 4MB, but a
tempting thing to do might be
to increase that so that you hopefully
have autovacuum kick
in and do the work in the background
instead of having a random
insert paying the penalty.
Nikolay: Right.
Michael: But then once you've increased
the list, then each select
has to look through more pending
entries, especially towards
the end, and then you're paying
the price on select as well.
So yes, I don't think I haven't
seen in fact, actually, we probably
should mention there's a really
good write up about these kinds
of issues by Lukas Fittl of pganalyze who also
Nikolay: use
Michael: the GitLab work.
Nikolay: I think Lukas does a fantastic
job analyzing various Postgres
cases.
I think that's why his company
is called pganalyze, but this
is your joke, not mine.
Michael: You're stealing my jokes
now.
Nikolay: Yes.
Well, yeah, good job with PG analysis.
And yeah, in GitLab's case, I was
slightly involved only, and
I remember that case was a few
years ago and the result was a decision
for specific indexes to switch
off fast update because the table
was huge and so on, right
Michael: yeah anyway so switching
off fast update that means
we're not using the pending list
anymore.
And every insert and update pays
the price at the time of insertion
to update the index.
And I think that's a really interesting
trade-off to make each
insert a bit slower, but not have
to have occasional very slow
ones.
I like that a lot in terms of performance,
in terms of trade-off.
Nikolay: Anyway, kudos to GitLab
for openness as usual, because
all the details are visible to
the public, to the community, and it's
super helpful for, like, for general
development of Postgres and
so on.
And kudos to Lukas for a very great
analysis of this Postgres
case.
Good.
I think we touched a little bit
of some deeper waters but not
super deep.
If someone wants to read internals
Igor Rogov's articles and
books are really great in this area.
And yeah, I think that's it, right?
Or anything else?
Michael: Yeah, 1 last tiny tip
that I think is worth, you mentioned
JSONBPathOps briefly.
I think for most other data types,
the defaults are the only
operator class, but for JSONB,
the default is actually JSONBOps,
not JSONBPathOps, and you can get
a performance boost with JSONBPathOps
if you don't need set and op, if
you're just using the usual JSON,
the usual contains operators.
Nikolay: Final words, GIN is a
very important index type.
It's not going to disappear.
It's going to be used heavily.
It's 1 of the strengths of Postgres,
like, rich set of index
types.
Yeah, for sure.
Good.
Okay, So thank you.
See you next time.
Michael: Thanks so much, Nikolay.
Catch you next week.
Bye.