Skip scan
Michael: Hello and welcome to Postgres.FM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard and I'm joined as usual by
Nikolay, founder of Postgres.AI.
Hello Nikolay.
Nikolay: Hi Michael, how are you?
Let's not miss the fact that we have a guest today.
Michael: We didn't miss it.
You throw things off by saying, how are you?
And I'm British, so I have to reply, and how are you?
Nikolay: Well, let's blame the US.
This is where I learned to ask.
So I'm super glad we have Peter today.
Hi, Peter.
Peter: Hi.
Good to be here.
Thanks for having me on the pod.
Michael: We are delighted to have you.
And yeah, it's Peter Geoghegan.
Hopefully everybody knows of Peter, but in case you don't, he's
a major contributor and committer to Postgres for many years.
Three or so of the last were at AWS.
So today we're going to be talking about some of Pete's recent
and upcoming work on Postgres related to skip scan implementation,
but it's far more than that.
And honestly, the more I learn about this work, the more excited
I am about it.
I thought I knew quite a lot, and I've been reading up in the
last couple of days, and it's even better.
I didn't realise, and I'm probably cheating or skipping ahead
far too much, but I want people to understand how big this is.
This could affect our indexing strategies going forwards.
We could be choosing to index Postgres differently.
I guess it's obvious in hindsight, but there are some real major
differences.
So yeah, I'm super excited about this work.
I want to thank you for having started it and continuing with
it.
I'd be interested to hear, though, what attracted you to this
work, and did you realise how powerful it would be when you first
started on it?
Peter: Actually not at first, no.
I last did a similar interview to this.
I think it was a predecessor podcast, technically, not this one.
At that time, I happened to talk about how, for whatever reason,
I kind of like finding as many different ways as possible of
looking at a thing.
At the time, I think I said something like, well, there's a bunch
of different perspectives.
There's the data structure perspective, there's the concurrency
control perspective, and on and
on down the list.
And it was that sort of general
mentality that brought me to
this work, because I realised,
hey, I'm missing something here.
My tendency to sort of collect
these different ways of looking
at the thing, The same familiar
old topic, this tendency I have.
I was for a long time missing this
extra bit, which I think is
embodied by the work I'm doing
now, and its source material,
which is a pretty obscure 1995
paper.
It happens to be called Efficient
Search of Multidimensional
Indexes.
So you haven't heard of it, but
hopefully that's not any indication
of its real-world relevance.
It has been something that I believe
has influenced a bunch of
other systems.
So it's, in that sense, I think,
pretty standard stuff.
But the implications may be, as
you were saying, Michael, are
not fully appreciated.
So maybe we'll start with a general
overview for those that don't
know what skip scan is.
It's a feature that certainly Oracle
and in recent versions MySQL
and also SQLite have.
They all call it skip scans.
I think in DB2 it's called jump
scans or something like that.
So the basic idea is to extend
the standard B-tree index access
method to support new types of
querying without changing anything
about the on-disk contents, without
fundamentally altering anything.
This is possible with multi-column
indexes, composite indexes.
The simplest case, and the one that
gets the most attention, is
where you have an omitted prefix
column in your query.
So, for example, you do a SELECT
from a table where b equals,
let's say, 5.
Here we would have a composite
index on AB, of course, A has been
omitted from the query predicate.
So traditionally, in Postgres at
least, you wouldn't be able
to really use the index.
You might do a very inefficient
full index scan where you sort
of have to grovel through the entire
leaf level of the index,
And that works, but obviously that's
not very efficient.
What SkipScan will allow you to
do here is, assuming that there
are not terribly many distinct
values in column A, as a rule
of thumb, I would say hundreds
of distinct columns total, or
perhaps thousands, is about the
upper limit, then you can effectively
use that composite index as a collection
of logical sub-indexes,
as they're sometimes called.
So, for example, the first value
might be 1 in the column A.
So we'll do 1, A equals 1, B equals
5, A equals 2, B equals 5,
A equals 3, B equals 5, and exhaustively
will do multiple small
index scans that are effectively
pretending to be one continuous
index scan.
And that will be certainly much
more efficient than a sequential
scan in most cases.
Of course, it's true that you'd
be much better off having a column
on B directly, but it's also true
that who says that that's going
to be possible?
You might just not think to do
it, or in some cases, at least,
it might not really be feasible,
which depends on the requirements
for the user.
This is particularly likely with
things like data warehousing
use cases, the facts table, where
a lot of the querying is sort
of ad hoc, and it's not really
feasible to have all the indexes
you might possibly require ahead
of time.
So here, it's making a composite
index, which is more useful
for a wider variety of queries.
If you only require adequate performance
here, sub-second performance
is attainable.
It's not going to give you sub-millisecond
performance, although
that does sometimes happen.
But it's still, for these kinds
of use cases, almost as good.
And yet it allows you to avoid
creating a bunch of quasi-redundant
indexes.
So in that sense, as you were saying,
Michael, it's kind of like...
It at least would require some
of the conventional ways that
the updated...
I wouldn't say it's a radical shift,
but there is a sense of
which, OK, in light of this new
capability, we're going to have
to rethink some of the rules of
thumb we have for indexing in
general.
I think that's likely to work out
that way.
I would say I'm excited to see
how people end up using the feature
in practice.
Many things are possible.
So that example I gave is a sort
of very traditional simple example.
But also worth noting that the
techniques that I'm working on
include similar, though slightly
different, cases where, for
example, you have a range.
So suppose the query is SELECT
* from sales, where sales date
between the 1st of August and the
10th of August.
Then it could be that you're not
enumerating every possible date
in the entire table, but just those
dates from that range.
But the same exact idea still.
So for the 1st of August, plus
whatever the actual thing that
appeared in your predicate in the
query is, let's say, again,
it's b equals 5, I don't know.
That for the first of August, for
the second of August, for the
third of August.
So from the query perspective,
you haven't omitted the column.
The column appears in this between
range, but it is effectively
the same idea.
It also should be pointed out that
these are all very general
techniques.
They can be combined in all kinds
of different ways.
And that's where it becomes super
interesting.
Like, you have very, very complicated
combinations of these things.
The example query that I used in
my initial email to the hackers
and adding this to sell it to people
was taken from the original
source paper.
And in that instance, it was, I
believe...
So you have an initial omitted
prefix column, which is on a column
called Apartments.
Then you have a range.
Then you have an in, and then you
have another in.
So it's all kind of working together.
You see it's like this sort of...
And that's where I think it can
make a huge difference when you
combine all these techniques together,
which you will be able
to do freely.
So there, effectively, you have
this query that would conventionally
require a huge and inefficient index
scan, which is just kind of
a non-starter.
What you actually would have to
do is have a bunch of single-column
indexes and use some kind of bitmap
or something like that.
But then you give up the ability
to have the sort order from
the index.
You can't do a limit 5 or something
like that.
You can't use a group aggregate
directly, etc.
Whereas this way, you can pretty
well convert what you would
think of as 1 big query into what
you would think of as maybe
thousands of tiny queries that
each land on 1 if page again and
again.
And it's sort of a lot of small
queries pretending to be 1 big
one.
It's not apparent to you looking
at EXPLAIN Analyzer that's going
on, but it is.
So effectively, you're rewriting
the query as if it was just
a bunch of tiny queries in a way
that is fairly intuitively obvious.
It wouldn't be very practical to
do this, but you could do it
by procedurally generating a bunch
of these small queries yourself,
in principle, at least.
And that is the wooden core clever
idea here.
It's surprising how far this wooden
core idea will get you, because
it is a fairly simple idea in the
end.
But there's a bunch of things that
I haven't even considered
yet that the paper also contemplates.
For example, it has a part about
not-in predicates.
You would think it would be impossible
to use a not-in predicate
for an index.
But again, it's the same idea.
It's like skip-scan, except rather
than enumerating every possible
value in the index, it's enumerating
every possible value in
the index except these ones.
So it's basically the same thing
again and again.
So it just buys you an awful lot,
one core simple idea, which I
find kind of cool.
So that's the big picture.
Yeah, something that I'm expecting
to get into Postgres 18, and
something that builds on work that
started with Postgres 17.
Michael: So yeah, that's a great
explanation.
I think the multi-column thing
is the complex example you mentioned
is the part that blows my mind
and starts to make me think, you
know all this advice we've given
that's, you generally advise
people against huge multi-column
indexes because they often only
target then like a very, very niche
set of queries.
And you end up accepting that you're
going to do a large index
scan with a bunch of rows being
filtered, which we know is inefficient,
but it's currently the best solution.
You lost
Nikolay: hot updates also.
I'm going to continue advising
against too many columns in indexes.
Michael: Well, okay.
But I can see the flip side.
Once you've got fewer, once you
only need fewer indexes, you
don't have to support, like let's
say you've got 5 columns that
are important here.
Sometimes I've seen people index
all 5 in 3 different orders
to support different versions of...
And I feel like that kind of thing
will become less important
if we even need it at all anymore.
So that's mind-blowing.
But just to go back, Your thought
process is super interesting
here as well.
So did you come across this paper
as part of widening your horizons
and then think of all these implementation
ideas?
Or was it, you thought, this is
an area, this is a blind spot
for me, I should seek out papers
in this area?
Peter: Well, there's a surprisingly
little written about the topic,
actually, even though, as I say,
it's implemented in a bunch
of other systems, like Postgres.
I also think it's probably true
that they don't exploit it to
its full extent.
The actual paper was written by
people that were working on a
system called Non-Stop SQL in the
1990s, which hasn't quite completely
died yet, but it certainly has
far fewer users.
It was a very influential system,
though.
So I think that it's likely that
the full generality of it hasn't
been exploited for whatever reason
in other systems.
My hope is that my implementation
will do just that.
It's kind of weird.
The example complicated query I
gave, what I noticed about that
is, most of the time, if I changed
the order of the columns,
not always, but most of the time,
it made hardly any difference.
Wow.
Right, which is sort of like...
That wouldn't always be true, but
just given the particulars
of the query that I'm talking about,
it kind of didn't matter
if I swapped.
It made maybe a small difference
in favor or get...
But that is a bit mind-blowing,
sure, that that could ever be
true.
Again, it's going to be most compelling
in cases where you don't
know what the requirements are
ahead of time.
It's very good for ad hoc indexing.
And you can always beat it by having
a dedicated index if you
want to pay for that.
But that's perhaps impractical
for a lot of users.
And again, I think in OLTP apps,
it is still useful, but it's
more for avoiding the worst case,
for something that you maybe
should have anticipated but didn't
for whatever reason.
It's good for that more so than
something that you would ideally
plan on using.
But it is relevant to both.
As for why, I don't know.
I think it entered my awareness
about 5 or 6 years ago, but I
didn't fully understand it.
And it took me a little while to
figure out the right way of
doing it.
So having explained all that, it
should also be pointed out that
Whether or not you actually want
to rewrite your query in this
manner, it all depends, doesn't
it?
It could be, for example, that
the old way of doing it happens
to be the most efficient one, because,
yes, you could, in principle,
rewrite it into 100 different small
queries, but those small
queries would all be hitting the
same few leaf pages anyway,
so you actually would want to do
maybe not 100, maybe 1 or 2
at most small queries there.
How do you know ahead of time which
it is?
Well, you don't, right?
But you don't have to.
So I've had this intuition that
if I solve that problem first,
if I solve it for in-lists, then
after that, I would be able
to move on to this.
That would be a useful foundation.
So Postgres 17, this has been implemented.
It'll be available, I think, in
3 weeks when we release 17, if
all goes as planned.
So that took the existing capability
for indexing in-lists and
made it so that if you have 50
integers, say, and they're contiguous
integers like 1, 2, 3, 4, 5,
through to 50, then suppose they're
all co-located on the same leaf
page, then you're going to do
1 index scan and read them all
at once.
If, on the other hand, it turns
out that they're spread out all
over the index because maybe it's
low cardinality, or maybe it's
low cardinality just in that part
of the index, there's no rule
that says it has to be 1 or the
other.
It could be at different points
in the key space.
Low cardinality, others could be
high.
So making a dynamic choice is very
useful, I think, because you
don't have to know.
That's what makes it safe to aggressively
apply these optimizations,
because if it doesn't work out,
in any case, the v3 scan code
is able to combine the would-be
accesses into 1 on the fly.
So you'll get the same behavior
as in previous versions, if
and only if that makes sense.
So that sort of flexibility, that
dynamic behavior, seemed like
an essential prerequisite.
So it's not actually true
It's conceptually the case that
you're rewriting the big query
to lots of little ones and then
combining the result, allowing
them to pretend that they are one
big index scan, even though
they're not.
But the actual degree to which
you do that, how much atomizing
actually happens, is determined
on the fly as the scan progresses.
So it's kind of indeterminate if
it's how many actual index scans
you're doing, but that doesn't
matter, if you get what I mean.
Michael: Yeah, it's so cool.
It doesn't matter to the, it's
like one of those things that as
a user, I don't have to worry about
it, think about it at all.
As somebody who's implementing
it though, it feels like it makes
it much more complex for you to
build.
Is that fair?
Peter: Yes and no.
I think there's two ways of looking
at it.
There's what you just said, which
is, in an immediate sense,
true.
It is more complicated.
But in another sense, It isn't,
because...
Complicated to who?
I mean, the optimizer has to model
all these costs, right?
And it turns out that's hard.
And it turns out that statistics
are...
Well, they're statistics.
And if you're talking about, like,
you have all these secret
hidden correlations, but these
are multiple column indexes, of
course, so we have statistics for
each column, but at least,
unless you go out of your way to,
there's no correlated statistics
for columns.
There's a generic assumption that
these are independent, even
though very often they're not.
So having the access path be very
forgiving, so the optimizer
gets things quite wrong, but it
actually doesn't matter, or it
doesn't matter very much at least,
is way easier, at least from
the point of view of the query
optimizer.
And on balance, that probably makes
it much, much easier to actually,
as a practical matter, make it
work.
I think that that will probably
turn out to be really important,
although at this point, that's
pretty speculative.
Michael: I had this on my list
of things that might come up,
but are the cost models super interesting?
It seems to me, and I haven't checked
for sure, but from looking
at some examples, you might not
have changed the cost estimations
at least in 2017.
You might consider that because
you're making a bunch of things
more efficient in a bunch of cases,
you might try and reduce
the costs for those to encourage
those scans when they make sense.
But I guess we're already doing...
Anyway, I guess my question is,
have you changed the cost model?
And if not, why not?
Peter: Yes, but I'm not surprised
that you didn't notice it because
the main change that I made to
the cost model was that now it
will saturate.
Because, a very simple observation,
the most expensive possible
index scan ought to be a full index
scan.
You know, very simple.
If you scan every single...
There's a pretty tangible maximum
cost to, in this hand, at least
in this world where this stuff
is available, as it will be in
17, where it is physically impossible
to scan the same leaf page
a second time now, no matter how
the constants have been specified
in the query itself because automatically
that just cannot happen.
It's a physical impossibility.
So we know that we can take to
the bank.
And so the cost model saturates.
There's an awful lot of uncertainty,
except when it comes to
the worst case.
We know for absolute sure that
in the very worst case we'll
scan every leaf page once.
So that's where you notice it more.
So you can have an absurdly long
in-list, and at a certain point,
it'll just stop adding additional
cost, since it is physically
impossible for the cost to keep
going up even as you continue
to add these.
So that sort of plays into what
I was saying about having some
general understanding of what the
worst possible case can be
when everything goes wrong, when
the statistics are garbage.
That assurance, in the end, makes
everything easier rather than
making it harder, I feel.
It's also just useful.
Michael: Mason I like that it seems
pessimistic but also safe.
Peter: That's exactly it, sort of.
I think statistics is being kind
of inherently just very it's
amazing that optimizers work as
well as they do, I say.
Like, people don't realise, I feel,
often, when something goes
wrong, well, it was next to a miracle
that everything was working
as well as it was.
And probably things were not technically
working optimally, but
you didn't know that because you
were happy with adequate performance.
You didn't consider that that was
technically 10 times slower
than what could have, in principle,
have been attained with the
same index and such.
So, like, I think that if we're
honest, we have to admit that
there's an awful lot of uncertainty
about the cost models, and
that doesn't necessarily matter.
We're more encouraging the idea
that it's amongst people that
are on the hackers list, that better
to have a kind of...
I wouldn't even call it insurance,
that's too strong a word,
but just have...
Account for variation at runtime
dynamically rather than doing
so statically in the query plan.
That just seems like, to me, something
that has very little downside
and potentially quite a lot of
upside.
It is a more general idea.
It doesn't just apply here.
It applies, for example, with...
I don't know.
You can do some similar things
with hash joins.
It does get more speculative if
you keep going down the line
of things that are like that.
But why would you trust an uncertain
selectivity estimate if
you didn't have to?
There's no reason why you can't
have some kind of way of recovering
from a set summation at runtime,
at least in certain cases.
I guess that it is related to that
whole way of thinking about
it.
That is the projects I'm working
on are sort of in that way,
I feel.
Nikolay: Do you mind if I break
the structure a little bit as
usual?
Peter: Go ahead.
Nikolay: Honestly, I didn't read
the whole discussion.
So I'm going to ask some newbie
questions.
First of all, I see this potential
of...
There is loose index scan as well,
right?
And this is a very different thing
for distinct values.
And we have a wiki page, and we're
constantly using this technique.
And we also mentioned, like you
did, other database systems have
it implemented, Postgres doesn't.
But here we discuss skip scan and
skip scan, and this is very
different, I see.
But I just googled a little bit
and I found people use skip scan
in that context as well, right?
So if I'm right, some people reaching
this point will still not
fully understand what we'll discuss,
right?
Peter: Alex.
Nikolay: It's possible.
Right?
So we discussed the problem.
There is a problem.
We have an index on columns A, B,
but filter is only on B, the first
layer of A is not relevant, and
usually it means this index cannot
be applied
Peter: Well.
It's worth pointing out that with
what I'm talking about, it
only applies when you have at least
2 index columns, whereas
I don't believe that's true with
Skip Scan.
You can still usefully skip if
there's only 1 column.
Nikolay: Because there we need
distinct values, it's a different
task.
Here we need all values, but we
have an index, which usually we
say you have the wrong index, right?
But your optimization will make
it good, but only if a column
has not many values, right?
If it has a lot of values, maybe
it won't be good, right?
Peter: It's certainly not going
to be able to help because,
of course, technically you could
say, okay, there's a million
logically distinct subindexes,
but that degenerates to a full
index scan, so it's the same.
By the way, that might still be
the best plan, all things considered.
We can't rule that out either.
Sometimes that is actually the
best available plan.
But overall, yeah, you're probably
not going to want to use it
unless you have hundreds, perhaps
low thousands, or occasionally
maybe tens of thousands, but after
that it's unlikely.
Nikolay: Will the planner decide,
will it have some threshold
to decide when skip scan should
be applied when it already doesn't
make sense?
Peter: Not as such, because as
I sort of touched on already,
there is no special skip scan index
node.
I think that's a bad idea because
it's all dynamic, right?
It's an additive thing.
Nikolay: Yeah, now I'm catching
up.
Peter: Sometimes for all the usual
reasons, you might want to
have an index only scan that happens
to use a skip scan, or you
might want to bitmap index scan,
or you might want to parallel
index scan, etc.
It is useful, I think, to have
a distinct kind of index path.
More choices means more wrong choices
from the point of view
of the optimizer.
So I feel like that's a distinct
advantage of the approach I've
taken.
Of course, I still have to figure
out a way of making it not
matter when that's not the right
thing to do.
And that's not very easy, but I'm
almost...
I think I've got that part figured
out.
We'll see.
Michael: Do you mean to like minimize
regressions at times where...
Peter: Yeah, OK.
Because again, occasionally it's
going to be intrinsically the
best thing is to just scan the
index without trying to skip at
all.
So you don't want to waste a lot
of CPU cycles on a useless enterprise
that is skipping when you shouldn't
be skipping.
You really do want to fall back
on essentially the same behaviour
as previous releases there.
Now, mind you, overall, hopefully
the optimizer just wouldn't
pick that way of doing it, because
it would be wrong.
It wouldn't be the fastest available
plan, given the available
indexes.
So that only applies at all when
you're already doing something
that's basically pretty slow.
But overall, yeah, I think we're
going to have to make sure that
you don't notice it when it isn't
able to do what you would hope
it would.
That is sort of complicated, for
sure.
Michael: Just to make sure I've
understood, if we could do a
scan without skips, if we had the
perfect index for a very efficient,
let's say, index-only scan, that
would have been the index that
would have been chosen in the first
place because its cost estimate
would have been lowest.
So already when we are skipping,
you're doing a scan that involves
some skips.
We haven't had that available,
and therefore, yeah, okay.
That's really cool.
Peter: Yeah, in general, I think
it makes more sense to think
about these sorts of queries as
being on a continuum.
One extreme is where there's, let's
say, no more than 5 or 10 distinct.
And then the fact that you're doing
5 different descents of the
index is hardly noticeable.
It's maybe less than a millisecond
in the end anyway.
So that would be a very favourable
case for skipping.
And then you have less favourable
cases, I suppose, where cardinality
goes up.
And eventually, you get to the
point where you can't do any skipping
at all, and that's where you don't
want to pay a penalty for
being open to the possibility of
skipping, if you like.
Also, skipping, as I sort of touched
on earlier, there could
be, for example, that you have
an index that has 90% of the pages
in the index in respect of the
leading column have 3 values total,
but nevertheless, the remaining
10% are all perfectly unique.
So now you don't have...
You can't really average the two.
It's just two different things.
You can't think of it as the average
of the 2, really.
So the fact that it's kind of continuous
just makes sense.
You wouldn't want to have it be...
If the planner were to make a static
choice, then how would you
account for that variation where
you legitimately need to change
your strategy for the same index
scan in real time like that's.
Totally possible that that would
be the best thing to do so I
like not having an opinion until
you really have an opinion the
last minute you.
I do get the best of both worlds.
That's the thinking.
Anyway.
Michael: I think it's super smart.
And those distributions are quite
common.
We see them all the time.
And it's the time where statistics
most obviously falls flat
on its face because of averages.
I know it does some things to most
common values.
But anyway, I completely agree
that makes loads of sense.
Peter: Sorry, you were going to
say?
I was going to give another example
of what you're saying with
statistics.
I mentioned briefly earlier, there's
this...
I think it's called the attribute-entity-independence
assumption.
It's this generic sort of assumption
made by all cost-based optimizers
that there's no correlation between
different columns, which
is demonstrably false all the time,
but somehow mostly doesn't
break that badly.
But for something like this, you've
got to think, there could
be a very strong correlation between
columns, or there could
be a totally different, completely
independent relationship between
the 2.
And it's really hard to model that.
So if you can just not do that
to much of an extent at all and
still get useful query plans, that
seems ideal.
You know, there's just different
types of indexes.
So you have, for example, in the
paper that I talk about, it's
data warehousing, So you explicitly
have different dimensions
of a business that are obviously
quite independent.
But then you could also have something
like, I don't know, supplier
ID, first column, second column,
discount code.
So it might be that different suppliers
use the same discount
code, coincidentally, But even
then, those 2 discount codes are
completely distinct entities.
So it just doesn't make sense to
think of them as being the same
at all.
You wouldn't expect a query to
just supply a predicate that doesn't
have a supplier ID.
You would expect either to have
a display ID or have both.
You wouldn't expect it to be omitted,
because it just wouldn't
make sense.
So it's impossible or very difficult,
in any case, for the optimizer
to understand those kinds of distinctions,
even though they're
obviously very, very common.
So again, ideally, it wouldn't
have to.
Ideally, it would just be able
to fall back on this sort of idea
of, hey, we'll try to work it out
at runtime to the best of our
ability.
As long as we get the fastest query
plan, everyone's happy.
It doesn't necessarily have to
be that exact and cost model.
In fact, usually isn't.
So that's kind of the, the overarching
philosophy of the thing.
And.
Michael: Yeah, as usual with your
work, it's so simple once it
makes like, once you understand
it, but to like, I'm super impressed
with the thought process to have
got there.
It feels like yet another feature
where, as users, we don't have
to think about how it's working,
why it's working that much.
Maybe with strategies later in
terms of indexing, we can take
advantage of that.
But we could carry on indexing
exactly the way we are, and we'll
only benefit as a result still.
Peter: Unless I mess up, of course,
but hopefully I won't do that.
Yeah.
I'm not exactly playing to my strengths
when I'm talking about
the optimizer, you understand,
but nevertheless, that is kind
of part of the philosophy.
I'm not really an optimizer person.
I like being successful, that's
probably why.
But here, I think that undeniably,
there is some thought given
to those aspects, because it kind
of has to be.
Michael: What do you mean successful?
Peter: Well, it's just really hard
to do anything in the optimizer.
That's why some people will do.
And thank goodness someone is willing
to do it.
But it's a very difficult area
to work in, I feel, which does
seem to put people off more than
perhaps it should.
Perhaps that will change at some
point in the future, but you
have to ask somebody else about
that.
Michael: Al-Khalili Yes.
That somebody else, I believe,
is going on another podcast, which
I'm excited to… Al-Khalili
Peter: Oh, great.
I think I know what you're talking
about.
Michael: Yes, for everybody else,
this is Tom Lane going on the
Talking Postgres podcast, which
is scheduled to be live, I think,
next month.
So I'll share a link to that.
Exciting times.
So we've talked about it briefly
already, But another cool thing
you've done here is split this
work into some useful stuff on
its own for version 17 coming out
soon.
I'll be interested to hear how
and when did you realize you could
split it up into the in-list optimizations
that I think are going
to be very useful for a bunch of
often-generated queries that
people don't have that much control
over, like long, like relatively
long in-lists seem to pop up, at
least from what I've seen from
like ORMs normally.
How did you work that out?
Peter: I'm not sure exactly when.
I think it probably wasn't any
1 moment where I realised that
those were… I remember talking
about another patch, actually
1 for LucidNextScan, and just being
very confused about that
whole area myself, being very confused
about what the difference
is, where to draw the boundaries
architecturally.
And I recall asking to nobody in
particular at the time, how
does this relate to in-lists or
what we would call internally,
ScalarArrayOpExprs equals
any array expressions?
There must be some relationship
here.
At that time, I was aware of the
paper, the source material,
and that strongly suggested as
much.
But I just didn't...
I kind of thought, you can't really
have an opinion about 1 without
having an opinion about the other.
You want to be able to combine
these things arbitrarily.
So that was when the initial awareness
sort of sunk in.
Then I had some work on Vacuum
over a couple of releases, and
then I came back to B-tree again.
And I think at that point I realised,
OK, this is probably the
most useful thing I could do.
And then I got a little bit lucky
as well with the project for
17.
I worked out a couple of things
on the optimizer side that I
wasn't necessarily counting on.
That's more subtle.
It's stuff that's related to whether
or not we know that the
scan will return ordered results.
So I realized the relationship
between that question...
So in previous versions, you would
do a query like SELECT FROM
TABLE where a equals 5 and b in
a long list, ordered by AB limit
something, and it wouldn't be able
to terminate the scan early
because, for reasons that are complicated
and not that interesting,
the implementation could trust
the B-tree code to correctly return
things in the ordered results.
So I kind of had this idea that
that was important, but I didn't
know how important it was to everything
else.
So I kind of got a bit lucky in
figuring that out in about 2023.
Before I even realized that I was
pretty sure that I would do
the skip scan thing, but there
was some serendipity with, there
was actually a user complaint,
I think it was on Slack.
Michael: Oh, yeah, Benoit.
I'll link that up as well.
Peter: Yeah, Benoit had a complaint.
Basically, because of the limitation
I just mentioned, the query
plan, which was for such a query
with an emit, it couldn't do
it without creating a way of filtering
the non-matching rows
that required heap access.
So for reasons that are not really
fundamental and worth explaining,
it did way more heap accesses just
to exclude non-matching rows,
when in principle it could have
done that at the index level
before accessing the heap at all.
So for very roundabout, hard to
explain, and actually not that
relevant reasons, that made a huge
difference for his use case.
I wasn't even thinking about it,
because I'm a B-tree fellow.
I wasn't really thinking about
avoiding heap accesses.
But then I found, actually, that's
by far the main reason to
do that.
I was thinking initially about
just the simple fact that, hey,
do you want to do the index descent
1 time or a hundred times?
Obviously, you want to do it 1
time if that's feasible.
That's where I started, but then
I realized not much later, but
somewhat later with some help from
Edoard, oh, actually, yeah,
you can in some cases at least
completely...
Only to all this right about nonsense
with restrictions in the
planner, you can completely...
In some cases, like the one he showed,
you can completely eliminate
a huge number.
And that was by far the best, the
most flattering case for the
B-tree work, where I think it was
like 8, 20 times faster, something
like that.
It doesn't really matter what the
number is, it's way, way faster,
because you're fundamentally doing
way less I/O.
So I didn't necessarily count on
that happening at all.
And then it kind of did without
my initially realizing it, which
was nice.
Serendipity did play a role, as
it often does.
Michael: Very cool.
Nikolay, do you have any questions?
Nikolay: Of course, I have comments
and questions.
All of them are silly, maybe.
Once again, we should avoid this
confusion.
Loose index scan is not skip scan.
Because I saw posts mixing these
names, but I double-checked
Oracle and SQLite, and they have
both skip scan, the same meaning,
so we should stick to this.
Peter: I guess so, yeah.
I mean, I wouldn't mind calling
it something else entirely, but
I certainly do think the distinction
between loose index scan
and skip scan.
I was confused on that point myself.
I mentioned this already.
The important distinction between
the 2 is...
So loose index scan involves an
Index scan that knows specifically
that it's feeding into a group
Aggregate.
So it's only correct because you
don't actually logically need
to return all of the rows from
the Index, or indeed from the
Heap at all.
It's only correct because of that
high-level context that you're
applying in the slow-level code.
Whereas, in contrast, SkipScan
isn't like that.
SkipScan is just exploiting information
that's readily available
to the scan.
It's doing Index scan things in
a more clever way without really
understanding how that information
is being consumed 1 level
up or 2 levels up.
So that's how I think of it.
Nikolay: Yeah, you mentioned work
on LucentXScan also happening.
What do you think about the future
of that work?
Peter: I think it's a perfectly
fine idea.
There's nothing wrong with it.
I did briefly look at that some
years back.
My recollection of exactly how
it was at the time is a little
bit vague.
I remember noticing that it had
lots of bloat which was
probably a little bit of something
that discouraged my involvement,
because, again, I like being successful.
Right, yeah.
But fundamentally, it's a perfectly
sound idea.
It's obviously a good idea that
could be implemented with enough
effort.
And I think that the patch didn't
have any major problems, or
at least didn't have any major
problems that could not be addressed
in the fullness of time.
I recall Tom, at the time, who
was sort of asked to take a look
at it by me, saying it was maybe
doing things the wrong way on
the Planner, but that's kind of
above my pay grade.
But from an indexing point of view,
obviously it makes sense.
It's like, just thinking about
it very sensibly, it's doing potentially
an enormous amount of extra I/O
you don't need to do logically.
It's just not...
Yeah, I mean, it's inarguable that
that's a perfectly good idea.
If I haven't started there or I
haven't paid too much attention
to it, it's only because of competing
priorities.
I think it's totally reasonable.
Nikolay: Yeah, we write with recursive
all the time to implement
this, basically at the application
level.
I'm curious if there are any specific
benchmarks which could
be helpful for these both patches
to check, like, kind of workloads?
Peter: Well, not really, because
as often is the case with things
that have some compiler component
to them, it's sort of like,
It's perfectly obvious that it's
not a quantitative improvement,
it's a qualitative improvement.
So you name a number, and I'll
give you a test case that is faster
by that amount.
It doesn't make up a number.
It's no problem.
I'll get you a test case that will
do that.
I just have to come up with something
that has sufficiently large
amounts of things that you will
be skipping now.
Coming up with a number, it's not
really very meaningful.
You're doing it in the efficient
way, the obvious, sensible way,
having not done that, so there's
really no practical limit on
how much faster it could be.
And it's hard to make generalizations
in the real world.
That's the annoying answer to your
question.
Nikolay: By the way, I wanted to
ask you about propagation of
this idea from B-tree to other
types of indexes, GIN first of
all.
Is it a valid idea at all?
Maybe with combination with the
btree_gin?
Peter: It definitely could be,
but then the way the GIN in particular
does multicolumn indexes is kind
of weird.
The way that actually works is,
it essentially, rather than having
a B-tree that has A and B stored
together, let's say it has a
B-tree, if and only if you have
a multi-column index, It has
a B-tree where attribute 1 is stored
with 1, whatever the value
for A is, and attribute 2 is stored
with 2, whatever the value
for A is.
So the column number is, in the
B-tree itself, where the entries
live, is the most significant key,
which is very different.
Now, if you said to me, what about
GIST, then that's very different,
because GIST is much more like
B-tree, particularly here.
I think with GIST it's totally
possible.
The way that GIST works, so the
source paper I mentioned, Efficient
Search of Multidimensional Indexes,
it might be confused.
You might think, well, multidimensional
like GIST?
And the answer is, yes, sort of.
I'm doing work that's purely enhancing
B-trees, but some of the
ideas with dimensionality are at
least at a high level related
to things like multidimensional
indexing using GIST.
So I think it's entirely possible
to do that.
GIST is loosely ordered, so it's
kind of inherently doing something
a bit like that anyway, where it's
not so much finding the exact
match and the exact place where
it must be, but rather looking
in all the places it could be exhaustively,
which is usually
only a fairly small number of pages.
It doesn't have the same propensity
of B-tree to go to exactly
the right...
The only right place for that value
that you're looking for directly.
It's more because of the loose
ordering of the index, it's not
quite doing it that way.
There's a bounding box in the internal
pages which describes
what the leaf pages contain.
So you might have to do 2 or 3,
even for a point lookup, you
might have to do 2 or 3 leaf page
accesses, as opposed to just
1, which would be very often the
case with B-tree.
So it's almost kind of doing that
anyway.
So it seems entirely plausible
that with GiST you could do something
similar.
I don't think that's probably possible
in the case of GIN, though,
because what does that even mean
in light of the scheme with
GIN storing two different attributes
in completely different places?
And the tree doesn't really seem
compatible with that.
I don't know, though.
I mean, anything's possible
Nikolay: what if it's a bit region
when first called on this.
Scalar like organization ID
or category or something and
second column is like JSONB or
something?
Peter: Maybe.
I mean, I think it's probably more
proveable to talk about it
in the context of a use case.
So there, I guess, intrinsically,
you have a kind of nested structure
to JSON.
And so having that capability could
be useful if you had some
kind of partitioning scheme or
something like that, I suppose.
Nikolay: I mean, JSON nested structure
doesn't matter.
It's a second column.
I
Peter: know, I know.
Nikolay: Right?
We put a scalar value, like some
number or timestamp on the first
place.
Peter: Yeah, I mean, I think that
could be possible.
I haven't thought about it too
much, though.
You have to talk about it in two
levels as well, though.
It's like, what would be practically
achievable within a year?
Nikolay: Let me tell you.
Why I care so much about this.
Because I started to reflect on
one problem recently related to
full-text search in Postgres and
why people move to Elastic.
Why I did?
Because I saw a similar situation
happening with pgvector.
The most popular issue in pgvector
repository right now on GitHub
is additional filtering.
So people need like global index
with vectors, but they also
need to be able to sometimes search
locally for a particular project,
organization, I don't know, anything.
And this usually is done, what
Elastic and others, they call
it metadata search.
Basically, it's additional filters,
usually these colors, like
numbers or timestamps, something.
And for vector search, it's not
solved yet.
That's why people still can say,
like, they say PostgreSQL vector
search with pgvector is not well
for us.
Because the usual solution is,
let's just create partial indexes.
But if you have hundreds or thousands
of categories, it doesn't
work well.
Then I realized that for full-text
search, we had a similar problem
for years.
Usually it's solved by extension
bit region.
Then you are able to create a multi-column
index, but this is
a barrier for many people.
Although this is a contrib module,
it's present everywhere, it's
some kind of a barrier.
And I saw cases when people even
created this extension, but
somehow still not using it well.
They still have this problem, B-tree
vs. GIN dilemma for planner,
right?
So I'm thinking, for example, if
we use BRIN more often, we
usually will put scalar value on
first position, because it's
more selective, by default, maybe.
We don't know.
And if it has a low number of distinct
values, this skip-scan
approach would make total sense
there as well.
Right?
Peter: Maybe.
I don't have a lot of domain expertise
here, so forgive me, I
need to ask questions to clarify
what you meant.
Are you talking about pushing down
a limit for each grouping
or something like that?
Nikolay: Limit is like...
I'm not sure about limit and ordering,
I'm just thinking...
Peter: But not a limit for the
entire query, but rather for each
individual grouping or whatever.
Where you want to...
Is that what you mean to the question?
Nikolay: Imagine we have full-text
search, TSVector, GIN index,
and then we have additional things,
like, for example, creation
time.
And when we want to search within
some window only, By default,
full-text search in Postgres doesn't
support it itself.
It will be a different B-tree index
on timestamp, like created at
timestamp, and we have a global
whole table GIN.
And then the planner decides what
is cheaper.
Bit by bit and then filter on the fly,
or use full text search and
then filter or order by on timestamp.
And the pg_trgm extension gives
you the opportunity to combine,
have 2 column index.
And then the goal is to find everything,
no limits, right?
But if we put scalar value on first
position, queries which don't
have this filter will be not good
with this index.
So we put a timestamp as our column
A and the TS vector as our
column B, use B-tree for that
to have 2-column index.
And then I would like to have skip
scan for this, because if
my query doesn't have additional
filtering and we have a low
number of distinct values for column
A, maybe timestamp was not
good because it will be a lot of
distinct values.
Let's say it's some category ID
or, I don't know, like group
ID or something.
In this case, probably skip scan
will help there as well for
such queries.
Or I'm forced to put this column
to second position and keep
full-text vector on the first position?
Okay, this is something to think
about additionally.
I'm just very curious.
Peter: I don't know is the simple
answer.
In principle, you could...
I mean, after all, what I've described
is, in the end, there's
a bunch of details, but it is basically
quite simple.
You have to be willing to do, at
least in the simple case where
you omitted, fully omitted a prefix
column, you have to be willing
to look at every single individual
partition, every distinct
value, which adds up quite quickly
if it's something like a timestamp.
Whereas if you had some kind of
bucket by day or something like
that, then it would be more feasible,
because then you would
have relatively few buckets, and
especially if you had not completely
omitted the date column, if you
had a range in between, then
it would be, I think, feasible
from a data structure point of
view, not necessarily in GIN, but
in something, to do something
like that.
I mean, I don't know what the basis
of comparison here, what
is Elastic doing?
I just don't know.
I can't speak to that.
I don't know
Nikolay: as well, internals.
They just claim if you have additional
filtering, it's not slower
at all, and usually it's faster.
This is what their documentation
says.
I'm very curious about internals
as well, because I just feel
this, even from a user perspective,
even the need to create btree_gin
extension, it's already a
barrier.
Some people don't do it, and they
think, okay, full-text search
is weak, And they are going to
have the same impression about
vector search.
And this whole topic about combination
of these powerful indexes
with B-tree indexes, It's interesting.
Direction, I think, can be improved
as well.
And with SkipScan, maybe it can
be improved further.
This is my basic idea.
It's quite maybe not well thought.
I promised to have silly questions.
So
Peter: It's okay.
There are no silly questions.
Well, I've got an even sillier
idea because I don't have I think
a very very Fisher Price level
understanding of how pgvector
works, which is not going to do
much good here So, you know,
I'm not going to say something
about I think I know so little
about it's something that.
Maybe I'll talk to Jonathan Katz
better next time I see him he's
my go to expert on these things.
Yeah
Nikolay: I'm sure he knows already
this problem very well, because
as I said, this is the most active
discussion among issues on
GitHub repository, and basically all people
need this.
They need a combination of B-tree
and vector search somehow.
Maybe not only just one B-tree, not
only this color, but a bunch
of them, because they need to filter
additionally.
This is a challenge.
I also know one project which they
built some kind of quite universal
solution for the others.
When they first came to us, they
said, we are forced to index
every column for flexibility.
We know this is anti-pattern, but
we are forced.
I'm sure they will be happy when
Postgres 18 is out with this
optimization.
This is super relief for them,
because they are struggling with
this problem.
It's good to have only, as Michael
said, only one index on multiple
columns and don't think too much
about order.
Peter: Yeah.
Especially if, as I say, the important
distinction is between
getting a very slow, totally unacceptable
performance, typically
if sequentials can have a very
large table, versus adequate,
sub-second but not sub-millisecond.
Like, a lot of people are in that
zone, and if they have a lot
of ad hoc queries, they can't really
predict the indexing needs
ahead of time.
I think it makes the biggest difference
there.
Much less so in OLTP apps, but
still to some degree, it's going
to get you acceptable performance
where you didn't know that
you really ought to have had a
certain kind of index, and then
you get adequate-ish performance
there too, which could make
all the difference in a pinch.
Nikolay: Yeah, well, actually,
interesting additional question
I have.
With loose index scan, we have
this wiki page describing this
technique.
We apply this technique using recursive
CTE, and there is an
Expectation in the future we will
stop doing this, it will be
fully automated.
It doesn't make any sense to think
about implementing while we
are still not there and not on
Postgres 18, it's not even committed.
It doesn't make sense to try to
implement a similar approach
at a higher level ourselves, somehow,
splitting query to pieces
like different index scans.
Peter: It might.
I would just say, in the case of...
If you're able to upgrade to Postgres
17,
Nikolay: and
Peter: having done that, you're
able to also...
Suppose you have 10 distinct values
in your leading column.
If you can write an inlist that
has those 10 distinct columns
and are satisfied that that will
work for you, that will give
you the correct answer, not only
when you write the query, but
going forward, if it's sort of
static, then you can totally do
that.
And in fact, I would expect the
actual access patterns to be
identical to what you would have
gotten had you had the skip
scan feature.
It is basically doing the same
thing.
It's just rather than having an
array with these constants that
come from a query, you will instead
have a magical sort of an
array that the B-tree scan code
uses almost the same way.
It's just that it generates its
array elements procedurally and
on demand, but it is essentially
the same thing at quite a low
level.
So it would be not just similar,
but identical, if you could
make it work within those constraints.
And in particular, there'd be no
possible downside in terms of
whether or not it made sense to
do a full index scan.
Depending on the distribution of
values, it's going to be adaptive
in that same way I described.
With a big caveat, of course, you
kind of need to know that,
at least for those leading values,
that it's static and predictable,
and it won't change and break your
query.
Nikolay: So I guess our approaches
will change with this.
Usually we, like, analyze workload
and decide on indexes, we
decide what to put on the first
position, second position.
Here it seems like Sometimes we
will decide to put columns with
lower cardinality of distinct values
or something.
Peter: That'll start to make way
more sense than it did.
I don't want to overstate it.
It's not like a revolution by any
means, but there are specific
rules of thumb that will be we
can buy it for sure Would that
you already said yourself Nikolai
was Oh put the most Just put
the highest cardinality column first
That's already kind
of suspect actually, but it's actually
is in this world with
this capability.
Nikolay: Interesting.
Peter: I think it's mostly confined.
That'll be most true by far with
more your data warehousing use
cases, where there is a built-in
requirement for a lot of ad
hoc querying that you can't really
exactly predict ahead of time,
looking at the details of something
rather than just a summary,
really matters, then yes, absolutely,
that will be true.
Well, with all the CPUs, less
so.
Nikolay: Cool.
Yeah, this is an interesting direction
to explore, even if the patch
is not yet committed.
Yeah, thank you.
Go and have something in the to-do
list.
Michael: Yeah.
Is there anything else you wanted
to make sure you mentioned,
Peter, or anything anyone can help
you with?
Peter: Well, I always welcome input
into the work I'm doing.
It would be kind of nice if I could
get some feedback on the
stuff we were just discussing about,
hey, how as a practical
matter does this alter the general
advice that we give to you
about how to index things?
I'm not entirely sure how, in practice,
that will tend to work
out anyway.
It is nice that this is a very
general thing.
I think that will tend to make
it possible to use it more often,
because in a certain sense, the
optimizer only has to have the
right rough idea, because if it
has that, then it will pick an
index scan, and then the index
scan itself will figure out what
to do.
So I think that's probably going
to make it more compelling than
it appears to be in other systems,
where, as far as I can tell,
I might be mistaken, they require
the optimizer to generate a
distinct index node type.
So it's kind of either-or, which
seems to me anyway, could be
something that limited the applicability
of the optimizations.
So I feel like it would be nice
if I had a better idea of how,
practically speaking, this will
change things and where.
I don't think I've got a complete
understanding of that myself.
Like, what kind of applications,
what kind of usage patterns.
That would be nice to have more
detail on.
So if anyone wants to help me,
that would be an obvious place
to start, I would say.
Michael: Nice, thanks so much,
Peter.
It's been an honor having you on.
Great chat.
Peter: Thank you both.
Thanks, Michael.
I really enjoyed it.
Nikolay: Thank you.