Postgres FM | Transcript: Full text search

May 24, 2024 • 50 Minutes

Full text search

Nikolay: Hello, hello.

This is PostgresFM, episode number
98, right?

98.

Michael: Yes.

Nikolay: And this is Nikolay Samokhvalov,

founder of Postgres.AI.

And my co-host is Michael Christofides,
founder of pgMustard.

Hi, Michael.

Michael: Wow.

That's the first time you've attempted
my last name and you did

great.

Nikolay: Okay.

Thank you so much.

I practiced all these years.

So what's the topic today?

Michael: Yeah, So this was your
suggestion.

I think it's a great one.

We've been meaning to come back
to it for a while now.

It's full-text search.

Why did you want to talk about
it?

Nikolay: I think it was on our
list for a very long time, maybe

from the very beginning, because
it's a huge topic.

And when we built that list, I
think pgVector already existed,

but only a few people paid attention
to it, unlike today.

So since we didn't manage to discuss
full-text search before

this AI boom started last year.

I think we should maybe compare
it a little bit.

Since I used full-text search a
lot in the past, I still remember

many features it offers in Postgres.

I think it's interesting to not
only discuss, like usual discussion,

embedded full-text search in Postgres
versus Elastic.

But also it's interesting to discuss
embedded full-text search

of Postgres versus PgVector
extension and similarity

search, or maybe not versus, maybe
together.

And this is interesting.

But I think time is changing and
evolution is very interesting.

And today people probably pay less
attention to full-text search,

but they at least should know what
capabilities it can offer.

And let's maybe start from use
cases and discuss functions, features,

FTS, full-text search provides
in Postgres.

What do you think?

Or maybe we can talk about some
historical aspects.

I also can provide some overview
of what was happening in Postgres

and not only in Postgres.

What do you think?

Features, use cases, history, what's
better?

Michael: Actually, I don't know
the history that well.

That would be awesome to hear a
tiny bit about that.

And then I'd love to jump onto
use cases where you see it used

most commonly, that kind

Nikolay: of thing.

Yeah, actually we can combine.

I now see how we can do it.

So if, for example, we roll back
to first web era, Web 1.0, so

to speak, like 25 years ago, and
when Google was created.

Because Google is kind of, it's
a search engine, originally,

right?

And it solves the problem of finding
proper texts, most relevant

texts, right?

And at that time, before Google,
already other search engines

were created, obviously.

Full text search capabilities were
originally called T-Search

Actually, I remember this contrib
module I was seeing at 2005,

maybe, when I first started working
with Postgres, created by

Oleg Bartunov, Teodor Sigaev, and
then Alexander Korotkov joined

them.

They created it because they were
building a search engine originally.

They helped building quite a big
search engine.

Then they incorporated many of
their works into Postgres, and

this became T-Search2 contrib module,
which later was merged

to the core and you don't need
to install extension, so it's

already in core, right?

But interesting thing that it was
before social networks and

so on, but If we look at how it's
built, it has a lot of things.

For example, you can search, you
can, for example, exclude some

words.

By default, it excludes some words.

It's called dictionary, which is
like in Postgres full text search

terminology, dictionary is a program.

It can process input somehow transforming
it.

So stop words, it's just exclusion
of various articles for example

and so on which we probably don't
have any meaning, so we can

just exclude them, that's it.

Then we have Snowball, which is
a stemmer, so it removes the

endings.

It's super fast and super simple.

So we can, for example, runs and
run, it's like kind of the same

word, so s should be removed, right?

And then more full-fledged thing,
we can ask to join, to use

some real dictionary, usually ISPELL,
which is also used for thesaurus,

like It can give a lot of things,
but it's usually slower, requires

some memory.

And all these things are configurable
in full-text search, and

they were configurable in TS Vector
2 originally, and it's great.

But if you think about the task,
the use case, what we want to

try to solve, we want to present
user some information which

is most relevant, right?

What is most relevant?

The user has some query.

I want a recipe with, I don't know,
with some eggplant, for example.

He puts eggplant.

If we have, for example, food recipe,
website, or system, knowledge

base.

And then we just stop words, normalizing
words, we can just filter,

right?

But filtering is just 1 thing.

We also need to somehow order,
maybe, right?

And present only the most relevant.

What is the most relevant?

Maybe these words are most frequently
used in these documents,

right?

Like repeated many times, for example.

And this is quite a basic idea.

If the word is used many times,
it's considered most relevant.

Now, confession, I never used this.

Why?

Because I'm from a different era.

I'm from web 2.0, where most relevant
meant something else because

it was social networks.

It was like also time matters a
lot.

Like if it's fresh document, like
fresh blog post or something,

it matters a lot.

But Postgres full text search lacks
this, like it's hard to build

together.

There is a special index RAM, I mentioned
it many times already.

Maybe I need to revisit it finally
and think, maybe it still

can be used, because if you're
on self-managed Postgres, this

type of index can be installed
as an additional extension.

But the original full-text search can
order by only this tsRank,

right?

It's called function tsRank.

And there is another function which
also considers positions

and distance between words.

I think like density or something.

Like not density, but the closer
words are the better ranking

score is, right?

But for me, it was always nonsense
to use only these word positions

and so on.

I couldn't imagine how I can use
it because I always needed to

take into account time.

And also sometimes things like
likes, reactions, comments, if

this post is very heavily discussed,
it's very important, right?

But back to history, we know there
is BM25 and also there is

like this TF-IDF.

The idea, TF, it's term frequency,
inverted document frequency,

right?

If some document has this term
like word mentioned many times,

it's good for this document, right?

And if, in general, this term is
mentioned in our whole database,

it's a whole dataset, it's mentioned
not very frequently, it's

also good for our document, because
it means that it's exceptional

in terms of frequency, right?

So that's why it works, document
frequency.

So it means that for such thing
we need to analyze the whole

dataset.

Postgres full text search doesn't
do it.

It's quite hard to maintain such
type of thing, but I think if

there are some attempts to have
it in Postgres, maybe some extensions

exist.

This is what Elastic uses, right?

But there is another approach.

Back 25 years ago, why did Google
win?

Because it had PageRank.

If we think deeply about this,
it's already from Web 2.0 idea.

Because PageRank takes into account
social network of documents,

I would say.

It takes into account links, right?

So interactions.

We are not alone.

And this idea is super powerful.

I'm curious, are there any extensions
which implement this in

Postgres and how it could be implemented?

This is super interesting.

But this is a big power.

If we have links, we value this
document more.

I think there are good algorithms
how to solve it with billions

of documents of course.

There are papers written, anyone
who studied search engine theory

of course.

There is a course from Stanford
mining massive data sets, very

good material where I studied this
PageRank and other algorithms

closer to machine learning on big
data sets.

So idea is like links, right?

You multiply by some matrices and
you have some good index and

you can use it.

Again, Postgres doesn't provide
this.

Elastic doesn't provide PageRank,
I guess.

But it's good because we start
taking into account not only word

positions.

Full-text search in Postgres, only
word position for single document

and that's it.

But since full-text search in Postgres
is inside relational database

system, we can have other columns,
right?

And we can have indexes on them.

So for example, likes or something
like timestamp when document

was created.

We can construct SQL, which takes
into account all these things.

It will be just slow, right?

Maybe because the idea that the
fastest query is always like

index scan or index only scan is
the best, right?

But when we have full text search,
we have other columns and

we have B-trees probably on them.

Combination is a problem, right?

This is the problem.

But there are things how you can
solve it, but let's maybe discuss

a little bit later, I mean, B-Tree GiST
and other extensions.

But originally, the combination
of different factors in the ranking

task, this is the problem, right?

Later, Google and others, they
started taking into account social

network activities.

If some document is mentioned on
Twitter, for example, or Facebook.

It happened like 15, 10 years ago.

This document, everyone who dealt
with search engine optimization

knows that if you have a lot of
likes in social networks, your

document goes up in positions.

But if we deal with our system,
we can take into account interactions

like that because it's just some
data we store and that's it.

Today it's not that bad usually
if you don't have tens of billions

of records or just millions or
tens or hundreds of millions.

It's not a huge volume of data.

So you can rely on multiple indexes
and let Postgres decide what

to choose based on statistics.

This is very similar to what we
discussed, remember, with...

Yeah?

Yeah, very similar.

Because it triggered my problems
with full-text search.

I had trying to incorporate it
to social media projects.

Because you have 2 indexes, which
to choose?

Because you cannot choose them
together.

Well, you can, you can have bitmap
in the scan, but it's already

slower.

But most likely Postgres will choose
just 1 index in 1, like

for example, over time, or full
text search, but based on statistics.

And then on the fly, it will apply
either sorting in memory or

filtering in memory, right?

Michael: Yeah, exactly.

That's where the inefficiency comes
in.

Nikolay: Yeah, yeah.

So this is like historical overview.

We had like simple stop words, stemming,
dictionary.

It's not simple, by the way.

It's a lot of functionality.

But ranking based on just our opinion
about how this document

is relevant to your query, not
considering relations or other

things.

Second thing is, I think, page
rank and Postgres doesn't have

it.

And the third thing is taking into
account various factors like

time and likes and so on.

But also there is similarity search
now and we know search engines

use it for long, right?

So...

Michael: Well, actually, I want
to go back to...

I think your example is remarkably
useful.

So the example you gave of searching
a site of recipes for eggplant,

or let's add that maybe we're not
super sophisticated, we type

an eggplant.

So we can, for example, say that
an would count as a stop word.

We don't want recipes that say
an a lot to rank highly.

So that's getting rid of the stop
words.

We might want, for example, to
do something slightly more.

We might want to rank recipes that
list eggplant in the title.

We might want those to score higher
than ones where it's listed

in the document more times.

So that's like an interesting additional
challenge.

Secondly, it might be, let's say,
a user-generated recipe site,

and we might want to factor in
how many likes a recipe has got

or how recent it is or something
like that.

So there's like all these, even
in the simplest example you can

think of these days, it can get...

Oh, I've got 1 more.

You might also, if it's user-generated
content, you could have

British people or like people using
aubergine instead of eggplant.

And you also want those to rank
highly.

So yes, exactly.

Nikolay: So synonyms already, it's
already a bridge to discussing

full text search versus similarity.

Let's keep it as the last item,
right?

But let's, great attempt to zoom.

This is good.

Let's do, or diving into deeper.

So let's slightly mention how full-text
search works in Postgres.

So after this pre-processing, we
just discussed like removal

of stop words and you can control
it.

Like good thing about full-text
search in Postgres, a lot of

customization options.

Like A lot.

You, for example, can control stop
list and stop considering

an as a stop word.

Or you can, for example, remove
accents.

There's a special extension on
accent, right?

Additional pre-processing.

Michael: So like in French, for
example, with lots of...

Nikolay: Or there are support of
multiple languages also, it's

also good.

So you can start considering additional
stop words, for example,

remove it if they mention too often,
like, almost all documents

have them, let's remove it because
it doesn't make sense to use

it, right?

Then like stemming or dictionary
processing, I spell dictionary

processing.

So in the end we have a set of
normalized words, right?

And then we build an array.

It's called TS vector, again also
vector by the way.

So it's a set of words with positions,
already normalized, in

a normalized form.

But basically it's just an array
of texts, right?

Michael: Yeah, like words or lexemes
or whatever.

Yeah.

Nikolay: For example, I forgot
also to mention hyphened words.

I think you also can control it.

As I remember, it was many years
ago, I think you can choose

either to consider them as a single
1 word or to split them to

multiple words, like 2 words basically,
first and second, or

together.

Like You can put a pair of separate
words and whole word as hyphen,

right?

A total of 3 already, right?

Michael: So that if people search
1 or the other, they still

get that result back.

Nikolay: Right.

For example, shift left testing.

Shift left, if it's a hyphen, you
can put both shift left and

shift left as 3 separate entries
in your

Michael: test vector.

Nikolay: You can control how Postgres
will do it, building test

vectors.

Also, you can concatenate test
vectors so data can come from

different sources.

For example, we have title, we
have body, subject, or body, like

for example if it's email, or if
it's blog posts, title and content,

text, how you name it, it doesn't
matter.

And you can concatenate 2 TS vectors,
or you can concatenate

it before that with additional
space and then build TS vector.

So TS vector is just an array of
texts, and you can explicitly

store it, consuming space.

Or you can rely on function, because
then we need to create index

on it.

Originally it was GIST.

GIST, I think it was the work of
Heikki Linnakangas in Berkeley very

long ago, in the early 90s maybe
or when.

It was not finished, but the guys
who I mentioned, Bartunov and

Sigaev, they just saw this directory
in PostgreSQL source code, as

they say, and understood this is
exactly what they need.

There's also an additional thing.

GiST is a generalized search tree.

It's similar to B-tree, but it's
abstract.

You can use it for many data types
and data operators.

For B-tree, it will be numbers
and operators less than or more

than and equals.

And it's just 1 axis, right?

R3 it's 2 axes and then you can
have it for arrays as well.

In this case we talk about operators
intersect, includes, contains,

is contained by, and overlaps.

Overlaps is at at, usually.

You can actually redefine it and
use your own symbols.

It's actually first you define
function, then operator, since

Postgres is very extendable.

But historically it's at at, or
less than at, or at greater

than.

So these operators, you can define
how to deal with them in terms

of this B-tree-like index structure.

So it's balanced with many, many
children on each node.

And balancing is automated, everything
is automated.

So when these folks found this Heikki Linnakangas's
work, it was not

finished, and the WAL was not
supported.

They worked on that supported WAL.

Then additional thing, we won't
go deep because of lack of time

and I also forgot everything, But
there's also a thing that's

called signature tree.

So to put our TS vectors to this
structure, there's additional

concept of signature tree, you
need to build signature.

It's also defined.

So these signatures are placed
into the leaves of this tree.

And then search is very bit tree-like,
but with additional exclusion,

which makes today's GiST option
is not popular at all.

Recheck is needed, because it's
not certain, this search.

Michael: I've heard it's called
lossy index types.

Nikolay: Right, right.

So if you check, this is what you
do in pgMustard, right?

Yeah, yeah.

EXPLAIN, ANALYZE, BUFFERS, right?

You check and see in the plan,
recheck happened.

If GiST was used, recheck happened.

So let me close some gap.

Query also processed to build a
vector, right?

But it's called TS query.

It's called TS query, different
data type, but it's also similar.

Also a bunch of words, also with
pre-processing.

You can have different pre-processing,
you decide how to pre-process,

right?

But usually it's same as for TS
vector, for TS query, same pre-processing

also remove stop words, remove
normalized words, and so on.

So, and then our question is, let's
find everything which includes,

like, we need to find documents
which include everything we asked

for in our query, right?

Michael: And rank them, but yeah.

Nikolay: Yeah, and rank also additionally.

So there is approach to put
it to trees, to such values, to

leaves.

And then it searches similar to
B-tree, but instead of greater

than or less than, instead of that,
it checks if this vector

is contained in that vector.

I mean, this array is contained
in that array, and that's why

you need to go left or right, something
like this.

So it's called Russian doll tree
because it's like Russian doll,

right?

Somehow.

RD3.

So by the way, it's also in Berkeley
papers, this term RD3, somehow.

So I think it was invented before
Bartunov and Kratkov.

Probably.

Maybe they influenced, I don't
know.

But then, obviously it's slow in
terms of search for large volumes

of data because of this recheck.

This is not how Google and others
worked in terms of like...

Not even Google.

Google, we know, page rank is a
bigger thing.

But eventually, Bartunov, Sigaev,
and then Kratkov, they created

GIN.

GIN is a generalized inverted tree
where we have a list of terms,

our words, and for each list of...

Michael: Documents.

Nikolay: Right, all tuple IDs,
tuple CTIDs, I think, where this

is stored, but not lists, there
are bit trees there, actually,

for faster search.

So it means that Gist is good only
at 1 thing, update speed.

Michael: Gist fast is gin.

Nikolay: Right, right.

But gin, it was improved, there
is also fast update option.

But anyway, default option for
us is gin, right?

And comparing tsQuery and tsVector,
gin is good, search is fast,

but order by tsRank, which is not
probably good.

So back to the comparison, back
to these use cases you started

mentioning.

Let's think about them.

First, you said, or at least second,
let's distinguish subject

and body, right, for example.

Michael: Yeah, or should we rank,
should we, should the, the

presence of an ingredient in

Nikolay: it or

Michael: let's yeah.

Words because we know that people
aren't going to be searching

in our like search bar for random
things.

Like maybe there's like a few different
things they might search

by.

The most common is probably ingredient.

Should recipes that list that in
their title rank higher, or

be weighted higher, even if they
only mention it like a few couple

of times later on, I feel like
personally I'd be expecting those

to rank higher than ones where
it was a small ingredient, or

I only needed a small amount of
it maybe.

Maybe the amount of the ingredient
matters a lot.

Maybe that's easier.

Nikolay: Right, right.

So it's definitely a good thing
to want, right?

If a word is entitled, it means
that maybe the whole article

is about this, right?

If eggplant is entitled, it pays
attention more to it.

Makes sense.

So for this, there are 2 bits which
you can use.

2 bits means 4 options, right?

So there is a function setWeight.

And when you combine information
from multiple sources, from

title column and content column,
for example, body column, you

can say set weight A to first 1,
set weight B to second 1.

And they are only ABCD, uppercase,
because again, 2 bytes only

are used for this.

And then later in your query, you
can also say, I pay attention

to this or to that.

Actually, you can use it for filtering
as well.

And this is what I did.

But originally it was created for
ranking.

I don't remember details, but setWeight
function you need to search

in documentation, it should be
there and you will find out how

to...

It's a very strange concept.

Why only 4?

Because of 2 bits only.

They had only 2 bits to spend for
it.

Maybe there should be more.

But I use it for sometimes like
we search only in subject, right?

And I say, okay, only a category.

It's called category maybe.

It's embedded inside TS vector.

So you can skip using it.

It's there, but you can just ignore
it in search.

But in different time, you can
say, I want to search only category

A, it means only title search.

You don't need to build 2 TS vectors
separately.

However, you could, right?

Michael: Got it.

So like, if, for example, going
back to our example, I search

what's clearly like an author name,
and we can maybe on the application

side, we're doing a little like
a quick, Maybe we're doing something

first to try and categorize what
people are searching for.

If it's an author name, I could
then send that through to Postgres

as like, let's only look in this
category.

Nikolay: Right, right.

But again, it originally was created
for ranking.

I just didn't use it.

So I cannot explain how the documentation
of course, explains

how.

Right.

So good thing to mention also,
now we have generated always thing,

right?

Generated columns.

They are stored, but Postgres maintains
them automatically.

So, I think generation of TSVector
is probably a good thing to

use together with that functionality.

So you have author, title, body
of blog post, and then TSVector

can be generated based on some
expression, which puts different

categories to different sources
of data, again up to four, and generates

and puts TSVector there, and then
you have an index on it, right?

Or you can define a trigger.

All my life I defined triggers
for this.

And also there is another option
not to store it at all and just

use index on expression, GIN, and
then your big expression, maybe

with this set weight we just discussed.

But in this case, it's good in
terms of storage.

Less thing to keep in memory.

But it might be bad in terms of...

I don't know.

Sometimes you need to deal with
TSVector directly.

If you have only expressional index,
then you need to construct

this expression once again to deal
with it.

I don't like this approach somehow,
because it limits in terms

of what you can do with such records.

So, if you don't store it, you
need to build it on the fly to

deal with it, to additionally somehow
analyze it or so.

So I always prefer to store them,
although they can consume a

lot of data and the TOAST and so
on.

Sometimes I just put them to separate
table understanding that

then I will need to join.

For me it's easier to create a
trigger and use a regular column,

right?

But if you think to store it or
not to store it, it's good to

think what you will do with it.

And if you want to deal with TSVector
in different ways, then

keeping it stored only in the index
itself.

It's maybe not enough and maybe
it's good to put it to the table.

But to put it to the same table
as a column and rely on TOASTing

and so on, or maybe to allocate
a different table and like one-to-one

relationship?

It's a good question.

Again, depending on workloads.

So okay, This we covered.

Can you remind me of other use
cases you mentioned?

Michael: Well, I think it's all
I consider this kind of one use

case, but I guess, I guess there's
like all the complexities

that you may or may not want to
support.

1 that I didn't mention, but probably
we care about is what if

somebody spells eggplant with 1
g?

Nikolay: Yeah, typos.

Michael: Yes.

Nikolay: For this, there is pg_trgm,
3 grams.

Extension, which actually also
uses RD3.

Gin, actually.

Gin these days.

RD3, forget about it.

Forget about GiST, very rarely
used.

So Gin, right?

So we have some text, right?

Or some...

Basically, 3-grams should work
for words, right?

We have word, and we suspect maybe
there is a typo there.

So it depends 1 space on 1 side
and 2 spaces to different side

and then just split for 3 words,
3 letters, 3 letters, 3 letters,

right?

And then we have array again, right?

Vector.

Vector, right.

And then the question is which
is the most, the closest.

Closest means like most overlapping.

Most of members of array are the
same.

Might be not all of them.

And if it's just 1 letter typo,
it means 3 members of this vector

will be different.

Michael: Ah, nice.

Yeah.

Well, it wouldn't include EGG,
it wouldn't include GGP, and it...

Is that...

They're the only 2?

In this case, I think, possibly,
just they're the only 2.

Nikolay: So 3-grams, We're just
shifting, shifting, shifting

also, right?

We don't just split, we're shifting.

So start position, first is,
for example, space, then 2 letters,

then second, the first letter, second
letter, third letter, shift,

shift, shift, and we construct
array out of it.

And then if, for example, you just
missed, forgot 1 letter, overlapping

will be huge.

Distance is very close.

Similarities, distance low, similarities
high.

If you, for example, mixed positions
of 2 letters, just swap

them.

Also huge.

This is an interesting idea behind
3-grams.

And again, we use either RD3, nobody
does it, or Gin.

GIN, we use GIN so we can find
arrays which are closest, overlapping

is higher.

That's how we find words which
actually present in our document

dataset.

And then they say, you probably
thought about this word.

1 thing, in the past I remember
we maintained the list of words

with usage counts probably, using
some statistics provided by

full text search.

I don't remember details, but you
can build a list of words with

stats, and then you can store it
in a table, and then use trigrams

on top of it.

You need triggers to maintain this,
or you need to refresh it

periodically.

For example, if you just imported
a lot of documents which use

new words, this list of words is
already outdated, you need to

rebuild it.

And then this Trigram approach
worked, you have input words,

you check the table and it works.

Now I think we don't need it, right?

It's automatically maintain this
list of words.

Or we just don't care and build
an index and then take whole

input.

Michael: That's how I've seen it
used, yeah.

Nikolay: So there are several ways
to use it, and maintaining

this additional table probably
still makes sense, right?

List of words.

Michael: Do you think it would
then speed up queries, or what

do you think the benefit is?

Speed?

Nikolay: You lose speed here probably,
right?

But accuracy is good.

Michael: Accuracy, yeah.

This is great, but I think the
conversation you mentioned earlier,

like the comparison of full-text
search versus semantic search

in the current climate, and something
I wanted to introduce was

whether the use case suits false
positives, or suits false negatives

better, like I think it's really
difficult to, to come up with

a solution that that does neither.

But often you see kind of slightly
biased towards like is it

better like Google for example
what PageRank was really good

at was making sure it didn't miss
a website that was clearly

very relevant to your search term
and so it was very very good

at avoiding false negatives but
you did often especially in the

earlier days get quite a lot of
false positives.

You'd get articles that didn't
match your intention, at least,

even if it did match the words.

But the recipe example, maybe we
don't care so much about like,

if you miss 1 great recipe, but
you get 17 perfectly matched

ones, that's still a good result.

Maybe you'd rather that kind of
weigh around on the trade off.

I don't know how you see that and
whether that's relevant.

Nikolay: I see this as many different
small use cases.

For example, if first of all, I
remember your different use cases.

This is what I mentioned as well.

Like we want to take into account
likes, comments, timestamp,

everything.

It gives us ability to build very
big query and then we think,

okay, but our full text search,
single GIN index is not enough,

right?

Sometimes we can use bit.ly gist,
for example, right?

Which allows us combining these
arrays of text and regular numbers

or timestamps in a single index.

And this is good.

We can have single index scan.

Also there is B-tree GiST.

Also interesting thing.

Also there is RAM which you need
to install as a separate extension,

but these 2, B-tree GiST and B-tree GIN,
they are included with regular

contrib modules, so with any Postgres
it's available.

But in some cases we want, like
I forgot to mention Postgres

also supports it.

I think functional web search to
test query or something like

this.

It supports some small language
of phrase search and inverted

search, you can exclude, user can
say, I want everything but

this word should not be there.

Just minus, right?

Or put in double quotes for exact
match.

This all great means that user
can control and in regular manner

Google also supports things like
that, right?

So it means that you want, for
example, to see exact phrase mentioning,

well you know it's there and you
just can do it.

But then similarity search, 2 big
use cases.

For example, you mentioned synonyms,
right?

Someone doesn't use the word eggplant,
there's another word meaning

the same thing.

Michael: Aubergine.

It's honestly, that's what we call
them.

Aubergine.

A-U-B-E-R.

I'm not going to try the

Nikolay: rest.

I haven't heard it.

Cool.

So if you want a synonym search,
full-text search supports it.

But you need to maintain it.

You need to maintain the dictionary
of synonyms.

And the normalization process will
automatically, all synonyms

will be defined in our test vector
and test query it will be

1 word we chose.

So we can maintain synonyms.

It's easy.

Not easy.

I mean, it's straightforward.

It's not easy because it requires
effort of maintaining synonyms,

right?

But on the other hand, if we use
pgVector, it probably puts both

words in this highly dimensional
space very close, right?

Because meaning is the same.

Right, interesting.

So maybe semantic search provided
by pgVector is better here,

right?

Okay.

Another thing is, for example,
something is not working versus

something is working.

Not is a stop word, right?

What do we try to search?

Something is not working.

For example, my car is not working.

Maybe not working I should put
in double quotes, so not as definitely

there for exact phrase search.

Or maybe I just need to use pgVector
because it definitely will

distinguish semantically that not
working and working are very fine

in this highly dimensional space,
so similarity is not good,

right?

Distance is not good.

And if we are trying to find my
car is not working, we won't

find documents.

They will be put not high, right,
in terms of ranking by similarity.

But also PgVector...

Michael: SUTTON, I think that depends,
though.

I think that's contextually important
and interesting because

let's say this was a car forum,
how many posts are you going

to find about cars that are working great?

Like realistically in the

Nikolay: data set?

Michael: Well, it's a bad

Nikolay: example. It was a bad
example, but in general, inverted,

since how full text search works
in Postgres, it will remove

not as a stop word.

And it's bad sometimes, right?

Because it's the opposite meaning,
right?

Michael: Yeah, makes sense.

I didn't realize that.

Nikolay: We might think about similar
examples where the removal

of the stop word basically leads to
very bad ranking or filtering,

also.

Michael: Yeah, like one that comes
up quite often is like film

titles or like band titles.

If you put the in front of it,
it could be a completely different

band.

Nikolay: Or some English is very,
has very such example, a lot

of such examples.

For example, go on, go on, go,
it's like goes, right?

On is just some...

So go on means continue, right?

Meaning is very different.

If you just look at words separately,
you don't get this meaning

at all.

And there are many such phrases,
right?

In many languages.

And semantic search will capture
this, the meaning, and put it

to vector, right?

And then we use approximate nearest
neighbor search, ANN search.

But it lacks many capabilities,
like for example, exact phrase

search or inverted word search,
or this categorization I mentioned.

What do you think?

It's interesting.

Should we use only semantic search
these days?

Michael: I don't think so, but
only because I'm thinking of some

quite boring use cases where, for
example, you're set, like,

I quite often think about software
as a service applications.

And let's say you might want, you
might be looking up a customer

and you might want to look up them
by email address or by name

or by like, there's a few different
things you might want to

look them up by.

And you might like want to provide
your users with a single search

bar to look those up.

Nikolay: Or you can have tags,
for example, if based on what

I just described, overall, like
these vectors or arrays of text,

text arrays, you can use GiN for
text search, putting all text

for document into single value,
single column, right?

All the text, denormalizing them.

I first did it, by the way, using
GiST in 2008.

It was the topic of my first talk
in the U.S.

I presented in 2008.

Can you imagine?

And I was, we were building this,
we put, it was social media,

and all tags were stored in single
column instead of EAV approach

when you have separately words
like tags, terms, can be phrases.

1 table, documents, another table
and between them like relationship

table, right?

And then you need to have 3 table
joins, like 2 joins all the

time.

In terms of storage and speed of
search, it was terrible, it

remains terrible, E-A-V, Entity
Attribute Value approach.

But here you can put all tags and
full text search can provide

you exact search by tags.

Versus if you put everything to
vector, probably how we would

do it?

We would do it, we probably would
just take author, title, body,

and then okay, we have tags.

And we append probably tags colon,
this comma, this comma, this,

right?

And rely on our vector search that
it will be used somehow, but

we lose control, right?

If it's full text search, or we
have multiple indexes, we can

reliably find proper documents,
which definitely contain some

text.

Michael: This might be my answer.

I think if I care a lot about reducing
false negatives, like

it would be really bad if you search
the exact name of a customer

in my application and we didn't
show it back to you.

If we couldn't find it.

So sometimes the problem with the
index types for vector search

is they can have false negatives.

Results that should be the correct
answer turn out to then not

be returned in the answer.

Nikolay: So and also we remember
that vector search since dimensionality

is super high, it's approximate
nearest neighbors.

Michael: So that's what I'm trying
to explain.

Nikolay: But full text search is
okay, nearest neighbors or something

like filtering is exact.

Yes.

Precise.

So let me rephrase you.

If we want to present user just
single input with a button, fine,

right?

Probably similarity search is enough,
right?

Or if we want to have some of this
language, like exclude this

word or exact phrase search in
double quotes, or we have advanced

search form, like choose the offer,
date range or something.

Similarity search is not enough
and probably we will need not

only full text search but also
faceted search with filtering,

additional things.

Some things we can put to full text
search index, as I described,

but not all of them.

And it's interesting that now we
have similarity search, now

we have full text search, typo
correction basically, but it requires

effort, as I said.

Other indexes, B-tree, doesn't go away
at all, right?

And then we can probably build
some system which can be very

powerful in terms of what you can
do with it.

And if we consider a particular
example, for example, I know

you used PostgreSQL.org and Google
search, right?

2 search engines, right?

And PostgreSQL.org search is remarkably
weak.

Michael: it true?

Yeah, it's not great, is it?

Nikolay: It seems it doesn't use
full text search, or maybe it

uses it in some strange form, because
I don't see all power full

text search provides.

Maybe we could look into details
because I guess source code

should be available.

But in general it can be improved
and things like just using

PostgreSQL's own full-text search
will improve it and we could

define more things, right?

But if pgVector would be installed,
it would be even better,

right?

Because we could combine semantic
search and full-text search

phrase.

I think, for me, full-text search
is good in terms of these capabilities

like this language, like phrase
search and negative exclusion,

and also categories.

Yes.

So as someone who built a lot of
systems using full text search

and recently built 1 system using
pgVector and semantics similarity.

I think it's probably a proper
time to start adding full text

search capabilities also, and it
will be tricky and interesting

how to properly combine them.

I saw simple examples, trivial
examples, like how to combine

them.

Let's find 50 documents using semantic
search, 50 documents using

full-text search, that's it.

Or let's find 1,000 using full
text search, then re-rank them

using similarity and leave only
100.

And then define some additional
ranking and leave only 10.

Quite weak examples because I have
no idea how pagination will

work, right?

And also, what about speed?

So for me, it remains an open question
how to look back at old

functionality provided by full-text
search and bring them to

similarity search.

So it's a super interesting topic.

Michael: It depends on the use
case, right?

So I'm thinking about your bot
for Postgres AI, you don't need

to paginate, right?

Like if you're returning the results
in a chat-based interface.

Nikolay: Yeah, there, yes.

Michael: So it's an interesting,
like we maybe don't, depending

on the use case, maybe don't always
have the same constraints

we're used to having, like, via
web search type interfaces.

Nikolay: Well, I think yes and
no.

We don't need it now, but I already
see building the bot, I follow

my own behavior using Google.

When using Google.

Michael: And there

Nikolay: I definitely sometimes
go to page number 2, page number

Michael: But not page 20.

Like pagination maybe is not that
big a deal if you're looking

at the first 3 pages.

Nikolay: But speed will suffer.

But anyway, I think I should combine,
start combining with full

text search because sometimes I
want an exact search, I know the exact

phrase or function name.

Sometimes I want to exclude something.

I mean, both already want to exclude
something.

It happens.

And also I think about two-phase
search when we search for more

money.

Maybe indeed pagination is not
needed.

I think about the idea to search
a lot of items, like a thousand,

2000 items, but then analyze snippets
like in Google.

Using, again, automated analysis
using LLM and leave only a few.

For them, we need to expand because
I always open every relevantly

looking snippet.

I just click it and open a new
tab and consume the whole document.

But we cannot consume whole documents
for 10 or 20 items.

It's interesting there.

But full text search, I think,
is about to return.

Because again, sometimes I want
to search by title reliably.

I know it was mentioned in the
title or something.

Or authors.

Also author.

Yeah.

Michael: Makes sense.

Exciting times.

Nikolay: Yeah, maybe it was slightly
messed in terms of content

structure this time.

But I still have questions and
I don't like examples like just

50 semantics, 50 full-text search.

Michael: Do you want to hear from
people?

Should people let us know how they're
currently?

Nikolay: Yeah, maybe there are
better ideas already.

I think it's a very evolving area,
right?

And maybe there will be more systems
where capabilities will

be combined, because we know in,
for example, in Google, it combines

a lot, including similarity and
like full text things.

Michael: Yeah, so if you're doing
this in Postgres already combining

the 2 techniques, let us know.

Nikolay: Yeah, how exactly, what
are the details?

It's super interesting.

Good.

Michael: Nice one.

Cheers, Nikolay.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

Full text search

Creators and Guests

Some kind things our listeners have said