Nikolay: Hello, hello, this is
Postgres.FM.

As usual, I'm Nik, Postgres AI,
and as usual, my co-host is

Michael, pgMustard.

Hi, Michael.

Michael: Hello, Nikolay.

Nikolay: How are you doing today?

Michael: I'm good, thanks.

How are you?

Nikolay: Great.

Very good, very good.

Thank you for asking.

So today we have a very big topic.

I'm joking.

But Yeah, I hope for this podcast
episode should be the shortest

because we are going to discuss
very narrow topic.

Although this topic hits almost
every project, as I noticed.

So every system has usually some
table like users.

It has some column like email and
in email we usually expect

the search to not to distinguish
registry of characters look

case it should be case-insensitive
search right So if it's all

uppercase email or lowercase or
mixed, all those values must

be considered the same.

However, All people made this mistake,
including myself, many,

many times.

You usually start using varchar
or something or text for email

column.

And then you see that multiple
accounts, multiple users' records

have basically the same record,
just written in different cases.

Right?

So here, we usually, what do we
do?

We just introduce additional unique
index, over expression, lower

email.

So we convert email value to lowercase
or uppercase, doesn't

matter, and build an index, B-tree index with unique option,

create unique index concurrently
on lower or upper function from

that value and this gives us uniqueness,
case-sensitive uniqueness,

right?

But that's not it, we then need
to fix our search queries.

Michael: And make sure that every
future use of email uses the

function as well in order to be
able to use the index or have

2 indexes if you want to be able
to, you know, there's a few

kind of like gotchas around this
approach.

Nikolay: Honestly, like roughly
half of the cases, you know,

I, we do a lot of health check
of our startups, like usually

like they are grown to some like
terabyte or so.

And we see some tables and not
every time we look into the, the,

the, the table structure, but oftentimes
we do.

And I just see some indexes, like
index on raw email value and

on lower email, I see it a lot.

And some queries use 1 index.

So we hunt for unused indexes,
which is part of our health check,

and we discussed it on our podcast
in terms of how to maintain

good health of indexes.

Extra indexes is not what you want
to have.

And in this case, if we have index
on raw value and on lower

of email, often we also see both
of them are used, which means

there is a mess.

So all of this was introduction
of our today's topic, which is

called citext.

Michael: Yeah, so this was, I know
it's a very narrow topic,

but this was my choice because
I was looking through the contrib

modules at what was in there that
was both popular and that we

hadn't discussed yet.

And we've got to enough episodes
now that that was not a long

list of ones that, I mean, there
was, there were definitely some

other extensions we have, some
other contrib modules that we

haven't discussed yet, but the
ones on that list are not ones

I hear discussed often, not ones
I see in use.

Whereas citext, I feel like at
least was quite popular.

It did get quite a lot of use,
or at least a lot of people were

interested in the use case.

So yeah, I was interested kind
of in your, in your perspective

on this.

And I also, after suggesting it
as a topic, started looking into

it more and found the topic more
interesting than I was expecting

to.

So yeah I'm looking forward to
getting your thoughts on this

and what people should be doing.

But just in terms of those drawbacks
of using the function approach,

there's a couple more mentioned
by the documentation I thought

were good.

1 is less important but there is
the just the verbosity of every

query and just having to have functions
in them.

But a more important 1, potentially,
depending on other opinions

you might have, is that you also
then can't use it as a primary

key if it's just text because,
or at least you'd have to have

another unique index on it to guarantee
uniqueness, to make sure

somebody couldn't sign up with
the same email address, but with

some case different because it's
the same email address.

But yeah, email, email's an important
1.

I don't know if you'd actually
want email as a primary key anywhere,

but also username.

Like a lot of services allow you
to set a username that is

case-insensitive, like social media
handles and things.

Those will often be, if not
case-insensitive, definitely unique

depending on case in a lot of cases.

So like Twitter handles, for example.

Nikolay: Yeah, and the question
is why it's an extension not

in core, right?

Michael: Well, there's a couple
of questions, right?

Like, could this be a data type?

Yeah, could it be a core data type?

Nikolay: Well, it is data type,
but

Michael: it's

Nikolay: living outside of the
core.

Michael: But I actually think so.

Well, I don't know how quickly
you want to jump ahead, but the

reason I found this super interesting
is In the docs for citext,

the contract module, it actually
suggests not using either citext

or the lower Function approach,
but instead creating a case-insensitive

collation and using that for it.

And that was so elegant.

But it is something you have to
do at the User level.

And we could, yeah, we could ship,
like, I don't see why Postgres

couldn't come with 1 of
those prebuilt that you could

just use.

And maybe even, maybe it wouldn't
be a case-insensitive field.

I could imagine a case.

Sorry, I'll keep this.

So I need to stop using the word
case to mean like argument.

I can imagine an argument for having
email as a datatype that

could be...

Nikolay: Oh, that's interesting.

I remember for Postgres, some huge
snippet to check that it's

really email.

It's like this RFC, right?

And it's not trivial to ensure
some valid email address or not.

It's not, it's not really like
so trivial as 1 might think.

Michael: Well, so I've seen 2 cases
for this 1 is the really

complicated 1, which actually I
think suits the argument that

it should be pre-built because
you don't want everybody rolling

their own version of that and The
other version is no just ship

it right back.

All you need to do is It doesn't
contain an at It maybe is it

all lowercase or can like cast
it to lowercase?

And then that's it.

Maybe a period after the at, or
a full stop after the at.

But I think anything more complicated
than that runs into potentially

not allowing things that should
be allowed, right?

Nikolay: So I will tell you what
I think.

First of all, I agree that collation
is an elegant solution.

And actually, collation, it's the
job of collations, even in

SQL standard, is to define rules
how we compare values, right?

And case sensitive or case-insensitive,
this is comparison.

So it's exactly where it should
be.

The only thing is, I remember there
should be some, like you

need to distinguish comparison
at the byte level, like logical

level.

It's like deterministic or something,
right?

Yes.

Yeah.

So honestly, I never use it myself.

Never.

I just read about this and I have no idea.

My own experience is, I was like, when I first discovered citext,

probably it happened 19 years ago or so.

Like on my second year of Postgres use, or Maybe in the first

year even.

Michael: I have the history of it here, by the way.

Nikolay: Yeah, it's a very old contrib module.

Michael: First released to the public 17th of February 2003,

but it wasn't contrib module at that point.

That only happened a few years later.

And it was a kind of like a redesign at that point.

But yeah, so you were pretty early.

Nikolay: Well, yeah, I started using Postgres in 2005 or 6.

And of course, it was social media.

So we definitely had the table users.

At the time, I tried to avoid plural.

So it was maybe something, something else like person or something.

Yeah, Then I was convinced by the influence of Ruby on Rails,

and Django and others, like, okay, it should be plural.

And then I quickly became a fan of citext.

And then I quickly became not a fan because like maybe 1 or 2

years later I stopped being a fan because like it's extra effort

to install extension and then you need to describe everyone what

like oh this special data type and I found it easier to keep

using text or varchar and these days I think just text but well

it's not a topic text or varchar and with limit or no for length.

Michael: But did you use collations or no?

Nikolay: No, no, no, no, no, no.

Michael: I just interesting.

Nikolay: It's just functions.

Yeah, function index with unique option if we need uniqueness,

that's it.

And then we need to adjust all the Queries.

And I was in my own projects, I was hunting to if we if we made

decision that it should be case-insensitive search, then probably

it should be global.

So we should get rid of index on raw value.

So that's it.

And fix it and all Queries and avoid sequential scans.

And that's it.

So this is like, you either want a sensitive search, so index

should be on the raw value.

Or you want case and sensitive search.

In this case, you must have index on lower or upper.

You need to choose in advance and make it a rule, like part of

your code style, probably.

And then use it usually on both sides of comparison, so input

value also get lowercase or uppercase,
and the value of Column,

and we know that we rely on Index
on expression here.

Michael: Yeah, well this is an
interesting thing that the original

author, so it was originally, the
citext module was originally

authored by Donald Fraser and in
the initial email about this,

1 of the use cases was multi-Column
functional indexes.

So if you wanted to have, so if
you want an Index on a single,

like just on email, fine, you can
do a Function on that.

But at least at the time, you couldn't
have it on 2, let's say

you wanted it on email, comma,
username, which may have been

more common back then before they
had like the includes option.

If you wanted it to be on the lower
of both of those, you'd have

to have like a kind of a parent
Function that did both of those.

So it got a bit messy in terms
of multi-Column indexes.

Whereas if you have them as their
own data type, you can just

do that at will.

Nikolay: Multi-Column expression
on 2 expressions, so lower email,

lower something else.

I think right now it's definitely
possible.

Michael: Okay, great.

I didn't realize that.

But back then it wasn't, so that
was like...

Nikolay: Yeah, that's interesting.

I don't remember this nuance.

Yeah, okay.

But yeah, so it's difficult to
name, to use the term multi-Column

here because it's already not Column
but expression yeah multi-layered

or something I don't know so multi-dimension
multi-dimensional

maybe Index on expressions.

It should work these days.

Not a problem, but also there is
a like there is in inconvenience.

If you go this route and use like
lower email or something you

it's easier to forget that we need
to when Index is created we

need to run Analyze on the Table
because it lacks statistics

on expression

Michael: yes yeah good point

Nikolay: If you build the regular
Index on raw, a Column value,

statistics usually already there.

But for expressional indexes, it's
not there.

But yeah, since then, I made this
decision in favor of expression

indexes and I never went back.

So why should I use citext in
2025?

Michael: I don't think you should,
but I also don't think you

should be having to do lower Functions.

Nikolay: Both paths are bad, collation
path is good.

Michael: That's what I think is
the solution now.

So it's like super interesting.

I actually came across just by
chance while looking into this

a blog post by Adam Johnson who's
worked on Django for many years.

It seems like Django actually removed
support or mostly removed

support for citext in version 5.1
of Django.

So they actually deprecated this
and he blogged about what you

should do instead and wrote up
about the how to set up a collation

and what we kind of define all
the kind of ICU I don't even know

what you call them like codes or
it's called a locale isn't it

But it's like made up of a bunch
of different, complete gibberish

in my opinion.

But you can look up the spec and
kind of choose which language

do you want it to be and do you
want it to be case sensitive

or not?

Do you want it to be sensitive
to accents or not?

So you

Nikolay: can.

Yeah, that's cool.

And it's much more flexibility.

And it's also SQL standard approach,
I think.

Michael: Yes, it is.

The last time

Nikolay: I looked into SQL standard,
it was more than 10 years

ago.

So my memory might make tricks
with me, but I remember something

about collations.

This came up,

Michael: well, in 2003 when it
was first proposed, I think I

read a post from Tom Lane in that
thread that said basically

the SQL standard way of doing this
is via collations, but I can't

see us doing that anytime soon.

So this makes sense as a hack in
the meantime.

Nikolay: Oh, that's great.

Yeah.

So, so that's great.

So it's very hard.

How are we going to name this episode?

I thought it will be citext now
it's shifted.

Michael: I was going to suggest
maybe case-insensitive

Nikolay: text.

Don't cut this.

Because it shows confusion, you
know, like it's like case-insensitive

comparison.

Yeah.

Michael: Because that's what we
want, isn't it?

Nikolay: But it goes beyond just
cases.

You mentioned excellence.

That's great.

Because some people might use my
like some names, for example,

like Thomas.

Maybe.

Michael: But as you said, I think
the 2 main use cases for this

are email addresses and usernames.

That's where I've seen it.

And email addresses are unique,
no matter the case, but I don't

know if they're unique.

Nikolay: No, they are not.

They are not.

But we might decide, for example,
if username in our system is

allowing like variants of Latin
characters, We might say, okay,

we allow them, but we don't want
collisions.

Like 1 user used pure Latin, another
user used like native version,

like from another language.

Many languages have it, right?

But we say we allow them, but first
person who use it should

be like, should be protected from
collisions.

In this case, we might decide to
use collisions and define unique

indexes.

That's a great idea, I think.

Yeah.

Michael: Yeah, good point.

In fact, I know it's not quite
related to that, but reading about

this, I came across somebody giving
an example of lower not always

being ideal, or lower and upper
being subtly different, because

in some languages, you can get
an uppercase character that can

have 2 alternative lowercase versions
of it.

I

Nikolay: didn't know about this.

Michael: Nor me.

But that, I mean, you can imagine
the kind of hassle that's going

to end up causing.

So yeah, it's...

Nikolay: Wow.

Michael: Collations are where this
stuff is defined properly,

right?

Yeah, yeah.

So it gives

Nikolay: you big freedom.

But what are downsides of using
collations?

Michael: Well, I did wonder because
do you remember, you know,

if glibc changes and we can end
up with corrupted indexes.

Nikolay: I don't, I, how can I
like,

Michael: Well, of course you remember?

Nikolay: Yes, of course I remember.

We had multiple cases with multiple
clients.

Michael: Could that be, Could that
be an issue here?

I

Nikolay: don't know.

Well, if you use ICU, it should
not be so.

I don't

Michael: see how.

Right, yeah, right.

Nikolay: Yeah.

And I think in this case, we definitely
want to specify a provider

ICU, right?

Yes.

Good point.

Yeah, and this comparison, I don't
see issue here.

And also there we have issues not
with like comparison but with

order right?

True.

Michael: So isn't that the same
thing?

Nikolay: No Who is bigger who is
who's upper who is lower basically

in the order it doesn't it's not
the same As who equals who I

know

Michael: it Except if you're looking
for like is there a clash

is this Is this unique?

You might traverse it only to the
point where you're expecting

it to be, but it's actually lowered
down.

So you say, oh no, I've already
got past that point in the index,

you can insert this and you can
end up with duplicate values

in the unique index.

Nikolay: Yeah, what I don't know
is if, so here we use deterministic

false, right?

Michael: Yes, yes.

Nikolay: And it means that at byte
level, the characters are

different, but we logically, we
think about them as equal.

But if you think about as equal
in the order by, they should

go together, right?

But do we have deterministic ordering
of them always?

This is, I don't know, like should
be, right?

Or there's some other mystic thing.

If so, that might cause this corruption.

Or yeah, If it's not unique index,
if it's unique index, it's

just 1 value.

I don't see corruption risks.

But if it's non-unique index and
like these values will come

in groups, is there like deterministic
structure?

I mean, not a rule, how they are
ordered.

So we avoid the switch of order
when glibc version changes.

But again, here we talk about ICU,
so glibc is not involved,

right?

So everything should be fine.

The only thing, just not to forget,
ICU should be enabled at

compile time.

I just, yeah, which is default,
I think right now.

Is it default?

I just remember Cursor and I were
fighting a few weeks ago because

my laptop didn't have ICU installed,
so I always reminded that

I created even Cursor a rule without
ICU.

If you run configure without this,
Then I surrendered and installed

ICUs just because somehow Cursor
kept losing this rule somehow.

I don't know why.

So yeah, so it should it should
be present.

It might be not present and not
be available on the system.

Michael: I think lots of people
use managed services these days,

I think it tends to be on those.

But yeah, really good point.

But yeah, this seems to me like
the way to do it now.

And just to complete that, I'm
going to read from Adam's blog

post.

The syntax is create collation,
and then you can call it whatever

you want.

For example, case-insensitive.

And then you set provider and he
goes provider equals ICU.

Then locale and then that's the
string that you can set all the

various settings we mentioned earlier.

And then you choose whether it's,
you have to choose in this

case, deterministic equals false.

So that's the byte-wise thing that
you're talking about.

Nikolay: Right.

Right.

Michael: So yeah, in fact, just,
is it worth going through that

gobbledygook in the locale quickly?

So yeah, he's gone with und, so
UND, I thought it was German

at first, but it actually just
means undetermined language.

So you could set it to like German
or French or something if

you really wanted to.

U specifies Unicode attributes,
and then KS, level 2.

So apparently that's collation
strength and then level 2 doesn't

include case in comparisons only
letters and accents.

So letters and accents can be considered
different, but case

can't.

So yeah, so that's what you'd want
in email address use case,

I think.

Nikolay: Yeah, so yeah, that's
interesting.

Anything else?

We have chances to have this episode,
as I promised, very short.

Or there are some additional aspects
here.

Michael: So no, I think that's
it.

Nikolay: So it looks like both
of us, if we were building a new

system, we probably would try to
use this approach with collations.

Sounds

Michael: like 100%

Nikolay: most flexible, most powerful
these days, right?

Michael: Yeah.

Nikolay: Yeah.

You know what I'm
checking right now and I see

cases about, collation
corruption.

Yeah.

Yeah.

I see, Daniel Vérité or how...

Michael: Oh yeah.

Is he psycopg?

Nikolay: Not sure.

I remember this name.

I read some blog posts in the past, but definitely collation

versioning problem with ICU 73.

So it's interesting to think about possible corruption when something

changes under the hood.

Yeah, and if we go this route, what to expect?

I lack personal experience with it, but I would definitely study

this topic closer because I thought if we use ICU collations,

well, glibc change is not a problem anymore, So we can upgrade

OS, right?

But it looks like an ICU library upgrade can be a problem.

So we need to check this, right?

We should be very careful considering what happened in the past

with glibc changes.

I remember it started from not knowing about it at all, then

thinking only one change was a problem, and then adjusting vision

like, you know what, every glibc version upgrade should be very

carefully tested because things change quite often actually there.

So I would bring this experience to here and then double check

what's happening.

Michael: It's probably a case for re-indexing at those times,

like having maybe a prioritized list of indexes that you want

to make sure, like maybe unique indexes at least.

Nikolay: It's only about unique indexes.

If we talk about corruption, we, again, order what matters.

And we don't want some values to be returned in different order

or search is not working as expected and so on.

Michael: I was thinking about the limitation of only being able

to concurrently create or like re-index one at a time.

You probably want to prioritize lists, so which ones would you

do first?

Yeah, maybe your unique ones first.

Nikolay: Yeah, during OS upgrades we usually look at them and

using amcheck to identify indexes which are going to be corrupted

and then we plan to just rebuild them or we rebuild them on if

it's this upgrade involves logical replication.

We rebuild them before we switch.

This logical replication gives us flexibility here.

But amcheck, back in the days, could only check B-tree indexes.

There was a multi-year work to bring it to GiST and GIN.

Yeah.

I don't remember.

There was some progress there recently.

So Postgres 18 probably will bring something, but I don't remember

from top of my head.

So that's why you probably want to rebuild all GIN and GiST indexes

just for safety, because corruption might happen there as well.

Michael: So nice.

Nikolay: Yeah.

Okay.

Yeah.

I hope this was helpful.

I learned something from you.

Thank you so much.

And next, next time I need this,
I probably will look into collations

based on ICU and deterministic
equals false.

Michael: Yeah, nice.

Nikolay: That's a good idea, yeah.

And probably we'll think about
accents next time.

Yeah, it's a good thing to keep
in mind.

Michael: For sure.

All right.

Nice one, Nikolay.

Take care.

Nikolay: Thank you.

See you next time.

Some kind things our listeners have said