
caSe-inSENsiTive
Nikolay: Hello, hello, this is
Postgres.FM.
As usual, I'm Nik, Postgres AI,
and as usual, my co-host is
Michael, pgMustard.
Hi, Michael.
Michael: Hello, Nikolay.
Nikolay: How are you doing today?
Michael: I'm good, thanks.
How are you?
Nikolay: Great.
Very good, very good.
Thank you for asking.
So today we have a very big topic.
I'm joking.
But Yeah, I hope for this podcast
episode should be the shortest
because we are going to discuss
very narrow topic.
Although this topic hits almost
every project, as I noticed.
So every system has usually some
table like users.
It has some column like email and
in email we usually expect
the search to not to distinguish
registry of characters look
case it should be case-insensitive
search right So if it's all
uppercase email or lowercase or
mixed, all those values must
be considered the same.
However, All people made this mistake,
including myself, many,
many times.
You usually start using varchar
or something or text for email
column.
And then you see that multiple
accounts, multiple users' records
have basically the same record,
just written in different cases.
Right?
So here, we usually, what do we
do?
We just introduce additional unique
index, over expression, lower
email.
So we convert email value to lowercase
or uppercase, doesn't
matter, and build an index, B-tree index with unique option,
create unique index concurrently
on lower or upper function from
that value and this gives us uniqueness,
case-sensitive uniqueness,
right?
But that's not it, we then need
to fix our search queries.
Michael: And make sure that every
future use of email uses the
function as well in order to be
able to use the index or have
2 indexes if you want to be able
to, you know, there's a few
kind of like gotchas around this
approach.
Nikolay: Honestly, like roughly
half of the cases, you know,
I, we do a lot of health check
of our startups, like usually
like they are grown to some like
terabyte or so.
And we see some tables and not
every time we look into the, the,
the, the table structure, but oftentimes
we do.
And I just see some indexes, like
index on raw email value and
on lower email, I see it a lot.
And some queries use 1 index.
So we hunt for unused indexes,
which is part of our health check,
and we discussed it on our podcast
in terms of how to maintain
good health of indexes.
Extra indexes is not what you want
to have.
And in this case, if we have index
on raw value and on lower
of email, often we also see both
of them are used, which means
there is a mess.
So all of this was introduction
of our today's topic, which is
called citext.
Michael: Yeah, so this was, I know
it's a very narrow topic,
but this was my choice because
I was looking through the contrib
modules at what was in there that
was both popular and that we
hadn't discussed yet.
And we've got to enough episodes
now that that was not a long
list of ones that, I mean, there
was, there were definitely some
other extensions we have, some
other contrib modules that we
haven't discussed yet, but the
ones on that list are not ones
I hear discussed often, not ones
I see in use.
Whereas citext, I feel like at
least was quite popular.
It did get quite a lot of use,
or at least a lot of people were
interested in the use case.
So yeah, I was interested kind
of in your, in your perspective
on this.
And I also, after suggesting it
as a topic, started looking into
it more and found the topic more
interesting than I was expecting
to.
So yeah I'm looking forward to
getting your thoughts on this
and what people should be doing.
But just in terms of those drawbacks
of using the function approach,
there's a couple more mentioned
by the documentation I thought
were good.
1 is less important but there is
the just the verbosity of every
query and just having to have functions
in them.
But a more important 1, potentially,
depending on other opinions
you might have, is that you also
then can't use it as a primary
key if it's just text because,
or at least you'd have to have
another unique index on it to guarantee
uniqueness, to make sure
somebody couldn't sign up with
the same email address, but with
some case different because it's
the same email address.
But yeah, email, email's an important
1.
I don't know if you'd actually
want email as a primary key anywhere,
but also username.
Like a lot of services allow you
to set a username that is
case-insensitive, like social media
handles and things.
Those will often be, if not
case-insensitive, definitely unique
depending on case in a lot of cases.
So like Twitter handles, for example.
Nikolay: Yeah, and the question
is why it's an extension not
in core, right?
Michael: Well, there's a couple
of questions, right?
Like, could this be a data type?
Yeah, could it be a core data type?
Nikolay: Well, it is data type,
but
Michael: it's
Nikolay: living outside of the
core.
Michael: But I actually think so.
Well, I don't know how quickly
you want to jump ahead, but the
reason I found this super interesting
is In the docs for citext,
the contract module, it actually
suggests not using either citext
or the lower Function approach,
but instead creating a case-insensitive
collation and using that for it.
And that was so elegant.
But it is something you have to
do at the User level.
And we could, yeah, we could ship,
like, I don't see why Postgres
couldn't come with 1 of
those prebuilt that you could
just use.
And maybe even, maybe it wouldn't
be a case-insensitive field.
I could imagine a case.
Sorry, I'll keep this.
So I need to stop using the word
case to mean like argument.
I can imagine an argument for having
email as a datatype that
could be...
Nikolay: Oh, that's interesting.
I remember for Postgres, some huge
snippet to check that it's
really email.
It's like this RFC, right?
And it's not trivial to ensure
some valid email address or not.
It's not, it's not really like
so trivial as 1 might think.
Michael: Well, so I've seen 2 cases
for this 1 is the really
complicated 1, which actually I
think suits the argument that
it should be pre-built because
you don't want everybody rolling
their own version of that and The
other version is no just ship
it right back.
All you need to do is It doesn't
contain an at It maybe is it
all lowercase or can like cast
it to lowercase?
And then that's it.
Maybe a period after the at, or
a full stop after the at.
But I think anything more complicated
than that runs into potentially
not allowing things that should
be allowed, right?
Nikolay: So I will tell you what
I think.
First of all, I agree that collation
is an elegant solution.
And actually, collation, it's the
job of collations, even in
SQL standard, is to define rules
how we compare values, right?
And case sensitive or case-insensitive,
this is comparison.
So it's exactly where it should
be.
The only thing is, I remember there
should be some, like you
need to distinguish comparison
at the byte level, like logical
level.
It's like deterministic or something,
right?
Yes.
Yeah.
So honestly, I never use it myself.
Never.
I just read about this and I have no idea.
My own experience is, I was like, when I first discovered citext,
probably it happened 19 years ago or so.
Like on my second year of Postgres use, or Maybe in the first
year even.
Michael: I have the history of it here, by the way.
Nikolay: Yeah, it's a very old contrib module.
Michael: First released to the public 17th of February 2003,
but it wasn't contrib module at that point.
That only happened a few years later.
And it was a kind of like a redesign at that point.
But yeah, so you were pretty early.
Nikolay: Well, yeah, I started using Postgres in 2005 or 6.
And of course, it was social media.
So we definitely had the table users.
At the time, I tried to avoid plural.
So it was maybe something, something else like person or something.
Yeah, Then I was convinced by the influence of Ruby on Rails,
and Django and others, like, okay, it should be plural.
And then I quickly became a fan of citext.
And then I quickly became not a fan because like maybe 1 or 2
years later I stopped being a fan because like it's extra effort
to install extension and then you need to describe everyone what
like oh this special data type and I found it easier to keep
using text or varchar and these days I think just text but well
it's not a topic text or varchar and with limit or no for length.
Michael: But did you use collations or no?
Nikolay: No, no, no, no, no, no.
Michael: I just interesting.
Nikolay: It's just functions.
Yeah, function index with unique option if we need uniqueness,
that's it.
And then we need to adjust all the Queries.
And I was in my own projects, I was hunting to if we if we made
decision that it should be case-insensitive search, then probably
it should be global.
So we should get rid of index on raw value.
So that's it.
And fix it and all Queries and avoid sequential scans.
And that's it.
So this is like, you either want a sensitive search, so index
should be on the raw value.
Or you want case and sensitive search.
In this case, you must have index on lower or upper.
You need to choose in advance and make it a rule, like part of
your code style, probably.
And then use it usually on both sides of comparison, so input
value also get lowercase or uppercase,
and the value of Column,
and we know that we rely on Index
on expression here.
Michael: Yeah, well this is an
interesting thing that the original
author, so it was originally, the
citext module was originally
authored by Donald Fraser and in
the initial email about this,
1 of the use cases was multi-Column
functional indexes.
So if you wanted to have, so if
you want an Index on a single,
like just on email, fine, you can
do a Function on that.
But at least at the time, you couldn't
have it on 2, let's say
you wanted it on email, comma,
username, which may have been
more common back then before they
had like the includes option.
If you wanted it to be on the lower
of both of those, you'd have
to have like a kind of a parent
Function that did both of those.
So it got a bit messy in terms
of multi-Column indexes.
Whereas if you have them as their
own data type, you can just
do that at will.
Nikolay: Multi-Column expression
on 2 expressions, so lower email,
lower something else.
I think right now it's definitely
possible.
Michael: Okay, great.
I didn't realize that.
But back then it wasn't, so that
was like...
Nikolay: Yeah, that's interesting.
I don't remember this nuance.
Yeah, okay.
But yeah, so it's difficult to
name, to use the term multi-Column
here because it's already not Column
but expression yeah multi-layered
or something I don't know so multi-dimension
multi-dimensional
maybe Index on expressions.
It should work these days.
Not a problem, but also there is
a like there is in inconvenience.
If you go this route and use like
lower email or something you
it's easier to forget that we need
to when Index is created we
need to run Analyze on the Table
because it lacks statistics
on expression
Michael: yes yeah good point
Nikolay: If you build the regular
Index on raw, a Column value,
statistics usually already there.
But for expressional indexes, it's
not there.
But yeah, since then, I made this
decision in favor of expression
indexes and I never went back.
So why should I use citext in
2025?
Michael: I don't think you should,
but I also don't think you
should be having to do lower Functions.
Nikolay: Both paths are bad, collation
path is good.
Michael: That's what I think is
the solution now.
So it's like super interesting.
I actually came across just by
chance while looking into this
a blog post by Adam Johnson who's
worked on Django for many years.
It seems like Django actually removed
support or mostly removed
support for citext in version 5.1
of Django.
So they actually deprecated this
and he blogged about what you
should do instead and wrote up
about the how to set up a collation
and what we kind of define all
the kind of ICU I don't even know
what you call them like codes or
it's called a locale isn't it
But it's like made up of a bunch
of different, complete gibberish
in my opinion.
But you can look up the spec and
kind of choose which language
do you want it to be and do you
want it to be case sensitive
or not?
Do you want it to be sensitive
to accents or not?
So you
Nikolay: can.
Yeah, that's cool.
And it's much more flexibility.
And it's also SQL standard approach,
I think.
Michael: Yes, it is.
The last time
Nikolay: I looked into SQL standard,
it was more than 10 years
ago.
So my memory might make tricks
with me, but I remember something
about collations.
This came up,
Michael: well, in 2003 when it
was first proposed, I think I
read a post from Tom Lane in that
thread that said basically
the SQL standard way of doing this
is via collations, but I can't
see us doing that anytime soon.
So this makes sense as a hack in
the meantime.
Nikolay: Oh, that's great.
Yeah.
So, so that's great.
So it's very hard.
How are we going to name this episode?
I thought it will be citext now
it's shifted.
Michael: I was going to suggest
maybe case-insensitive
Nikolay: text.
Don't cut this.
Because it shows confusion, you
know, like it's like case-insensitive
comparison.
Yeah.
Michael: Because that's what we
want, isn't it?
Nikolay: But it goes beyond just
cases.
You mentioned excellence.
That's great.
Because some people might use my
like some names, for example,
like Thomas.
Maybe.
Michael: But as you said, I think
the 2 main use cases for this
are email addresses and usernames.
That's where I've seen it.
And email addresses are unique,
no matter the case, but I don't
know if they're unique.
Nikolay: No, they are not.
They are not.
But we might decide, for example,
if username in our system is
allowing like variants of Latin
characters, We might say, okay,
we allow them, but we don't want
collisions.
Like 1 user used pure Latin, another
user used like native version,
like from another language.
Many languages have it, right?
But we say we allow them, but first
person who use it should
be like, should be protected from
collisions.
In this case, we might decide to
use collisions and define unique
indexes.
That's a great idea, I think.
Yeah.
Michael: Yeah, good point.
In fact, I know it's not quite
related to that, but reading about
this, I came across somebody giving
an example of lower not always
being ideal, or lower and upper
being subtly different, because
in some languages, you can get
an uppercase character that can
have 2 alternative lowercase versions
of it.
I
Nikolay: didn't know about this.
Michael: Nor me.
But that, I mean, you can imagine
the kind of hassle that's going
to end up causing.
So yeah, it's...
Nikolay: Wow.
Michael: Collations are where this
stuff is defined properly,
right?
Yeah, yeah.
So it gives
Nikolay: you big freedom.
But what are downsides of using
collations?
Michael: Well, I did wonder because
do you remember, you know,
if glibc changes and we can end
up with corrupted indexes.
Nikolay: I don't, I, how can I
like,
Michael: Well, of course you remember?
Nikolay: Yes, of course I remember.
We had multiple cases with multiple
clients.
Michael: Could that be, Could that
be an issue here?
I
Nikolay: don't know.
Well, if you use ICU, it should
not be so.
I don't
Michael: see how.
Right, yeah, right.
Nikolay: Yeah.
And I think in this case, we definitely
want to specify a provider
ICU, right?
Yes.
Good point.
Yeah, and this comparison, I don't
see issue here.
And also there we have issues not
with like comparison but with
order right?
True.
Michael: So isn't that the same
thing?
Nikolay: No Who is bigger who is
who's upper who is lower basically
in the order it doesn't it's not
the same As who equals who I
know
Michael: it Except if you're looking
for like is there a clash
is this Is this unique?
You might traverse it only to the
point where you're expecting
it to be, but it's actually lowered
down.
So you say, oh no, I've already
got past that point in the index,
you can insert this and you can
end up with duplicate values
in the unique index.
Nikolay: Yeah, what I don't know
is if, so here we use deterministic
false, right?
Michael: Yes, yes.
Nikolay: And it means that at byte
level, the characters are
different, but we logically, we
think about them as equal.
But if you think about as equal
in the order by, they should
go together, right?
But do we have deterministic ordering
of them always?
This is, I don't know, like should
be, right?
Or there's some other mystic thing.
If so, that might cause this corruption.
Or yeah, If it's not unique index,
if it's unique index, it's
just 1 value.
I don't see corruption risks.
But if it's non-unique index and
like these values will come
in groups, is there like deterministic
structure?
I mean, not a rule, how they are
ordered.
So we avoid the switch of order
when glibc version changes.
But again, here we talk about ICU,
so glibc is not involved,
right?
So everything should be fine.
The only thing, just not to forget,
ICU should be enabled at
compile time.
I just, yeah, which is default,
I think right now.
Is it default?
I just remember Cursor and I were
fighting a few weeks ago because
my laptop didn't have ICU installed,
so I always reminded that
I created even Cursor a rule without
ICU.
If you run configure without this,
Then I surrendered and installed
ICUs just because somehow Cursor
kept losing this rule somehow.
I don't know why.
So yeah, so it should it should
be present.
It might be not present and not
be available on the system.
Michael: I think lots of people
use managed services these days,
I think it tends to be on those.
But yeah, really good point.
But yeah, this seems to me like
the way to do it now.
And just to complete that, I'm
going to read from Adam's blog
post.
The syntax is create collation,
and then you can call it whatever
you want.
For example, case-insensitive.
And then you set provider and he
goes provider equals ICU.
Then locale and then that's the
string that you can set all the
various settings we mentioned earlier.
And then you choose whether it's,
you have to choose in this
case, deterministic equals false.
So that's the byte-wise thing that
you're talking about.
Nikolay: Right.
Right.
Michael: So yeah, in fact, just,
is it worth going through that
gobbledygook in the locale quickly?
So yeah, he's gone with und, so
UND, I thought it was German
at first, but it actually just
means undetermined language.
So you could set it to like German
or French or something if
you really wanted to.
U specifies Unicode attributes,
and then KS, level 2.
So apparently that's collation
strength and then level 2 doesn't
include case in comparisons only
letters and accents.
So letters and accents can be considered
different, but case
can't.
So yeah, so that's what you'd want
in email address use case,
I think.
Nikolay: Yeah, so yeah, that's
interesting.
Anything else?
We have chances to have this episode,
as I promised, very short.
Or there are some additional aspects
here.
Michael: So no, I think that's
it.
Nikolay: So it looks like both
of us, if we were building a new
system, we probably would try to
use this approach with collations.
Sounds
Michael: like 100%
Nikolay: most flexible, most powerful
these days, right?
Michael: Yeah.
Nikolay: Yeah.
You know what I'm
checking right now and I see
cases about, collation
corruption.
Yeah.
Yeah.
I see, Daniel Vérité or how...
Michael: Oh yeah.
Is he psycopg?
Nikolay: Not sure.
I remember this name.
I read some blog posts in the past, but definitely collation
versioning problem with ICU 73.
So it's interesting to think about possible corruption when something
changes under the hood.
Yeah, and if we go this route, what to expect?
I lack personal experience with it, but I would definitely study
this topic closer because I thought if we use ICU collations,
well, glibc change is not a problem anymore, So we can upgrade
OS, right?
But it looks like an ICU library upgrade can be a problem.
So we need to check this, right?
We should be very careful considering what happened in the past
with glibc changes.
I remember it started from not knowing about it at all, then
thinking only one change was a problem, and then adjusting vision
like, you know what, every glibc version upgrade should be very
carefully tested because things change quite often actually there.
So I would bring this experience to here and then double check
what's happening.
Michael: It's probably a case for re-indexing at those times,
like having maybe a prioritized list of indexes that you want
to make sure, like maybe unique indexes at least.
Nikolay: It's only about unique indexes.
If we talk about corruption, we, again, order what matters.
And we don't want some values to be returned in different order
or search is not working as expected and so on.
Michael: I was thinking about the limitation of only being able
to concurrently create or like re-index one at a time.
You probably want to prioritize lists, so which ones would you
do first?
Yeah, maybe your unique ones first.
Nikolay: Yeah, during OS upgrades we usually look at them and
using amcheck to identify indexes which are going to be corrupted
and then we plan to just rebuild them or we rebuild them on if
it's this upgrade involves logical replication.
We rebuild them before we switch.
This logical replication gives us flexibility here.
But amcheck, back in the days, could only check B-tree indexes.
There was a multi-year work to bring it to GiST and GIN.
Yeah.
I don't remember.
There was some progress there recently.
So Postgres 18 probably will bring something, but I don't remember
from top of my head.
So that's why you probably want to rebuild all GIN and GiST indexes
just for safety, because corruption might happen there as well.
Michael: So nice.
Nikolay: Yeah.
Okay.
Yeah.
I hope this was helpful.
I learned something from you.
Thank you so much.
And next, next time I need this,
I probably will look into collations
based on ICU and deterministic
equals false.
Michael: Yeah, nice.
Nikolay: That's a good idea, yeah.
And probably we'll think about
accents next time.
Yeah, it's a good thing to keep
in mind.
Michael: For sure.
All right.
Nice one, Nikolay.
Take care.
Nikolay: Thank you.
See you next time.