
Extended Statistics
Nikolay: Hello, hello, this is
Postgres.FM.
My name is Nikolay, founder of
Postgres.AI and as usual my co-host
is Michael, pgMustard.
Hi Michael.
Michael: Hello Nikolay.
Nikolay: How are you doing?
How is your service doing?
I saw a newsletter.
Michael: Oh yes, all good.
Ticking along.
Nikolay: Improvements.
Michael: How about you?
Yeah, also like bug fixes, you
know.
Michael: Yeah.
Nikolay: And improvements as well.
We are preparing some big update
probably next month.
So I wanted to, actually you wanted
to discuss this.
It's your topic.
Michael: Yeah, very much my topic,
it sounds like.
Nikolay: Yeah, and I think it's
a useful topic, but I also want
to be honest and, as usual, I want
to be honest and discuss how
often can we use that feature of
Postgres and what's wrong.
Not wrong, maybe, but let's just
go like extended statistics
is the topic.
So it's what you can achieve running
SQL command, CREATE STATISTICS,
giving it a name.
It's very similar to index creation,
right?
Similar, right?
You create something
Michael: with, it's
Nikolay: named for 1 table only.
You cannot do it for multiple tables.
You choose columns, and there are
several parameters you can
specify.
And it's very similar to index
creation, but instead of building
the whole structure additionally
to speed up search of values,
of entries, we just tell Postgres
that it needs to maintain more
statistics than usual, because
by default it maintains statistics
for every Column individually.
Right?
And also, not only every Column
individually, but also multi-column
statistics, it does maintain and
also expression, not multi-column,
expression statistics.
It maintains for expression indexes,
like Function indexes.
For example, a very simple common
case is when we have a Column
named email and we use text, not
citext.
Nobody uses citext.
This Extension is not used at all.
I don't know why it exists.
It exists for ages.
citext is an Extension.
It's actually a contrib module, so it's always available in any
Postgres setup.
And it's case-insensitive text.
Because by default, If you create an Index on the Column named
unique Index, it won't guarantee uniqueness in case-insensitive
manner.
So uppercase email, lowercase email, they will be considered
different values.
This is not what we want, because in reality emails, they are
cancel sensitive.
Case doesn't matter.
So you need to use citext contrib module, but nobody does it.
I never see it in reality.
Well, it's so exotic.
I used it last time 10 years ago.
So we usually end up having an Index lower email.
And what I wanted to say is that by default implicitly Postgres
will maintain statistics for this value, lower email.
Right?
Yeah.
So expression.
It's automatically.
But CREATE STATISTICS allows you to maintain additional statistics
like additional expressions or multi-column statistics without
creation of Index?
Michael: Yes, although I've only ever thought until reading a
lot about it earlier today I'd only ever really thought of the
multi-column use cases.
The individual column cases is interesting if you don't if you
if you want to be able to use the statistics on an expression
but don't want the overhead of maintaining an Index on it.
Nikolay: Which
Michael: is interesting but I've not seen that case before.
Nikolay: So for example we have some expression on a Column,
a single Column, we want statistics but we don't have it, we
don't want an Index.
Can you imagine such case?
Michael: I haven't worked out 1 yet, but but it made it into
the documentation.
So I imagine somebody came came with that case 1
Nikolay: example Yeah, I want an example
Michael: But I think I do think it's worth us going back to basics
a little bit, like why do we want or need these statistics?
I think it's really important.
And I was going to just say...
Nikolay: Actually this is what you should see working with individual
query plans in pgMustard, you obviously have cases and as we
discussed before recording, you mentioned that you have some
logic in pgMustard developed to diagnose cases when something
is off and probably you need to
consider statistic creation.
Michael: Well yeah, this is the
tricky thing.
It's not super intelligent though.
We look out for bad raw estimates
and that they can be a real
problem for query plans because
the beauty of the Postgres optimizer
is that it's choosing between various
plans and it's doing that
work for you.
You don't have to specify how it
should get the data, you just
ask your query and it works out,
hopefully the fastest way to
do that, but it's making a bunch
of decisions about join order,
and maybe that's a bad example
for this topic, but a scan type,
different join algorithms, all
of these decisions are based largely
on the number of rows being returned
at each stage and therefore
the selectivity of various things
and calculating those selectivities
is tricky, like it's a hard problem
and it's the reason why Postgres
maintains statistics in the first
place.
It does that on a per column basis.
The reason it doesn't do it on
a combination of column bases
by default is because just the
number of those explodes so quickly
like as soon as you've got a few
columns in in 1 table the number
of combinations of like is it like
n choose 2 like they just
explodes
Nikolay: right
Michael: and So it would be super
expensive to do it for everything.
But in certain extreme cases, which
are actually quite common
in real world applications, some
columns are heavily correlated
to other columns and popular ones
and important ones often get
queried together.
So an example I see really often
in like software service applications
are things like organization ID
and team ID or something like
that you know a team ID is always
going to be associated with
an organization ID always always
always
Nikolay: or country and language
another example
Michael: country and language is
often is a common example as
like car and manufacturer and things
like that I cut and but
the 1 reason I mentioned org ID
and team ID is because often
the data is then partitioned based
on that or the primary key
is some combination of those things.
So they're really often always
being queried together.
And Progress's default, which is
a really sensible default, is
to assume that different conditions
are independent of 1 another.
And in a where condition, that
means it's really going to underestimate
how many rows are going to come
back, which can lead to bad things
like nested loops being
Nikolay: used.
It just multiplies estimates.
Michael: But it also has the opposite
problem in group by if
you group by multiple things and
they're actually very related
you're going to get a massive and
over estimate instead of under
so it can have both it can have
issues in
Nikolay: both directions.
Michael: Yeah so once you know you're like as the maker of that
software you're probably very understanding of which of your
columns are going to be very...
I think the technical language is functionally dependent on others.
Nikolay: How?
Michael: Well, like orgid and teamid, you know that.
Nikolay: I don't like this answer, You know that.
I want the answer which could be written for humans and maybe
not only humans.
Some answer, like some step-by-step how to recipe algorithm or
something, right?
Look here, check this, and then decide.
This is what I think is missing here because as with indexes,
great statistics is something that is not working by default.
It requires an effort from engineer.
And I think, as I told you before recording, I never used it
for real in production to solve any problems, like 0 cases I
have.
It appeared, it was added to Postgres 10, 7 years ago.
During these 7 years, several times I considered it.
And you know, like when something, like we see, okay, the row
estimate is completely off, like thousand times, million times,
completely off.
What to do?
What to do?
Of course, analyze, analyze, recollect the stats, see if it helps.
If it helps, no, it doesn't help.
What to do?
Then, old school approach, okay, like probably 100 buckets is
not enough.
default_statistics_target is a global setting, but you can control
it on an individual column level, yeah, and you say, okay, I
want 1, 000 buckets here.
And actually, I remember Tomáš Vondra was a guest on Postgres
TV we discussed this and Tomáš's opinion was hundred buckets is
quite a lot so it's not a good idea to always fix it increasing
this number but I see sometimes sometimes
Michael: yeah I think that might also sometimes help because
it not only increases the number of buckets but also increases
the length of the MCV the most common values
Nikolay: but if you don't use if we don't use standard statistics
Michael: with regular yeah with regular statistics target on
a column you also get a bump in the number of MCVs tracked.
Nikolay: Ah, okay, this I understand, yeah, yeah, yeah.
Michael: Which is, I think, might also be sometimes helpful for...
It might sometimes solve the problem.
So I can understand why throwing more sampling at the problem
could be helpful there.
Nikolay: And I'm okay to spend more disk space and memory for
this.
The problem usually if you raise
default_statistics_target to
1000, for example, is that analysis
takes longer.
Which is acceptable in many cases.
I wanted to mention very big news
which occurred last week, like
you know it, you name it.
Michael: Well, the commit Postgres
18 that unfortunately won't
include what we're talking about
today, extended statistics,
but it's still huge and really
positive news.
Although I've seen that there are
some performance issues.
Nikolay: So in Postgres 18 it is
expected that during the upgrade
Michael: yes
Nikolay: We all know that analyze
is on user shoulders, basically,
on DBA or DBRE shoulders.
pg_upgrade doesn't run analyze.
If it's in place, brief upgrade,
very fast upgrade, taking only
a couple of minutes.
It's pg_upgrade --link or -k.
So hard links are used, it's very
fast, and it also dumps, restores
the schema, because new version
has different features, so this
schema needs to be dumped explicitly
and restored.
This is automated.
But then it says, okay, now it's
time to run analyze, and there
is analyzer in stages, which I
don't like.
We talked about it a few times.
But it's basically, it's outside
of the scope of pg_upgrade
to run analyze.
We also, I criticized managed Postgres
providers like RDS, Cloud
SQL, Azure because they don't automate
it as well and live on
shoulders of DBR.
And people keep forgetting it,
and after upgrade, they don't
have statistics at all.
So, big news is in Postgres 18,
it was just committed and hopefully
won't be reverted.
Statistics will be dumped and restored.
Michael: Yes, and that's even better
than, in my opinion, that's
even better than analyze being
automated.
Nikolay: Of course, it's better.
This is the answer I've got when
I raised this on Twitter on
X many months ago saying that's
not all right that managed service
providers don't do it.
Somebody said like let's wait and
it was the right idea So if
now statistics is dumped and restored,
we have old statistics
and it's good.
Michael: Yeah, and my understanding
is this is a pg_dump feature,
therefore it should work...
Let's say it does end up in 18,
it should work for the next upgrade
you do, no matter what version
you're coming from.
Nikolay: Yeah, I also thought about
it.
It's interesting.
I'm not 100% sure, but of course,
definitely you can run, you
can use new pg_dump against some
old Postgres server, and hopefully
this feature will work.
It means that we will be able to
upgrade old servers to new without
need to run ANALYZE explicitly.
Yeah, and
Michael: it means less downtime.
I consider that time you're running
ANALYZE to still be there,
could effectively be down, even
if you try to restore connections,
just because queries could be much,
much slower than they should
be.
So, yeah, I think for me this could
drastically reduce effective
downtime for a lot of projects
during upgrades.
Nikolay: Yeah, I agree, because
in my opinion, ANALYZE in stages
it's a wrong feature, completely
wrong.
And I prefer keeping an ANALYZE
inside maintenance window.
And of course, if you have defaults,
this is a target elevated
to some high numbers, thousands.
In this case, it takes longer,
especially if you have like some
people have a lot of Database objects,
a lot.
I just recently had the pleasure
to deal a little bit with a
case with 200,000 tables and more
than 1 million indexes and
if you decided to raise
default_statistics_target there, ANALYZE
will take ages and even if it's
parallelized like yeah so keeping
statistics is good.
Michael: Let's back to the topic
though, you might still want
to run ANALYZE after a major version
upgrade.
There are a couple of caveats,
even in the 18 commit, as to things
that won't get pulled across.
And one of them is extended statistics.
So if you are using extended statistics,
you will, and they're
important for your query performance,
you will still need to
run ANALYZE to actually get back
up and running.
Nikolay: I always like running
ANALYZE.
For example, you know, if you have
partitioned table, the vacuum
doesn't maintain statistics on
the parent table.
Somehow if you run ANALYZE explicitly
you have your statistics
so I just like running ANALYZE.
Michael: I think there's even an
improvement, maybe it was in
17 that you can run ANALYZE only
on the parents so you don't even
have to analyze all of the partitions,
which is quite cool.
Nikolay: Yeah, so partition statistics
on partitioned tables probably
it's another topic and it's yes
itself but back to extended statistics.
So my point is since this is a
feature which requires you to
run some things explicitly and
make decisions, there should be
recipes.
For indexes, we have a recipe.
We have sequential scan, we see
some filter or order by, okay,
we know this index should help.
We verify it, ideally on full-size
clones.
We see it helps, we deploy it to
production.
People who don't have fast and
cheap clones, they check it right
on production and see it works.
Okay.
So that's it.
As for statistics, which extended
statistics, which was added
as we discussed in Postgres 10,
7 years ago.
During these 7 years, several times
I said, okay, Maybe we need
to create statistics here.
Let's try.
I remember 0 cases when this was
solution.
We always ended up having different
solution like creation of
index or maybe redesigning schema
sometimes to the question of
previous discussion with Frank,
right?
I mean, redesigning schema, denormalization
and so on.
So I don't remember conclusion.
Okay, in this case, CREATE STATISTICS
is our solution to the
problem and that's Saves me it
Honestly, it builds some Weird
feeling I'm missing something.
Everyone is using something.
I'm not using
Michael: Well, interestingly the
reason 1 of the main reasons
I wanted to bring this up is I
think extended statistics might
be underused globally I think a
lot of people don't know that
it exists.
Nikolay: Because there are no recipes
that's this is my point
we need recipes when Very concrete
recipes how to decide it's
worth using it right here.
Michael: Yeah, interesting.
But I do have something as a chance.
Nikolay: Give me a schema only
dump.
I will give some LLM, good 1, this
dump and ask to find just
based on column names, to find
some ideas what can have correlation,
then we build statistics and hope
it will help someday.
But it's a weak solution.
Michael: I also think it's flawed,
because I think if it's not
actually causing query performance
issues, then you're paying
that penalty for Analyze being
slower.
Well, again...
For no benefit.
And actually, that's a problem.
Nikolay: I didn't finish my thought.
I'm okay to pay some penalty to
Analyze more, keep more statistics.
It's not a problem.
Unless we have hundreds of thousands
or millions of objects.
In this case, I will be very careful.
But if it's only like a few thousand
of objects I mean indexes
and tables it's it takes 1 minute
or 3 minutes to run Analyze
on this database it can be many
many many terabytes 1 or 3 minutes
I don't care and in terms of storage
it's nothing and memory
it's not a lot maybe I'm wrong
but I think so.
Michael: I think you're right and
I think that's important for
people to realize that it's not
dependent on your data size because
we're sampling at this point like
because it's a sampling thing
it doesn't scale linearly with
how much data you have.
Nikolay: Also important disclaimer,
I have set of mind targeted
startups heavily because I'm building
my fourth startup and our
clients, consulting clients and
clients for products mostly are
startups and that means rapid development,
rapid, really rapid,
sometimes a couple of deployments
per day or more.
It means that maybe we, statistics
we decided to keep, maybe
it's not needed today, but maybe
it will be needed tomorrow because
we make some new features and so
on right and it's maybe it's
better to have it because we know
there is correlation in between
these columns so let's have extended
statistics over all these
columns but we lack recipes this
is the point So like like recipes
leads to under use of this feature.
I
Michael: Think they might also
I think historically you've also
mentioned having a bias towards
OLTP systems in production and
I think there's also a chance that
because often you're hyper-optimizing
queries and that's a good thing
going for index only scans and
you know very very limited data
retrieval very small number of
rows a very very precise lookups
you're probably not hitting
these bad plans in a lot of those
cases that you're skipping
straight to a really good
Nikolay: that's interesting for
Michael: that's really hard for
the planners to get wrong because
you're giving it the ideal index
to use for the important access
patterns and it's possible that
this comes up a little bit more
when people are having to serve
more user-defined queries, maybe
analytical ones that have any combination
of parameters.
Nikolay: Right, like OLAP queries
and so
Michael: on.
Yeah, exactly and I think in those
cases maybe sometimes just
not knowing what's going to come
you can't create indexes for
every case you do and then this
is slow
Nikolay: down writes unlike statistics
so this is a super big
difference.
Michael: Yes so I think there's
a chance that bias towards OLTP
and also being able to often add
the indexes you need for a smaller
number of access patterns or that
like a lot of queries per second
might be that this just isn't as
important in those cases.
But in terms of recipes, I think
the big 1 is bad row estimates
on important and slow queries.
So if a Query is important and
a lot slower than you need it
to be, and there's a really bad
row estimate in an important
part of the plan.
So like, not in a part of the plan
that is fast, but in a part
of the plan that is slow.
The really famous case is a nested
loop that thinks that the
planner estimates it's gonna return
very, very few rows per loop,
so maybe like 0 or 1.
So therefore, it's thinking, well,
I only need to do this a relatively
small number of times, therefore,
it's quickest and easiest to
get started with a nested loop
and I'll be done really quickly.
But in reality we'll create maybe
a thousand rows or more per
iteration.
Actually in those cases a hash
join or merge join would have
been it would so much faster.
So those cases I think can benefit.
But even actually there's a blog
post I was thinking of that
blogged about a time where extended
statistics helped them a
lot I put the plan through our
tool and noticed that yeah a bad
row estimate was as highly scored
as it could be, but so was
an index efficiency issue.
And I think, looking back, they
could have added a better index
and solved their performance issue
just as well.
So they've solved it with great
statistics, but they could have
solved it with better index and
probably the query plan they
got in the end could still be improved
further by a better index.
Nikolay: So I have an idea for
myself and who is listening is
welcome to join.
Next month I will have a challenge
for myself.
So every time I deal with some
query optimization, I will put
consideration of extended statistics
at first places.
I will think, can it be helpful
here?
And so on.
Just prioritize it and think about
it more.
And after 1 month of this, 30 days,
or month, February is shorter
month, right?
So 30 days.
After it, I will make my conclusion,
should I keep this as 1
of primary tools for query optimization
or it's just secondary
tool?
How does it sound?
Michael: I would love to hear that,
and from anybody else that
does try it.
1 thing I'd encourage doing is
also benchmarking the performance
hit on analyze of that tape.
Like if you do add it to a table,
how long does analyze take
before and how long does analyze
take after and I think it would
depend on which there's like a
few types of extended statistics
that you can add and you can add
all of them for
Nikolay: expressions or you mean
Michael: they could it there's
like 3 There's like 3 parameters
you can give it.
So, nDistinct, dependencies, and
mcv.
And I think mcv will be by far
the most expensive in terms of
analyzed performance.
Oh, by the way, another thing that
the docs mentioned can be
penalized is query planning time
and maybe query planning buffers
for that matter.
Nikolay: If you have a lot of extended
statistics created by
us, a planner can take longer?
Michael: Yeah, well it makes sense,
right?
It's considering more information
up front.
Nikolay: Right, makes sense, yeah.
Michael: I don't know how, but
I'd be interested in the benchmarks
of how much longer, how much more,
like.
Nikolay: Yeah, but in this case
2 things.
First is, well, both Benchmarks
sound to me like single session
benchmarks.
So it's not like we run a lot of
like, we can take 1 session
and check just speed maybe multiple
times for good statistics,
right?
So First thing sounds to me like
checking analyze speed.
And I think we need to remember
that if we talk about a single
table, it's OK.
I mean, no problem.
But if we talk about many tables,
it's worth remembering that
it can be parallelized.
And its recommendation is to use
vacuumdb with option hyphen,
hyphen analyze and specifying number
of jobs, number of processes,
backends basically, right?
Which will execute this.
And if we are alone on the server,
we can take, for example,
the number of vCPUs we have on
the virtual machine, for example.
This is 1 thing.
But if it's a single table, it
cannot be parallelized.
It will be a single thread at any
rate.
Michael: No.
And there's actually 1 more thing
that I think worth trying just
before we wrap up is if it is a
huge penalty but looks valuable
to query performance, which I'm
not sure you're going to come
across, but we'll see.
I've wondered about reducing statistics
target for those columns
so you could reduce it and then
add multi-column.
It just is an interesting idea
of getting back some of that analyzed
time.
Nikolay: I know a very good SQL
optimization expert who recommended
reducing default_statistics_target,
and I'm going to communicate
with him very soon.
Yeah, great.
But this battle in my mind was
clear like 10 years ago already.
Like reducing default_statistics_target
or per column parameter
was considered not good idea in
my mind, compared to increasing
it somehow.
I don't know, like this is from
various cases, But I'm going
to use this as well as a part of
my challenge just to think about
this as well, because I will talk
to him soon.
Michael: So.
I've got an idea for you.
Nikolay: Let me comment on the
second.
I already start forgetting what
you said about...
What was the second 1?
First is less benchmark, you said
benchmark.
I don't...
I'm very careful.
Analyze?
Michael: Yeah.
And also query planning time or
query planning buffers.
Nikolay: Query Planning time, this
is good.
Yeah, buffers as well, that's great.
So increasing default_statistics_target
globally definitely should
affect this as well, right?
Globally,
Michael: yeah.
Nikolay: Here, if somebody is going
to do it.
Don't make a mistake my team and
I did a few months ago when
we published an article.
So when you just connect to Postgres
and you do something, planning
time can be much bigger because
of lack of rel cache and it's
loaded, right?
Second time you run this, it's
very different.
So planning time, first planning
time and second planning time,
they are very different.
If you have connection pooler,
most people do.
It means that you don't, like it's
not a good idea to make conclusions
from observations of the first
round Yeah, yeah, so so
Michael: right point.
Nikolay: Yeah, so if even if statistics
Extended statistics affects
planning time.
I think should be so Question is
is it only for the first round
or the second round as well?
Who knows, right?
Because if only first round it
means like rel cache all...
Well, it's not rel cache here,
right?
Well, it's an interesting point,
just to check.
Michael: Yeah, based on the notes
in the documentation, I got
the impression it would be for
every
Nikolay: planet query.
Because it's statistics, it's not
real cache.
So anyway, I would check multiple
times.
Michael: 1 more idea for you, especially
talking to that person
you mentioned I wonder if you could
come up with a recipe for
which columns to reduce it on I
reckon there's a whole bunch
of really boring columns like unique
IDs you know primary key
columns type thing Do we really
need high statistics targets
on those or could we get away with
much much lower?
Do you see what I mean like ones that we know are unique in it
if we're keeping statistical on the most common values?
Nikolay: Well, I agree with you, but what's the point to optimize
in this area?
What what are our savings?
Michael: In my reducing the time analyze takes
Nikolay: As I said, I don't I don't really I Don't really bother
by a longer time.
Okay, it's 3 minutes against 1 So what
Michael: That is quite a big deal, isn't it, in terms of downtime?
Nikolay: I don't know.
We have a recipe for 0 downtime.
We can afford a couple of additional minutes being spent, and
nobody notices it, because It's a target cluster running on logical
replication.
We can do very long analyze there.
So even 1 hour, nobody notices.
It doesn't affect downtime at all.
So I don't know.
I don't know.
Like I can imagine this will save something.
We can find many places like that, right?
For example, in some cases we can start tuning autovacuum at
table level.
And so many times I saw teams, various teams, went this path.
And then we pulled them back, saying, let's stick to defaults.
Because table level settings often cause more harm than good
Because they add complexity, a lot of complexity It's hard to
maintain.
Then you go microservice, you have a lot of clusters and this
table level Well, it's so hard to maintain.
So and you talk about column level settings.
Michael: All I meant is, if you can find a really simple rule,
yeah exactly, if it's a unique column, reduce the statistics
target to 1.
Nikolay: This is good if it goes to hackers and the core itself, and implementation
which doesn't require effort from user.
This I agree with.
But if it's...
Again, My main concern, my main comment about extended statistics
is it requires effort, it's not clear, there are no clear recipes
how to apply it, so it ends up not being actively used unfortunately.
Maybe in the next 30 days we will have some recipes, right?
Michael: Sounds... I'm looking forward to it.
Nikolay: Sounds good.
Okay.
Thank you so much.
Michael: Thanks so much Nikolay.