Denormalization
Nikolay: Hello, hello, this is
Postgres.FM.
I'm Nikolay from Postgres.AI and
as usual, my co-host is Michael
from pgMustard.
Hi Michael, how are you doing?
Michael: Hello, Nikolay.
I am good.
Thank you.
How are you?
Nikolay: I'm very good.
So our topic today, we are going
to discuss denormalization. But
it's very like maybe high level
and basics and the pros and cons
of having it used, applied as a methodology.
So we already had a couple of episodes.
We discussed materialized views.
It's a very close topic to this
one.
What else?
We discussed schema design.
Michael: Yeah, we called it data
model trade-offs.
I'll link up those two.
Yeah, I think data model trade-offs
was probably the closest
to this because naturally it came
up as a big topic.
And then the third one that it came
up in was I had a good chat
with Markus Winand and he made
some great points around normalization
and denormalization that well,
at least normalization I'll
link as well.
Maybe I need to re-listen to that
one.
Nikolay: Yeah, it was my choice
for this week, and I'm just
thinking that it's important to
get back to this topic at some
maybe slightly different angles
because I keep seeing issues
with this and cases of this, and
I have two different customers
which have consulting contracts
with Postgres.AI, and I just
observed the need for denormalization,
or at least to consider
applying denormalization to solve
some performance issues.
Before we discuss performance and
reasons, let me say this.
If you don't understand normalization,
don't use denormalization.
You need first to learn normalization,
and feel it, and use it.
Because otherwise, with relational
databases, it can be tricky
to understand particular cases
and limitations and consequences
of applying denormalization.
I'm not talking about non-relational
databases, but we can maybe
mention the work by Stonebraker
in the EDB blog, discussing
that MongoDB particularly, or maybe
DynamoDB there as well, I
don't remember exactly, the problems
which happen to MongoDB,
I think, document databases and
big JSONs, and the tendency to
have denormalized state design
by default normally, and issues
later, both flexibility and performance-wise.
Michael: Was that the blog post
with Álvaro?
Nikolay: Yeah, but, yeah, yeah,
yeah.
So it also had the word harmful,
which I inherited and then got
a big backlash on my sub-transactions
related blog post on Hacker
News.
Yeah, so let's keep that aside and consider we talk about only
Postgres and maybe slightly wider relational databases.
First thing you need to understand is normalization, normal forms,
at least 3, 4, Boyce-Codd, right?
Normal form of M.
Functional dependencies, it's very interesting thing to understand.
It's not only theory because it's like it's it feels theory and
always felt like theory to me but when you design your schema
if you understand these things you learned it studied them you
feel much better thinking like how to design schema tables and
columns and foreign keys between them right?
Michael: Yeah but and in the in the world at least in the last
3 to 5 years I think the vast majority of systems I've come across
have had pretty good like understandings of normalization and
I've got pretty good hygiene in that area from the things I've
come across.
I think it's 1 of those database things that back-end engineers,
full-stack engineers, generally do look up on.
They do know it's important and they do some thinking and up-front
design work, get second opinions and things.
So I actually haven't come up...
I have come across some cases of, you know, like, is it EAV,
entity attribute value, like nonsense or like just things that
haven't been fully thought through, but they've been really rare
compared to people that have done things pretty well from the
start.
Nikolay: EAV is radical normalization, right?
Which leads to bad performance, usually.
Yeah, that's a good point.
But I think many back-end engineers look at this topic, normalization
versus standardization in general, using lens of ORM, right?
They have different terminology, right?
Sometimes hard to understand each other talking.
Also, I think today Most powerful LLMs, they do a great job designing,
suggesting design.
For example, we integrated and it produces mermaid diagrams.
So if you ask to design, I don't know, like Twitter-like or Facebook-like,
YouTube-like schema or e-commerce, and then you iterate adding
tables, it generally tries to be like, to apply common sense
in terms of normalization as well
Michael: That makes sense.
Nikolay: But again before starting to use denormalization.
It's good to understand normalization first in several forms.
Then first thing when you Let's say sometimes it's convenient
to have a denormalization.
I maybe cannot think about a particular case right now, but it
happens.
I know it happens.
Sometimes you just think there is functional dependency here,
but I don't go and normalize and
move this data to a separate
table, for example, because I know,
for example, this functional
dependency might vanish in the
future.
You know, it might change.
I just know my field, and this
functional dependency is only
for now, it feels so, but it might
change.
In this case, it might be beneficial
and providing more flexibility
if you don't apply next step of
normalization in this situation.
Yeah, it sounds abstract and yeah,
I don't, I cannot imagine
some.
Michael: I think of 1 example,
like imagine if you're an early
stage startup without product market
fit and you're adding features
quite quickly and make like a simple
example would be You're
a blogging platform and you think
the idea of adding tags would
be a good idea.
So you instead of adding a whole
extra schema around that you
could have a single column where
you let people add a few tags
and you know that might not be
how you want it to be eventually
but you also might remove that
feature in a couple of weeks if
people hate it or something So
like you might for convenience
go with 1.
Well you don't like that.
Nikolay: Well this particular case
I remember my talk in Maryland
2008 the first time I came to the
US to speak how good Postgres
is for web 2.0.
It's like 16 years, right?
And this was my example, tags.
And I talked about hstore.
Or today we can talk about Array.
Back to that time, there was no
GIN at all.
GIN was not created yet.
We used only GiST and R-tree or L3 in some cases, it depends.
But in the case of tags, it's natural
to think, okay, I'm going
to put these 2 different tables,
and then you need additional
table to have many-to-many relationship.
And this is AE-AV, exactly.
This is what is going to have bad
performance if you have a lot
of entries in the main table of
like your objects and also a
lot of tags and a lot of relationships
between them.
It's really...
This join can be problematic sometimes.
But we will probably return to
this schema, like 3-table schema.
Later what I wanted to just emphasize,
I cannot just not to comment
here, it's about performance.
I would these days consider arrays
with tags or maybe arrays
with IDs.
We know foreign keys don't support
this, like 1 row and you have
tags, tag IDs, right, and then
like multiple foreign keys to
different entries in the text table,
but we can live without
the foreign key here.
For the sake of having good performance, we can just have these
IDs and or just text as is like text, text array, right?
So integer 8 array, big int array or text array.
In this case, using GIN should be good.
Michael: Or just JSONB.
Nikolay: JSONB, I feel would add like additional layer of something
here which we don't really need because we just need ideas or
bunch of phrases, texts, like text, right?
So I don't know.
Do you think this is denormalization?
Michael: I think strictly speaking, yeah.
Nikolay: Well, let's not say we are breaking first normal form
here Because of course first normal form is like say it's saying
it's just saying that in each cell basically of a table In each
Each value should be atomic, right?
It should not be a table, for example.
Or in this case, array or JSON.
We are breaking it, strictly speaking.
But in this sense, if we consider...
In this case, every time we use array, we are already breaking
normal form.
But in what sense it's denormalization.
If we are repeating values, For example, later we found a typo
in some tag and going to adjust it.
This is functional dependency, of course, and we will need to
update a lot of rows here.
Of course, it's better to store IDs and have tag values in a
separate table.
So in this case, we can even join in an interesting way, maybe
with lateral join or unnest here.
It's interesting, but it's a separate topic.
So if we don't store values of text, I don't see denormalization
aspect here.
It's just, yeah, we treat the first normal form in interesting
way.
This discussion happened 20 years ago, I think.
I remember some from Chris Date, right?
Some works like, are we breaking the first normal form in such
cases or no?
But if we store just IDs, well, it's quite well normalized and
we don't have a AV so we can use GIN search much faster.
If we need some tags, we like this is this query is going to
be much more efficient, in my opinion.
But we lack foreign key, unfortunately.
This is what we lose in such approach.
So, flexibility is 1 thing.
Different thing, and this is most interesting of course, performance.
We already started discussing performance.
I would highlight 2 particular cases.
This is exactly 2 particular customers I observed over the last
couple of weeks, 2 particular situations.
Let's start from maybe a simpler
1.
It's slow count, right?
Did we have an episode about it?
I'm sure we
Michael: did, yeah?
Nikolay: So slow count or in broader
view, slow aggregates.
So there are approaches to solve
it using like index-only scan.
We also had an episode.
I think we are covering more and
more things, which is great.
But if we see that it's not enough,
what else?
We need to start denormalizing
in some form.
And obviously, we could use materialized
view, which has all the
counts, but it's very asynchronous.
Michael: And the...
Like pre-calculated.
Nikolay: Yeah, pre-calculated.
You can do it, you cannot do it
often.
You cannot, For example, we have
a huge table, like 10 billion
rows, for example, and we insert
1 more row.
And we say, okay, our materialized
view is just create materialized
view, blah, blah, blah, name as
select count star from the table.
Interesting, right?
Is it a good idea to do it often?
Well, maybe no.
We should do it not often.
In this case, it will be lagging
and showing not relevant information,
like kind of eventual consistency.
Yes, we will reflect next insert
unless it's got it's got deleted
right but it will be lagging and
some people can notice it not
good.
So what else like we can consider
synchronous and also like when
you refresh materialized view, it's kind
of everything like it could
skip some old data understanding
it hasn't changed, and just
count the only fresh part of the
table.
Maybe we have partitioning there,
I don't know.
But Postgres doesn't support only
additional extensions like pg_ivm,
or what was the name of different
extension, I forgot, I just
found it recently.
Michael: The 1 you just mentioned,
was it denorm?
Nikolay: Yeah, denorm.
It was interesting.
I think it's Python and external
to Postgres, because pg_ivm
is an extension, so obviously you
don't have it on many managed
Postgres offerings.
But if it's external, or for example,
you implement it yourself,
it's quite possible, using maybe
triggers or something, maybe
some queue mechanism involved externally
to make it asynchronous
and incrementally updated.
In this case, it can be good and
maybe more resilient, like updating
more faster and having more actual
data.
The problem is, like with modularity,
the problem is it's heavy.
It's a very rough and heavy tool.
In most cases I deal with them.
I say let's try to avoid them.
Get rid of them, right?
In this case particularly as well.
By the way, we could build also
our own like kind of materialization
mechanism using logical decoding,
right?
Logical replication even.
For example, accumulating some
events through, it can be also
external queue mechanism, but also
using logical replication,
if you batch and then update, not
on each insert, but after 100
inserts or updates and so on, deletes
also.
And then you reflect this change,
and nothing is lost because
it's quite Postgres-centric approach,
right?
Because Postgres guarantees nothing
is lost.
What do you think?
Michael: Well, I like it.
I remember reading a good blog
post by Timescale on how they
how continuous aggregates work
and I remember thinking I would
I wouldn't want to implement that
myself like I remember thinking
it like keeping it synchronous
or like synchronous enough like
it's just quite painful so I do
admire people that have done
that and yeah I can see the argument
for it, but I also I think
at this point if once you're considering
that you probably also
should be considering the synchronous
approaches like triggers
or some other way of keeping things
just actually in sync.
Nikolay: Yeah, well, I agree and
I like this.
I have a sense of like lack of
some additional materiality mechanism
in Postgres which is not yet developed
which could support some
kind of asynchronous way to update
things.
Michael: Oh I'd love it if it was
in core that would be amazing.
Nikolay: Yeah in core exactly this
would be great but it's maybe
a hard task to have.
Michael: And also not that high
up the priority list for most
people I think.
Nikolay: Yeah but it would give
a lot of interesting things interesting
capabilities to develop interesting
well-performant and responsive
I mean not responsive but data
is coming and you reflect it like
within a few seconds it's great.
By the way
Michael: when you mentioned performance
so I looked up the definition
of denormalization on Wikipedia,
and it said denormalization
is a strategy on a previously normalized
database to increase
performance.
So they're explicitly saying that...
Nikolay: Right, but we can assume
that in our head sometimes
we normalize and then we move back
and denormalize and then go
like deploy it right away in production.
This would...
Michael: Oh yeah, I thought, obviously
there is the...
There's an interesting part there
that says it has to be previously
normalized, but I thought it was
also interesting that the only
reason they gave was for performance
reasons.
Nikolay: Well, maybe I'm wrong
thinking it can be not only for
performance reasons Of course in
my in my experience I did for
performance reasons.
Let's be honest.
Let's be honest here here I just
wanted to cover maybe cases
which can happen
Michael: You mentioned a couple
of cases that come up recently.
Nikolay: Yeah, well, before we
move to the next 1, I wanted to
emphasize, if we think about, okay,
we are not going to use materialized
view, what else?
We have a table, we insert, but
we need the count very fast imagine
We have a trigger and we update
that count on each insert Or
update and or delete the
Michael: update and delete.
Nikolay: Yeah delete Update maybe
no big depending of course
if you have soft delete approach
then you need To reflect updates
because they might be doing soft
deletion, right?
Michael: Well, and if it's only
a count, yeah, sure, but if it's
a different kind of Aggregate function,
then you might have to worry about
Nikolay: averages, sum and so on.
Yeah, min-max probably is fast
enough, thanks to index anyway.
Yeah, by the way, of course, this
also, there is ongoing discussion,
there are ongoing discussions about
having Column store and Postgres
all the time, which might be good
here, but maybe not.
It depends on the particular case,
because if your main part
of data needs to be Row store,
you still need to deal with some
replication inside your engine
or outside of your engine.
So replication means like maybe
it's triggers, maybe it's logical
Replication involved, who knows.
But yeah, it can be interesting.
But again, we have a main table
and trigger which on insert,
on each insert, it increments this
counter on this additional
table, what do you think will happen?
Is it good?
Michael: Like hot spots, we've
talked, yeah.
Nikolay: Hot spots.
Michael: Yeah.
At sufficient scale.
And I guess all of this is only
relevant at sufficient scale,
isn't it?
Nikolay: Yeah, so this synchronous
approach can be super dangerous
because locking, heavy locking,
and even without heavy locks,
if we have Foreign keys, for example,
there, and Yeah, it's going
to have multixact ID involved
and even SELECTs can downgrade
So I
Michael: think you can spread out
you can spread out the hotspot
like You don't have to have a single
record that gets updated.
You could have a series and that
that series could grow as your
concurrency grows.
Nikolay: Right, right.
That's like, yeah, like, butches
in place, right?
And you, then you just sum all
of it and this is your total count,
right?
That's, that's good.
Yeah.
Yeah.
This is actually what we did a
few times.
We just had batches.
I don't remember particularly how.
Basically, some kind of partitioning
of your account inside 1
table, right?
You partition those counts into
buckets or batches, how to say.
Then you increment them sometimes
collapsing maybe and so on
Yeah, also vacuum is an issue here.
If you update that table very often
and it can 1 row can can
be super heavy in terms of real
Disk space because of that tuples,
right?
Yeah, So Postgres MVCC is not good
in this particular case.
Michael: Well, I think you'd ideally
be aiming for those to be
hot.
There's no reason, I think, to
index that column.
So I think you'd be aiming for
those to be hot updates.
Nikolay: Good point.
Yeah, good point.
Michael: And then hopefully avoid
vacuum issues at all.
Nikolay: It's great.
Like index less stable.
Yeah, actually, this is maybe the
case when primary key is definitely
against our goals.
Good performance against
Michael: the primary key anyway.
But yeah, I understand.
Nikolay: Yeah, I mean, this is
a concept from theory as well.
At the same time, when we learn
relational theory, normalization,
we also learn that tables without
primary keys, basically, it's
also breaking rules, right?
So in this particular case, we
don't want primary key.
It's interesting.
Michael: Isn't there like a saying,
something like, first you
have to learn all the rules so
then you know how like how and
when it's safe to break them.
Nikolay: Yeah, yeah.
Well, the whole demonization idea,
you need to learn how to do
it right and then how to break
it right.
To have good performance.
So let's, we don't have a lot of
time, so let's discuss these
most interesting maybe case.
Back to AV maybe, but not particularly.
In my projects, I did it several
times.
I imagine you have social media,
for example.
You have a lot of users, say a
few million users, and you have
a lot of objects, say dozens of
millions of objects.
It can be posts or comments, anything.
Maybe it's not only like posts.
There is also something, some concept,
like kind of blog organization
or some data source where posts
exist.
And people can relate to those
channels, right?
Channels, let's say channels.
And for example, they can subscribe
Or they can have permissions
to view them or not to view them
as different things So and then
you just need to display last the
most the freshest 100 entries
posts or something right
Michael: relevant to them
Nikolay: relevant to this person
yeah only from subscriptions
yeah or only only that data which
is allowed to view to access
to edit or even even worse if you
want to show hundred last updated
or last accessed last changed different
kinds of timestamps can
happen here.
And it's interesting because some
timestamps belong to the objects
itself, for posts, for example,
creation time or modification
time.
But some of timestamps can belong
to this particular relationship
between users and those channels
or posts themselves.
For example, last accessed.
Right.
It's definitely for each person.
It's different in respect to particular
post.
1 user accessed it 1 day, another
user, it's the same day but
different time, so timestamp is
different.
So this particular task is very
common.
This is common pattern.
Not pattern, Pattern maybe should
be applied to solutions, right?
But usually people just do SELECTs,
joins, and that's it.
And at some point, performance
become terrible.
It happened first in 1 of my social
medias, and I was super upset.
I thought it's really similar to
graph issues Like graph working
with graphs like for example return
First circle of connections
second circle of connections and
try to find some connections
like in LinkedIn, right?
I remember many, many years ago,
we were trying to solve it in
relational database.
It's quite difficult.
We didn't have a recursive CTE
at the time and lateral joins
and so on.
So It's a hard problem to solve.
Query has several joins and filtering,
order by, but some filters
columns exist in 1 column, in 1
table.
Different columns exist in different
tables.
When filters and order by are in
different columns, in different
tables, it means you cannot have
ideal situation, a single index
scan or even index only scan, which
is very good.
Then you need to rely on 1 of 3
join algorithms, Postgres implements.
I can assure you there will be
edge cases where statistics and
planner, they don't have idea how
to execute this.
So
Michael: this is
Nikolay: bad performance.
Michael: The age-old issue of which
is more selective.
Is the planner going to be better
off going through an index
on the order by until it finds
enough posts, or whatever the
example is, to satisfy the limit?
Or is it going to be cheaper to
look at all posts that satisfy
the other conditions and then order
those assuming they're a
smaller set and unless the statistics
are good then you could
end up doing the wrong 1, you know,
the more expensive 1, and
have a very slow query if it's
a huge index.
Nikolay: Exactly.
Imagine we have, we order by creation
time, and we just subscribe
to channels, channels have posts,
so we have users table, we
join with channels, we have subscriptions
and posts, and then
we order by creation time of posts
and limit 100, right?
Order by desc, order by creation
at desc, limit 100.
So in this case, indeed, as you
say, Postgres need to choose
what to do.
Is it a good idea to extract all
subscriptions and then in memory
order by and find 100?
Or it's better to use an index
on created ad, go backwards on
this index, and try to find posts
which can be displayed for
this particular user, meaning that
they are among subscriptions.
Actually, both paths are bad.
They might be good in some cases,
but at a larger scale, there
will definitely be cases where
both paths perform really bad,
dealing with a lot of buffer operations
and bad bad timing, right
execution time.
So how would you solve this particular?
How?
What was
Michael: the solution?
Nikolay: Yeah, yeah, well,
denormalization is 1 of the ways,
right?
Michael: So like storing a duplicate
of this data in a materialized
view?
Nikolay: Yeah, For example, we
take creation time and we just
propagate it.
For example, if it's a relationship
to particular items, posts,
we can just propagate to this table
which represents this relationship.
And then We can even have an ideal
situation, a single index
scan.
Not index only scan because we
usually need to join data and
bring actual data from… Maybe it's
not in the index, but it might
be index only scan for this particular
table, which has a relationship.
This is 1 of the ways.
Of course, if you have channels,
well, it's more difficult, because
if we order by creation of posts,
not channels.
So it's, yeah.
Michael: Yeah, I think some of
these cases, you mentioned the
1 where different users have different
timestamps, for example,
if it's like last viewed, I think
that gets really complicated
in terms of...
Nikolay: No, vice versa, I think
it's good, because if we order
by last access, For example, last
access timestamp on channel,
we just deal with channels.
We can propagate it to the posts
themselves.
Well, we cannot.
Yeah, I agree with you.
It really becomes complicated.
So there are cases where it's really
hard to apply denormalization,
but there are cases where it's
easy.
If we forget about channels and,
for example, think about relationship
between users and these objects,
it can be, for example, permissions
or it can be, I don't know, like
last access timestamps should
be stored neither in users nor
in objects table.
It should be stored in a table
in between, right?
Like many-to-many relationship.
So in this case, usually, by default
it's good.
We just use this table, We have
index, we quickly find what we
need, 100 for this particular user,
100 objects that should be
displayed.
But if additional filtering, this
is usually in real life, we
need additional filtering involving
data inside objects table,
right?
It can be, well, soft delete, for
example, we mentioned soft
delete.
So like, deleted at timestamp.
If it's not, if it's filled if
it's not now Then the subject
should not be showed in this result
set right But we can propagate
when we delete, we can propagate
in all entries this object has
to all users.
We can propagate to this table.
Of course, it can be a lot of rows.
This depends on the situation.
I would not propagate it synchronously
if it's more than 1, 000
rows, because delete will be super
heavy.
Soft delete, it's update actually,
right?
But we know if it's like limited
number of users can access each
object, like not more than 1,000
say, or 10,000.
We can do that.
Or we can do it asynchronously.
Again, there is some need in, like
I would say there is a need
not in asynchronous, not in incremental
materials, but maybe
in asynchronous triggers.
So like there is an Oracle, there
is pragma autonomous.
It's not exactly the same, but
sometimes.
There are folks which are using
PgQ, for example, in Cloud SQL.
It's available, right?
1 of the good things about Cloud SQL
is the availability of PgQ,
so you can implement asynchronous
triggers there yourself.
If some object has update to be
soft deleted, we propagate to
all rows of this last accessed
or last viewed or something table.
In this case, we can have, again,
we can have single index only
scan.
But it might be more difficult
than that.
There is another way.
It's not about denormalization.
This is how we solved it.
Not only we, I know GitLab also
has to-do list and they apply
the same recipe which I think 1
day we will blog about.
Forget about denormalization and just
use huge recursive CTE with
some tricks, some algorithm actually
implementing this newsfeed
pattern to solve it in a very interesting
way.
Michael: Oh, is it like the skip
scan recursive CTE?
Nikolay: It's not skip scan, it's
like advanced skip scan.
So you know you need to return
100 rows.
You have 100 slots to fill, and
then you start working with channels,
filling these slots, and each time
you work with next channel,
you think, okay, this channel has
a fresh post, for example,
has a fresh object, and this is
the place we like we replace
some object we keep in memory we
replace and put it to the slot
and at some point you finish and
return but it requires some
effort to implement
Michael: yeah I can imagine that
that sounds really cool it reminds
me a little bit of the latest feature
1 of the latest features
in pgvector where have you have
you seen this in I think 0.8
it added being able to continue
a scan.
Like if you...
Nikolay: Didn't help us.
We checked yesterday.
Michael: Oh, no.
But the idea is the same, right?
If you don't get enough results
in your limit, go back and continue
the scan.
Nikolay: It's good if you have
quite like uniform database.
In our case, we are speaking about
Postgres.AI, for example,
we have more than 1 million documents
and more than 90% of them
is mailing lists, mailing lists
and in-entries emails, right?
And when, for example, you ask
for source category in this huge…
This approach doesn't help.
But still, it encounters so many
mailing list entries, it cannot
find sources.
So basically we are considering
either, I think we currently
already using partial indexes there,
and I'm seriously thinking
we should move to separate tables
because we have only limited
categories and our use cases, very
often we deal with only 1
source.
Michael: You could maybe partition
on source.
Nikolay: Yeah, actually it's also
a good idea maybe, yeah.
So we deviated here from original
denormalization topic, and
I think it's another story how
to implement newsfeed better,
right, without denormalization,
but with huge recursive CTE.
But, like, my summary is that first
of all, denormalization should
be applied only you know normalization,
and it can bring new
performance dangers if you don't
think about concurrency and
various issues with Postgres MVCC,
locking, LockManager, for
example, and so on.
It's an interesting thing, still
quite useful to consider in
many cases, but it should be tested
well as usual.
Michael: I think it's 1 of those
sharp tools, right?
Like, sometimes advanced craftsmen
need sharp tools, but be careful
with their sharp.
Nikolay: Yeah, yeah.
Something like this.
Agree.
Michael: Wonderful.
Thanks so much, Nikolay.
Catch you next week.
Nikolay: Sounds good, have a good
week, you too.
Bye bye.