[00:00:00] Nikolay: Hello, Hello, this is , episode number 70. My name is Nikolai and together with me is Michael. Michael. Hello.

[00:00:09] Michael: Hello, Nikolay.

[00:00:10] Nikolay: what do you think should discuss today?

[00:00:13] Michael: Well, this week was your choice, right? Um, I think you've picked a nice topic that was a listener suggestion. So I think we've simplified it to data model trade offs.

Is that A fair summary?

[00:00:27] Nikolay: right, right, purely relational, let's, emphasize that we are going probably to keep JSON, JSON B columns as a separate topic, we had it already,

[00:00:38] Michael: uh, with respect to PostgreSQL internals, almost, as a... Uh, so there's, that's quite a nice way of looking at it is what, could come up as a result of Postgres specific implementation details that should make us consider data model, decisions?

[00:00:56] Nikolay: alright, so Postgres allows a lot of columns, to be present in a single table. How many? 256,

[00:01:05] Michael: Yeah. I'm not actually sure. I think it's more.

[00:01:07] Nikolay: There is some limit. Oh, okay, poor preparation. Okay, anyway, anyway, you can have hundreds of columns in a single table, and you can have indexes on those columns, or you can go and implement voice code normal form, right?

And so on. And, like, I mean, there are many... Directions to move and, uh, of course, everyone wants a very good performance and compact storage and so on. Is it so? Maybe not everyone.

[00:01:42] Michael: Well, I think, I think most people care about performance on some level, but it's. I think it's easy to list three or four things that almost everybody cares about more, like, like durability, for example, or everything in ACID, you know, it's a reason people pick Postgres and pick old databases. And trust old databases more than some of the newer ones is because they care about things more than performance. I think I would argue that

[00:02:10] Nikolay: Well, well, right, but if the question is related to internals, I guess we should think about performance first of all. And performance in terms of, uh... Various aspects of it, like read time, write time, various overheads we can have. And so, to me, like, if we simplify the question a lot, it's a question of how wide should be our table, how many columns we should afford, or should we normalize versus denormalize, or should we have a lot of columns in one table, or... Fewer columns and more joins when needed. It's not directly the same questions, actually, right, because, of course, sometimes, of course, denormalization is when it's done on purpose, like, intentionally. It means that we extend the same table, putting more columns there, although those columns should be somewhere else, for example.

But sometimes it's very different things, like, and we just... If we just follow... Like normal theoretical approach normalizing our model, we can easily end up having a lot of columns in a single table, a lot of them, because usually if you think about some social media application or a mobile application, usually if it's a big project, growing project, we have some entities running that with those attributes.

Uh, which are like central to our model. For example, user table, or for example, if you have kind of organizations or projects or something, these tables are going to be central. A lot of other tables will have foreign key pointing to the central table. And eventually this central table is going to grow in terms of width.

You will add more and more columns because you need to track something. Sometimes just because of denormalization to avoid extra join and maybe aggregates, like count, right? Sometimes just because the complexity grows and you just need to keep more properties. And each property in a pure relational approach, it's just a single column.

New property, new column, right? Attribute, field,

[00:04:34] Michael: so what's, I mean, I think the, the obvious question is why is that a problem? Why is a wide table a problem?

[00:04:42] Nikolay: good question. Well, there are, obviously, there are strong trade offs here. On one hand... Extending in terms of width of our table is not a problem. We know how to do it with zero downtime very nicely. Postgres 11 brought us default for existing rows without, like, virtual default, I call it. It's not official name, but this is actually what happens.

Instead of rewriting whole table when you add a column with default, it's stored virtually for all old... Records, right, and you just, uh, your author is very fast, so we know how to deploy changes. With zero downtime, with proper testing, all good, then we create indexes, we create indexes concurrently because obviously more columns means more indexes.

Because some columns you want to be present in indexes because you filter, order, and so on. But, uh, trade off is, on one hand, it's good to have a lot of things in one table because you can achieve... Single index scan or single index only scan for many queries. Because if you need to join data from multiple tables, you cannot have single index scan. Index,

[00:06:15] Michael: scan full stop. Yeah. Mm

[00:06:17] Nikolay: Yeah, because index cannot be defined on multiple tables, and the only way to do it is to define a mutualized view and deal with all issues it brings, because it will need maintenance and proper use. So, if you have multiple tables... Let's consider a simple example, one on one relationship, somehow you ended up having like, for example, a user's table, and user's extra properties, for example, table, and somehow you decided, and it's maybe not a good idea, because you cannot, for example, filter or order by Just within the scan, using columns from both tables.

For example, you want to search by email, lowercase, or case insensitive CI text data type. And at the same time, you want to use some extra field in this search, or order by it. And in this case, you, Postgres will need to choose one of three join algorithms, and apply, like, loop, or... Merge join, depending on, even if, if it's a male, probably it will be nest loop or something like that, right?

So, it won't be looking super cool, because super cool is like index scan, index only scan. And I'm not saying, joins are bad. Because joins in Postgres are quite optimized, three algorithms, in many cases they work really well, but you cannot beat the performance of index, or even more important, index only scan, right?

So, to have that, you need to bring those columns in the main table, users, for example. And in this case, you can define a multi column index. And your query will be super fast. You will have, like, just a few buffer hits to find those one or a few rows, and it will, it will be below one millisecond, for sure. Right? And that's great. That's it. End of story. Let's, let's have a lot of

[00:08:22] Michael: it.

[00:08:24] Nikolay: No, no, no. Sometimes, I say, okay, sometimes even if we perform normalization like we learned from theory, at least to third normal form, very good situation already. And then I say, okay, maybe we should go back and denormalize just to have index only scan and avoid joins. But this is just, yeah, just one, part of story. Another part of story, there are many issues you will have when you have a lot of... Columns and indexes. And this correlates a lot with previous episodes we had recently. Overindexing two weeks ago, right?

[00:08:59] Michael: I think, is there an even more fundamental problem? by the design of Postgres, we've, we've got, we store data as a row, like as a, in tuples, in pages, right? And because we're doing that, If we've got really wide rows, we reduce the number of tuples per page that we can store.

So as soon as we're accessing multiple, like we, we might get that index only scanned, but if we're frequently only reading from a few, like a couple of those columns, and maybe they're slack, they're really narrow columns, those being on a separate table to the, the wide ones that we're infrequently accessing. Uh, is an argument against the white table for read performance. So even if we only focus on read performance, I think there are some arguments to, have splits like that.

[00:09:46] Nikolay: Yeah, makes sense, right, uh, but again, you can compare for your particular case, you can conduct a few experiments and compare exactly with Explan, Analyze, Buffers. how many buffers, how much are your each approach. Uh, gives you and I, I suspect for reads and not in all cases, but on many cases for reads.

Wide single table will still be good. Even if it's toasted, right? If we exceed 2 kilobytes for a row, like, roughly, it will be toasted, and so on. It means it's actually implicit join, right? Because to find those chucks, Postgres will need to perform. But, uh, buffer hits or reads will show this as The problem...

The second second side of this coin, is that extending the number of columns we have in the table, we have two big issues coming. They can come not in acute form, they can come slowly and increasing over time and accumulating. So, two problems. First problem is that... How writes are working. I mean how updates and actual updates, how updates are working. We... Updates means, an update in Postgres means that the whole tuple is copied, right? And this means that if we have a lot of, , columns which we, like you said, access not often, during update we need to copy them. so, this is first part of the first problem. Second part of the first problem is... More columns usually means more indexes, because, again, you need to work with indexes, you need some additional, to work with columns, you need additional indexes. And if you update a row, it means... Or if you insert a row also, it doesn't affect delete, but again, I had this mistake a few times.

Delete doesn't matter, but inserts and updates, they will need to change all indexes you have. You have many columns, you have many indexes, to update just one column you need to, or to insert a row, you need to update all these indexes. There is a heap only tuple update, optimization, but it doesn't work often.

Like, I mean, it doesn't work always. It requires two, conditions to be met, and, uh, it's not always happening, right? So, if we have a lot of columns, we need to write more, we need to update indexes often. And this makes our writes very heavy, and it gives a lot of work for Autovacuum, for example. Simple example.

if you decided to put a counter to avoid the slow aggregates. For example, we talked about users. Okay, how many, like, comments this user left in various...

I'm just talking about some abstract social media applications. If you put a post count as a column to a user's table. You will need to update it quite often, and all columns will be copied always, and all indexes will need to be updated when this update is not hot, or if it's in serve, but okay, here we talk only about updates, so if update is not hot, all indexes... will be updated. It's called index write amplification. One of the reasons Uber decided to go back to MySQL. So, so what does it mean? It means that probably we need to think carefully. If, if this is a central table specifically, which will be used in many, many queries, and also if it, if it's a central table, it also will be often used as just...

Single index scan just to find the current user, for example. Who is working? We need to find the current user. Of course, you can cache it, but still, you need to find this user, right? And then you need to update it. So, probably, it's wise to keep this counter in a separate table, one to one relationship, like... User extra data or something, and count is there, and it will be narrow table, only some numbers, and foreign key to the main table, and you just, and unique also keys to maintain this one one relationship, maybe it's primary key as well. So you just update this counter, maybe there are some other counters, or timestamps, another...

Example here is, like, last blah blah at, for example, last updated at, last commented at, last created some post at. Some timestamp, it's like usually 16 bytes. Timestamp with timezone or without, usually you should prefer with timestamp, and it's 16 bytes. And keeping it separately and just having an extra join when needed, maybe it's a wise approach because the main table remains untouched if you need to update it often, right? And the index write amplification is lower, because only a few indexes there, which, probably just one index, actually. probably don't need to index those extra columns, like counters and timestamps. Sometimes you do, of course, but in many cases you don't, and in this case it's just one or a few indexes to be updated.

This, right? Amplification is not, uh, is not a big problem anymore,

[00:15:44] Michael: and we avoid a lot of bloat on the main table, right, or a lot of, um,

[00:15:49] Nikolay: right?

[00:15:50] Michael: additional work on the, on the main table as well as the right amplification. We also don't have to, free up or reuse that space efficiently in the, in the heap. is there, presumably there's an impact on the right ahead log, the amount of.

Work we're doing in the right hand log as well,

[00:16:07] Nikolay: Well, yes, it, it is solved, but we also have full page rights, right? So full page right means we, we need to write full page anyway, but for subsequent rights, uh, between until next checkpoint, if you need to update very narrow table triples are small. In this case, of course, the amount of data written to wall becomes smaller, so you are right, it's a another good reason, right, so we, if we write less wall data, it means less pressure on backups and replication systems.

[00:16:44] Michael: exactly.

[00:16:45] Nikolay: But again, you need to join. So, probably, like, we end up with, like, this approach. If this is something like a counter or timestamp or something which is very often updated, but we don't need an index on it, it's good to keep it in a separate table.

[00:17:03] Michael: And one other condition, if we don't need, if it's not like highly correlated to something else, if we don't need statistics, mult, uh, is it called multivariate statistics? If we don't need to define those, anything that's... I guess all the single table features we have, this applies to as well.

[00:17:22] Nikolay: Four tuples.

[00:17:22] Michael: Oh, and I thought of one other, uh, I said thought of it, I remember reading a really good blog post by Haki Benita, on this topic.

So we've got the frequently updated columns. That we should consider, or that it might make sense to move out into a separate table. And we have these, uh, well, large or medium sized texts. Now large ones, more than I think you already mentioned, the two kilobyte number, Postgres already handles those for us using Toast.

Awesome. But it kind of shows that there is a, point where it becomes beneficial to do this join. We get some other features in there as well. Um, but there's this weird case, like below two kilobytes, that there's an argument that medium sized texts should be, you'd consider moving those out as well.

If you've got comments or something that are like one kilobyte on average, having those in your, in your pages means you're, you're only going to have maximum four. Well, what would it be? Eight tuples per page. Yeah, exactly.

[00:18:23] Nikolay: four.

[00:18:24] Michael: I keep saying that wrong.

[00:18:25] Nikolay: Two, two. Well, I also, I also, I'm also switching back and forth. Tuples or tuples, four. If it's a page, it's eight kilobytes, two kilobytes, like four, roughly four per page. Yeah, yeah, yeah. So, it's an interesting point. Another interesting point. It's vice versa if some column is like very off, very rarely used, very rarely, but it needs index, then also probably it should go to this extra table.

For example, token, activation

token.

[00:18:58] Michael: Yeah.

[00:18:58] Nikolay: If you put activation token to users table, it's becoming like, like forever passenger. Not doing useful work anymore at all, but you keep having it in all updates, you copy, copy, copy in all raw versions and tuples. Why?

[00:19:19] Michael: and returning that data

[00:19:20] Nikolay: And reads, it's bloating, bloating reads, and yeah, if you do need it one time, although it requires, usually token requires index because you find raw, like, using token, you find who...

This token belongs to, right, the owner of this token. So you need the index. But in this case, you need it to just, like, once per lifetime of user or maybe when a user resets password or something. It's so rarely. So keeping it in separate table, even if it's with index. It's a good idea, and yeah, it compacts the storage and memory representation, less bytes used in shared buffers because probably you deal with it once, then you forget about it and never read anymore to shared buffers until the password is required one more time.

[00:20:15] Michael: That's a really great point. I've got another question for you, if you've got a really large, um, if you're like chucking some HTML into Postgres for some reasons, but that makes sense for your use case in, in this case, and maybe it's potentially megabytes or something of data. Now we could rely on Toast for this, but we could put it, we could proactively put it in a separate table and use.

And it would still use Toast, right? But to avoid, like, it's almost a developer experience if, if some, uh, somebody's accidentally running select stars or doesn't, isn't aware of the impact of fetching that data and does a select star, if you're having to support those kind of queries, not having it'd be possible to have to fetch, you know, megabytes and megabytes of data from the server, by mistake when they probably don't need it.

Might be an interesting idea.

[00:21:05] Nikolay: Oh, definitely. So, definitely, if you look, if it's some project which was developed during many years, and if you look to these, these central tables, especially like users, like I said, or project organization, organizations, and so on, they are used a lot, and of course, you can check toast level with various queries, it's easy.

And if they started to be toasted, of course, select all columns, including those which you don't actually need. Usually, you need it often, but you need only few, few columns to check that this user is still active. Right? No, like, it's not, this user is not banned, for example, or project is still active, is visible.

It's just a few most important columns which are required often. Others can be offloaded, but again, there is a trade off here. You, if they are offloaded, but with indexes. Uh, you need to join, and Postgres planner choosing form of join and indexes, it will be sometimes less performant than just single index scan, right?

So it's an art. You need, when you build something, you need to predict future, and this comes with some experience. You need to think, okay, these columns, will we need them often? Will we need indexes on them? And based on the answers here, we can decide. To offload them to this extra table. So, the second problem is over indexing. We discussed it. If you have a lot of columns, you have many indexes, you quite often have single index scan or index only scan. This is the perfect plan you always want to see, because it's usually either sub millisecond or few milliseconds only, depending on the number of rows and hardware. So, if you have this...

You start having a lot of indexes, and not only it's index write amplification problem, it's also the problem we discussed, exceeding 16, 15 indexes, actually. Even if your query deals only with single table, but it has many indexes, if you don't have prepared statements. Planning each time. This planning will be slower and exceeding this number of locks, 16, table plus 15 indexes, already 16. 17th index locking will be with fast path false, so without fast path for lock manager. And it will be much slower than with fast path and it means that, if it's center of table and, uh, QPS for queries involving this table, especially simple queries, uh, they can be very frequent queries, high QPS in this case, contention for simple queries might hit performance a lot. So you'll see lightweight lock lock manager, in weight events. In PgSetActivity, or in like on RDS performance insights, Cloud SQL has it as well, Timescale, I think, has it as well. WaitEventAnalysis will show you waiting on lightweight log, uh, log manager.

And this will happen with very simple queries, primary key lookups, or some index scans, finding single row. Just because you have a lot of indexes, and queries are very frequent. So offloading might help, again. If you think, okay, , this table is going to be central in my system, and these columns I do need, most of them will be indexed, but I won't exceed this magic 16 number, I mean, for indexes it's 15, and in this case I will, I will be able to have thousands of indexes.

Maybe dozens or thousands of queries per second to this table, because I need it, right? Not, not everyone needs such, such performance, of course, right? But designing for such performance means you need to avoid having more than 15 indexes per table. This should be a rule for high, high frequent, or you, high frequent tables, or, or... You can just use prepare statements. It won't eliminate this problem completely, but it decreases it. It decreases it, because you eliminate planning time for primary kelo cups, your execution time might be even bigger than planning time. So, so you shave off a lot of milliseconds, microseconds from, uh, this, duration and contention.

Also can happen, but at higher QPS, so you postpone this problem, if you do need to have more than 15 indexes. But having less indexes is good, because log manager won't have contention if you deal only with single table.

[00:26:04] Michael: And we can refer people to the, overindexing episode. I've got a, theoretical question around that though, not something I've ever seen become an issue as a result of this, but, if we follow this pattern in a few places in our database, and we've split table, things that are logically one table into multiple, we're going to have more joins.

And there is a, there's a couple of limits, is it, from collapse limit and join collapse limit? And I think that, is it eight? Uh, yeah. Okay. Um,

I knew it

was, yeah. And that, we could quite quickly reach eight relations, um, if we're joint. So then, yeah. So you might, if you're doing this, you might also want to just check that any of your complex queries that involve a lot of joins don't flip plans or you're not getting some strange, I think it switches to, is it the, is it called the genetic, query optimizer, something like that?

[00:26:57] Nikolay: Genetic query optimizer, yeah, right. It's, it's, I'm checking, uh, 12. I always keep forgetting. Genetic. Genetic query optimizer threshold is 12. So if you reach 12, uh, relations... By the way, here we mean tables, not indexes, because previously we counted both tables and indexes when we talked about locking.

Here we talk about tables, and yes, if you have, if you join 12 or more tables, It means that genetic word optimization starts happening. It's hard to understand this thing, how it works, right? And maybe you are good, maybe you are not. So, I, in general, I, of course... Aim to have good performance and joins are fine, but usually I try to avoid joining too many tables.

And how to do it? Denormalization. In this case, yes, we need white tables worth saying that what I just described it's about very high... TPS, High QPS. So, if you have hundreds of queries per second, good hardware, you probably will never see this log manager contention. But, in this case, of course, denormalization is one way to reduce the number of relations participating in query.

Another way, of course, sometimes we do it, sometimes we just, okay, we read something, And then we did something in a different query. It's also possible, but, yeah. Or, do you know, like, if we use with materialized keyword, meaning we intentionally put an optimization fence there. means that these steps will be performed, like, this threshold is considered separately, right?

So, if you, I mean, if you, if first step was six tables, second step is six, another, another six tables. Plus previous results, of course.

[00:29:08] Michael: I don't know what happens with JoinCollapseLimit there.

[00:29:10] Nikolay: I think it's separately, these subqueries, even if it's a single statement, these subqueries, physically, it's a separate thing,

right? Because it's, we have a materialization step, right?

And we just, in memory, we have... Temporary tables, like, so to speak. It's not a table, but just some data, right? It's used like a table.

Like, we can join with it, right? In this case, splitting to multiple steps using CTE might help as well. I guess, like, actually, I see some overload in terms of number of...

Tables used queries quite often, but usually we find like some ways to simplify them. Sometimes a lot of tables participate. Almost for nothing there, like, you know, it was like just convenience, let's join this, let's join that, but then you ask, like, why? If you want good performance, maybe you should just make it lighter, right, not so heavy.

And in this case, , you just reduce the number and it quickly becomes below 12. So, but I don't know, maybe Genetic Optimizer is a good thing. I think only a few people in the world know exactly is it good or not. It's like some mystery, right?

[00:30:35] Michael: Yeah. Well, one, one thing to bear in mind is that unlike previous topics we've had, this is a configurable limit. We, you can change this number with the caveat that, that you keep an eye on planning time. Basically the reason these limits are in place,

[00:30:50] Nikolay: And you can do it dynamically. You can change it for one query, change back before another query. Same with join from collapse limits. But it's not the same with thresholds I described, like with this fast, fast path locking related. Sixteen. It's a hard coded constant.

[00:31:12] Michael: There's, is there anything else you want to cover? I've got one more thing on the

[00:31:15] Nikolay: Yeah, I wanted to mention that what I described, we have recent benchmarks. Thanks to my co workers, especially Vitaly and Dmitry, they did good benchmarks. And let's put links to description. Very detailed, everyone can repeat them as usual.

So, these benchmarks show, like, the effect of having extra indexes, for example. And, uh, yeah, I actually, until recently, I never thought that having extra indexes... is, uh, affecting planning time significantly, like 5%, 10 percent of, uh, of overhead in terms of time. And I also never thought it's related to this topic, actually, we discussed today.

More columns means more indexes. More indexes, faster we reach this 16 threshold. So, it's interesting. I, I, I must admit it's new to me. This is what I wanted to say. Yeah.

[00:32:19] Michael: On the last question I had, um, I, the case I see most commonly for denormalization is around, like, for example, with a software as a service company wanting to partition a large table by account ID or, or your equivalent. You need to have that on the table in order to be able to do so, right?

So like, that's quite a, so it could be like a, not only for read performance, but also in preparation for something like partitioning.

[00:32:50] Nikolay: Right. Partition key propagation, right? We need to propagate this partitioning approach to, like, dependent tables. This is quite obvious, I think, right? And, uh, again, like, if we partition, we need to make sure all our queries, or at least most of them, don't deal with more than one or just a couple of partitions.

Uh, otherwise, this, we can have another, , amplification. It's not index write amplification, but index... Planning time, locking, related amplification, because if you have partitioning and your query deals with multiple partitions, each partition has, for example, five indexes. If you deal with four partitions, you already have 20 locks.

Plus, plus table, so 24, right, logs to be acquired, and it definitely goes above this threshold 16, so FastPath won't be there for some logs, and logging will be slow, so you do need prepared statements at least. Or you need to just make sure consent exclusion is working, and, and your plans target just one or a couple of partitions always. Better one. And that's it.

[00:34:07] Michael: course. Awesome.

[00:34:09] Nikolay: if it's join, if it's join also. Of course, if you join, often, main table has many indexes, extra table has many indexes. When you just select from it, planner needs to lock all. I mean, at planning time we also need to lock all of them, actually. With the lowest possible heavy lock.

Access share log, just to avoid DDL. to happen while we are reading, but all indexes, even those which are not needed, will be locked. This is the key. We discussed it. So, if you split to two tables, but need to join often, and this join query needs to be very frequent... It's worthless to split. This is what I want to say.

[00:35:01] Michael: We, we've

[00:35:01] Nikolay: It's worth.

[00:35:02] Michael: a relation. Yeah.

[00:35:04] Nikolay: Yeah, we added a relation, plus one, okay, but, and we also need to perform a join now. So, it's worth. So, if you offload something, you need to be very careful. You need to, like, column switch rarely used, or column switch need to be very frequently updated. Column switch shouldn't be indexed, or indexed and rarely used.

I guess there are like some, maybe like, design patterns here worth developing, like what we discussed is kind of raw material, we need to polish it, I guess, before saying this is a...

[00:35:43] Michael: And the whole, the whole point of this episode was it's a trade off, right? Like there are going to be cases where one of these makes sense and there are going to be cases where the opposite makes sense,

[00:35:53] Nikolay: It's better if it's a serious system that will need to handle a lot of queries per second. It's worth checking, prototyping, filling with, like, millions of rows, hundreds of millions of rows, and just testing, testing, testing various... Scenarios,

[00:36:13] Michael: with concurrent selects. It really important, right? That was your

[00:36:17] Nikolay: Well, it depends. You might have only single session experiments and already understand the lock.

If you use buffers, you see I. O. and that's it. But if you need to check the behavior of the buffer pool of lock manager, in this case, of course, you need full fledged benchmarking multi session experiments and you need to be alone on that machine. But this, we had also an episode about experimenting and benchmarking . So here I see value in both approaches. Just different value.

[00:36:52] Michael: Wonderful. Was there anything else you wanted to make sure we talked about?

[00:36:57] Nikolay: I guess that's it. It's not old. Far from being old. Far from being polished. But I encourage everyone to think about their particular case with experiments. I don't call them benchmarks because explain, analyze. Buffers, it's also like performance related experiment, you can check plans, but it doesn't feel like benchmark, right?

But just from this single thing you can do many good optimizations, right? So experimenting is the key for understanding your particular case. I hope we highlight the directions of analysis, right? This is the, probably the main value of this episode. Okay, if you liked it, please like, subscribe. And share. I'm talking to our listeners.

[00:37:47] Michael: Oh, I thought you were talking to me. Uh, thanks so much Nikolay.

[00:37:51] Nikolay: I've noticed when you upload to YouTube, you don't like, I usually fix it myself, and I like our video, because... Yeah. Okay. Thank you so much.

[00:38:01] Michael: Take care. Bye.

[00:38:03] Nikolay: Bye.

Some kind things our listeners have said