Postgres FM | Transcript: Companion databases

November 17, 2023 • 46 Minutes

Companion databases

[00:00:00] Nikolay: Hello, hello, this is Postgres. fm, your favorite one of many. And my name is Nikolay, and, As usual, I have Michael here. Hi, Michael.

[00:00:10] Michael: Hi, Nikolay.

[00:00:11] Nikolay: So today we chose, you chose actually, one of the topics suggested by our users, so we, we can blame them, not us. And tell us which topic are we going to discuss today.

[00:00:26] Michael: Yeah, this is nice. I like picking user suggestions, or listener suggestions, Because I know at least one person will be interested in it. they suggested discussing companion databases, which is a term I hadn't heard before, but made complete sense. So this is the concept of having a different database management system for a particular workload.

So like, they've mentioned Redis, but we've discussed in the past, queuing. Uh, we've, we've discussed analytics workloads. There's loads more that we haven't discussed. Like Vector databases are all the rage these days, aren't they? Document databases, even search specific, databases. So yeah, I thought this was an interesting topic.

I think there's some, probably some generic pros and cons, but also some details in the specifics that, that might be of interest, uh, but I was, yeah, super keen to hear your thoughts and experiences with when some of these things make sense when they don't, and whether we can come up with some rules of thumb.

[00:01:24] Nikolay: Well, first of all, rules of thumb for me is if you can avoid it, avoid it.

But I'm okay with other systems as well. And obviously pros and cons in each case. Sometimes it's easy to see, for example, the Victoria matrix is much better in terms of compression and working with time series and Postgres and even timescale if you check benchmarks, right?

It's obvious, uh, like, we had a discussion with their CTO, how many, uh, gigabytes, one billion of rows should take. And my impression usually for Postgres, like, row store, uh, with very light compression only for toast, uh, it's like kind of terabyte. For them, it's gigabytes. So, like, several orders of magnitude.

I'm not sure gigabytes, maybe dozens of gigabytes, but definitely not a terabyte. And only this already highlights a big difference. Of course, timescale is also good at compression, but not as good as the Victoria Metrics, as I understand. I'm not going to blame timescale, it's great technology.

so, obviously, like, this just, this single...

Example says, like, sometimes you probably want to consider different databases, for example, Greek house. Obviously, uh, it has great possibilities, capabilities to, to handle workloads, so like log like or like stream of some events and so on at a very large scale. You can do it with Postgres, uh, but you will hit some difficulties.

, Obviously, again, if you can avoid it, I would prefer avoiding it if you already chose Postgres because the main problem will be synchronization and like two, two big problems, synchronization and Management of additional tuning and so on, of additional system, requiring some expertise.

[00:03:25] Michael: Would you count, yeah, the expertise, would you count that in the latter? Or like, I'm thinking there's like a certain amount of educational overhead as well, you know, team expertise, that can make sense in larger companies or larger teams, but, yeah, do you count that in the management overhead? I guess it is.

[00:03:42] Nikolay: Right, yeah, well, I say management, it's not only, like, you can find maybe a managed version of some system, but that's how you won't be worrying too much about backups, although I think you should even for a managed system. Uh, and, uh, various stuff like provisioning, replication, failover, HA and so on. But, , it's still, most likely it will be different language, uh, dialect of SQL or maybe not SQL at all. It will add complexity to application, definitely. So, , management probably is not a good thing. Word here, like dealing with something else requires additional effort, not in terms of management, but in terms of like just working with new system.

[00:04:30] Michael: Yeah,

[00:04:30] Nikolay: this is serious. If you add additional system, you need to add a lot of stuff additionally.

[00:04:37] Michael: I like it. I think you're already starting to extract some of the things I was hoping to in terms of the trade offs again. so we've already touched on the analytics one. I think it's particularly interesting because I think there are quite strict trade offs there.

[00:04:50] Nikolay: Mm

[00:04:51] Michael: some of the specialized systems contain features we just don't have even with the Postgres extensions, but yeah, you mentioned we do have some.

Quite deep Postgres extensions with Timescale and Citus that offer, compression. And if you look at some, like I was looking for examples and it's really interesting to see, for example, Heap Analytics, dedicated analytics product, I believe still using Postgres with Citus and making it work even at extreme scale.

And yet I also read very similar, right up from. Posthog, also an analytics product, who, I much newer than Heap, so I'm guessing smaller but maybe not, and already bit the bullet and moved their analytics workloads to Clickhouse, for the benefits. But when you're an analytics product, it can make a tonne of sense, right, to, Pay that migration cost, but the, I think the interesting learning from me from that was they did scale to a certain point using just Postgres. So I think that I like your rule

[00:05:48] Nikolay: Without side ops, you mean?

[00:05:50] Michael: I don't know for sure, but my guess is without yes.

[00:05:54] Nikolay: I remember they used ZFS and ARC too, it was interesting. So, trade offs is synchronization of data and, uh, additional requirements to your, like, development and administration activities. Like, so how about, discussing several particular workloads, and checking, like, these trade offs and...

And so on, like starting from analytical workloads, maybe. What do you think?

[00:06:22] Michael: Good. Sounds good. I've got, so you mentioned those as trade offs. I, I saw they're kind of like downsides, aren't they? They're like the cons of, having a separate companion database. But I think we haven't mentioned some of the pros as much. So like some of them, we mentioned more advanced features.

We mentioned performance, like they can be better performance wise in certain cases, I think there's also a couple of others, like. They can be easier to scale out, well, I think with Postgres, we have really good scale up options, but we don't necessarily have, as easy scale out options where some of these dedicated databases do, and then there's one more, which you brought up a few times in previous episodes, which is reducing the load on the If we've got a Postgres primary that's already fairly maxed out, or we're pushing the, the limits of our managed service provider, for example, we probably don't want to be adding additional workloads to it.

So, or like anything we can do to split out services could be beneficial as well. So those feel to me like the, the pros of going down this path of, of adding a companion database, but then the cons are. As you said, like learning another system, another system to maintain and those, yeah. Synchronization or, or managing transactions, I guess, in some sense

[00:07:37] Nikolay: right. So, yeah, if, if you mentioned, Citus, then we should mention Hydra as well, because like it's, it's a new thing that probably will, we probably will soon hear some use cases, I hope. And I would distinguish, talking about the analytical workloads, I would distinguish two big, uh, different cases.

First is when you have an LTP system as a, like, main system you have, for example, a web or mobile app. And additionally, you need analytics, for example, e commerce, e commerce, and you need analytical system. To analyze what's happening with your e commerce and a lot of stuff. Also, maybe like for BI and so on, like a lot of stuff. You need a lot of stuff there. And, , you can technically build it on top of Postgres. There are systems, I don't know, like Odoo and Some ERP systems also, they can be considered as analytical, uh, built on top of Pentaho from the, like, some, some names from the past.

But probably you, you would prefer, in this case, this companion system like Snowflake or something. In the past it was a Vertica or something. and there is very different case when analytical workload is your primary workload and you are just building some analytical tool and, this is what you deliver.

In this case, probably Postgres. It can be your main system or it can be, like, I don't know, like, non existent at all, right, in this, in this approach. So, speaking of the first thing, when there is all TP in the center and analytical is, like, additional thing, , obviously, by default, growing projects might choose running analytical workloads just on replicas, right?

And this, we discussed this as a quite bad idea. I mean, it's not a bad idea for very beginning because you don't see a lot of TPS and data volumes, uh, a lot of bytes. So, but eventually you will hit into different trade off which doesn't have good solution, which is called host by feedback should be on or off,

[00:09:42] Michael: Yeah.

[00:09:42] Nikolay: right? And as a reminder, if it's on, then running long running queries on statements, uh, transactions actually on replicas with by feedback turned on. It will affect the primary workloads and Autovacuum won't be able to delete freshly dead tuples. If it's off, then it makes your replica a single user replica because it accumulates lag while you run your long transaction and other people are asking, like, why lag, right?

[00:10:14] Michael: Well, and it, it depends a little bit if you need it with, uh, with analytics, there, there can be cases where you don't need like completely real time analytics data. It's okay to

[00:10:24] Nikolay: Fair, yes, fair, but, then, like, it's still... Not fully healthy when lag has accumulated, how to monitor it, is it, like, caused by normal workload, somebody running, like, 30 minutes report or something, or it's already an incident, and then we need to catch up, and then, anyway, other users, from my experience, would expect, they will expect, , this replica not being relevant at all. Being le legging. And, uh, good solution to this problem. So both on and off for host are bad, but the good solution is branching. For example, if you have branching on replica, you can create your branch and analyze everything there and destroy this branch and so on, or cloning, right?

But, uh, it's not, . Everywhere available. So this is the problem number one, and the problem, number two is, uh, what we mentioned already. Postgres, compression options, and especially not only compression, throw store, lack of column store. I don't, I,

[00:11:28] Michael: yeah,

[00:11:29] Nikolay: I don't even think, uh, sharding should be like immediate. maybe, of course, like, for example, Clickhouse, by the way, it's interesting, like, analytical workloads and time series workloads sometimes are quite close to each other, right? So, do you consider Clickhouse as an analytical database system or time series database system?

[00:11:49] Michael: I always think of it as analytics.

[00:11:51] Nikolay: But originally it was developed for, uh, So, analytical, web analytics, system similar to Google Analytics, but, the primary type of, like, the data is time series, series of events, well, it's, is it called Clickhouse? It's a series of clicks, web clicks, with a lot of characteristics of each click. Click means like request, HTTP request or something. So it's time series, basically, also, right?

[00:12:18] Michael: yeah, I see what you mean, but equally, like, I think they also have like vector features now. It's like, just cause, just cause it has, they go, how do you define Postgres? Is Postgres an analytics database as well? Like, I think it's difficult to categorize these things in one category, right? I've seen Clickhouse used for analytics stuff. And yeah, it's mostly time series data. So it's like so much data in this world. How much,

[00:12:44] Nikolay: Yeah, well,

[00:12:45] Michael: collect that isn't time

[00:12:46] Nikolay: We know timescales, point of view, everything is time serious. We know this, right? But I, yeah, even, I think even a human is time serious. It has a birthday. So all humans are time serious data. And everything is, like, everything you have, it has, like, created at timestamp and maybe updated and deleted.

I don't know. Okay, so, uh, Then we should also consider a time like not cyto, hydra, and also time scale DB as, uh, your options to stay, , with Postgres only. But in this case, even in, in this case, if you have strong L TP system and, analytical workloads, which you know will be, handling long grinding queries or transactions, I would suggest having two clusters separately and, uh, connecting them through logical replication.

To avoid this hotstand by feedback problem.

[00:13:42] Michael: And, and if we take it to the extreme, like the extreme at the other end, the small side of things, I think you can start even without any of those. Like it, even with partitioning, we get a lot of the benefits, straight away. And I think with, with a little bit of, uh, I think I've seen, um, people use roll up tables quite effectively. So while, if you can aggregate the data. In a separate table, for example, that can massively improve, uh, very limited analytics workloads, or like, you know, you mentioned the case that you have an RTP workload and you need to provide some analytics over it. If you can do, use rollup tables, you don't even need any of those extensions for quite a while.

[00:14:22] Nikolay: So what, I guess what we try to say is that it's not only like single Postgres cluster versus Postgres and companion Victoria Metrics, for example, or I don't know, like Snowflake. Quite expensive option, but very popular, obviously. We can consider a lot of options with staying inside Postgres ecosystem.

And for example, it. And if you can, reach, good query performance. probably it's fine to run it on a regular, on a single cluster replicas. It will be, this is, I think, what we should call an HTAP case, when we have LTP workloads and analytical queries which don't take 30 minutes. I think... Yeah, several seconds. Well, up to one minute. It's okay for Autovacuum not being able to delete that tuples during some time, some like minutes, maybe 10 minutes. It's okay. It doesn't delete them immediately anyway. It will come back later. And in this case, Hydra is looking interesting because you can define, specific Tables as ColumnStore and run analytical workloads on them.

You may be having some data stored twice inside the same cluster, like originally RowStore and additionally as ColumnStore. And, uh, it's interesting, right?

[00:15:47] Michael: Yeah. For sure.

[00:15:48] Nikolay: yeah, yeah. Also AlloyDB interesting, but it's different. As I understand, this store is raw store, the main storage. But in memory, they also built, like, 90 degrees, how is it called?

I know only in Russian, sorry.

[00:16:05] Michael: Transformed or?

[00:16:06] Nikolay: transformed table, so it becomes additionally column store, and you can run a lot of stuff with them. Aggregate is much faster, but it's not open source, Hydra is open source, so I would root more for Hydra in this case. So interesting, and new options to explore before you think, okay, enough, I'm going to touch Snowflake.

But sometimes you have business requirements, for example, if it's a large company, they build a whole department for analytics, and these guys know how to work. With particular analytical system, and they say, okay, we need data there. Snowflake, that's it. In this case, no options, right? We don't say, let's, let's try Postgres. So it depends on many, many questions here.

[00:16:52] Michael: I would still encourage people to consider it as an option and to ask, to mention it to the team that it might be an option. But if you've already got the expertise that does, that does limit the, the cons, you know, if you, if, if you don't have some of those downsides that we discussed, then it is an easier trade off to make, especially if you have some of the pros, especially if you're looking for, if you know about a feature in Snowflake that you get out of the box that you just don't get out of the box in Postgres or would take additional engineering effort.

And I think a lot of the cases, it's very easy for us to brush over those. I, I don't know these systems intimately, so I'm sure there are features I've never heard of that some people find extremely useful in some cases. Like we haven't talked about, we've, we've got episodes on, uh, queuing on analytics already that I'll link up in the show notes for anybody that missed those, but we don't have one on search, like full text search yet.

I think that will be an interesting episode someday, but for like Elasticsearch is a huge product and has so many features and I'm sure support

[00:17:53] Nikolay: Before we go to this topic, let's finish with analytics. And maybe time series as well. I think, , full text search is definitely a new category to discuss, but, speaking of, analytical workloads, if you already, like, decided to have, uh, additional database, , or it's a requirement, for example, a lot of budgets allocated for Snowflake and you cannot beat that, right?

So they're already allocated. They should be spent, you know, to Snowflake. So, and people hired. So, okay. And there will be question of synchronization. And, , this is interesting, because you, you, you will need to deal with probably logical decoding or logical replication, and probably you will need additional tool.

I mean, it's possible to reach, synchronization yourself, but, If, the company already spends money on Snowflake, probably they will decide to spend money on tools like, I don't know, Fivetran, for example., because, a lot of stuff to solve there in synchronization. And, uh, over time, uh, it's related to logical replication and decoding.

And this, in this case, it's logical replication between two different systems. So, problems of avoiding initial, uh, full resynchronization. Is no joke there. And, , if tool is not right, for example, if you choose click, it will produce, , duplicates on the analytical database side, because usual analytical database, such as Snowflake, BigQuery, what else, like Redshift or something, they don't have unique keys.

And this is by design. And in this case, if a synchronization tool which you built on top of logical decoding, this tool is implemented in the wrong... Way, it will, uh, you will have some duplicates when, replication switches from initial full sync of snapshot of data to CDC, Change Data Capture.

And this is a problem. It's a solvable problem. The tool just needs to be fixed if you have duplicates. Obviously, like the tool might be relying on the presence of unique keys. On the subscriber side, but it should not do because that such tools primary goal are pipes from OLTP to analytical database systems.

Analytical database systems don't have unique keys. If this tool allows duplicates, it's a bad tool, don't use it. Or tell them, uh, fix it right now, or we won't be paying you, or so. so, duplicate problem is solvable, just with careful, uh, organization, orchestration of this process of switching from initial load to CDC.

But actually, that's it. I think today we have quite good capabilities on Postgres side, so data can be moved from OLTP system to analytical system with very, very small lag. And it's amazing. I mean, you can have lag below one minute, definitely, and with a lot of TPS, if everything is done right. Great. Also, if you use some cloud replication tools like Fivetran, I had a bad experience.

It was kind of my fault. So, it was Postgres database RDS, which obviously allows logical replication to anywhere. Logical replication slot is possible to create and logical replication connection from outside as possible. And, uh, I was checking Fivetran. From RDS to Snowflake, Snowflake was also on AWS, but at that time it was, I think it was early days of 5.

3, now they are much, much better. So, like, I was seeing very bad performance on lags and so on in my synthetic tests. You take pgBench and you see the lag is accumulated very... Very fast. And it turned out that 5trend servers were in Google Cloud. So latency, it doesn't make sense. You need, these, like, software as a service, platform as a service, anything as a service, you need to make sure network connectivity is good, I mean, not distant, latencies, throughput, everything you need to check.

And right now, I know, I know 5trend have. Servers in AWS, so you just need to choose proper settings, and in this case, everything should be in the same region, right? In this case, well, unless you need a multi region setup. In this case, performance should be good, and delays, lags of replication should be very low, and I think it's good if your company needs Snowflake, for example, it's good. Or Clickhouse, managed Clickhouse, I don't know, so there is such a thing, right? So probably you need to manage Clickhouse and you need to stream changes from Postgres to Clickhouse. It's possible with some open source tooling, but I think it's great. I mean... Why not? system.

[00:23:07] Michael: also, think about the progress. Like, even 5 10 years ago, it feels like this would have been a dream for many people. Like, the most things I heard about were... Extract, transform, load, kind of nightly processes to a, to a analytics database. So your data was like at, at, you know, a day old normally. Yeah. Right.

But that was, that was the norm even just a few, well, maybe 10 years ago now, maybe I'm a bit older than I realized,

[00:23:36] Nikolay: Many banks still use this nightly load approach, nightly refresh. It's a terrible approach, it should be redesigned. But of course the cost of changes are high, so they keep old systems working. If you wake up at night and try to use something, application says, I'm down for a couple of hours, it's insane.

It should not be so. Yeah, but I would like to emphasize like maybe banking systems, some banks use Postgres, right? So transactions, reliable, open source, quite good, robust, maybe like a cost of ownership is low because no licensing from like Oracle level, but then you need additional systems like banking system or maybe e commerce.

E commerce might be a good example here because obviously they need good analytics. Systems. So maybe this is, this is where a combination of Postgres for, order processing and so on. And, uh, something like Snowflake or Vertica or something for, um, analysis is a way to go.

[00:24:43] Michael: Cool. I would be keen to get your thoughts on some of the other, use cases for.

[00:24:48] Nikolay: Yeah, let's discuss, uh, what, uh, full text search,

[00:24:52] Michael: Yeah. I think that's a great, yeah. A good, good starting point.

[00:24:55] Nikolay: yeah, I think, uh, there are a lot of benchmarks saying that, uh, Elastic is better than Postgres at full text search. Right? But, uh, unlike analytical workloads, which usually, , reorganize data in some form for analysis and so on, and usually you synchronize some data there and analyze, in the case of full text search, you, I think, want the search to be a part of your main RLTP workload, right?

Because... User, for example, works with some pages or, I don't know, buttons, and then the user decides to search something. And for user, it's a part of everything on this OLTP mobile or web app, right? So it's a part of, and this is a big difference from analytical, from analytical system. So from analytical case we just considered, this is the main difference, because full text search is a, should, usually is a part of. The main workload, all GPU, like.

[00:25:58] Michael: but I think it can really depend, but yeah, I think, I think you're probably, well, you're probably right, but equally user facing analytics, people might expect to be up to date immediately as well. So I think there is, there are cases for both of these to be like, depends how, how much that data is updating, like how much is it user generated data versus like system data, or I do think there's a.

There's interesting cases in both, on both sides, but the, the, the full text stuff I, I've seen, I, I don't trust most benchmarks I see, I think it's, I see a lot of benchmarks showing whichever vendor is doing the benchmark, uh, hap, just miraculously comes out on top, yeah, exactly, and I do think Postgres has, until Until relatively recently suffered a little bit from not having that marketing machine behind it. People super incentivized to do,

[00:26:51] Nikolay: Yeah, yeah,

[00:26:52] Michael: the marketing efforts, but now we do have, now we do have a lot of commercial entities doing benchmarks that show, you know, timescale doing various things and super base doing various things. I do think those two make a good effort to. To make things fair, and most, I'm sure most companies try and make things fair, but they know their system way better than they know their competitors systems.

So there's always going to be bias in those. So I'd really, really, I think maybe if we, if I could only encourage one thing for people from this, if you are comparing Postgres to a companion database, don't trust the benchmarks from those two systems. Do your own, do your own benchmarking.

Test your own kind of workload.

[00:27:34] Nikolay: So, so, actually, you are right. And in this case, probably Postgres full text search is good, but it really requires also a lot of effort. I mean, in this case... If it's a part of the main workload LTP, then you have two options. For example, we deal with Elastic, and it also will require some effort to use.

Obviously, this system is targeting only full text search. Great. Now also VectorSearch, I'm sure they have, right?

Of course, of everyone. But if you choose to use full text search, it has a lot of capabilities, but it will require also an effort to set up properly the dictionaries, various ways, manipulative things like how to transform text to, Test vector, for example, and query to a test query properly, and the query language probably it will be not as powerful as at Elastic, but it will be possible to combine it in a single query with other Postgres capabilities.

This is the power, right? And data is... Here, in the same database. So, this is super big pro for, to use full text search. And I must say it's quite advanced. You will require many days to understand all details which are possible. But looking at the benchmarks, I just looked at before our podcast. I personally didn't touch full text search probably few years, but I know it's not progressing.

fast now, because, you know, the same people who were driving progress of full text search from Russia, they were, they founded Postgres Pro and had different focus, different challenges. So, I think, uh, I think full text search is, like, it would be good have a new wave of attention to be developed further, many, many, like, things to improve, because it's a very wide topic, very wide.

But before, like, a few years ago, I touched it a lot and implemented a lot of things myself on using full text search. And I must say, of course, I was trying to avoid Elastic all the time because of these Data is here, I don't need to synchronize it, no dealing with lags. No transformations except to test query to test vector, and it's great, but obviously, still, like, Elastic offers a lot of stuff.

Some people say I want to benefit from both, and there is project ZomboDB, right,

[00:30:09] Michael: Yeah. Really impressive.

[00:30:12] Nikolay: Yes, and I'm impressed with techno... Ah, I mentioned the benchmark this morning, Elastic versus Postgres. I usually, when I have a few minutes, I usually quickly find in any benchmark, I quickly find downsides in Postgres.

You remember recently some people said they are going to beat full text search Postgres. It was, how's it called?

[00:30:38] Michael: Parade DB, I

[00:30:40] Nikolay: Paradb, and they just forgot to create index, gene index. It's not a good idea to compare index less, obviously. It's like, okay, it was, it was Clear from the curves they published. So, but, uh, the benchmark, Postgres versus Elastic, I've checked right before our podcast.

I'm trying to quickly find problems with Postgres stuff, and I didn't find problems with Postgres stuff. They, they, they're two different problems, and it was named like, uh, Postgres full text search, of course, loses to Elastic in terms of performance. So, Maybe still there is an opportunity to tune Postgres there, but maybe in many cases it's quite like a valid statement, maybe Elastic handles full text search workloads better.

Okay, so if you decide to synchronize, look at ZomboDB, right?

[00:31:32] Michael: Well, yes. So, so Zombo is really interesting. I don't know exactly how it works, but my understanding was that it. It's backed by Elasticsearch, but it's using, like, you query it through Postgres, so it's like an index access method, I believe, so you're using the Elasticsearch index from within Postgres, so you get all those same benefits of being able to Join it with other data in your system and it handles some of that transaction boundary stuff for you, which sounds like it would be a nightmare. So the, yeah, it's, it's Eric Ridge, isn't it? And his team are behind that. And it's open source as well, right? I think even Apache license.

[00:32:10] Nikolay: yeah, that's, that's great. And, this reminds me actually that in the previous category we forgot, I forgot to mention PeerDB, which is a new company, YC backed. There are many Postgres related companies which are YC backed now. And, uh, PeerDB solves the problem of, uh, synchronizing data.

From Postgres to Data Warehouse, Analytical Workloads, and so on. Analytical systems. And, they promise to do it much better than others. So, also worth looking at them, but they just started recently. So, very young company. I've met with founder recently here at San Diego. And, it was very good discussion.

I enjoyed it. And also, people which attack particular area, they have a lot of great ideas. So I'm glad to see the Postgres ecosystem is expanding in terms of various additional tools people need. Alright, so okay, so Postgres to analytical system, like I mentioned many names, but also look at PureDB, which is a new player, and Postgres to Elastic, ZomboDB, right?

Great. So, I would try to convince everyone to stay within Postgres only to solve full text search problems. Also, uh, 3 gram search. But I, like, I know how Gene sometimes behaves. Maybe you won't reach those levels of workloads, but I don't know, like, tuning, it also requires effort. So maybe two systems here is not a bad idea. Maybe,

[00:33:45] Michael: And, and again, maybe in some cases, right? Like maybe there's a, it's a scale thing for ages, you might be fine on just Postgres. And then at some point in scale, you want to. Switch. It might be one of those trade-offs again.

[00:34:00] Nikolay: but still, like, to me, being at Postgres, Guy, purely, of course, full text search in Postgres is very great, a lot of options, uh, my biggest concern is actually not related to full text search itself, but the lack of ability to have a single index scan which will solve, which will work on full text search, but also order by time or ID in reverse order, yeah, yeah, so it was the

[00:34:23] Michael: index thing.

[00:34:24] Nikolay: The idea behind RAM indexes, which, like, still, like, I think this will be good, great to improve and solve, but this is a hard topic.

And I'm sure, I'm not sure Elastic solves this problem well.

[00:34:38] Michael: Yeah. I don't know enough about, like, it feels to me like a di an extra dimension of data. It feels to me like this might be where some of the, uh, work around dimensionality indexing around that thing might be good. But again, maybe with the downside of some missing date, like maybe accidentally mi missing some posts 'cause of the, the

[00:34:57] Nikolay: Yeah, but this is a good thing to solve. And it's not only about full text search, it's also about semantic search

[00:35:05] Michael: Mm-Hmm.

[00:35:05] Nikolay: that pgVector can provide. For example, I want fresh data to come first, which... Meets my requirements in terms of full text search or vector search, but I want fresh first. It's natural desire, you know, from social media applications, for example.

[00:35:22] Michael: Yeah. Cool.

[00:35:23] Nikolay: yeah. So, what else? Uh, like, Q like workloads, we discussed them quite well, uh, some time ago, and I would like to... Emphasize that. I don't, I don't see people use logical decoding or replication for like from Postgres to, I don't know, some Kafka. Oh, why I don't see it? I see it, Debezium, right? So Debezium project, open source, it's exactly what it's needed for.

But, now we have, this new pgMQ system, right? From tembo. io, a new company, which is worth looking at. I haven't tested myself yet. And also here I would like to mention this problem of synchronization, which by default is So, synchronization should not be done with ListenNotify because of, a lot of restrictions they have, this mechanism has. Logical decoding is great for synchronization. It will require some effort. But also there is a approach maybe for Q like workloads which is more standard when you... have in Postgres something and then you want to, insert or update it somewhere else, there is two phase commit, which probably you want to avoid because it's slow. But there is, uh, from microservices, they invented a lot of patterns, like object oriented programmers did, a few decades ago. So there is so called, transactional outbox pattern. Which I find quite good to learn, and it's simple, but it helps you avoid problems. So, for Q like workloads, if you offload workloads from Postgres, maybe this is the way to go.

For example, Rails, sometimes, Rails people prefer Sidekiq, for example. Some, some rail people prefer. In podcast there are many libraries like delay drops, right? But some people prefer to offload tasks to sidekick. And Sidekick is

[00:37:23] Michael: Redis.

[00:37:24] Nikolay: Exactly, this is what I wanted because original request was about Radius, right?

[00:37:28] Michael: Yeah, true.

[00:37:30] Nikolay: So this is, I, I know we're about to finish because like we don't have a lot of time, but I needed to To lead us to this point, you know, like, okay, so Postgres, Rails, code, and Sidekiq with Redis behind. It's like some kind of popular combination, I think. But, I think many people maybe, maybe have issues with synchronization.

Probably you're not noticing if they do it strictly from rails, not involving any two, two point com, like two phase commit and so on. You inserted the posts but didn't insert, to sidekick you. You updated like, I mean, you need to see how synchronization works. And this transactional out outlook is quite simple.

It's not. Rocket science at all. You just have a special table, like you have a propagated symbol there, then consumers like offload it to this Sidekiq reliably, and in Sidekiq you already have this processing. It's not already our problem, it's Redis problem to process those events, right? Not to lose them, and so on and so on.

But I think in this particular case, Redis, right, Probably, what is needed for such kind of foreclosures, like event processing. Let's look at Postgres. Usually, if you look, like, look deeper. Some systems, two things to check, as we discussed in Q, Q episodes, like, too long, too long don't read approach.

First is check how, like, if they delete or update intensely, it's bad. I mean, how old events are cleaned up? It should be either truncate or drop partition, to avoid bloat and autovacuum issues, vacuum issues. And second, if you need insane scale, I mean, multiple workers, it should have SelectForUpdate. SkipLocked. Sorry. . This is it. These are two things for proper performance long term. I mean, not degrading performance, which is not degrading over time. Good. That's it.

[00:39:38] Michael: I think so. Oh yeah. Last shout out to the one that's probably the most in flux at the moment, which is the vector database, specialism. I, I, I actually think this might be a case where we don't know how the dust is going to settle yet. It feels so, there's so much progress, and there's so many, so much money involved, so many companies, very well funded,

[00:39:58] Nikolay: is high.

[00:39:59] Michael: yeah, and, and also there seem like there's still huge leap forwards.

The most recent release of pgVector the numbers that they released in their 0. 5 update, because they added the HNSW Uh, indexed, indexed type or how algorithm, I don't even know how to describe it, but that was such a huge leap forward in a couple of ways that you might want to optimize things that who knows what's in it, like who knows what progress there is down the, down the line there.

[00:40:30] Nikolay: Yeah. So, I want to say the important thing, uh, if you, like For example, first of all, I use it daily, pgVector, but I'm not, I haven't reached that scale, we haven't reached that scale where we need good index because we only have less than 100, 000 documents yet, we will have a lot more in this case, I mean, I haven't I haven't done my own benchmarks, or my team haven't done it. Hasn't done it yet. We will, because eventually we'll have many, many millions, I hope, may, maybe dozens of millions of documents. It's by the way, Postgres related, but of course we use this HNSW, uh, which is, stands, stands for.

Hierarchical, navigable small worlds. Weird name, by the way. I think some, some guy from, or a bunch of guys from Russia and from my own university invented it like five years ago or so. Maybe it's not there. This term is not from them, but the algorithm, everyone tries to implement this from there. I was surprised.

So, I want to say that there is a very popular benchmark which compares various vector databases versus Like, Postgres and Elastic, and so on. Every, every other, like, uh, already existing systems, which just added some, kind of plugins, extensions capability. And I must say this, , comparison still lacks HNSW, capabilities from pgVector, because it benchmarked an older version of pgVector.

That's why pgVector loses. drastically there. So, don't trust those benchmarks, it became better, and don't rush into using, for example, Superbase had an article, let's touch it, Paul just also wrote me, that pgVector with this HNSW example. Thank you. Indexes, it's a lot faster than Pinecone, so look out, look, look at their blog post, Superbase blog post, uh, with additional benchmark, and I hope that very popular vector benchmark will be also updated soon, right, so to, because it's not fair, you know. So Postgres is good here, actually. And

[00:42:46] Michael: it, yeah, it looks, it looks really good the, but please, again, don't trust any benchmarks people like check for yourself. I do trust the Superbase people. I've read that exact, I've read that exact post and it's, it seems great. They even spend more on the Pinecone instance than they're

[00:43:02] Nikolay: You know,

[00:43:03] Michael: Superbase instances. It seems like as fair as you can get, but how are we to know, like, that they're, they're great at tuning Pinecone it's really difficult.

[00:43:13] Nikolay: yeah, I cannot keep it, so a small spoiler, our bot will be capable of running benchmarks, conducting benchmarks, and I think we should do it as soon as possible. This particular case, let's, let's check pgVector with various options, various indexes, we will do it. And it will be super easy for everyone to, to reproduce. We're just chatting with both.

[00:43:35] Michael: on Postgres, but not like for

[00:43:37] Nikolay: Yeah, just Postgres. We focus only on Postgres, but you, you can get all the same, like, , metrics, , and compare and see TPS latencies, all the details, a lot of details. So then think how to improve, and not only you will be thinking, but... Okay, enough. So, so vector workloads is good at Postgres, just...

Try to use them, don't trust others, because others want to build new systems. They will be saying, Postgres is bad, it's obvious, right? But it's not, if I trust Victoria Metrics CTO and I see numbers myself or Clickhouse, I know this technology is super good at both, or Elastic. Here I don't trust it yet, let's, let's give Postgres a good chance

[00:44:24] Michael: do you know what, that's a, that's a really good point. And something I forgot to list in the, pros for keeping a workload on Postgres is, and it is mentioned, I've still the wording exactly from the Superbase blog post, but they've said Postgres is battle tested and robust, whereas most specialized database haven't had time to demonstrate the reliability doesn't mean they're not going to be in time, super reliable, but a

[00:44:44] Nikolay: takes 10 years at least. Exactly. 10

[00:44:47] Michael: It can, yeah, so, yeah, really good note to, and, and obviously.

[00:44:52] Nikolay: bugs, ACID, a lot of stuff. Yeah, it takes a lot of time. Even if you change the part of Postgres, you need a lot of time to be stable. So, there are systems each time I touch that demonstrate issues very quickly. So, but

[00:45:09] Michael: Well, um, like there's those, this, uh, that group that do is, um, Jetson, like the, like the doing serious testing of, of systems,

[00:45:20] Nikolay: systems particularly.

[00:45:21] Michael: well, yeah, but even like finding bugs in non distributed systems, transaction isolation levels and things like that, some, some really impressive people doing some really impressive work, but they won't have done that on the, on newer databases.

And they wouldn't like, people wouldn't have had a chance to fix some of those issues yet. So. Yeah, it's a really good point. And I don't mind us being a little

[00:45:40] Nikolay: but as we see, but as we see from full text search and analytical use cases, it's possible to beat Postgres for particular workloads for sure,

[00:45:48] Michael: Yeah. Yeah. Good point. Cool. Thanks so much, Nikolay. Thanks everybody and keep the suggestions coming.

[00:45:55] Nikolay: Yeah, thank you. Bye bye.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

Companion databases

Creators and Guests

Some kind things our listeners have said