
When not to use Postgres
Nikolay: Hello, hello, this is
Postgres.FM.
As usual, I'm Nik, PostgresAI,
and as usual, my co-host is
Michael, pgMustard.
Hi, Michael, how are you?
Michael: Hello, Nik, I'm good,
how are you?
Nikolay: Very good.
So, the topic you chose is to talk
about beyond Postgres, when
we should avoid using Postgres,
right?
Michael: Yeah, you put a shout
out on a few social networks asking
people what kind of questions they'd
like us to answer and we
had lots of good suggestions as
we've had for many years now
and 1 of them was particularly
good and I thought was worth a
whole episode which was yeah when
not to use Postgres I think
there's a growing trend of or a
few popular blog posts of people
saying you should consider Postgres
for most workloads these
days and I think it is still an
interesting topic to discuss
are there cases where it doesn't
make sense if so what are those
and what like what when does it
make sense not to use Postgres?
I was interested in your take on
some of these as well.
Nikolay: Yeah, well, classic example
is analytics.
Michael: Do you want to list a
few and then discuss them in a
bit more detail.
Nikolay: Yeah, let's create a frame
of this episode.
So analytics, embedded databases.
Michael: Yeah, so like, and I think
analytics, yeah, so analytics
embedded, I think storing large
objects, there are some cases
where
Nikolay: it makes sense.
Pictures and database.
Michael: Exactly, especially larger
ones like videos, like very
large objects, 100%.
Nikolay: And just let's agree in
every area we can discuss pros
and cons of using Postgres because
some people definitely have
an opinion that it's not an excuse
to avoid Postgres.
Let me add then, with this in mind,
let me add then a topic like
ML datasets and pipelines.
Right?
Yeah.
Machine learning and big data.
Michael: I think anything where
there's specialized databases
like search, vector databases.
Nikolay: Vectors, exactly.
Let's talk about vectors separately.
It's worth it.
Michael: And then 1 more, which
I think, well, actually, I had
2 more on my list 1 was potentially
controversial I wondered
if there's a case if you're at
extreme OLTP Write heavy Very
very write heavy and you, like,
let's say you've got institutional
experience with Vitess I think
sticking with that for the moment
makes a lot of sense so like what
when not to use Postgres I
wondered if like a new project
came along starting with Vitess
while we get these sharded Postgres
like OLTP sharding Postgres
solutions up and running I think
at the moment maybe it still
makes sense to not use Postgres
there.
Nikolay: Let's add then time series.
Yeah.
Let's discuss this area, time series,
and also data that can
be compressed really well.
This topic is close to analytics.
True,
Michael: yeah.
And then I have 1 more, but I guess
it's kind of what I said
just now.
I think If you or your organization
have tons of experience with
another database, the argument
for using Postgres for your next
project is weaker.
I'm not sure it's like when not
to use Postgres.
I think I could think of lots of
counter examples.
Well, this
Nikolay: is an orthogonal discussion.
You can say if we already have
a huge contract with Oracle, we
already signed for next 5 years,
it's not wise to start using
Postgres and those money will be
spent for nothing, right?
There are many such reasons, right?
Michael: Should we stick to technical
ones then?
Nikolay: Yeah, like area, like
types of usage, area.
Queue-like workloads, I would add.
Michael: Yeah, interesting.
Nikolay: Yeah, yeah, yeah.
Last 1 is like kind of Kafka territory.
Or there are others of course.
Michael: Yeah.
All right, let's, should we start
with analytics then?
I feel like that's, I know we did
a whole episode, kind of a
whole episode on this.
Nikolay: Yeah, so, so row store
is not good for analytics and
SELECT COUNT will always be slow,
you need denormalization or
estimates, estimates will be slow,
well not slow, too rough and
too wrong sometimes.
Yeah, it sucks.
Michael: I think there's a scale,
I think we've talked about
this before, but there's a scale
up to which you'll be fine on
Postgres.
You could achieve better performance
elsewhere.
But if you have hybrid, a lot of
systems now are hybrid, right?
Like they have to be transactional,
but they have to provide
some analytics dashboard for users
or something like it, but
they still want real-time data.
They still want transactional processing,
maybe 90 to 95, maybe
even 99% of the workload is transactional,
and only like there's
a few analytical queries from time to time.
I still think those make a ton of sense on Postgres.
Nikolay: Counts everyone needs, right?
Pure OLTP applications, they need to show counts or understand
pagination and like still have counts.
If you think about social media, you will need to show how many
likes, comments, and so on, reposts.
So, and like to implement it in purely in Postgres, like a good
way for generalization would be needed.
And also, there are so many rakes lying around this, You can
easily step on them and have hotspot.
Like, you know, this classic hotspot when accounting system,
like, like taking balances and let's say we have like whole balance
and all transactions Update single row.
Yeah.
So this is where it can be an issue.
And yeah, but at the same time, there are many attempts to do
better and some attempts led to companies being acquired, right?
I mean, Crunchy.
There are many new ways of aiming to solve this problem, better
analytics for Postgres.
I see 2 big trends right now.
1 trend is how Sid, founder of GitLab, recently had a post saying
that Postgres and ClickHouse is a great couple of database systems.
I don't remember the exact title of that post, but the idea is
that it's great.
They go together very well.
We also had Sai, founder of PeerDB, which was acquired by ClickHouse.
And I met with him last week, and we talked about, Again, the
same idea that ClickHouse and Postgres are great together.
This is 1 direction.
Another direction is saying ClickHouse is very different.
And not even maintaining.
Maintaining is absolutely different, but also using it requires
a different mindset and skills so it's better to choose, for
example, DuckDB.
Do everything inside Postgres but in a smarter way.
This is what multiple companies worked on recently and 1 of them
Crunchy was acquired by Snowflake.
Michael: And we had it, we did an episode on pg_duckdb as well,
so like a slightly different approach on that.
But yeah, The Crunchy one's interesting because all the Queries
go through Postgres, but a lot of the data's stored in like Iceberg
or like some other file format.
Yeah, exactly.
It's in the Column last side.
Yeah.
So yeah, this definitely feels like 1 of those ones.
There's also a third, by the way, there's a third option, which
is these massively parallel databases.
Like, well, I was at a, you spoke to Cy last week, I was at a
UK event this week, And there was presentation from the successors
of the Greenplum project, the kind of the open source successors,
which is called Cloudberry.
It looks like really interesting work.
But that's another way of doing some analytics from within Postgres,
kind of.
Nikolay: Yeah.
And from previous experience, from the past I remember cases
when Postgres and Greenplum was combined in 1 project, and it
was great, and it was some bank, even quite big 1 bank.
And yeah, but somehow I stopped looking at Greenplum for quite
long already, I don't know.
There are also, of course, commercial databases.
I remember Vertica, there is Snowflake, super popular, it's a
major player in this area.
By the way, I would distinguish 2 areas of analytics.
1 is internal needs.
For a company, we need to understand how business is doing a
lot of stuff.
We need a lot of reports.
And another need is we need to show our users some counts.
Like I said, social media.
So 2 big areas, I think, also.
Michael: Yeah, good point.
Nikolay: In the first case, users are internal.
In the second case, users are external.
I'm pretty sure there are a lot of mixed cases, additional cases.
But I personally like these 2 directions.
Of course there are others, there is there is also Redshift on
AWS.
Michael: Also originally based on Postgres, yeah.
Nikolay: So there are many options here.
Michael: So yeah, is the short version at sufficient scale?
It probably doesn't make sense to be using Postgres at this point
for analytics, but like that that level is quite high.
Nikolay: Yeah, but also I see cases when companies go to Snowflake
then try to escape it.
Michael: Come back, yeah.
Nikolay: Okay.
So going to Snowflake, it's like going to Oracle, in my opinion.
Michael: You mean like in terms of financial?
Nikolay: In terms of vendor lock-in
Michael: and so on.
Nikolay: Because it's just purely commercial offering.
There are of course many tempting things there, features, performance,
integrations.
Michael: Nice product to use as well, yeah.
Developer friendly.
Nikolay: Yeah, well, users love it.
I agree.
But if we try to remain in more
open source and vendor look in
less, then it's like it should
be excluded.
Even ClickHouse.
ClickHouse is open source itself.
Michael: Yeah.
Right.
Well, you mentioned time series
being quite close to this.
I feel like we should jump to that
next.
What do you reckon?
Nikolay: Well, TimescaleDB is
great, but it's also kind of underlooking
because it's not open.
Yeah.
Michael: Yeah.
So because of their license, other
cloud providers can't provide
Timescale as a service easily.
Or at least not the version with
lots of nice features.
Nikolay: Yeah, in Timescale Cloud,
I had a recent case where
we saw limitations again, very
badly.
Like create database doesn't work.
And moreover, lack of observability
tooling.
Like again, like I keep promoting
on this podcast, if guys who
build platforms listen to us, you
must add pg_wait_sampling.
Unless you are RDS, okay.
But even in case of RDS, we talked
about this.
It's great to have it in SQL context
and be able to combine wait
event analysis with regular pg_stat_statements
analysis.
And pg_stat_kcache, additional very
good observability point.
Because I had a case when guys
just compared everything, saw
worst performance, worked closely
with Timescale.
But in case of RDS, you see performance
insights, understand
where we wait, right?
In case of Timescale, only a rare
collection of samples from
pg_stat_activity is possible.
It's sometimes good enough, but
it's quite a rough tool to analyze
performance.
So yes, such things are lacking.
And unfortunately, more and more
I come to the conclusion that
when I recommend TimescaleDB to
customers, it contradicts with
the idea they want to stay on managed
service.
Michael: Yeah.
Yeah.
Because they're down to a single
choice.
Yeah.
Nikolay: Yeah.
That being said, the Timescale
Cloud even offered me some bounty
if I convince someone to go to
them.
And this is great, like I love
loyalty programs, but I need to
be fair, some pieces, big pieces
are missing, unfortunately.
Michael: Yeah.
And again, Postgres, even without
Timescale, can be used for
good time series workloads up to
a certain point.
We're just talking about at very
high scale, where all the features
like compression, like continuous
aggregates, like automatic
partition.
Nikolay: Straight to the point.
Straight to the 0, by the way,
for time series, ClickHouse is
also still a good option and there
is also VictoriaMetrics,
right?
Michael: Yeah.
Well, and I learned just yesterday
about even Cloudberry have
incrementally updated materialized
views.
I need to look into it.
But that's quite cool.
And if you're like, maybe that
would be
Nikolay: what you think about this?
Wouldn't it be great to have in
Postgres something like update
materialize you where in you just
define the scope.
And also concurrently.
Michael: We should do a whole episode.
I think there are several projects
that have started to look
into incrementally updating materialized
views.
And I think they're more complicated
than I've ever.
Like, they're just, it's like 1
of those topics, the more you
learn about it, the harder you
realize it is.
Nikolay: Right now I'm in a position
where most, not everyone,
but most of our customers are on
managed Postgres, so it's really
hard for me to look at extensions
which are not available on
RDS, CloudSQL and others.
Michael: I understand, I'm just
thinking, like, I think it's
worth learning from the extensions
as to what would be needed
in core, like, how did, what did
they try, what was difficult
about that, what, and it's not
just extensions, right, there
are, there are whole companies
that have been built on the premise
of, is it called, is it Materialize?
Or like what
Nikolay: was the- Yeah, yeah, yeah.
I haven't heard for a few years
from them.
What's, like, I'm curious what's
happening there.
Lack of autonomous transactions
will be an issue,
Michael: right?
Yeah.
Nikolay: Or Q-like, Q-like tool
inside Postgres.
So asynchronously update will be
propagated through Q-like.
If everyone had PgQ like CloudSQL has, from Skype, developed
20 years ago.
In this case, implementing incrementally
asynchronously updated
materialized views would be easier.
Michael: Well, yeah, async and
sync is...
Anyway, this is an
Nikolay: interesting topic.
Yeah, yeah, yeah.
And we already basically we just
touched workloads, queue-like
workloads.
It's still hard.
Bloat is an issue, right?
We discussed it, I think.
Michael: I think, well, we've discussed
there are solutions,
right?
I actually think queues is 1 of
the ones that I was going to
fight you hardest on.
Like, I think there are ways to
do it badly within Postgres.
And again, at extreme scale, I
think it wouldn't be smart to
put it in, especially.
Nikolay: I don't agree.
Skype had extreme scale.
Michael: Yeah, well, yeah.
Okay.
Nikolay: 20 years ago, 1 billion
users was a target.
Michael: Good point.
So maybe actually of all the ones
that we added to potentially
be on the list, that would be 1
where I think if you manage it
well, like PgQ at Skype did with
partitions, actually I think
that's not an excuse to not use
Postgres if that's the title.
Nikolay: Yeah, benefits are serious.
Problem is like, when I recommend
keep queue-like workloads inside
Postgres, I just say like understanding
whole complexity and
hidden issues that you just need
a high level of understanding
of what's happening to have it.
But if you have it, it will be
great.
It will be great.
It's just not a small project,
unfortunately, usually.
Michael: Yeah, good point.
Nikolay: And this recent case with
Notify as well, because it's
also sometimes used in such workloads.
Yeah, as a reminder, Notify exclusive
log on database, serializing
all Notifies makes it basically
not scalable.
Yeah, yeah, all right.
Anyway, what's the answer?
Like If you needed to create a
project and you would need to
think about analytics like think
about like, okay We will have
terabytes of data very soon fast
growing What do we choose for
analytics?
What do we choose for?
Queue like workloads for time series
was what are the choices you
would make I?
Michael: Think it does depend a
lot that you already said with
analytics.
Are we talking about a bank that
is doing?
Nightly loads of data and only
cares about Internal reporting
or are we talking about a user-facing
web app that has to do,
or like a social media app that
has to do counts and various
like aggregations that are user-facing?
Nikolay: Well you need both.
You need to think about both and
what architecture choice would
you make in the beginning.
Yeah.
For target of terabytes of data
in like in 1 year for example.
10 terabytes in 1 year.
What would you do?
Would you choose to stay in Postgres?
Michael: I'm a big fan of simplicity.
I think I would stick with Postgres
for as long, like until it
was painful.
Nikolay: Okay, re-engineered then.
Michael: Yeah, and I know that
would be painful, But that would
be my preferred option.
I think then I am tempted, I think
it's quite new at the moment,
but I am tempted by the work that
crunchy started on moving data
out to iceberg format and still
querying it from Postgres.
Like I like the,
Nikolay: I
Michael: like that I can keep
Nikolay: queries going from the
same place.
But it's possible on Cloud SQL
or RDS, right?
Michael: Not yet, right?
But I think it's quite early.
Like if I was starting a project
today, I would hope that they
caught up by then.
And if not, then a ClickHouse,
like a whatever PeerDB is called
now within ClickHouse, like having
that go out to an analytic
system like ClickHouse makes a
load of sense to me.
What do you think?
Nikolay: I would choose self-managed
Postgres 100%.
And I would use TimescaleDB full-fledged.
Michael: Got it, OK, yeah.
Nikolay: And then I would consider
DuckDB path as well, additionally,
at some point.
And queue workloads, queue-like
workloads, I would engineer perfectly
and squeeze as much as Postgres can.
Yeah.
And what else we touched?
Michael: So again, all of these
are actually sticking with Postgres
and then just at some point in
the future, you're going to have
to think about sharding if you
Nikolay: can't in the future.
Only 1 reason would make me do
it.
So I wouldn't go, in these 3 areas
we just discussed, I wouldn't
go away from Postgres.
Although I understand very well
customers who have and why they
say we need ClickHouse or something.
In my company, I would add ClickHouse
only if there is a strong
team which needs it.
And this is their choice and I
delegate and then I don't, I'm
not involved in this decision.
But while this choice is still
mine I would stick to Postgres
and just make it work better.
And can scale until IPO.
I saw it several times with several
companies.
Michael: Cool, makes sense.
So I wonder if this next one's
going to be the first 1 where
we both would and don't use Postgres,
which is storing large
objects, like large files.
Nikolay: Well definitely, yeah,
last time I tried to store a
picture inside Postgres was probably
2006 or 7, when I was just
exploring, you know, like oh this
is working, okay, But no, I
even don't know how, what will
happen, you know, like this, this,
like this piece of Postgres I touch
super rarely, you know.
Michael: Yeah.
I think the 1 exception is text-based
stuff, stuff that you might
want to query it, but even then
you probably
Nikolay: want to
Michael: be doing...
PDFs.
Nikolay: PDFs, yeah.
Michael: But you know, it's like
some representation of the PDF,
not the actual PDF, but the text
extracted from it.
Oh, yes.
Nikolay: It's going to be...
Or it can be marked down, and then
we have a Pandoc or something
which converts both to HTML and
PDF.
This is what we do with our checkups.
Originally it's in markdown.
Michael: Yeah, and by the way,
and another possible exception
I think it's almost not worth discussing
is if you only need
to store 5 pictures like maybe
you know but I just don't I don't
see many cases like that.
Nikolay: Yeah yeah yeah.
Still cool.
Michael: All right.
Nikolay: Yeah We recently implemented
attachments in our own
system for pictures and various
like archives of logs or something
which customers upload or we upload
PDFs as well sometimes.
Of course we store them in GCS
in a secure manner.
Not in Postgres 100% now.
Yeah, yeah.
I even don't think they're, yeah.
It's just exercise for those who
probably don't have other tasks.
Michael: Yeah, all right.
Another 1 that I think is maybe
in between these 2, like advanced
search, I still think there are
cases, like there are search
use cases where people that I respect
and trust would still choose
ElasticSearch over Postgres even
though they know that.
Nikolay: By the way, sorry for
interrupting.
I just realized you talked about
blobs.
Yeah, but we also can.
So And this can be an issue with
upgrades, I've heard, right?
Major upgrades.
I don't see them recently at all.
So I mean, you can try to store
it in Postgres, but then you
have some operational issues, additionally.
Because not only I don't see them
often, people who design some
procedures and tools, they also
don't see them often.
And it's kind of exotic to keep,
to have blobs.
Michael: Yeah, so when you say
major upgrades, are we talking
about like the speed of the initial
sync?
Or are we talking about pg_dump?
If we go in the dump restore route,
then actually it's all having
to get dumped out.
Nikolay: Yeah, yeah.
I just remember some notes about
it I saw maybe on RDS.
It's specifically about large objects.
Maybe I'm wrong, actually.
Michael: I just remembered I'm
actually using, not for super
large objects, but several kilobyte
files.
Like, we store query plans in JSON
format or text format.
Text format ones don't tend to
get that massive, but JSON format
ones can be hundreds of, well,
we've actually seen a couple that
were tens of megabytes.
I think 1 or 2 that are in the
hundreds of.
But.
Nikolay: We need to amend this
part of the episode.
You're talking about varlena types.
Yeah.
There is a special thing, large
object facility, special chapter.
Michael: Yes, sorry, that's a different,
yeah.
Nikolay: Yeah, in Berlin everyone
uses JSONs, large texts, everyone.
Even by the race.
Michael: Okay, you mean there's
a specific issue with the thing
called large objects?
Nikolay: I cannot say I don't touch
large JSONs.
Of course, I touch them a lot.
We have them a lot.
And yeah, we talk about how they
toast in Postgres.
Michael: Yeah, yeah.
Nikolay: Large JSONs, large XMLs
sometimes, right?
Texts, of course, large texts,
everything.
For example, our RAG system for
AI assistant has a really large
text, chunks of source code, many
of these discussions, kilobytes.
Michael: And you put all of those
in Postgres?
Nikolay: Of course, because we
need to parse them and also full
text search and vectors, that's
everything in Postgres right
now.
Michael: Well, you could store
the vectors in Postgres without
storing the text in Postgres, but
the full text search makes
a lot of sense.
Nikolay: Yeah, I understand you,
but we do everything in Postgres,
even vectorization.
Yeah, okay, cool.
It doesn't scale well if you need
to deal with billions of vectors,
but millions is fine.
Michael: Yeah, makes sense.
Nikolay: So what I was talking
about is like, it's like a
lo_create function, these things.
Michael: Yeah, I've not used that
and you're saying you don't
see it.
Nikolay: Yeah, this is what I don't
see.
Michael: Hopefully everyone got
the memo and no one's using it.
Yeah,
Nikolay: lo from byte...
So yeah, I don't use those.
And Again, last time I touched
that, it was so long ago.
Actual blobs.
lo_
lo_put, lo_get, these functions.
I have no idea, and I suspect something
will be broken if you
start using them.
Some operations like upgrade maybe.
Not broken, but you will need to
take care of them.
Like some side effects like table
spaces can have, you know.
Recently not used also often.
In cloud context we don't use often
the table spaces.
But table spaces might be headache
when you do some migrations,
move your database from place to
place or upgrade and so on.
Michael: Yeah, yeah.
Nikolay: Okay.
Yeah, good 1.
What else?
Michael: We haven't talked about
embedded databases yet on the
kind of tiny scale of things.
Nikolay: Yeah, I'm not an expert
in the better databases.
I thought SQLite is good.
Michael: Yeah in this category.
We do now have PGlite Looks like
a very interesting project
But I think at the moment unless
I was doing some syncing between
Postgres, unless I had a really
good reason, I'd probably still
default to SQLite, however
the community pronounces
it.
But actually I was going to include
in this topic, like even
browser local storage for example.
If you're wanting to do stuff client
side in like a browser app
or web app, it still makes sense
to use the local storage there,
right?
IndexedDB or whatever.
So there are like a few embedded
cases where I don't think it
makes sense to use Postgres or
if you're gonna try maybe PGlite.
Nikolay: And can you remind me,
PGlite, what does it do?
Is it related somehow to WebAssembly?
I think yes, right?
Michael: I think it must be, right?
But I don't know enough about it.
Nikolay: Yeah, I'm checking it's
a complete awesome build of
Postgres that it's under 3 megabytes
exhibit.
That's impressive.
Michael: Yeah, it's a cool project.
But I guess there's an argument
to say it's not actually Postgres.
Like it talks and behaves like
Postgres, but it's kind of its
own thing.
Nikolay: Well, we can talk about
many Postgres variants like
this, including Aurora and so on.
Some of them more Postgres, some
of them less.
Michael: Yeah.
But if the topic is like when not
to use Postgres, yeah I guess
Aurora.
I don't know if I'd count Aurora
as that or not.
Nikolay: I'm not sure I understand
what you mean.
Michael: If the only solution was
use Aurora, let's say that
was the, there was like a, 1 of
these cases, it turns out Aurora
was the absolute best for, and
way better than native Postgres.
I think I would count that as when
not to use Postgres.
Because it's kind of Postgres compatible
like or Cockroach like
or any of these kind of compatible
or
Nikolay: you know
Michael: yeah I'd say I like to
be yeah so you're right it's
kind of a scale it's hard to like
draw a line where it where
it is and isn't
Nikolay: it's a spectrum Postgres
yeah
Michael: yeah yeah cool okay but
yeah it feels like that's an
easy 1, right?
If you've got little devices or
little, you know, little sensors.
Nikolay: Yeah, default choice is
SQLite already.
And I like the idea of PGlite,
and I know Supabase used it for
database.build project, which I
like a lot.
With like merging this with AI
and right in browser, you can
create pet projects and maybe like
explore and it's a very creative
tool to think about, to bootstrap
some new project, think how
it could look like It has the AR
diagram and you can iterate
with the eyes.
It's great.
And there, PGlite works really
well.
And I'm sure they already created
this ability to deploy, to
sync what you build to real Postgres
in Supabase somewhere,
right?
Michael: Well, I think that was
the main aim of the company behind
PGlite, called like ElectricSQL
or something.
Nikolay: Replication.
Michael: Yeah, exactly.
So it's the whole premise was local,
if you heard of local first
development, so the idea like apps
like Linear, the task management
tool, they they're like lightning
fast because they do everything
locally and then sync.
Nikolay: Very thick client.
Very thick client.
Yes.
Basically like Git, like Git clone,
when you type Git clone and
execute it, it's basically a whole
repository.
It can live on your machine.
Yep.
Distributed fashion, right?
Michael: Well, and it has to handle
mergers.
Nikolay: Yeah.
You can use there.
It's great.
It's a conflict.
I'm curious if we could explore
better branching in that area
because we already very close to
implementing a synchronization
between 2 DBLab engines.
Yeah, but it's a different story.
Yeah.
I like the idea.
So it might be a foundation for...
PGlite, I mean, might be a foundation
for more apps which will
live in browser but then be synchronized
with real Postgres.
Michael: Yeah or even desktop apps
like it doesn't have to be
browser-based.
Nikolay: Well I guess the desktop
apps built on top of using
electron right?
Michael: Oh no yeah good point
good point.
Nikolay: And then if you don't
have internet connection you still
can work?
Yeah.
Like offline mode?
That's great.
I like the idea actually.
I like the idea.
So you have Postgres mirror.
Postgres, well, it reminds me of
multi-master replication.
This is complexity.
Michael: All the same problems,
like with merging and conflicts.
Nikolay: But at the same time,
recent Postgres has this ability
to create subscription and avoid
loops of replication.
Michael: Yeah, true.
Nikolay: So origin something you
can say, I want to replicate
on the data which doesn't have
origin, which means it was born
here.
Local origin basically, but means
no origin there somehow.
Terminology is strange a little
bit, as usual in Postgres, right?
But it's great ability to break
the loops, infinite loops.
Yeah, it's
Michael: interesting.
Yeah, it fixes 1 of the problems,
but it doesn't...
Nikolay: Fix all of them, no fix.
Michael: Yeah, exactly.
And like, last write win type things,
yeah.
Nikolay: But if you use, if you
need to have very good, like,
server-side application and so
on, you choose Postgres, but then
you have these very thick clients,
and you need to choose a database
for them, and you choose SQLite,
then you need to synchronize
between them somehow.
It's maybe even worse.
Michael: Even harder.
I think that's
Nikolay: the point.
Different data types and models.
Yeah.
Michael: Yeah.
Cool.
What about the specialized workloads
like vectors?
And I was going to bring up search
as well.
I think search is slightly easier.
Nikolay: Let's bring search.
Michael: I don't actually have,
I haven't written an app or been
involved in an application that
is heavily reliant on, like,
very advanced search features.
But the people I speak to that
have swear by how good ElasticSearch
is.
Nikolay: This is also what I see.
I touch Elastic only usually working
with some logs, application
logs, Postgres logs, through Kibana,
so ELK, or how is it called,
this stack.
But I also see many customers use
Elastic and like it, and shift
from full-text search to Postgres.
There, okay, their choice right
and I know limitations of Postgres
native full-text search yeah I'm also I
don't understand ParadeDB and
I haven't seen benchmarks The benchmark
I saw was only in the
beginning when they didn't create
index on tsvector column.
Michael: They made a really interesting
hire recently.
I saw, do you remember ZomboDB?
Nikolay: Hires, yes.
Michael: Do you remember?
Several hires,
Nikolay: Because they raise money,
yeah.
But where are benchmarks?
So I don't understand what's there
because I don't see benchmarks.
I tried recently because their
CEO, founder, Philip, approached
me.
Nice.
Maybe asking for, I guess, to spread
the word, but I cannot spread
the word if I don't see numbers.
If it's about performance company,
show numbers.
I might be missing something because...
Yeah, so.
Michael: The other product in this
space for Postgres was an
extension called ZomboDB, which
synchronized, which kept an ElasticSearch
index maintained, but the data
coming from Postgres originally.
So I thought that was a really
fascinating way of having both.
A bit like when we talked about
analytics, like having the interface
be from Postgres, but the actual
query being run on something
that isn't Postgres.
So that was fascinating and it
was the founder of that ZomboDB
that recently joined Parade.
So that seems interesting.
Nikolay: This whole story seems
interesting.
I don't understand it because I
cannot find numbers at the same
time I see everyone mentions them,
a lot of blog posts, a lot
of GitHub stars, a lot of like
a lot of noise, but where are
the numbers and benchmarks so they
removed it after initial ones.
Michael: Yeah, so if you were to
try and stay within Postgres,
they seem like the obvious thing
to try.
But I still see people choosing
ElasticSearch, and I'm not sure
why.
Nikolay: Yeah, yeah.
Yeah, yeah.
Please, if someone listening to
us can share benchmarks showing
how ParadeDB is behaving under
load, some number of rows, some
number of queries, like latencies,
buffers used ideally, right?
I would appreciate it because I'm
still, I'm just, I'm stuck
in question what is this?
Michael: Cool.
What about vectors?
Nikolay: Vectors I have a picture
for you I saw yesterday near
my house.
These guys definitely, yeah, those
who can see YouTube, please
check this out.
So these guys are definitely experts
in vector storage.
Michael: This is Nikolay's joke,
it's like a removal company
that are called Vector Moving and
Storage.
Nikolay: I saw storage as well,
I thought it's funny.
So we have turbopuffer, right?
Michael: Well, again, not Postgres,
right?
Nikolay: Not Postgres at all, and
not open source at all.
And not free
Michael: at all,
Nikolay: even though freemium.
And data is in S3, and this new
type of indexes is being used
there.
I already forgot the name.
But yeah, so it's not HNSW.
Michael: Oh, interesting.
Nikolay: Yeah, yeah, yeah.
Michael: I didn't know.
Nikolay: So HNSW doesn't scale
beyond a few million rows.
DiskANN and we had the Timescale
TigerData guys who developed
an advanced version of that.
My perception, I don't see what
scales to billion rows at all.
And turbopuffer says they scale
to billion rows, but as I understand,
it's a multi-tenant approach.
So every, like, it's not 1 set
of billion vectors.
I also don't understand that, but
like I see some development,
the planned scale for MySQL, they
implemented the same index.
And this development started, I
think, at Microsoft and maybe
in China, actually.
Microsoft in China, this is what
I saw.
And interestingly, they choose
Postgres for prototyping.
So this area is worth additional
research.
I started it and didn't have time,
unfortunately, but it's a
very interesting direction, what's
happening with vector search.
Because I think Postgres is losing
right now.
Michael: Well you say losing, I
think...
Nikolay: It's losing, 100%.
Michael: Well, bear with me.
I think there are a lot of use
cases that don't need the scale
that you're talking about.
And a lot of those are fine on
Postgres with pgvector.
But you're probably talking about
the ones that then succeed
and do really well and scale, they
hit a limit relatively quickly,
or like within the first couple
of years.
Nikolay: It's really hard to maintain
a huge HNSW index and latency-wise
it's not good.
turbopuffer, I'm not fully sold
on that idea that let's store
everything on S3.
Speaking on S3, a few weeks ago
they released S3 vectors.
AWS released S3 vectors, and this
might become mainstream.
So S3 itself right now supports
vector indexes.
Have you heard about this?
No.
I think this might become mainstream.
But if big change doesn't happen
in Postgres ecosystem, it will
be worse than the case with full-text
search and Elastic.
It will be worse.
How it's called?
Like, alarmist I am today, right?
Michael: Well, this is the point
of the episode, right?
It's like, it's almost by design
that we're talking about the
weaknesses.
Nikolay: I was feeling so good
saying like, I would choose Postgres
for this, this and this.
I can rely.
But here, since we have 1.1 or
1.2 million vectors in our RAG
system for Postgres knowledge.
Postgres knowledge.
Michael: Is that mostly because
of the mailing list?
Yeah.
Nikolay: Mailing list, I think,
70%.
But we also have a lot of pieces
of source code of various versions
and not only Postgres, PgBouncer
and so on and documentation
it's a lot of stuff also blog posts
and I feel not well thinking
about how to add more.
And we are going to add more.
We are going to do 10x at some
point.
Of course, we will check what TigerData
has, but at the same
time, I'm feeling not well.
In terms
Michael: of- Is the main issue
latency?
I'm not sure.
Is it query latency?
Or what's the main, okay, yeah.
Nikolay: Latency, index size, index
build time, all these things.
Interesting.
Ability to have additional filter,
which is something that HNSW
still lacks, right?
Yeah.
Maybe.
Michael: I do remember seeing an
update on the pgvector repo
but I can't remember what.
I feel like they had a something
to address this but I can't
remember what.
Nikolay: I haven't touched this
topic for several months.
I might be already lagging in terms
of updates.
It's a very hot topic, of course,
very young as well, right?
Michael: Yeah, and not something
I'm experiencing again.
It's more something I'm observing.
So you're definitely way ahead
of me on this.
Nikolay: Yeah.
So I know just companies use, and
We have several customers who
are in Postgres, but they chose
turbopuffer additionally.
And Linear you mentioned, for example.
Cursor and Linear, they also chose
turbopuffer.
Notion chose turbopuffer to store
vectors.
I'm just checking.
They have
Michael: Some cool customers on this list, yeah.
Nikolay: Yeah.
And several more companies which not mentioned here, we also
mentioned, and they are our customers in terms of Postgres consulting.
And they were super surprised to see that something is like massive
migration of vectors is happening.
Some moving company called turbopuffer helped them move their
vectors to S3.
But yeah, it's interesting and they use some different index,
which is like younger idea.
It's based on clustering and centroids, so vectors.
So it's like ANN is implemented differently, not graph-like as
in HNSW, but basically quickly understand which centroids of
clusters are closer to our vector and then work with those clusters.
Quite simple idea actually, but I guess there are many complexities
in implementation there.
Michael: Yeah.
Well, and it's cheap, right?
Being on S3.
Nikolay: Oh, yes.
And slow.
But they have, turbopuffer, I guess they have additional layer
to cache on regular disks closer to the database.
So there's caching layer, of course, yeah, but it's much cheaper,
much.
Actually, this is another area, if you have hundreds of terabytes
of data, tiering of storage in Postgres is still not fully solved
problem.
Right.
Michael: Well, unless you shard, right?
And this is like
Nikolay: the next generation.
It's too expensive to shard and keep everything on disks, especially
if there is some archive data which you touch very rarely.
I would prefer to have it on S3.
And Timescale Cloud, TigerData, they solved it in their solution.
We also had attempt to solve it from Tembo, which is not Postgres
company anymore.
Yeah.
PGT, right, it was called.
But this is, I think, this should be like more and more needed
over time.
Michael: Well, and it's a side effect of the, like, the Crunchy
Data approach of putting things in Iceberg, that also solves
the problem, right?
You can archive the data from Postgres at that point.
So it's a similar solution, isn't it?
Nikolay: So I guess these days I would explore S3 vectors at
this point.
If I needed to.
Maybe I will, actually.
Michael: Well, you are going to need to, it sounds like.
Nikolay: Well, yeah, yeah, yeah.
PostgreSQL infrastructure mostly is on CloudSQL.
Not CloudSQL, Google Cloud.
Not CloudSQL, no, no, no.
That was wrong.
Google Cloud, so...
Michael: 1 level up.
Nikolay: Yeah, yeah, yeah.
But S3 is AWS, so it's...
But it's really interesting.
Should be cheap, should be interesting to explore.
And it's a big challenge to the Postgres ecosystem.
Or maybe an opportunity if somebody creates a foreign data wrapper
source.
Actually, why not?
It should be.
It's a good project, by the way, right?
So interface...
There is a foreign data wrapper to S3 already, right?
Michael: I think so.
I think super...
I don't know, I'll check.
Nikolay: Should be just extended to have vector functions, in
my opinion.
Okay, enough.
It's like a brainstorm mode already.
Thank you so much.
See you next week.
Michael: Thanks, and catch you next week.
Bye-bye.
Nikolay: See you next week.