Materialized views
Michael: Hello and welcome to Postgres f a
weekly show about all things Postgres qr.
I'm Michael, founder of PG Mustard.
This is my cohost Nicola, founder of Postgres ai.
Hey, Nala, what are we talking about today?
Nikolay: Hi Michael.
Let's talk about motorized views, skipping regular
views, or you want to talk about them both.
Michael: Well, I was most keen to talk about materialized views, but
it's difficult, to define them, I guess, without mentioning views at.
Nikolay: Right.
And the, and the rule system,
Michael: Yeah,
Nikolay: right?
regular views also interesting.
They have also interesting things to discuss, right?
so yeah, let's, let's, uh, focus on meteorite
views, but briefly touch regular views.
And as usual, let's start from very simple things, uh, and definitions
for those who are not really familiar with views, view system in pos.
And, uh, then try to dive into deeper topics and future maybe a little.
Michael: Sounds good.
so I, I guess let's can go back, let's
think about tables relations in general.
Nikolay: Yeah.
Yeah.
Michael: Um, We have ways of storing data in Postgres.
We have tables, uh, which hopefully everybody's familiar
with, and then views, which are kind of a virtual table.
so we can set up a query that, uh, queries tables or yeah, let's
stick to that simple version and call that a view, but that,
that doesn't, store anything that's like, it's a view into the.
But our queries, if we, if we're querying a lot of data
in the tables, we're still gonna be querying a lot of data
via the views, whereas materialized views are, or they
Nikolay: let's, let me little bit rephrase about, uh,
like, uh, we, the whole, whole this, uh, area is based
on the fact that any query returns kind of a table.
Uh, what, what is the table?
It's a good question because, for example, in theory, all all roles in the
relation, all, all tops in relation should, should be distinguishable in pos.
It's at, at like primary key is not
required so they can be not distinguishable.
but if you also remember about, hidden Columns, CT
i d Exmark X mean, we discussed a couple of times.
they are distinguishable again, but results, sets
of queries don't have those, invisible columns.
So it's kind of a lot of nuances here.
But, in like in simplified, point of view, any query returns to.
Right, which means we could subs substitute some
table in some query within our query and vice versa.
We could take some subquery and say, we can put a name
for it and store it as another query and just use, uh uh.
Aas here.
Right.
And this is actually a view, and in pogo, uh, views are
implemented using, uh, rule system, which is very, very old
thing, uh, originally POGO and maybe INGOs had, many decades ago.
And this, uh, uh, is rule system allows, users to define,
some rules, saying what to do at planning stage, not
an execution stage as triggers, but in planning stage.
So instead of this, do that.
So we can say, when you see, select from view one.
Just do this.
Sub-select from there and then do what you want.
And this, this rule system allows you to define, uh, views.
And also good thing, by the way, it was my,
bachelor of thesis very long ago, updateable views.
So you can even short if, if views are, quite
simple and, Postgres can propagate change.
so you could not only select from them, but you
can insert them, update, delete, and it'll work.
And by the way, some intermediate software like past Grid and for
example, if you're a super base user, you are using Past Grid.
it encourages you to define API using views.
and it relies on the fact that you can, change data,
not only read data there, and it's quite interesting.
This is views, right?
Based on rule system, not approaching execution, type of query
execution, just doing everything on the planning stage, right?
So quite, quite good.
Quite interesting.
what's the main problem of use?
like dependency.
If you need to change underlying table or underlying views, if you have
hierarchy, multiple layers, you usually need to recreate your views.
So it's quite, can be painful, especially if, uh,
your system is under constant, high load, right?
High load.
Mm-hmm.
Michael: And, and I think the, the benefits of them are
extremely different to the benefits of Materialized.
So you want, um, Like where, when do you see views used most?
Nikolay: Well, originally I remember from my Bachelor
of Leisure, so like 20 plus years ago, originally idea.
One of the key ideas is that this is helpful to control.
so you can, uh, restrict, number of columns you provide, you expose via views.
You can also restrict number of rows so you,
you can do it in both dimensions of table.
Com and, and then you can have a very different, uh, you can provide, for
example, writeable access and view, but no direct access to table at all.
And this is actually what progress is doing.
They, you're supposed to have a separate schema there.
You define, define views.
Nobody is working like users.
If your system don't work directly with table, they
work with it via views, it's helpful to manage success.
also views are helpful.
Sometimes just short shortcuts like others for your queries.
Why not?
Michael: Yeah, usability, right?
Better names, but that you can give objects different names.
To, to make them, you know, if you, if you regret a, a name and don't
want to actually do a big migration, I've seen it used for that as well.
But yeah, the main one seems to be security.
Nikolay: They add complexity to the system.
Definitely.
And especially if, again, if you have, several layers of views, so you
will end up dealing with like, if your schema is changing, often you
will, you'll notice further head of maintaining your views constantly.
But what else?
Like where are, are views helpful?
Actually reporting, like you have reports,
so you just store it and, and that's it.
But it does help with
speed anyhow.
Michael: exactly
Nikolay: yeah, it doesn't story in statistics.
It doesn't help you.
You cannot create index on view, nothing like that.
Cause it's very ural.
It's like just, it's like just a, just a query stored under some name.
That's it.
This is you.
Michael: Can it ever hurt performance in terms of push
down for like pre predicates and things like that?
Nikolay: I'm not so sure, like in some cases, of course, uh, if you, have a
view, you have indirectly you, you have less flexible, ways to optimize your
query, but, Like, in most cases, post's, planner, it can choose, what to do.
Like there is this joint collapse and, uh, another collapse,
Michael: From collapse.
Nikolay: yeah, yeah, from collapse setting.
So it can, it basically says that, POGS compares
different paths easily, and views are just, they're
already substituted at time, so it's just they're already.
this query, which, defines the view, it's already there.
Right?
So basically I don't see big problem, main problem only about
maintenance of schema changes for me in my, from my experience.
I might be wrong, of course, as usual, like my
experience is it's, it's not, the only one, right?
Michael: Yeah, get in touch if you've had other
Nikolay: Right.
So materialized fuels another idea.
I would say it's opposite idea, instead of just storing.
View definition, let's store the data and view definition as well, right?
Michael: Yeah, importantly.
But it's, so it's showing the data and
at the moment that's at a point in time.
So when, when the materialized view was created.
So it's, , that query one at that time.
So that, and again, let's.
Probably want to catch them before I go into the, uh, benefits too quickly.
But the idea is you can run an expensive, like computation of data
or you can, you can do some work once and then read from it a lot
of times instead of having to do that expensive work multiple times.
Nikolay: Uh, right.
Actually, you can define just you, but
don't, don't have data in it originally.
There is additional comment to populate.
Uh, uh,
Michael: the benefit of that?
I couldn't mark it out.
Nikolay: well, I don't know.
some cases we can invent some cases, uh, synthetically.
For example, you restore from, from damp and you don't need it.
But you, like, you already restored your like, uh, the original data.
This because metal, you, it's the data which is derived You always
can refresh it and have a up to date version so you restore from dump
and you don't want to have, it there yet somehow just to consider
dump as already restored fully because all original data already there
Michael: Yeah.
So you can speed up your.
Nikolay: Right, and later in.
In separate step, it can be done using store pt, dump, um, pair of utilities
and PGE store has dash uppercase and down case l pair of options.
So one option provides the least of objects, uh, the dump.
Which if, if dump is in custom for directory format and another option allows
you to filter out, everything you don't want or filter in what you want to
restore so you can restore ize to you, but don't restore data and do it later.
Separately.
This is actually, I did it, but I did it, , because I
had issues with, , by the way, maybe it's still there.
So if you have multiple material to use and also
multiple layers of them here, here are here,
right?
in some cases, , restore doesn't.
because, uh, it does know the order
Michael: Yeah.
Nikolay: and it says, uh, I cannot, populate data yet for this
mutualized view because, , I depend on another mutualized view.
It's, it's, it's empty or maybe even not create it at all.
I don't remember details there, but so I
remember I wrote a script Like infinite loop
and if a refresh test tube fails, just
skipped and another attempt is done later.
So it was was brute force, reive refresh
of all test use to fill all layers of it.
It worked very well.
So, to stores data.
I agree with you that it, it's helpful.
to avoid.
for example, uh, aggregates one of examples because we all
know the counting is quite slow in post is a raw store.
So it's not cone store, unlike some analytical database systems.
And, of course if you have billion rows, count will be slow.
Even if you have a index on the scan, it'll still be slow.
so you can pre-calculate.
For particular groups, for example, and have
quite small to, and, you can build indexes on it.
Any custom indexes.
Right.
Michael: Yeah, I don't think, I think a lot
of people don't realize that it's really cool.
Nikolay: Yeah.
It's because it's a table Actually.
I would like everyone to understand that instead of
to you, you have very good, powerful tool in pogs
Create table as select it's, consider it as a one time.
without refresh capability, it's the same.
Michael: Yes, but with refresh capability,
I think it becomes even more powerful.
Right?
Nikolay: I, I would argue let's, like, I, I, I mean, I
agree in many cases, but in some cases I don't agree.
And I will tell you why in the moment.
So, create table as select gives you powerful ability to, create some derived
table based on several, one or several other tables or choice to use actually.
And, then create indexes and work with it.
Very good for experimenting.
It's also very good and, , internally it's good.
It doesn't spa, the buffer.
because the in is used there.
So only small, small amount of, buffers relatively small.
I don't remember.
Always forget the size of it.
But, , of course the system, page cash, it doesn't,
have knowledge what data you are getting from disc.
So it'll be spent with this action.
But the buffer pole will remain good.
I mean, it won't evict useful buffers from your system, which is.
Create, select and, uh, you can create indexes.
And I would like to point to very important
thing I realized not very long ago, very simple.
I, I, I like to say silly things.
By the way, I want to apologize for the last episode because
I said, in ses cannot be had heon lead apples and deletes.
Cannot be heard well in ses, cannot be had.
It's like, doesn't make sense, but deletes.
They even don't change, indexes at all
because indexes don't have visibility in four.
So, I, I, it was a wrong idea to even, to raise, delete.
as we also discussed, it just puts, uh, value of transaction ID to x.
In the, he in in table, pages and doesn't do anything with indexes at all.
And, and so it cannot be hard because it's not needed to be hot at all.
It's already hot, actually, like new statement deletes
are always hot already because they don't change indexes.
And of course indexes, uh, uh, I changed later by, by
vacuum activities, but it's synchronously, which is good.
So back to our topic.
The idea that you can create table as select or create marginalized
to which currently we can consider like kind of equivalence.
So we will see differences a little bit later.
You can create indexes, and this is super cool because if joint
is involved, you can create an index on two tables at once.
Index it always belongs to some table, but
sometimes, for example, you want to order.
By two values, which are present in two different tables.
Or you want to order by and filter, and you could do
it using single index scan, both order by limit and.
Order by and, uh, where clothes uses one
column, order by and limit another column.
We could combine it in an index, like two column index, for example.
But we cannot, because these columns are in different tables, right?
So we cannot create an index for two tables.
But when you join and creates a create, select, and join, Create to you.
You finally can do, can have this index, even if you
don't do anything with data, except just joining it
Michael: Yeah.
I mean that, yeah.
That's awesome.
I, um, I've not seen anyone doing that.
That's super, such a cool idea.
Nikolay: I think everyone doing it just implicitly,
Michael: oh, interesting.
So the, so in fact, this actually takes us quite quickly to one of the
down, like, that sounds, that's great, but you are, you, you do, you
are, it's, it's a minor downside, I guess, for most people, but that,
that is taking up space on disk, in cash, in cashier and things like
Nikolay: Which has to take space and create select will will take space.
Right.
Michael: Yeah.
So it's, it is a minor downside for that
benefit, but it's, it's the reason why
Nikolay: But the speed benefit can be many, many orders of market
because when you avoid join, and you can have single index scan
and maybe even index only scan compared to 2 index scans in two
different tables, and then one of three algorithms to, to join it.
Like it's very, it's like game changer.
Could be in terms of performance,
Michael: Yep.
I completely agree.
The, the other like, let's, I think while I'm talking
about downsides, the main one that people normally,
Nikolay: justify size.
Like you can, instead of, , putting query, which will solve, your, like, like
you have some goal to do something with data and you have a query, you want
to improve it, the speed of it and Okay, I will basically, what, what is this
create table effect or create to, it's kind of, instead of considering it
as a query cash, you could, uh, for example, store more, have more indexes
on this, uh, derived table or mutualized you and, uh, have more, for future.
You, you, you can, , support more queries with this cash.
Like cash, not the final result.
Cash, data, right?
So, as we know, query cash is kind of, it's good until it's not right.
For example, my SQL removed query cash from engine at some point.
Michael: it feels like one of those things where
there are use cases that it can hugely benefit.
Like if your data's not changing much or if you have past data that, um, it's
Nikolay: You, you, you're pushing us too, too
fast to the main problem with this approach.
Uh uh.
Which, which also one of two biggest problems
in computer science as we know, right?
Cash and validation, how to, how to maintain the state of it.
Not to, of course when we say create select or we say data.
There, of course, the data is, is frozen, right?
It.
Michael: Yeah.
But I mean, like if we're talking about a, like some people do have
analytics databases from static data that isn't changing at all.
And that for
Nikolay: I doubt, uh, well, maybe some cases, right?
But, uh, I, I mostly observe cases when people want to
maintain the state of analytical database with very small lag.
Michael: Yeah.
Nikolay: or even, even they, they want to have
it in the same system, so in progress and so on.
So like each step.
Uh, approach.
Michael: Hybrid, right?
Hybrid, uh, transactional
Nikolay: the main problem is how to update it and, uh, what pogs offers.
If you forget about my idea of Creative Select, which I still think is
quite good, and, uh, it can beat in some cases, I will explain why.
Like, I, I, I keep this like intrigue, uh, uh,
state, so we have can say, uh, refreshment.
Right.
Refreshment size.
Michael: Yeah.
And importantly, I didn't realize this was added so
early, but refresh materials you concurrently as well.
Nikolay: Right.
What's the difference?
Difference is one will lead to blow, another will, will not.
Right.
. Right.
Michael: Um, so the big, the big difference, and I, so I
Nikolay: Like, like developer's point of view, it's so good.
Our, our queries are work's point of view or blo who, who knew it?
It can happen as well.
,
Michael: So if we refresh a materialized view without
concurrently, we block selects on the materialized view while it's.
Refreshing.
Right.
And that's offered like it, bear in mind we're normally doing
this on a slightly slow query because that's the, the benefit of
Nikolay: It's like vacuum fool, basically.
Right,
but with data will be fresh, our goal is achieved.
Michael: Yes, but with concurrently it's like a, I
guess the equivalent would be a PG pack or something.
Uh, where, or, or create index concurrently, or
sorry, reindex concurrently would be the equivalent.
So, with concurrently, we don't block selects on the past state of
the material that's viewed and it replaces it once it's finished.
Nikolay: Right.
Michael: is
that, is that right?
Nikolay: yeah, that's right.
But, uh, the main point to remember here is that we can refresh
it only fully, we cannot refresh part of it, unfortunately,
if, if we use regularized fuel, like, original postals
provides because, uh, just refreshment has to concurrently.
there are no like options.
Say that I know 90% of my data hasn't changed because it's like very old data.
So I, I probably could do it much faster.
so you refresh everything and it can take a lot of time but
concurrent select working, it's good, but, we, accumulate blood of.
Indexes.
If, if we follow my, approach, let's not, use, Let's,
let's define with some underlying data, maybe raw
data and support more queries, more kinds of queries.
With it, it means that probably you have multiple indexes there.
Oh, we forgot to say that.
For, to support concurrent update refresh, we need, uh, unique index.
Michael: Exactly on one of the columns,
Nikolay: it's mandatory.
So already one index is there, but if you follow my
approach, let's have raw data and uh, support more queries.
You probably have multiple indexes there
and there health will degrade as well.
So all problems we have with.
tables which are, receive significant numbers of updates, deletes.
we'll have this here as well, so repacking or something, and here idea like
why, just not to recreate it if we still need to populate fully it fully.
Right.
Why?
And maybe it can.
Just recreate, ized, view we under a different name, and then quickly
swap it in, in single transaction or just this create table, select.
It's like if it, if it's recreated every time we want to refresh.
What's the point of having ized view?
Just maybe convenience, because definition of it, uh, remembers the.
right?
So it, like a database system stores the definition.
It like, it's better.
Maybe that's it.
In some, in some cases I would like, I would
seriously consider a recreation of it every time.
, just to not to deal with blo in data part of it and in indexes.
So we have.
Right.
But of course if we do it every minute, for example, I would check the
blo of system catalogs as well because if you have a lot of details
you consider the consider DL is very, very often thing to happen.
Uh, you might end up having, very bloated PG class and so on.
So it's also worth remembering.
And, uh, park cannot repak, uh, system catalog.
So you need to do vacuum full.
Fortunately, it's usually very fast, but, uh, you don't
want to, to, to be like, uh, to have 99% of blood or
something because it'll affect basically all queries.
cause planning will be slowed down and so many things.
So, , bottom line refreshment size two
is good, but it's very, very rough tool.
One of, uh, our customers, enjoyed using them until some point then said,
it's like having huge hammer and, uh, just applying it to very small things.
Like it's very rough too.
Very rough.
So it saves you sometimes.
But then, like if, if you only small portions of your data is
changing, maybe you will be not satisfied with this approach.
Refresh it fully always.
And, uh, here also, like by the way, could ized u be update.
Michael: Do you mean increment?
Like what do you mean?
Nikolay: updateable means like in the same, we discussed it for regular views.
Uh, you can insert it.
Michael: I
don't
see how that would make sense.
Nikolay: Well, well, it might, it might make sense at
some point, but, uh, maybe it's too like exotic and I,
I suspect other systems have it, but I don't remember.
Let's, let's keep it just the question to,
to know where, just entertaining question.
But, uh, as for the main topic, what would we need, we would
need to be able to, uh, update on the parts of the modularized.
And, uh, there is a big theory, described in not big, but
some theory in described in, uh, Wiki page, post's wiki page.
And the discussion is happening already many,
many years, maybe decades, at least one decade.
I, I would say, to have, incrementally maintained
views incrementally refreshed or maintained view.
So we want, uh, to avoid full refresh.
Right.
We want to adjust only part of it.
And I, I already brought this, that you, it's like create table select,
but with rule system from views inherited, like we just remember
this, you can use create table as select and then maintain the state
using triggers or some other things and, and have incremental data.
If, if you understand the logic, probably it's the good way to.
Right.
Just to maintain from your application.
The, the question to this approach will be, do you lose data and,
uh, do you slow down your rights on the, uh, the baseline base tables?
Michael: Exactly what are the, what are the trade off?
. Um, I, it is actually quite an interesting time to talk about,
we could talk about the future and incrementally updating
materials, views in Postgres, but I think it's also worth
talking about a couple of existing tools and systems that have.
Partially tackled it or tackled it for their own specific case.
The reason, um, well this was a requested topic,
so thank you to, uh, the person who requested it.
But there was also an announcement, not in the Postgres space, but
by a company called Planet Scale that we've talked about briefly here
before that announced a, a boost feature that that looks very similar
to a materialized views with, but with incremental updates and,
Nikolay: and, and boost is it to achieve if you had a joint or
multiple joints with a lot of, , various kinds of index scans,
and then suddenly you have index only, I'm, I'm talking OUS terms.
Then suddenly you have single index scan, index only scan.
Of course, it'll be like 10,000, speed improvement or a hundred thousand even
Michael: Yeah.
They definitely went for a click baity title on it, but
it's, it's kind of what people sometimes want, right?
They've got a slow query and they want a quick fix while they Yeah.
Nikolay: I, I, I'm not concerned about the space,
additional, additional space usually, but sometimes I, I do.
But, uh, the price of maintenance, how maintenance are organized.
Do we have blood issues in pogs if we do this?
So,
Michael: The other, the other one I wanted to talk
about as an existing, there's a well, um, time scale,
have their continuous aggregates feature as well.
And, uh, we don't need to discuss 'em in depth here, but there's a good
blog post that they've done actually covering some of the topics we've
just discussed about views, materialized views and continuous aggregates.
And they have a solution that was, it's designed for time series workloads
where you're probably not gonna have loads of data, uh, in the past changing.
But it can cope with that and it's been, they're being improved.
So that's an interesting take if you, if that's
something you want and need, uh, and you're okay with.
The trade offs now exists and it's kind of, it's
in this area, topic of incrementally updating
Nikolay: you timescale user, you should go and check.
Uh, continuous aggregates a hundred percent.
This is one of key features I would say.
I.
but there is also project, uh, and there is, there is discussion
happening four year, almost four years already in hackers.
Unfortunately, it's, uh, I, I don't know details, but I
just see it in implementing incremental view maintenance.
Uh, started in December, 2018 from Nagata, and the same offer created, uh, pg.
Pro project, which is
it's extension.
Extension, okay.
Open source extension.
It's quite good.
Interesting work.
and this is attempt to follow this approach.
Like let's have
, incremental updates, automatic, partial updates of you.
But as I understand, like I, I've spent some time understanding how pg ibm.
it works first of all, the jurisdictions, of course, on
queries, uh, that can be used for dig, define such views.
And also, as I understand, there are two approaches.
Uh, first is synchronous propagation of data change, which
will slow down our original insert or update or delete.
Right.
still, it's good.
It's, it's, it's good to think, to have it automated.
Michael: yeah, I think of that as an immediately updating materialized view.
So as soon as there's a change, it's,
Nikolay: right?
It means that, uh, at commit time, like our transactions, which
changed the original table, will slow down of course, but it, it's
good that, uh, we don't need to write a trigger to, to, to do it.
And second, uh, way is to have, refresh.
I, I, I didn't get it.
Maybe like it's still full refresh, like
before or No, like I didn't get this part
Michael: I think they use trigger.
Nikolay: right.
Okay.
So it's, uh, if, if I updated only one row out of billion will, um, second
option will propagate change very fast compared to refreshment last view.
Michael: I didn't realize pg IBM had two options.
I thought they might be talking more philosophically
about their being two options and they.
That was my understanding, but I definitely might have,
there's a good video by, Lucas Fiddle on this as well.
We mentioned his channel last time.
I'll link up that video.
Nikolay: yeah.
Keep mentioning this channel and like us, our podcast.
It's only five minutes.
Always very much faster for, for those poor doc owners who work with docs.
Only five minutes,
Michael: Yeah.
Maybe ones with little legs.
Yeah.
Nikolay: Right.
. Right.
So, why I think, well, well still I, this is great,
but two big questions to this direction as a whole.
First, can we do it a synchronously, but, like propagated
always, but a synchronously, like first option, but
synchronously, I don't want my original transaction.
To slow down.
And second, uh, this is usually needed when we have a lot of data.
And, uh, when we have a lot of data, we always should use partitioning.
I want my you also partition maybe, right?
So, what about this?
Like for example, we partition by.
and we have very, like five years ago
data, January, five years ago for example.
It still can change, but so infrequently.
So I would refresh it very less often, like, right, I don't want my,
my original tables be partition, but ized to you as a single thing.
It means that, uh, if I refresh it fully, it, it's very, Query and will
affect vacuuming and will affect whole database leading to blo in all tables.
If I, uh, updated partially well, I'm fine actually maybe fine
to have, well, vacuum will also need to work here as well.
Right.
And if it's a single huge table, all problems we have with huge
tables, which are not partied, will hit this motorized tissue as.
even if we have partial incremental of human
Michael: yeah.
So if I understand you, we're kind of talking about
two extremes and like whether there's a middle ground.
There's the extreme of having to refresh everything
each time, which we currently have in progress.
There's the other extreme of, synchronous immediate, , tri,
uh, trigger based updates so that our, our materials view is
updated on each transaction, but that comes with right overhead.
is there a middle ground option where
Nikolay: We, we, there is, there is perfect solution.
It's not milligram.
It's perfect solution.
Perfect solution.
Should, uh, propagate data as synchronously, not
slowing down original transaction and not losing.
With some small leg.
Ideally also, absorbable.
So I would put this leg to some, I dunno, monitoring.
And second, the results would be partied.
Michael: This is where you lost me a little bit.
So my understanding of materialized fees is normally
that they're hugely smaller than the original data.
So I mean, how many, how big are these materialized
Nikolay: I don't agree.
I saw so big use.
And even if, like, again, the idea of big, result, big comes from my idea.
Let's not,
Uh, create a, for each query, let's create it for like a, to avoid joints for.
,
right?
And then run a lot of various queries on top of it.
Quite good idea.
It's like generalization, automated generalization, basically.
but even if people just use it for a final query, just cash.
Cash, the result of query, I still saw Huges huge.
Michael: Interesting.
Well, great.
Nikolay: And if, if it have a huge, it should
be partition maybe to, to benefit from it.
But the main key, the key, like I would highlight the key point here.
PO unfortunately doesn't have this, what Oracle has, uh, prma autonomous,
uh, the ability to, initiate transaction, which will be detach.
From a trigger, for example.
So I want to, to have insert, but database will guarantee it'll be finished.
But I already, it'll, it's not my problem already.
I, I finished my transaction and another
transaction is happening in different backend.
For example, you can do it with DB link or something
emulated, but it's kind of, it it looks not, clean solution.
and, uh, like currently, I see what, like, ideal
system, what I would build, I would use, uh, tables.
I would store definition, maybe not, not on pause,
maybe inside I would maybe be partition result.
But first thing I would solve, of course, I would propagate, changes
incremental incre, incremental fashion, not, refreshing whole, view.
And by the way, there are articles explaining how to do it.
And there are also tools for various languages which, help you to do it.
But what, first thing I would do, I would propagate changes
and synchronously through probably something like Kafka.
I don't know.
I can implement it in Postgres, but it requires additional knowledge.
You need to know how to deal with blood
in, it's, it's kind of cue inside Postgres.
It's, maybe we should cover this topic separately.
It's possible, but if, if it's in Kafka, it's also
good or some other system which doesn't lose data.
allows me to control the leg and it just delivers, uh, this,
uh, signal of change, uh, separately so I can update my target.
And, uh, the project I, I sent to you, the link,
this materialized company called Materialized.
They solve it as a commercial solution and
as I understand they use Kafka and dbi.
Everyone can do it, but maintaining Kafka.
Requires additional efforts.
So that's why maybe I would do it in pogs itself.
It depends on the system.
If it's, uh, if you have many, many databases, having Kafka is
something like additional to pogs, it would be justified, right?
But in this case, imagine like it's ideal world.
The target is good.
I mean, it's, it doesn't, suffer from having huge, uh,
tables and very, very long vacuuming indexes are usually.
Smaller because it's partition.
index maintenance.
There can be applied in regular fashion, and changes are propagated.
atomically, like every small change is
propagated, guaranteed with small delay.
And, for example, we have a lot of posts, comments
like social media or for example, Twitter.
There's master.
Don't think this, like, people migrated.
It's signing on post as.
Michael: it is.
Nikolay: We have, tweets, for example, answers, replies, likes, everything.
And we can have counters very, very, very good here, like using this approach.
Like, so I wonder why we go, development goes this direction,
which will not work well for very large volumes of data.
Well, I, I see the use cases where it'll be
useful, this incremental human illness, but why?
Two big problems ignored.
And first of all, this, uh, ability to
propagate, uh, changes as synchronously.
It's so like, I think this problem should be solved first to allow people
to define triggers, which will be, a synchronous, for example, or something.
Or maybe even, maybe it's should be done outside of database.
It's, it's not close topic for me.
I don't know the ideal answer here.
Maybe Kafka should be here or maybe it should be done at pogs.
I don't know.
Michael: Well, if anybody out there has done a lot of, , thinking
and reading about this, let us know what, what your thoughts are.
Also, I'm really grateful to the people
that are working on this, on the P G I vm.
Thing because they are looking to contribute this to Postgres.
They are looking to, get it committed upstream.
And I think that's a really nice use of the extension
model where you write an extension as a proof of concept.
Well, what do you think?
Nikolay: Yeah, I, I agree.
I agree.
I, I, I don't want to like to discourage, uh, to, just, to criticize.
I just see that problems that should be solved also,
like I, I solved it a couple of times in different cases, like not.
Not as open source just solved it.
And, uh, it was good.
But, maybe I, I remember also discuss discussions, this implementing this,
autonomous prag autonomous in Postgres, unfortunately also didn't finish.
so like the, the need of.
Is there, uh, incremental human maintenance probably will be good for
mid-size projects, but for larger scale, like dozens of terabytes and,
uh, hundreds of thousands of transactions per second, multiple standby
notes and so on, we do need, additional thing to, to be solved as well.
I, I feel it like, and currently solvable with Kafka or something.
Maybe it should be, maybe this like regular.
Discussion, should it be inside the engine or should be outside?
I don't have exact answer for this.
but it's, it's super interesting topic.
I think, in general, development of use can, uh, be
helpful for many, many because, uh, data volumes are growing
and, people want, uh, aggregates around faster and so on.
And.
Michael: Yeah, exactly.
More and more products need to, you know, have a dashboard,
have some, you know, a lot of even transactional products are
expected to do a little bit of analytics now as well, aren't they?
So, makes a lot of sense.
Nikolay: Yeah.
Well actually maybe it's not that bad.
If you like, I, I would maybe.
Right now consider P G I V M for a particular project.
I would just measure benchmark how my rights will slow
down because of this synchronous propagation approach.
Also, it requires an index, right?
So additional index, which also will slightly slow down our
rights, but, and if it's, if it's acceptable and if it'll remain.
Acceptable in few years when my data
volumes will grow, then I will just use it.
Why not?
And uh, it's also a problem of, uh, this, uh, like very rough approach.
Like, let's update as a whole.
Updates only part of it.
It's good.
Michael: Yeah.
Keeps everything in Postgres at at the start too.
Well, good luck to them and thank you.
Did we cover everything you wanted to hear?
Nikolay: Yeah, maybe some bottom line, uh, check, benchmark and consider it
building your own solution based on just create, select, some triggers, maybe
some, maybe kaf if you, or some synchronous queue mechanism you have already.
If it, if you, if it's reliable, of course.
And.
You can propagate and support your own data structures and queries maybe.
While, like, I mean, if existing solutions , don't suit well for
your needs, maybe you should, you just need to write your own thing.
it's not that difficult actually.
You have rules, you have triggers, so just do it.
That's it.
Triggers here.
Yeah.
Yeah.
I think about triggers, but, uh, synchron.
Nature would be good to have here.
Yep.
Listen, I, I would just to, just to warm like refresh,
massage you is good as like quick and dirty solution.
It, it'll be dirty because it, it'll hurt.
Uh, when data grows and everything grows and like,
it'll be problematic for large volumes of data, even
Michael: But it, it can get people a long way right at the beginning.
Nikolay: Well, yeah, I'm talking about, again, I'm talking about dozens of
terabytes and dozens or even hundreds of thousands of transactions per second.
It's like really big systems and also motorized, motorized companies.
They do interesting things.
So it's, it's commercial, but interesting.
And there, there is a talk, , there are a couple of talks from them.
I, I've checked them Very interesting, , just to
understand how it's implemented, how the vision and so on.
Uh, let's also link their, their materials
Michael: Will do materials.
Nikolay: materials.
Michael: Yeah.
Yeah.
well, yeah, I hope a few people got some ideas of things
they can use these for as well, even in the current state.
And yeah, look forward to speaking with you next week.
Nikolay: Good.
See you later.
Bye.
Michael: Bye.