Append-only tables
Nikolay: Hello, hello, this is
Postgres.FM.
I'm Nikolay, Postgres.AI, and my
co-host is Michael, pgMustard.
Hi Michael.
Michael: Hello Nikolay.
Nikolay: And guess what we are
going to discuss today?
Michael: I'll guess.
Is it append-only tables?
Nikolay: Exactly.
Michael: Ah, got it right.
Nikolay: I was surprised to hear
we haven't discussed it in the
past I'm sure we touched it many
times right
Michael: yeah for sure it's come
up in episodes but me too when
I saw it in our listener suggested
topics I did have a few searches
on our site just to make sure we
hadn't talked about it already
as a whole episode and yeah agreed
but it's not something I see
all the time like it's it is it's
relatively common like to have
an events table but even then I
mostly see append-mostly tables
rather than
Nikolay: like oh yeah I was thinking
we are going to discuss
both.
Yeah.
Do you feel in the episode for
append-mostly tables?
No?
Michael: I don't, well, I actually,
I don't think this is super
complex.
I think it's quite nice because
almost by definition if we do
accept that it's append-only We've
got quite a narrow scope and
there's only a few things to consider
Maybe it gets a bit more
complicated with append-mostly.
But yeah, we can definitely cover
that today.
I think it's still not that complicated
Nikolay: yeah where do we start
like definition
Michael: go for it yeah why not
Nikolay: well if we just insert
that's append-only that's it
Michael: insert and select don't
I would say yeah no updates
no deletes.
Nikolay: Select is allowed right
yeah insert and select this
is the only 2 things we can allow
ourselves from DML data manipulation
language and that's it yeah we
just select and insert this is
append-only If we have occasional
deletes and updates, it's
append-mostly or insert-mostly, I don't
know how to say.
Michael: Yeah, I like that.
Nikolay: And why do we care about
this particular use case?
Because it has characteristics,
right?
It usually has...
If it's append-only, for example,
we don't care about dead tuples
anymore.
No bloat, right?
It's good.
And usually we talk about huge
volumes of data.
And at some point we think, oh, we need to compress it, we need
to offload it maybe to cheaper storage, or just clean up, because
all data is not needed anymore in raw form.
Sometimes it's aggregated and in raw form we can just remove
it from database.
Or we just need bottomless.
And usually we want inserts to happen very fast, because a lot
of volume is huge.
So we need to make sure performance of inserts is good.
Did I miss anything?
Michael: No, I think those things aren't necessarily always true
for append-only tables but they correlate like a lot of the use
cases for very very fast growing data and and by definition append-only
means it's never going to decrease in size.
It's only going to keep getting larger
Nikolay: and larger.
Unless you clean it up.
Well, there are specific cases.
For example, imagine we discussed many times the topic of slow
count.
And if you can allow a synchronous calculation of count, maybe
it's like materialized or something.
I don't know.
So idea is instead of updating the count somewhere on each insert
in the original table You can aggregate operations in intermediate
table, and then it's append-only.
So you register events in some table, and then you process chunk
and reflect this in count in final storage.
And then you can delete it or better drop this partition or truncate
or something.
Right, in this case, it's append-only, but it grows, grows, and
then the size drops.
It happens also, right?
Michael: Yeah, I think dropping partitions definitely pushes
the definition of append-only, but it's the thing that makes
sense to do in most, or a lot of cases, at huge scale.
But yeah, is it still append-only if we're dropping partitions?
Nikolay: Yes, this is how we should do
Michael: it no I know but do you see what I mean you mentioned
deleting data but
Nikolay: well we again again again it's append-only we draw partition
we never delete we never update it's append-only but we if we
don't need like last year data we already processed it somehow,
made all calculations we need, we can get rid of raw data.
We just dropping partition, It's the best we can do instead of
cleaning up somehow using deletes.
I think we need to discuss it because I did it many times and
participated in huge projects in very large companies.
The idea, let's offload all archive data.
It was e-commerce.
Old orders, let's offload it to
cheaper storage for longer-term
storage, and then we need to delete
it in original place, original
database, and it was not partitioned.
Deletes, it was a project for a
couple of months.
Because it was, like downtime is
not acceptable, it costs a lot
of dollars.
E-commerce guys know very well,
they can calculate it, each second
of downtime, how much it costs
to company.
So if I had partitioned table there,
it would be magic.
And it's append-only.
That particular table was not append-only,
right?
But it can happen with append-only.
For example, we have audit log.
Some actions are stored in some
append-only table, but we have
a policy to store only 2 years
of data.
Then I would prefer to drop partition
with all data, that's it.
So cleaning up is a very important
topic for append-only tables,
this is what I was trying to say.
Michael: Yeah, I completely agree.
I think there are other benefits
to partitioning with append-only,
or append-mostly as well, due to,
like, if we do have the occasional
update or delete by having
Nikolay: partitions yeah
Michael: well partitioning helps
with that as well right so let's
let's zoom back out maybe we've
got we've got inserts and SELECTs
so we do have to we might have
to if we're talking about a very
very high volume, we might have
to worry about insert performance
and SELECT performance.
Nikolay: We can also have, sorry
for interrupting, we can also
have copy.
Yeah, sure.
In both ways.
And
Michael: I guess that's about to
come up if we're talking about
optimizing.
Nikolay: Also, common table, you
know, like table, which reads
everything.
Yeah, but it's kind of SELECT.
So yeah, inserts or SELECTs, no
UPDATEs, DELETEs and so on.
So what use cases?
You wanted to discuss use cases,
right?
Michael: Or even, I was actually
thinking of diving straight
to performance of, like, I think
there's a few things that we
don't have to worry about, and
a few things that we can then
optimize for.
Like, if we're having to insert
at extremely high volumes, which
sometimes these use cases do lend
themselves towards.
You know, if we're, I think IoT
for example, Internet of Things,
if sensors are sending information
and we're logging for each
second amongst thousands or tens
of thousands of sensors that
could be that can end up being a lot of data so Inserting can
be a bottleneck and you might make design decisions for those
tables that you wouldn't make If you had a different type of
table a different type of data So there's that side of things
but then there's also the read side of things I think you know
and I think those things maybe sometimes play off against each
other so But the fact we've got append-only we have some benefits
to like index-only scans for example become even better I know
I know you often talk about always trying to get index-only scans,
but in a table where the data is often changing, that can be
a losing battle.
It can be a battle that's not always worth fighting, or it's
maybe not always worth including as many columns to the index,
for example.
There's different trade-offs for append-only versus...
Nikolay: Let's unwrap everything here.
You mentioned so many things in just a minute, right?
So first, Let's talk about performance of inserts.
I would say the ideal situation is we don't have indexes and
we don't have triggers, including foreign key triggers, because
foreign key in Postgres is internally implemented via system
trigger.
This trigger is going to slow down inserts, especially if you
need to insert a lot of rows.
If you just have a few foreign keys, it can multiply the duration
of this massive insert.
So, ideally, we should get rid of foreign keys and keep as few
indexes as possible for this particular case.
I remember in some cases, I decided to go without primary keys,
you know, breaking relational model and so on.
There's no relational model in Postgres in any relational database
which implements SQL model, data model, which has null and it
breaks relational model completely anyway.
But this is a different topic.
Side note.
Anyway, so it's not good to be without foreign, without primary
keys, but sometimes you think, oh, I just need to dump these
to some table reliably.
So we have ACID, so Postgres guarantees, it's stored, it's saved,
it's replicated, it's backed up.
But even 1 index, sometimes you think, oh, it slows me down.
And I remember I decided to leave without primary key.
It was a weird case, but it was some archive, maybe just for
audit purposes.
I decided to use BRIN at that time.
BRIN is actually a good idea to consider if we have append-only
because layout, physically, rows don't move.
If we have a row, it's a tuple, it's saved in some block, it's
there, right?
So this is exactly when BRIN indexes work well.
And we had an episode, 1 of our first episodes, I remember.
It was.
BRIN indexes.
BRIN is a block range index, right?
Yeah.
So it's very lightweight.
It speeds up select performance, not as good as other indexes,
especially B-tree, but it's still good, right?
Or we might consider hash indexes also, right, because they might
be more lightweight than B-tree sometimes.
They're smaller,
Michael: for example, right?
Well, I think, Yeah, but when it comes to append-only, I think
you make a really good point.
Each index we have slows down the inserts.
So the fewer the better, possibly none if we aren't, let's say
it's a table we're never reading from or it's an audit log that
we only ever have to read from extremely rarely we might consider
1 or even 0 indexes on that maybe not an audit log because maybe
that's not 1 you would actually be writing an insane volume to
but I've read a timescale but sometimes they have to worry about
this kind of thing that they have whilst they've designed for
these kind of time series workloads They've written a good blog
post on optimizing inserts and they list all the same things
as you and go further.
So they, as well as foreign key constraints, basically other
constraints can add overhead as well.
So for example...
Nikolay: Of course. Checks
Michael: A unique constraint... yeah check constraint but unique constraints...
Nikolay: yeah index additional check for sure
Michael: yeah so not having it basically deciding for each constraint
if you really need it or what value it's adding having it and
make
Nikolay: yes that been said I must say like in most cases, I
prefer having primary key.
Because it's like the center of consistency, of data consistency,
right?
So it's good to have.
But it depends.
It's good that you mentioned timescale, but I think we will return
to timescale.
My question to you is a tricky question, but I think you already
know, and I must admit, when 2 years ago we started the podcast,
I didn't realize it fully.
Now I realize it much better.
So we have an index.
What operations does it slow down?
You said it slows down inserts.
This is for sure.
Does it slow down updates?
Well, yes.
And there's a mechanism, hot update,
which deals with it in a
limited number of cases.
Does it slow down delete?
Well, maybe no, because during
delete, index is not updated.
Postgres-only updates xmax, as
we discussed a couple of times.
Does it slow down selects?
What do you think?
Michael: So we've talked about
how having a lot of them can.
Nikolay: Yeah.
Yeah.
It slows down the selects, especially
if we have a lot of them
and high frequency of selects,
and this is about planning time
and a lock manager locks during
planning, all indexes are locked.
It's some overhead in a very heavily
loaded systems to keep in
mind.
But in general, I would minimize
the number of indexes and try
not to use foreign keys.
Foreign keys, in many cases, we
can imagine they exist, have
maybe routine checks that referential
integrity is fine, but
drop them intentionally because
in this case, we want, for example,
good insert performance.
And as usual, I would like to remind
that when I say all this,
in many cases when I deal with
new system, I have some of these
principles, but I never trust myself,
I always check again checking
should be like consider like sometimes
you spend time there right
but it's worth doing experiments
Michael: yeah I well and I would
say we're talking about extremely
high volumes here if if you can
I would much rather normally
have primary key have some foreign
keys if they make sense and
have a unique key if I need it
and then test if like can I get
better discs if I need to?
Are there other ways I can improve,
like I can cope with higher
write performance instead of...
Nikolay: Perform checkpoint tuning
if you expect huge volumes
to let into the store.
Michael: Yeah, so maybe pay for
it in other ways.
It's only at
Nikolay: the apps.
Bigger buffer pool.
Exactly.
Make sure backends don't write
all the time.
It depends, right?
So checkpointer is not crazy, it's
not too frequent, and so on.
Yeah, yeah.
And there's a lot of stuff here.
And if we think about Selex now,
what's the number 1 problem
usually?
I think, it makes me so, I'm still
wondering how come we lived
so many years until I think Postgres
12 or when
autovacuum_vacuum_insert_scale_factor was
added.
I think Darafei initiated it.
Darafei.
Michael: Version 13 I looked it
up yeah.
Nikolay: Okay it's very recently
compared to like my my experience
with Postgres.
So strange.
What it adds?
Originally, Postgres vacuum, which
also maintains Postgres statistics,
which is important for good query
performance, including selects,
right?
Originally, it was triggered only
after, say, like 10% by default,
10 or 20% of rows are changed.
There is some complex formula,
not very complex, but some formula.
But roughly after 10 or 20% of
rows changed, change means deleted
or updated.
It triggers, but not after inserts.
And only in Postgres 13, a specific
parameter was added.
I think by default it's also 20%
or 10 which tells what the vacuum
to run and process a table after
10 or 20% of rows were added
Michael: yeah and I look this up
and it's it's like, it's because
there's 3 jobs, right, of autovacuum.
There's the removing, well, there's
roughly removing dead tuples.
Nikolay: 4 jobs actually.
Michael: Freezing and analyze statistics.
Nikolay: Removing the tuples, maintaining
visibility maps.
Michael: Maintaining visibility
maps, of course, yeah.
For goals
Nikolay: maybe actually more but
these 4 come come to mind quickly
Michael: yeah and if you're only
doing inserts you don't need
the removing their tuples yeah
But that isn't the only thing
vacuum's doing.
So this then enables, though, the
visibility map and the freezing
to happen.
Nikolay: Well, freezing will happen
regardless of inserts.
It will happen...
Well we can insert a different
table.
Michael: Yeah okay yeah good point.
Nikolay: And autovacuum we'll
see that xmin or xmax or both
xmin right it's very very In the
past we have risk of wraparound,
so it's time to freeze this table.
We can have 0 operations in terms
of, like, table can be left
unchanged for many, but at some
point, we can decide, okay, it's
time to freeze.
Michael: But you're right, visibility
map would never be...
Nikolay: Visibility map is huge.
You mentioned index-only scans,
the performance of aggregates,
counts, right?
So we do want to keep it up to
date.
I think default is not enough as
usual with autovacuum.
We must tune it and even cloud
providers, their defaults are
not enough.
We must tune it and go down to
1% or smaller and make sure autovacuum
maintains statistics and visibility
maps more often so performance
of SELECTs including index-only
scans are good right?
Yes,
Michael: another reason to partition
as well so you can keep
those yeah yeah that makes sense
I was gonna say it's that it
is it's 20% So it is quite high
still as you say, would you ever
switch
Nikolay: to same?
I cannot imagine any OLTP system
any website any mobile app which
would be okay with Postgres or
to vacuum defaults this like ah
That's it like I don't know why
they are so.
They are so for what?
We have so many beautiful websites
working, Huge systems working
with Postgres.
It's like it's so cool to see that
Postgres main handles so big
workloads, but these defaults
Michael: Well, and the strange
thing is this 1 for example if
we did reduce it to 1%, it would
add overhead on small systems.
Sure, if you've only got 100 rows,
it runs vacuum every row for
a while, you know.
But who's running a small system
that can't handle a vacuum of
100 row table every row?
Like, that's fine.
And also,
Nikolay: with append-only specifically,
when some page is already
processed, it's marked all visible,
all frozen, or whatever.
Vacuum just skips it.
Michael: Yeah, so it wouldn't even
be
Nikolay: much overhead.
There were many optimizations in
this area, so to not to do work
which can be skipped.
So it's doing good job skipping
and it's many years already.
So I think like I never saw any
system and I saw maybe already
hundreds of them, different sizes,
websites, like OLTP, right?
I didn't see any time we decided,
oh, you know what, we need
to increase scale factor.
I don't remember this at all.
We can throttle it if we like,
we can balance work among many
workers and so on, but deciding
let's make work of autovacuum
less frequent, 0 cases I had.
Maybe my experience is not enough,
maybe 1 day I will see such
a system.
Michael: I've not seen 1 either.
Nikolay: Enough rage about defaults,
my usual fun I have with
Postgres.
Let's talk about partitioning,
maybe, right?
Why do we want it?
I see several ideas here, and TimescaleDB
is definitely for a
append-only table, so it's a good
thing to have in many senses.
But unfortunately, it's not available
in managed offering except
their own Timescale cloud, right?
And some others, but those some
others usually choose Apache
2.0 version which doesn't have
compression.
Right?
Michael: So...
Doesn't have a lot of their good
features, yeah.
Nikolay: Yeah, so partitioning
is good.
Again, there's some rule, empirical
rule, we say, like many people
say, not only I.
Let's consider partitioning if
table exceeds 100 gigabytes or
has chances to exceed 100 GB.
Partitioning adds complexity.
It's not as well automated as in
Oracle, but it's a very important
tool to consider.
Many factors here.
First, for example, you might say,
okay, I have a partition where
I insert and then many partitions
where it's like my archive.
And as we decided, we want a very
low number of indexes in the
main partition, which is receiving
inserts, and constraints like
foreign keys and so on.
But there is no such problem in
all archive partitions, right?
We might have more indexes there
and constraints and so on.
This is 1 thing.
The second thing is autovacuum.
If occasional deletes or updates
are happening, the block which
contains the raw data basically
is out of visibility.
It's marked not all visible anymore
and not all frozen anymore.
So a vacuum needs to process it.
And it's good to have data localities
or archive data is in some
partitions and fresh data is in
particular partitions.
So autovacuum is focusing on fresh
data in fresh partitions.
It reduces the number of blocks
it needs to deal with, right?
Because all data is rarely touched,
so we...
autovacuum visits are very rarely,
right?
This is another reason.
Cleanup is another reason as well,
right?
Michael: I think that's...
I think cleanup's the biggest reason.
I think...
I think maintenance...
Partitioning helps so much with
maintenance.
It does have other benefits for
sure but it helps so much of
maintenance that I can't help but
feel like that's the biggest
1 and I actually I've started to
say I think I must have stolen
this from somebody else because
it's too clever for me but partitioning
based on how you want to eventually
delete data makes sense so
if you want to eventually delete
old data partitioning based
on time makes sense But for example
if you're a bit like b2b
sass and you eventually want to
delete data based on a customer
quitting the service you probably
want to partition based on
Nikolay: or both a level of partitioning
calls
Michael: yeah exactly but but that
being the like a guiding principle
for how you partition because it
makes that deletion or dropping
so easy.
Nikolay: What will you do with
data?
And as I said I participated in
projects where Delete was a big
issue and of course with partitioning
it's very different and
it's good.
Deletes can be a problem.
Postgres deletes, like if you have
a terabyte, 10 terabyte table
and you need to delete 20% of it,
it's a big headache because
you need to make sure vacuum will
be okay.
It will, autovacuum will catch
up all the time.
You need to, again, to pay attention
to a checkpointer and you
need to find a way how to delete
so delete doesn't degrade.
This was my problem.
So I created beautiful queries,
but they degraded over time because
of that tuple accumulation and
bloat accumulation as well.
So I needed to adjust them and
so on.
So there are many problems with
delete and it takes time to delete
millions of rows.
If you rush with it you can put
system down or have degradation
of performance.
Michael: Well yeah, And it can
really affect even your SELECT
performances.
So you mentioned, BRIN is probably
the 1 where it gets, it used
to at least get affected the most
with the default way of creating
a BRIN index.
If you have a row inserted way
back in an old, if you don't have
partitioning, if it goes miles
away and you get some real scattered
data, BRIN performance can end
up effectively looking like sequential
scans.
Nikolay: And all indexes degrade,
B-tree degrades very quickly
if you perform deletes and updates
and you need to rebuild it.
And rebuilding is better with partitioning
because the smaller
partitions are, the faster rebuilding
is, and Xmin horizon is
not frozen, right?
So autovacuum is not affected
in the whole database right
Michael: yeah so yeah
Nikolay: building and rebuilding
indexes vacuum itself maintenance
tasks are good if you have smaller
physical tables or partition
is great
Michael: right yes on the BRIN
thing I'll link up the old episode
we did, but the min-max-multi I
think makes a big difference,
especially if you don't have to,
like, well, it handles loads
of outliers, so I do think that's
easier.
And if you are able to keep on
top of autovacuum, I guess the
B-tree stuff doesn't degrade that
quickly.
So I feel like these things aren't
as big a problem anymore.
But yeah, often in these cases,
if you're dealing with high volume,
like many, many thousands of queries
per second, like just extreme
volume, anything you can do to
help fight on the performance
front will be helpful.
Nikolay: Yeah, and as usual when
we touch partitioning, the state
of caches and buffer pool, For
example, if you have archived
data which is touched rarely, those
blocks are evicted from the
buffer pool, and cache efficiency
might grow, hit rate might
be better.
But yeah, I agree with you.
So partitioning is good in many
senses.
It comes with price of overhead
and maintenance as well, but
it's worth to have it.
But imagine, like all this said,
we moved slightly from append-only
to append-mostly, right?
But let's move back to append-only.
Imagine we have many partitions
where data is not changed.
Archive.
Indexes created.
All frozen, all visible.
It's a beautiful state of data,
right?
So all index-only scans are working
well.
And that's it.
Maintenance not needed.
Autovacuum not needed there, and
so on.
However, what if we have 100 terabytes
of data, and this is like
heavily loaded cluster, we have
many replicas.
The data is not changed as good.
It's evicted from buffer pool,
but we still need to keep it on
the main storage right and At some
point we think oh like we
pay a big price because this data
is replicated it increases
the volume of backups, full backups
if we consider, right?
So this is like, this legacy, it's
a lot.
And if we have, for example, 5
replicas, 1 primary, We need 6
times to store the same data and
nobody is using it.
Like people read it occasionally.
At some point you think it's not
efficient.
And you think I would rather store
it somewhere else, not on
the main disks, not on SSD, fast
SSDs or I don't know, NVMes
I have, or cloud storage, which
is also expensive, right?
So this leads to 2 ideas.
First idea is it would be good
to compress it.
Again, TimescaleDB, full version
of TimescaleDB is doing a great
job, and their blog posts about
compression are great.
I like especially 1, I remember,
first big 1 which explained
algorithms in row and basically
kind of column compression, although
we still have row storage, it's
great.
And also, I think, second topic
here, which opens up naturally,
is what I know Aurora now offers
it, right, and Neon and Timescale
as well, in cloud only.
Bottomless approach, where all
partitions are offloaded to S3
or object storage, GCS on Google
Cloud, or blob storage on Azure,
or how it's called, I don't remember.
And Now even Hetzner has it.
They just recently released, which
is big news.
I like it because I like their
prices and I worked with them
since I think 2006 or so in a few
companies.
When you bootstrap and you have
a small startup, Hetzner is like
number 1 in terms of budgets and
the hardware they can offer.
So they just recently released
S3-compatible object storage,
right?
So we can have normal backups and
so on.
But what to do with old partitions?
It's a natural way of thinking.
We don't want to keep them on these
expensive disks we have,
having multiple copies of that.
So offloading it somehow, like
implicitly, like transparently
in the background, to S3 or S3-compatible
mini or something,
if you have self-managed Postgres,
it would be great.
So we have it in timescale cloud
in I think new 1 also does it
right
Michael: I don't know
Nikolay: Bottomless bottomless
like new 1 they store data on
s3 originally anyway So idea is
we want to have petabyte size
cluster, but don't pay for lots
of disks and headache it comes
with.
And for append-only, it's very
natural to decide, okay, we want
to store data forever, not to clean
up, but we want cheap storage
here.
So S3 is a good idea to consider,
and it has tiers also, right?
It can be slow to retrieve, but
it's okay because it's rare,
right?
Michael: Well, and it depends what
you mean by slow like I think
there is cut there can be performance
advantages I think when
some of this data is fresh we might
want to retrieve it row by
row like if you do if you're looking
at some audit logs you might
want to look at some recent ones
that you might want all the
information about them but if you're
looking at data from 2 years
ago there's probably a higher chance
that you're looking at it
in Aggregate you know on average
how many audits of this type
will be having in in 2022 versus
2023 and I think actually the
types of queries that happen on
older data tend to be these Aggregate
ones that often perform better
once it's Column store compressed
you know these file formats often
suit that kind of Query so
I could I don't even think I know
what you mean
Nikolay: likes compression good
compression it likes it yeah
and TimescaleDB compression I
have seen how good it is it's
gonna be like 20 20 times smaller
and and indeed like if they
even support data changes for compressed
data, which is great.
Michael: I have seen a project
or 2 come up about lately, I think,
open sourcing some of this stuff,
or at least putting it under
the Postgres license.
Is it pg_parquet that allows you
to...
Nikolay: Yeah, but it's different.
It's for analytics.
And actually for analytics, we
also might want to consider append-only
tables, obviously.
But There is a new wave of this
and many, I know many people,
companies look at it.
PgDuckDB, not PgDuckDB.
Michael: Yeah.
Nikolay: DuckDB.
Right.
And the idea let's marry it with
Postgres.
And there are a few projects I
looked at a few once recently.
And 1 of them was just released
maybe last week.
I remember they use logical copy
from original tables, regular
tables, to these tables, which
are stored on, I think, in Parquet
format on object storage, and then
DuckDB is used as processing
for analytics.
But I remember, I think Álvaro
commented on Twitter that I'm
not going to consider it until
it works with like basically CDC
logical replication or something
because right now it's only
full refresh of like it's not serious
but they will do it I think
also I think new guys looked at
DuckDB and I saw some activities
and Hydra, right?
They also looked at that.
Michael: Yeah.
But I understand that most of the
marketing at the moment is
around analytics use cases, but
I don't see why it couldn't work
for append-only
Nikolay: data types.
I'd be sure.
I looked at a couple of extensions
because I have a couple of
customers with such need to offload
all data and Parquet is this
format supports only like mapping
of data types might be tricky
if you have some complex data types,
as I remember.
And when I looked at some extensions,
it didn't work well.
And I think right now, I have plans
to look at Tembo's extension,
which is called pg_tier, for tiered
storage.
The idea is, with this extension,
we can have all partitions
on object storage.
It's a great idea.
So if it works, it's great.
I just haven't looked at it yet.
If somebody from Tembo is listening
to us, please let us know
how this project works already.
Is it already in production or
beta stage?
I'm very curious.
And is it worth trying?
And what are the limitations, for
example?
Maybe we should actually have a
separate episode, because I think
this extension might bring bottomless
idea to the Postgres ecosystem
for everyone.
It's great.
It's open source, unlike what other
companies do.
So kudos to Tembo for making this
open source
Michael: extension.
And is it transparent?
Do I have to...
Because I think some of the DuckDB
stuff, I would have to be...
I'm not sure if I have to write...
Nikolay: I don't know.
It might be semi-transparent if
you need to make some transitions
with old partitions, I'm okay with
that.
And I can even create new partitions
in the background and move
all the data from 1 old partition
to this kind of new old partition,
which already has different storage.
This is doable.
It's not a big deal, as I see it.
I think what, for example, Timescale
Cloud has, it's transparent
and they have some kind of interesting,
as I remember, very rough
form.
They have something also with Planner
and some algorithms to
decide when to bring this data
to caches and so on right so it's
interesting but idea is we want
partitioning to evict blocks
with all data from memory 1 thing
but then we think we want to
evict them from our disks, because
disks are also expensive,
right?
Let's evict it.
And this is an alternative idea
to cleanup and to compression,
or compression goes well with offloading
as well.
I don't know.
So parquet format is definitely
good with compressing data in
terms of column storage, right?
So if it's time-serious, it's good.
So there are interesting new directions
of development of Postgres
ecosystem here.
And I think we mentioned a few
projects, both commercial and
open source, which is great.
So if someone wants to store petabyte
of data, is preparing for
it, doesn't want to delete everything,
I think there are ideas
to consider.
Well, we didn't mention with partitioning,
you can also have
some kind of foreign data wrapper
and store it on very cheap,
also Postgres.
It can be Postgres, it can be not
Postgres as well, right?
But for example, we can consider
a cheap Postgres cluster with
very slow disks, HDD, for a good
price, and we can have foreign
data wrappers and offload it from
the main cluster to that cluster
and live with it.
There will be some corner cases,
maybe, with transactions failing
sometimes because of, I don't know,
because of...
I expected foreign data wrappers,
Postgres, FDW, code in terms
of how it works with...
Basically, you need 2 PC, right?
You need a two-phase commit to
have a reliable commit because
it's distributed system already,
but without 2 PC, there are
risks to have inconsistency, for
example.
Michael: But not if you're, if
it's append-only, you're never
going to change that data by the
time you're pushing it to a
different server.
Nikolay: I remember I inspected
code and found some kind of edge
cases, maybe even corner cases,
but for inserts it's also possible.
You wrote to 1 place, you're trying
to write another place, this
1 is already committed, here it's
not committed.
But I remember code was quite smart
to reduce the probability
of some cases, but it's not 100%.
As I remember, it was like 5 years
ago.
So I haven't revisited it since.
Michael: Tell me if I'm wrong,
but you're saying new partitions
would be on the local Postgres,
and it would be old ones that
we would move to the second.
Nikolay: Yeah, if we don't have
inserting transactions which
deal with multiple partitions,
there is no problem at all.
Yeah, and old we can move to find
it with Postgres FDW to different
cluster.
So you're right.
No inserts happening, no problem.
Cool.
Select only.
Michael: I never considered using
FDW for partitioning.
Or like, you know, the old partition.
Nikolay: It's natural.
Yeah, it's natural.
This is how many folks were going
to have clustered Postgres,
right?
I mean, I mean, or sharding, sharding,
sharding.
Yeah, sharding.
Yeah, but this path has issues.
Yeah, it's a different, different
story.
Michael: Hereby dragons, is it?
Or what's that old phrase?
Nikolay: And I don't know.
So, okay, we discussed many things.
I feel we might be missing something,
right?
But for a good overview, I think
it's enough.
For a general overview of what's
happening with append-only tables.
So I think it's an interesting
topic, actually.
Many people need it.
Many companies store logs and so
on in Postgres.
And I'm looking forward to the
future where Postgres at some
point will have better compression.
Better compression and bottomless
feature.
And as I like to say imagine TimescaleDB
was Postgres license.
Okay this is I think should happen
at some point I know this
is there are people who don't like
this idea, but they developed
so good stuff that Postgres would
benefit from it.
I know some people would not be
happy with these words, I know.
But it feels natural to have these
features in Postgres itself.
Michael: Some of them for sure,
but I do think some of it's tricky,
like the bottomless stuff for example,
where would it go?
Nikolay: There is extension already,
right?
S3 compatibility is standard.
Even Google Cloud GCS is also S3
compatible.
Everyone does it.
So I think, no, no, no.
I don't see why not here.
I mean, I see it, but it's manageable.
The biggest question is business-wise,
I guess.
So, TimescaleDB, license, yeah.
I understand that business-wise
it's not going to happen in the
near future, but it could be so
great to have good compression
and bottomless and false width
itself.
Even if it's extension, I'm okay
with extensions.
I don't like extensions in general
because I like to...
All people have some features,
but in this case, it's okay to
have extensions.
Cool.
Extensions are great sometimes,
yeah.
Good, okay.
Have a good week.
I know we have more than 100 suggestions
in our doc.
Yeah.
We read them.
Keep posting.
This was 1 of suggestions, right?
Yeah.
So you chose it.
Yeah, good.
Thank you for our audience for
patience reaching this point,
you know.
Okay, see you next time
Michael: absolutely take care see
you next week bye