Postgres FM | Transcript: Append-only tables

November 15, 2024 • 44 Minutes

Append-only tables

Nikolay: Hello, hello, this is
Postgres.FM.

I'm Nikolay, Postgres.AI, and my
co-host is Michael, pgMustard.

Hi Michael.

Michael: Hello Nikolay.

Nikolay: And guess what we are
going to discuss today?

Michael: I'll guess.

Is it append-only tables?

Nikolay: Exactly.

Michael: Ah, got it right.

Nikolay: I was surprised to hear
we haven't discussed it in the

past I'm sure we touched it many
times right

Michael: yeah for sure it's come
up in episodes but me too when

I saw it in our listener suggested
topics I did have a few searches

on our site just to make sure we
hadn't talked about it already

as a whole episode and yeah agreed
but it's not something I see

all the time like it's it is it's
relatively common like to have

an events table but even then I
mostly see append-mostly tables

rather than

Nikolay: like oh yeah I was thinking
we are going to discuss

both.

Yeah.

Do you feel in the episode for
append-mostly tables?

No?

Michael: I don't, well, I actually,
I don't think this is super

complex.

I think it's quite nice because
almost by definition if we do

accept that it's append-only We've
got quite a narrow scope and

there's only a few things to consider
Maybe it gets a bit more

complicated with append-mostly.

But yeah, we can definitely cover
that today.

I think it's still not that complicated

Nikolay: yeah where do we start
like definition

Michael: go for it yeah why not

Nikolay: well if we just insert
that's append-only that's it

Michael: insert and select don't
I would say yeah no updates

no deletes.

Nikolay: Select is allowed right
yeah insert and select this

is the only 2 things we can allow
ourselves from DML data manipulation

language and that's it yeah we
just select and insert this is

append-only If we have occasional
deletes and updates, it's

append-mostly or insert-mostly, I don't
know how to say.

Michael: Yeah, I like that.

Nikolay: And why do we care about
this particular use case?

Because it has characteristics,
right?

It usually has...

If it's append-only, for example,
we don't care about dead tuples

anymore.

No bloat, right?

It's good.

And usually we talk about huge
volumes of data.

And at some point we think, oh, we need to compress it, we need

to offload it maybe to cheaper storage, or just clean up, because

all data is not needed anymore in raw form.

Sometimes it's aggregated and in raw form we can just remove

it from database.

Or we just need bottomless.

And usually we want inserts to happen very fast, because a lot

of volume is huge.

So we need to make sure performance of inserts is good.

Did I miss anything?

Michael: No, I think those things aren't necessarily always true

for append-only tables but they correlate like a lot of the use

cases for very very fast growing data and and by definition append-only

means it's never going to decrease in size.

It's only going to keep getting larger

Nikolay: and larger.

Unless you clean it up.

Well, there are specific cases.

For example, imagine we discussed many times the topic of slow

count.

And if you can allow a synchronous calculation of count, maybe

it's like materialized or something.

I don't know.

So idea is instead of updating the count somewhere on each insert

in the original table You can aggregate operations in intermediate

table, and then it's append-only.

So you register events in some table, and then you process chunk

and reflect this in count in final storage.

And then you can delete it or better drop this partition or truncate

or something.

Right, in this case, it's append-only, but it grows, grows, and

then the size drops.

It happens also, right?

Michael: Yeah, I think dropping partitions definitely pushes

the definition of append-only, but it's the thing that makes

sense to do in most, or a lot of cases, at huge scale.

But yeah, is it still append-only if we're dropping partitions?

Nikolay: Yes, this is how we should do

Michael: it no I know but do you see what I mean you mentioned

deleting data but

Nikolay: well we again again again it's append-only we draw partition

we never delete we never update it's append-only but we if we

don't need like last year data we already processed it somehow,

made all calculations we need, we can get rid of raw data.

We just dropping partition, It's the best we can do instead of

cleaning up somehow using deletes.

I think we need to discuss it because I did it many times and

participated in huge projects in very large companies.

The idea, let's offload all archive data.

It was e-commerce.

Old orders, let's offload it to
cheaper storage for longer-term

storage, and then we need to delete
it in original place, original

database, and it was not partitioned.

Deletes, it was a project for a
couple of months.

Because it was, like downtime is
not acceptable, it costs a lot

of dollars.

E-commerce guys know very well,
they can calculate it, each second

of downtime, how much it costs
to company.

So if I had partitioned table there,
it would be magic.

And it's append-only.

That particular table was not append-only,
right?

But it can happen with append-only.

For example, we have audit log.

Some actions are stored in some
append-only table, but we have

a policy to store only 2 years
of data.

Then I would prefer to drop partition
with all data, that's it.

So cleaning up is a very important
topic for append-only tables,

this is what I was trying to say.

Michael: Yeah, I completely agree.

I think there are other benefits
to partitioning with append-only,

or append-mostly as well, due to,
like, if we do have the occasional

update or delete by having

Nikolay: partitions yeah

Michael: well partitioning helps
with that as well right so let's

let's zoom back out maybe we've
got we've got inserts and SELECTs

so we do have to we might have
to if we're talking about a very

very high volume, we might have
to worry about insert performance

and SELECT performance.

Nikolay: We can also have, sorry
for interrupting, we can also

have copy.

Yeah, sure.

In both ways.

And

Michael: I guess that's about to
come up if we're talking about

optimizing.

Nikolay: Also, common table, you
know, like table, which reads

everything.

Yeah, but it's kind of SELECT.

So yeah, inserts or SELECTs, no
UPDATEs, DELETEs and so on.

So what use cases?

You wanted to discuss use cases,
right?

Michael: Or even, I was actually
thinking of diving straight

to performance of, like, I think
there's a few things that we

don't have to worry about, and
a few things that we can then

optimize for.

Like, if we're having to insert
at extremely high volumes, which

sometimes these use cases do lend
themselves towards.

You know, if we're, I think IoT
for example, Internet of Things,

if sensors are sending information
and we're logging for each

second amongst thousands or tens
of thousands of sensors that

could be that can end up being a lot of data so Inserting can

be a bottleneck and you might make design decisions for those

tables that you wouldn't make If you had a different type of

table a different type of data So there's that side of things

but then there's also the read side of things I think you know

and I think those things maybe sometimes play off against each

other so But the fact we've got append-only we have some benefits

to like index-only scans for example become even better I know

I know you often talk about always trying to get index-only scans,

but in a table where the data is often changing, that can be

a losing battle.

It can be a battle that's not always worth fighting, or it's

maybe not always worth including as many columns to the index,

for example.

There's different trade-offs for append-only versus...

Nikolay: Let's unwrap everything here.

You mentioned so many things in just a minute, right?

So first, Let's talk about performance of inserts.

I would say the ideal situation is we don't have indexes and

we don't have triggers, including foreign key triggers, because

foreign key in Postgres is internally implemented via system

trigger.

This trigger is going to slow down inserts, especially if you

need to insert a lot of rows.

If you just have a few foreign keys, it can multiply the duration

of this massive insert.

So, ideally, we should get rid of foreign keys and keep as few

indexes as possible for this particular case.

I remember in some cases, I decided to go without primary keys,

you know, breaking relational model and so on.

There's no relational model in Postgres in any relational database

which implements SQL model, data model, which has null and it

breaks relational model completely anyway.

But this is a different topic.

Side note.

Anyway, so it's not good to be without foreign, without primary

keys, but sometimes you think, oh, I just need to dump these

to some table reliably.

So we have ACID, so Postgres guarantees, it's stored, it's saved,

it's replicated, it's backed up.

But even 1 index, sometimes you think, oh, it slows me down.

And I remember I decided to leave without primary key.

It was a weird case, but it was some archive, maybe just for

audit purposes.

I decided to use BRIN at that time.

BRIN is actually a good idea to consider if we have append-only

because layout, physically, rows don't move.

If we have a row, it's a tuple, it's saved in some block, it's

there, right?

So this is exactly when BRIN indexes work well.

And we had an episode, 1 of our first episodes, I remember.

It was.

BRIN indexes.

BRIN is a block range index, right?

Yeah.

So it's very lightweight.

It speeds up select performance, not as good as other indexes,

especially B-tree, but it's still good, right?

Or we might consider hash indexes also, right, because they might

be more lightweight than B-tree sometimes.

They're smaller,

Michael: for example, right?

Well, I think, Yeah, but when it comes to append-only, I think

you make a really good point.

Each index we have slows down the inserts.

So the fewer the better, possibly none if we aren't, let's say

it's a table we're never reading from or it's an audit log that

we only ever have to read from extremely rarely we might consider

1 or even 0 indexes on that maybe not an audit log because maybe

that's not 1 you would actually be writing an insane volume to

but I've read a timescale but sometimes they have to worry about

this kind of thing that they have whilst they've designed for

these kind of time series workloads They've written a good blog

post on optimizing inserts and they list all the same things

as you and go further.

So they, as well as foreign key constraints, basically other

constraints can add overhead as well.

So for example...

Nikolay: Of course. Checks

Michael: A unique constraint... yeah check constraint but unique constraints...

Nikolay: yeah index additional check for sure

Michael: yeah so not having it basically deciding for each constraint

if you really need it or what value it's adding having it and

make

Nikolay: yes that been said I must say like in most cases, I

prefer having primary key.

Because it's like the center of consistency, of data consistency,

right?

So it's good to have.

But it depends.

It's good that you mentioned timescale, but I think we will return

to timescale.

My question to you is a tricky question, but I think you already

know, and I must admit, when 2 years ago we started the podcast,

I didn't realize it fully.

Now I realize it much better.

So we have an index.

What operations does it slow down?

You said it slows down inserts.

This is for sure.

Does it slow down updates?

Well, yes.

And there's a mechanism, hot update,
which deals with it in a

limited number of cases.

Does it slow down delete?

Well, maybe no, because during
delete, index is not updated.

Postgres-only updates xmax, as
we discussed a couple of times.

Does it slow down selects?

What do you think?

Michael: So we've talked about
how having a lot of them can.

Nikolay: Yeah.

Yeah.

It slows down the selects, especially
if we have a lot of them

and high frequency of selects,
and this is about planning time

and a lock manager locks during
planning, all indexes are locked.

It's some overhead in a very heavily
loaded systems to keep in

mind.

But in general, I would minimize
the number of indexes and try

not to use foreign keys.

Foreign keys, in many cases, we
can imagine they exist, have

maybe routine checks that referential
integrity is fine, but

drop them intentionally because
in this case, we want, for example,

good insert performance.

And as usual, I would like to remind
that when I say all this,

in many cases when I deal with
new system, I have some of these

principles, but I never trust myself,
I always check again checking

should be like consider like sometimes
you spend time there right

but it's worth doing experiments

Michael: yeah I well and I would
say we're talking about extremely

high volumes here if if you can
I would much rather normally

have primary key have some foreign
keys if they make sense and

have a unique key if I need it
and then test if like can I get

better discs if I need to?

Are there other ways I can improve,
like I can cope with higher

write performance instead of...

Nikolay: Perform checkpoint tuning
if you expect huge volumes

to let into the store.

Michael: Yeah, so maybe pay for
it in other ways.

It's only at

Nikolay: the apps.

Bigger buffer pool.

Exactly.

Make sure backends don't write
all the time.

It depends, right?

So checkpointer is not crazy, it's
not too frequent, and so on.

Yeah, yeah.

And there's a lot of stuff here.

And if we think about Selex now,
what's the number 1 problem

usually?

I think, it makes me so, I'm still
wondering how come we lived

so many years until I think Postgres
12 or when

autovacuum_vacuum_insert_scale_factor was
added.

I think Darafei initiated it.

Darafei.

Michael: Version 13 I looked it
up yeah.

Nikolay: Okay it's very recently
compared to like my my experience

with Postgres.

So strange.

What it adds?

Originally, Postgres vacuum, which
also maintains Postgres statistics,

which is important for good query
performance, including selects,

right?

Originally, it was triggered only
after, say, like 10% by default,

10 or 20% of rows are changed.

There is some complex formula,
not very complex, but some formula.

But roughly after 10 or 20% of
rows changed, change means deleted

or updated.

It triggers, but not after inserts.

And only in Postgres 13, a specific
parameter was added.

I think by default it's also 20%
or 10 which tells what the vacuum

to run and process a table after
10 or 20% of rows were added

Michael: yeah and I look this up
and it's it's like, it's because

there's 3 jobs, right, of autovacuum.

There's the removing, well, there's
roughly removing dead tuples.

Nikolay: 4 jobs actually.

Michael: Freezing and analyze statistics.

Nikolay: Removing the tuples, maintaining
visibility maps.

Michael: Maintaining visibility
maps, of course, yeah.

For goals

Nikolay: maybe actually more but
these 4 come come to mind quickly

Michael: yeah and if you're only
doing inserts you don't need

the removing their tuples yeah
But that isn't the only thing

vacuum's doing.

So this then enables, though, the
visibility map and the freezing

to happen.

Nikolay: Well, freezing will happen
regardless of inserts.

It will happen...

Well we can insert a different
table.

Michael: Yeah okay yeah good point.

Nikolay: And autovacuum we'll
see that xmin or xmax or both

xmin right it's very very In the
past we have risk of wraparound,

so it's time to freeze this table.

We can have 0 operations in terms
of, like, table can be left

unchanged for many, but at some
point, we can decide, okay, it's

time to freeze.

Michael: But you're right, visibility
map would never be...

Nikolay: Visibility map is huge.

You mentioned index-only scans,
the performance of aggregates,

counts, right?

So we do want to keep it up to
date.

I think default is not enough as
usual with autovacuum.

We must tune it and even cloud
providers, their defaults are

not enough.

We must tune it and go down to
1% or smaller and make sure autovacuum

maintains statistics and visibility
maps more often so performance

of SELECTs including index-only
scans are good right?

Yes,

Michael: another reason to partition
as well so you can keep

those yeah yeah that makes sense
I was gonna say it's that it

is it's 20% So it is quite high
still as you say, would you ever

switch

Nikolay: to same?

I cannot imagine any OLTP system
any website any mobile app which

would be okay with Postgres or
to vacuum defaults this like ah

That's it like I don't know why
they are so.

They are so for what?

We have so many beautiful websites
working, Huge systems working

with Postgres.

It's like it's so cool to see that
Postgres main handles so big

workloads, but these defaults

Michael: Well, and the strange
thing is this 1 for example if

we did reduce it to 1%, it would
add overhead on small systems.

Sure, if you've only got 100 rows,
it runs vacuum every row for

a while, you know.

But who's running a small system
that can't handle a vacuum of

100 row table every row?

Like, that's fine.

And also,

Nikolay: with append-only specifically,
when some page is already

processed, it's marked all visible,
all frozen, or whatever.

Vacuum just skips it.

Michael: Yeah, so it wouldn't even
be

Nikolay: much overhead.

There were many optimizations in
this area, so to not to do work

which can be skipped.

So it's doing good job skipping
and it's many years already.

So I think like I never saw any
system and I saw maybe already

hundreds of them, different sizes,
websites, like OLTP, right?

I didn't see any time we decided,
oh, you know what, we need

to increase scale factor.

I don't remember this at all.

We can throttle it if we like,
we can balance work among many

workers and so on, but deciding
let's make work of autovacuum

less frequent, 0 cases I had.

Maybe my experience is not enough,
maybe 1 day I will see such

a system.

Michael: I've not seen 1 either.

Nikolay: Enough rage about defaults,
my usual fun I have with

Postgres.

Let's talk about partitioning,
maybe, right?

Why do we want it?

I see several ideas here, and TimescaleDB
is definitely for a

append-only table, so it's a good
thing to have in many senses.

But unfortunately, it's not available
in managed offering except

their own Timescale cloud, right?

And some others, but those some
others usually choose Apache

2.0 version which doesn't have
compression.

Right?

Michael: So...

Doesn't have a lot of their good
features, yeah.

Nikolay: Yeah, so partitioning
is good.

Again, there's some rule, empirical
rule, we say, like many people

say, not only I.

Let's consider partitioning if
table exceeds 100 gigabytes or

has chances to exceed 100 GB.

Partitioning adds complexity.

It's not as well automated as in
Oracle, but it's a very important

tool to consider.

Many factors here.

First, for example, you might say,
okay, I have a partition where

I insert and then many partitions
where it's like my archive.

And as we decided, we want a very
low number of indexes in the

main partition, which is receiving
inserts, and constraints like

foreign keys and so on.

But there is no such problem in
all archive partitions, right?

We might have more indexes there
and constraints and so on.

This is 1 thing.

The second thing is autovacuum.

If occasional deletes or updates
are happening, the block which

contains the raw data basically
is out of visibility.

It's marked not all visible anymore
and not all frozen anymore.

So a vacuum needs to process it.

And it's good to have data localities
or archive data is in some

partitions and fresh data is in
particular partitions.

So autovacuum is focusing on fresh
data in fresh partitions.

It reduces the number of blocks
it needs to deal with, right?

Because all data is rarely touched,
so we...

autovacuum visits are very rarely,
right?

This is another reason.

Cleanup is another reason as well,
right?

Michael: I think that's...

I think cleanup's the biggest reason.

I think...

I think maintenance...

Partitioning helps so much with
maintenance.

It does have other benefits for
sure but it helps so much of

maintenance that I can't help but
feel like that's the biggest

1 and I actually I've started to
say I think I must have stolen

this from somebody else because
it's too clever for me but partitioning

based on how you want to eventually
delete data makes sense so

if you want to eventually delete
old data partitioning based

on time makes sense But for example
if you're a bit like b2b

sass and you eventually want to
delete data based on a customer

quitting the service you probably
want to partition based on

Nikolay: or both a level of partitioning
calls

Michael: yeah exactly but but that
being the like a guiding principle

for how you partition because it
makes that deletion or dropping

so easy.

Nikolay: What will you do with
data?

And as I said I participated in
projects where Delete was a big

issue and of course with partitioning
it's very different and

it's good.

Deletes can be a problem.

Postgres deletes, like if you have
a terabyte, 10 terabyte table

and you need to delete 20% of it,
it's a big headache because

you need to make sure vacuum will
be okay.

It will, autovacuum will catch
up all the time.

You need to, again, to pay attention
to a checkpointer and you

need to find a way how to delete
so delete doesn't degrade.

This was my problem.

So I created beautiful queries,
but they degraded over time because

of that tuple accumulation and
bloat accumulation as well.

So I needed to adjust them and
so on.

So there are many problems with
delete and it takes time to delete

millions of rows.

If you rush with it you can put
system down or have degradation

of performance.

Michael: Well yeah, And it can
really affect even your SELECT

performances.

So you mentioned, BRIN is probably
the 1 where it gets, it used

to at least get affected the most
with the default way of creating

a BRIN index.

If you have a row inserted way
back in an old, if you don't have

partitioning, if it goes miles
away and you get some real scattered

data, BRIN performance can end
up effectively looking like sequential

scans.

Nikolay: And all indexes degrade,
B-tree degrades very quickly

if you perform deletes and updates
and you need to rebuild it.

And rebuilding is better with partitioning
because the smaller

partitions are, the faster rebuilding
is, and Xmin horizon is

not frozen, right?

So autovacuum is not affected
in the whole database right

Michael: yeah so yeah

Nikolay: building and rebuilding
indexes vacuum itself maintenance

tasks are good if you have smaller
physical tables or partition

is great

Michael: right yes on the BRIN
thing I'll link up the old episode

we did, but the min-max-multi I
think makes a big difference,

especially if you don't have to,
like, well, it handles loads

of outliers, so I do think that's
easier.

And if you are able to keep on
top of autovacuum, I guess the

B-tree stuff doesn't degrade that
quickly.

So I feel like these things aren't
as big a problem anymore.

But yeah, often in these cases,
if you're dealing with high volume,

like many, many thousands of queries
per second, like just extreme

volume, anything you can do to
help fight on the performance

front will be helpful.

Nikolay: Yeah, and as usual when
we touch partitioning, the state

of caches and buffer pool, For
example, if you have archived

data which is touched rarely, those
blocks are evicted from the

buffer pool, and cache efficiency
might grow, hit rate might

be better.

But yeah, I agree with you.

So partitioning is good in many
senses.

It comes with price of overhead
and maintenance as well, but

it's worth to have it.

But imagine, like all this said,
we moved slightly from append-only

to append-mostly, right?

But let's move back to append-only.

Imagine we have many partitions
where data is not changed.

Archive.

Indexes created.

All frozen, all visible.

It's a beautiful state of data,
right?

So all index-only scans are working
well.

And that's it.

Maintenance not needed.

Autovacuum not needed there, and
so on.

However, what if we have 100 terabytes
of data, and this is like

heavily loaded cluster, we have
many replicas.

The data is not changed as good.

It's evicted from buffer pool,
but we still need to keep it on

the main storage right and At some
point we think oh like we

pay a big price because this data
is replicated it increases

the volume of backups, full backups
if we consider, right?

So this is like, this legacy, it's
a lot.

And if we have, for example, 5
replicas, 1 primary, We need 6

times to store the same data and
nobody is using it.

Like people read it occasionally.

At some point you think it's not
efficient.

And you think I would rather store
it somewhere else, not on

the main disks, not on SSD, fast
SSDs or I don't know, NVMes

I have, or cloud storage, which
is also expensive, right?

So this leads to 2 ideas.

First idea is it would be good
to compress it.

Again, TimescaleDB, full version
of TimescaleDB is doing a great

job, and their blog posts about
compression are great.

I like especially 1, I remember,
first big 1 which explained

algorithms in row and basically
kind of column compression, although

we still have row storage, it's
great.

And also, I think, second topic
here, which opens up naturally,

is what I know Aurora now offers
it, right, and Neon and Timescale

as well, in cloud only.

Bottomless approach, where all
partitions are offloaded to S3

or object storage, GCS on Google
Cloud, or blob storage on Azure,

or how it's called, I don't remember.

And Now even Hetzner has it.

They just recently released, which
is big news.

I like it because I like their
prices and I worked with them

since I think 2006 or so in a few
companies.

When you bootstrap and you have
a small startup, Hetzner is like

number 1 in terms of budgets and
the hardware they can offer.

So they just recently released
S3-compatible object storage,

right?

So we can have normal backups and
so on.

But what to do with old partitions?

It's a natural way of thinking.

We don't want to keep them on these
expensive disks we have,

having multiple copies of that.

So offloading it somehow, like
implicitly, like transparently

in the background, to S3 or S3-compatible
mini or something,

if you have self-managed Postgres,
it would be great.

So we have it in timescale cloud
in I think new 1 also does it

right

Michael: I don't know

Nikolay: Bottomless bottomless
like new 1 they store data on

s3 originally anyway So idea is
we want to have petabyte size

cluster, but don't pay for lots
of disks and headache it comes

with.

And for append-only, it's very
natural to decide, okay, we want

to store data forever, not to clean
up, but we want cheap storage

here.

So S3 is a good idea to consider,
and it has tiers also, right?

It can be slow to retrieve, but
it's okay because it's rare,

right?

Michael: Well, and it depends what
you mean by slow like I think

there is cut there can be performance
advantages I think when

some of this data is fresh we might
want to retrieve it row by

row like if you do if you're looking
at some audit logs you might

want to look at some recent ones
that you might want all the

information about them but if you're
looking at data from 2 years

ago there's probably a higher chance
that you're looking at it

in Aggregate you know on average
how many audits of this type

will be having in in 2022 versus
2023 and I think actually the

types of queries that happen on
older data tend to be these Aggregate

ones that often perform better
once it's Column store compressed

you know these file formats often
suit that kind of Query so

I could I don't even think I know
what you mean

Nikolay: likes compression good
compression it likes it yeah

and TimescaleDB compression I
have seen how good it is it's

gonna be like 20 20 times smaller
and and indeed like if they

even support data changes for compressed
data, which is great.

Michael: I have seen a project
or 2 come up about lately, I think,

open sourcing some of this stuff,
or at least putting it under

the Postgres license.

Is it pg_parquet that allows you
to...

Nikolay: Yeah, but it's different.

It's for analytics.

And actually for analytics, we
also might want to consider append-only

tables, obviously.

But There is a new wave of this
and many, I know many people,

companies look at it.

PgDuckDB, not PgDuckDB.

Michael: Yeah.

Nikolay: DuckDB.

Right.

And the idea let's marry it with
Postgres.

And there are a few projects I
looked at a few once recently.

And 1 of them was just released
maybe last week.

I remember they use logical copy
from original tables, regular

tables, to these tables, which
are stored on, I think, in Parquet

format on object storage, and then
DuckDB is used as processing

for analytics.

But I remember, I think Álvaro
commented on Twitter that I'm

not going to consider it until
it works with like basically CDC

logical replication or something
because right now it's only

full refresh of like it's not serious
but they will do it I think

also I think new guys looked at
DuckDB and I saw some activities

and Hydra, right?

They also looked at that.

Michael: Yeah.

But I understand that most of the
marketing at the moment is

around analytics use cases, but
I don't see why it couldn't work

for append-only

Nikolay: data types.

I'd be sure.

I looked at a couple of extensions
because I have a couple of

customers with such need to offload
all data and Parquet is this

format supports only like mapping
of data types might be tricky

if you have some complex data types,
as I remember.

And when I looked at some extensions,
it didn't work well.

And I think right now, I have plans
to look at Tembo's extension,

which is called pg_tier, for tiered
storage.

The idea is, with this extension,
we can have all partitions

on object storage.

It's a great idea.

So if it works, it's great.

I just haven't looked at it yet.

If somebody from Tembo is listening
to us, please let us know

how this project works already.

Is it already in production or
beta stage?

I'm very curious.

And is it worth trying?

And what are the limitations, for
example?

Maybe we should actually have a
separate episode, because I think

this extension might bring bottomless
idea to the Postgres ecosystem

for everyone.

It's great.

It's open source, unlike what other
companies do.

So kudos to Tembo for making this
open source

Michael: extension.

And is it transparent?

Do I have to...

Because I think some of the DuckDB
stuff, I would have to be...

I'm not sure if I have to write...

Nikolay: I don't know.

It might be semi-transparent if
you need to make some transitions

with old partitions, I'm okay with
that.

And I can even create new partitions
in the background and move

all the data from 1 old partition
to this kind of new old partition,

which already has different storage.

This is doable.

It's not a big deal, as I see it.

I think what, for example, Timescale
Cloud has, it's transparent

and they have some kind of interesting,
as I remember, very rough

form.

They have something also with Planner
and some algorithms to

decide when to bring this data
to caches and so on right so it's

interesting but idea is we want
partitioning to evict blocks

with all data from memory 1 thing
but then we think we want to

evict them from our disks, because
disks are also expensive,

right?

Let's evict it.

And this is an alternative idea
to cleanup and to compression,

or compression goes well with offloading
as well.

I don't know.

So parquet format is definitely
good with compressing data in

terms of column storage, right?

So if it's time-serious, it's good.

So there are interesting new directions
of development of Postgres

ecosystem here.

And I think we mentioned a few
projects, both commercial and

open source, which is great.

So if someone wants to store petabyte
of data, is preparing for

it, doesn't want to delete everything,
I think there are ideas

to consider.

Well, we didn't mention with partitioning,
you can also have

some kind of foreign data wrapper
and store it on very cheap,

also Postgres.

It can be Postgres, it can be not
Postgres as well, right?

But for example, we can consider
a cheap Postgres cluster with

very slow disks, HDD, for a good
price, and we can have foreign

data wrappers and offload it from
the main cluster to that cluster

and live with it.

There will be some corner cases,
maybe, with transactions failing

sometimes because of, I don't know,
because of...

I expected foreign data wrappers,
Postgres, FDW, code in terms

of how it works with...

Basically, you need 2 PC, right?

You need a two-phase commit to
have a reliable commit because

it's distributed system already,
but without 2 PC, there are

risks to have inconsistency, for
example.

Michael: But not if you're, if
it's append-only, you're never

going to change that data by the
time you're pushing it to a

different server.

Nikolay: I remember I inspected
code and found some kind of edge

cases, maybe even corner cases,
but for inserts it's also possible.

You wrote to 1 place, you're trying
to write another place, this

1 is already committed, here it's
not committed.

But I remember code was quite smart
to reduce the probability

of some cases, but it's not 100%.

As I remember, it was like 5 years
ago.

So I haven't revisited it since.

Michael: Tell me if I'm wrong,
but you're saying new partitions

would be on the local Postgres,
and it would be old ones that

we would move to the second.

Nikolay: Yeah, if we don't have
inserting transactions which

deal with multiple partitions,
there is no problem at all.

Yeah, and old we can move to find
it with Postgres FDW to different

cluster.

So you're right.

No inserts happening, no problem.

Cool.

Select only.

Michael: I never considered using
FDW for partitioning.

Or like, you know, the old partition.

Nikolay: It's natural.

Yeah, it's natural.

This is how many folks were going
to have clustered Postgres,

right?

I mean, I mean, or sharding, sharding,
sharding.

Yeah, sharding.

Yeah, but this path has issues.

Yeah, it's a different, different
story.

Michael: Hereby dragons, is it?

Or what's that old phrase?

Nikolay: And I don't know.

So, okay, we discussed many things.

I feel we might be missing something,
right?

But for a good overview, I think
it's enough.

For a general overview of what's
happening with append-only tables.

So I think it's an interesting
topic, actually.

Many people need it.

Many companies store logs and so
on in Postgres.

And I'm looking forward to the
future where Postgres at some

point will have better compression.

Better compression and bottomless
feature.

And as I like to say imagine TimescaleDB
was Postgres license.

Okay this is I think should happen
at some point I know this

is there are people who don't like
this idea, but they developed

so good stuff that Postgres would
benefit from it.

I know some people would not be
happy with these words, I know.

But it feels natural to have these
features in Postgres itself.

Michael: Some of them for sure,
but I do think some of it's tricky,

like the bottomless stuff for example,
where would it go?

Nikolay: There is extension already,
right?

S3 compatibility is standard.

Even Google Cloud GCS is also S3
compatible.

Everyone does it.

So I think, no, no, no.

I don't see why not here.

I mean, I see it, but it's manageable.

The biggest question is business-wise,
I guess.

So, TimescaleDB, license, yeah.

I understand that business-wise
it's not going to happen in the

near future, but it could be so
great to have good compression

and bottomless and false width
itself.

Even if it's extension, I'm okay
with extensions.

I don't like extensions in general
because I like to...

All people have some features,
but in this case, it's okay to

have extensions.

Cool.

Extensions are great sometimes,
yeah.

Good, okay.

Have a good week.

I know we have more than 100 suggestions
in our doc.

Yeah.

We read them.

Keep posting.

This was 1 of suggestions, right?

Yeah.

So you chose it.

Yeah, good.

Thank you for our audience for
patience reaching this point,

you know.

Okay, see you next time

Michael: absolutely take care see
you next week bye

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

Append-only tables

Creators and Guests

Some kind things our listeners have said