Get or Create
Michael: Hello and welcome to Postgres.FM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard.
This is my co-host, Nikolay, founder of Postgres.AI.
Hey Nikolay.
Nikolay: Hi Michael.
How are you?
Michael: Good.
How are
you?
Nikolay: Very good.
Michael: Well, yeah, me too.
Haki: Thank you for asking.
Nikolay: Thank you for asking.
Haki: Like I'm here too.
If you just want to keep chat among yourselves, I'll just wait
here.
Nikolay: Look, look, we have a guest today.
Haki: Yeah.
Hi.
Michael: Yeah.
We are delighted to have a guest today and it is the one and only
Haki Benita, technical lead and database enthusiast who also
writes an excellent blog over at hakibenita.com that we've mentioned
many times on the podcast before.
He also gives popular talks and training sessions.
Thank you so much for joining us.
Haki: Thank you
for having me.
Big fan, very excited.
Nikolay: The blog is great.
If someone didn't read it yet, like it's, it's a must-read
for, I think, and long reads, right?
So, It's long reads, but not very frequent.
I guess it's not possible to make them frequent, right?
Haki: Yeah, it's a lot of work.
Nikolay: Funny story is we agreed that our bot will have your
blog post in our knowledge base.
Haki: Yeah, that's right.
Nikolay: And thank you for that.
And like a few days ago, I told the team like, you know, like
this part of our knowledge base was not updated since January,
since alpha release.
And now we're approaching beta release.
And they said, no problem, we'll update and pay attention to
this blog, I said, and then they came and said, we have updated
only one article added since January, but it makes sense because
long reads require a lot of time.
I know it very well.
Right.
Haki: Yeah.
It takes a very long time to produce
these types of articles.
Yeah, I know.
I've been writing for a very, very
long time and I've been very
surprised about this article.
I'll be honest with you, you know,
after you write for some time,
you start to develop like a sense
of which ones are going to
do well and be very accepted.
And which ones are going to be
like, you know, just one for the,
for the archives.
So, you know, when I wrote stuff
like comparing pandas to, to
SQL, I knew that it's going to
be a hit because people like pandas.
And if you say pandas is slow,
then people are going to hate
you.
And when I wrote this, me and Michael
wrote something about hash
indexes a few years back, and I
think it's a top result for Postgres
hash index on just about every
search engine.
So that one, you know, you, you released
that one, you know, that
it's going to do well, but then
I released this one.
I said, man, it's so long, such
a niche problem.
No, no chance anybody's actually
going to read all of that.
And I was very surprised by the
amount of responses that I received
to this article.
And I think the most surprising
comment that I received on this
article is that a lot of Postgres
people didn't know about merge.
Now I came from, I came from Oracle,
so I knew all about merge.
And actually when I came to Postgres,
I wrote a lot of ETL processes
and I really wanted merge and I
didn't have merge, so I had to
learn about INSERT ON CONFLICT.
So a lot of people told me this
is the first time I heard about
merge.
Such a cool command.
Nikolay: Because it's new, it's
still new.
Haki: I know.
Nikolay: It was 15, right?
15.
So it's super new.
Haki: And also to be honest, INSERT
ON CONFLICT is like what
you want.
99% of the cases.
And while merge is nice, I think
that ON CONFLICT is, you seem
pissed.
Why, why did I say?
Nikolay: Well, it was you describing
your article, why it's bad,
right?
Let's talk about why it's bad.
These gaps in sequences, like a
BLOAT created.
Yeah.
This is not what people want.
Right.
I know.
Let's step, let's step back.
Like our, Michael likes us to pronounce
the topic.
Haki: I have to tell you something.
I have to tell you something.
It's funny.
Okay.
I have to tell you something.
So when I talked with Michael about
doing this show and I asked
him, okay, let's do it.
Can you give me like a rough outline
of how it's going to go?
And he told me, I can tell you
how it's going to go, but to be
honest, if Nikolay is going to
be on the show, then I can't promise
anything.
Michael: I can't promise any structure.
Nikolay: Well, I see you are the
same type of person, actually.
Haki: I don't know.
You delivered like on the first
5 minutes.
Nikolay: Don't follow the agenda,
right?
So you said that you were surprised
this topic would be popular,
but how come? It's super popular.
It's like in computer science,
particularly in databases, find
or create, this is like a super
common pattern, right?
And the fact that in Postgres,
the solutions have so many pros
and cons, like it's super interesting
because everyone needs
it actually.
Yeah.
You build something, you want this,
right?
Haki: Yeah.
I'm going to tell you a secret.
After I published this article,
I actually discovered that there
are still situations where you
get very unpredictable results.
Okay.
And, and, and, and I've done some
experimenting in the past 2
weeks based on comments that I
received and I haven't cracked
it yet.
Yeah.
It gets a lot more complicated.
And I'll even tell you another
secret.
There's a very, in my opinion,
an unexpected difference between
the way merge and insert on conflict
behave under different circumstances.
But, you know, we promised Michael
that we'll describe the topic
before we actually dive into the
details.
Nikolay: Right.
Let's name the topic.
Yeah.
Yeah.
Michael: Well, I was interested
in your thoughts on this actually,
because I feel like you deliberately
called your blog post get
or create.
Whereas from the database side
of it, I've always thought of
it as like INSERT or SELECT.
And I think Nikolay's called it
that, but in the past in like
a how-to guide and things.
So we'll link up the blog post
and we'll link up Nikolay's how
to, and there's like a few other
things.
I think one of the things you're
referencing about how it's become
even more complicated in the last
few weeks was like you linked
me to a great answer by Erwin Brandstetter
on Stack Overflow
that discusses this as well, but
it's a surprise.
It sounds really simple, right?
SELECT or INSERT, get or create.
It seems such a simple piece of
functionality.
And in small, low traffic systems,
it can be like you can, like
any newbie developer could come
up with a solution to this problem.
It just gets complicated when
you have to factor in concurrency
and MVCC implementation and things.
So, yeah, I loved your post.
I found it really interesting how
complicated this could get,
But yeah, maybe we should start
simple and like, it'd be great
to hear a little bit about like
you, you both said it's quite
a common thing that you come, like
you have to implement it quite
often.
I haven't had to that often.
Like I've got a couple of occasions,
but not maybe not as often
as you.
So I'd be interested to hear like
where you're coming across
this.
Haki: Okay.
So my use case was very similar
to what I actually implemented
in the article because I had this
organization table and we had
members and we wanted to have users
be able to tag members with
their own tags and we wanted to
encourage reuse of tags.
Otherwise, they don't make much
sense.
So we've set up this very simple
HTML form.
Now HTML forms are not SPAs.
They're not, they're not very complicated.
You can send JSON.
You just send a list of names, like
from a radio selection or like
an autocomplete, whatever.
So at the end of the day, the server
receives a list of strings
and it wants to set these tags
on the member.
Now, if the tag exists, you want
to reuse the tag.
And if the tag does not exist,
you want to create a new tag and
get the ID.
Because the next step would be
to get all the IDs and put them in
the table associating tags with
members, right?
So that's the first part where
you have a list of tag names and
you want to get or create tags.
This is where I came to this idea.
Now, the thing that surprised me
is that now this is not a very
high traffic part of the application.
I could have done the brute force
approach would have been just
fine.
But you know, I wrote some unit
testing and 1 of the tests was,
let's see what happens when you
just have an existing tag, you
want to make sure that it's reused.
So I use INSERT ON CONFLICT DO
NOTHING with RETURNING *.
So I expected that if I insert,
for example, 2 tags and 1 of
them already exists, I expected
to get 2 tags in return.
But in fact, what I got was just
1 tag.
So this was very strange to me.
So at this point I started investigating,
you know, and starting
to explore why this is happening.
And in fact, the first thing that
I thought about was let's do
a meaningless UPDATE.
Like instead of doing ON CONFLICT
DO NOTHING, I did ON CONFLICT
SET ID = EXCLUDED.ID.
Like let's fool the database into
thinking that this tag was
modified.
So I get, get that in return, but
it really bugged me because
it's very like, it's a very ugly
solution.
Why would I want to update something
for no reason just to get
it back?
So this is where, you know, all
the, the different scenarios
started to uncover.
And as I tested farther and farther,
I came to the conclusion
like, Hell man, why is this so
complicated?
I mean, this, this is what database
and applications do.
This should be so simple.
Why is this so complicated?
And then I started digging and
1 of the places that I eventually
arrived was Django.
Django is a web framework and there's
a, an implementation of
get_or_create().
And what Django is doing, they
try to fetch the record.
If they find it, they return it.
If they don't find it, they try
to insert.
Right.
But then, and that's like the funny
part, the non-intuitive part,
then they actually handle an integrity,
unique constraint violation,
and then they try to get it again.
This is the select, insert, select,
but then it gets even more
complicated because if you have
a unique constraint violation
inside a transaction, it aborts
the transaction, right?
You need to wrap the second insert
in another transaction.
Right.
Nikolay: Or use subtransactions.
I think Django by default uses
subtransactions, which is a very
bad idea, actually, but we will
talk about it later.
Haki: The reason that he uses subtransactions,
and I know you're
a very big fan of subtransaction,
you use them all the time.
You encourage everybody to use
subtransactions.
You wrote a very long blog post
about why they are absolutely
great and you should use them all
the time.
But the reason that you do that
in Django, the reason that they're
doing that is because if you happen
to call this function inside
a transaction of your own, and
if you reach a unique constraint
violation, it gets your transaction
aborted.
So that's a problem.
So the only way that Django can
make sure that they don't get
your transaction aborted is to
execute this part in a subtransaction
of its own.
Now There's also another very interesting
thing that happens
here.
And this is also something that
I mentioned in the article.
Python encourages asking for forgiveness.
So in Python, the idiom says that
you should try something and
then handle all the possible exceptions.
So trying to insert a record and
then handling the unique constraint
violation is actually consistent
with how you're expected to
do things in Python.
But it kind of conflicts with the
way Postgres handles this situation.
Because in Postgres, when you have
an exception inside a transaction,
it aborts the transaction, which
is not a very good thing.
So the approaches between how you
would do things in Postgres
and how you would do things in,
in Python kind of clash in
this case.
I thought it was very interesting.
So, you know, I explored different
approaches, like what would
be the equivalent of asking for
forgiveness in Postgres, what
would be the equivalent of looking
before you leap, check all
the preconditions in advance.
So, yeah, it turned out to be way
more complicated than I anticipated.
Nikolay: Yeah, that's interesting.
And you explored several properties,
right?
Bloat, concurrency, constraint
and idempotency, right?
Haki: Idempotency, yeah.
Nikolay: So I'm very curious.
I saw some parts of benchmarks,
but you decided not to include
the performance to the resulting
table when you compared methods.
Why so?
Performance is also interesting, right?
Haki: Performance is interesting.
And I've seen some articles
that do this from a performance
point of view.
In my situation, performance was
not a concern.
I was more focused on getting the
functional attributes that
I wanted, like idempotency, for
example, was a lot more important
to me than making it fast.
Because at the end of the day,
you don't UPDATE lots of tags.
You probably set no more than 5.
So performance was not a concern.
I did want to focus on the functional
attributes.
Also, it got pretty long.
So at some point you need to let
something go, you know.
Nikolay: Right.
Yeah.
Let's maybe talk about your favorite
method right now and for
future.
For future, I guess it's MERGE,
right?
Because in Postgres 17 it will
have RETURNING clause.
Actually, I didn't know.
And when I read about it in your
blog post, I immediately thought,
oh, this is a candidate to be one
of my favorite additions to Postgres
17.
I somehow overlooked it.
So it's great.
MERGE was added to Postgres 15.
It was being developed for more
than 10 years, I think.
It was a huge story.
But it lacks RETURNING in 2015
and 2016.
Now in 2017, to be honest, it's
receiving RETURNING.
It looks like all your checkboxes
are green, right?
You will choose this if you already
was on, we are on Postgres
17, but before 17, what's your
favorite method?
Haki: Well, INSERT ON CONFLICT,
DO NOTHING with a UNION is currently
my way to go.
And I expect it to remain my go-to
approach, mostly because of
habit.
And also because I experimented
with the new MERGE under different
circumstances in Postgres 17.
And it turns out that MERGE can
actually fail with unique constraint
violation.
So I found it unexpected.
I don't know, because I don't know
much about how these two are
different in the underlying implementation,
but I'm guessing
that they are not implemented the
same way.
Nikolay: And we talk about it commitment
level.
Haki: Yes, of course.
Nikolay: I think it also matters,
right?
It's the lowest.
Haki: Yeah.
So, so after I published this article,
I, some reader reached
out to me and he said, I really
liked your posts and everything
is very comprehensive, blah, blah,
blah, But there is one race
condition that can still cause
inconsistent results.
So basically, if you have two sessions
inserting the exact same
tag at the exact same time, then
you can end up with a situation
where using INSERT ON CONFLICT,
you would not get idempotency.
meaning the first statement would
return the tag, but the second
would return nothing.
And the reason for that is when
you start the first transaction,
you INSERT the tag A.
So tag A does not exist and INSERT,
INSERT the tag, right?
And then you get that in return,
but then you don't COMMIT.
Now another session starts, the
transaction begins, and now you
do the exact same thing.
You do INSERT tag A ON CONFLICT
DO NOTHING, RETURNING *, and
then you SELECT from the table
and SELECT from the RETURNING.
What's happening now is interesting,
because right now, the row
is LOCKED.
So, the second session hangs.
So now you COMMIT the first session.
So the second session at this point
is going to return nothing.
And the reason is INSERT ON CONFLICT
LOCKED.
It encountered a unique constraint
violation, so the row was
not inserted.
But then when you tried to SELECT
from the table, it found nothing
because the snapshot is older than
the other transaction.
So this is a scenario where, where
you get, you don't get idempotency.
You expect to get the same output
in both cases, but you don't.
And the more interesting thing
is if you do the exact same exercise
with MERGE, you get different results.
What do you get?
Can you guess?
I found it to be surprising.
If you do the exact same experiment
with MERGE, you are going
to get unique constraint violation.
Even though you can say in the
MERGE clause, you can say ON CONFLICT,
DO NOTHING WHEN MATCHED, DO NOTHING.
You are still going to get unique
constraint violation.
So this raises a question of whether
MERGE is really, or I would,
I wouldn't want to say something
like, is it safe in concurrent,
highly concurrent reloads, but
it doesn't do what it promises.
Better understanding of how MERGE
handles these types of situations,
you know.
Nikolay: Yeah.
It shouldn't produce unique constraint
violation in read committed,
but it does.
But it should be easy to check
based on what you say, like if
you just do pgbench multiple sessions
in parallel doing the same
thing, right?
Haki: It's very easily reproducible.
You just need 2 terminal windows
and you can immediately replicate
it.
I, the thing I found to be, I would
say disturbing.
Yeah.
And I'm using air quotes for those
of you just listening is that
when you do MERGE and you set ON
MATCH, DO NOTHING, you don't
expect to get Unique constraint
violation.
Right.
If I wrote this statement in my
code, I would not handle any
Constraint violation.
Okay.
Because if I'm inserting into a
row and I know that it can raise,
I would handle the situation and
do whatever I need to do, but
I would not expect to get integrity
error when I explicitly handle
the case where there is a match.
So I found this to be surprising.
So to answer your previous question,
what would I use now?
And that's a, that's a tough choice
between you don't get the
expected result to you get an exception.
You know, when I'm thinking about
right now, I think that it
kind of makes sense to get the
exception.
Right?
Nikolay: Yeah.
Well, I guess we're moving from
the end of article backwards,
right?
So much, this, the problem you
describe, is it discussed somewhere?
Like, do you know, like, like this
behavior was discussed in
mailing lists or somewhere, no?
Haki: I don't know.
Maybe I'm guessing that it might
have, but I haven't seen any.
And it also, as you said, it's
kind of new in Postgres.
So it's possible that some of the
people that worked on it didn't
have enough time to actually document
it.
I wouldn't say document, but maybe
write these types of articles
on it, analyzing how it behaves
under different circumstances.
Nikolay: Yeah.
I'm curious.
I think this should be documented.
Maybe, maybe it's actually documented.
We should check, But it's super
interesting.
I never used MERGE in production,
never, ever.
But yeah, so what...
Well, honestly, most production
systems are still behind a lot
in terms of major version.
But moving backwards, in SERP,
on conflict, do nothing or do
UPDATE.
This is interesting because we
said our goal is find or create
or get or create, right?
But when you say, let's also UPDATE,
it's like a third layer
of logic, right?
Find or create, but if you found
also UPDATE, like it's, it's
more complicated.
I like it, but it's already a different
task, right?
Haki: Yeah.
Well, you know, when I wrote lots
of ETL, I don't know, 15 years
ago in, in Oracle, I used MERGE
all the time.
It put like a third hand every
time I needed to do anything,
I would use MERGE because when
you do ETL processes, you're basically
taking data from one place and you
want to sync it to some kind
of target.
So you use MERGE all the time.
In Postgres, I kind of gotten used
to not using it.
And also some may also claim that
the whole ETL, the way that
we do ETL now is kind of different
than the way we used to do
ETL 15 years ago.
A lot of it is the same, but some
of it is still kind of different.
Yeah, I think it's interesting.
And I, and hopefully when more
people adopt MERGE, I know that
it was a very desirable feature,
right?
A lot of people wanted that.
Right.
Nikolay: People
Haki: were very excited when it
came out.
It was all in there.
What I anticipate, what I expect,
most in version, you know,
the MERGE command.
So hopefully we're going to see
some people doing interesting
stuff with MERGE.
Nikolay: Yeah.
And I was surprised to see, so
there is a difference between, in
certain ON CONFLICT DO NOTHING and
DO UPDATE in the table you provided
in the end of the article and DO
UPDATE, there is a like red
cross in the column of bloat, meaning
that it has bloat issues,
so obviously update, okay.
But also you have a green checkbox
for INSERT ON CONFLICT DO NOTHING.
The thing is that when we created
our bot, we also used INSERT
ON CONFLICT DO NOTHING for the knowledge
base.
For example, if your article gets
inserted, it searches if this
URL is known already.
And if it's known already, DO NOTHING.
We decided not to apply update
yet.
So I was surprised to see you concluded
that there is no bloat
issues because we had a lot of
activity, a lot of data to insert
and it suffers from bloat.
We have gaps.
So if you have a collision, right?
So you insert something, you try
to insert and DO NOTHING.
It actually generates that tuple,
I think.
No,
Haki: No, I think that, first of
all, gaps,
Nikolay: gaps for sure.
I just
Haki: checked it.
Nikolay: Gaps for sure.
Haki: Yeah.
Gaps for sure.
And okay.
I know what you're talking about.
You, you, you're talking about
the, the issue of bloat, right?
So this is also a very interesting
topic, which I briefly mentioned,
but I think that this is also something
that some people commented
about that they weren't aware of
the fact that it causes bloat.
So apparently in Postgres, when
you have unique constraint violation,
it can cause bloat because the
way it works is that Postgres
tries to, it basically inserts
the row into the table and then
it checks that there are no unique
constraint violations, right?
And if there is a unique constraint
violation, this row is marked
as dead, which causes bloat.
So this is what's happening if
you rely, heavily rely on catching
unique constraint violation.
However, however, however, and
this is also in the original RDS
article, which is how I found out
about it.
It was also very unexpected that
unique constraint violation
would cause bloat.
But according to the article and
according to the benchmarks
that I've done, when you use INSERT
ON CONFLICT, it uses a different
mechanism that can check in advance
for collisions, which prevents
bloat.
So I'm pretty confident that INSERT
ON CONFLICT DO NOTHING cause
bloat, which is a big plus for
DO NOTHING.
And also, you know, we, we talked
before about the difference
between how you do things in Python,
like asking for forgiveness
versus look before you leap.
So in Python, the pattern that
is encouraged, basically try to
do something and then handle
the exception can end up causing
lots of bloat, which is a problem.
So if you have a process with lots
of collisions where you actually
rely on unique constraint violation,
this is something that you
need to think about.
Nikolay: Right.
So yeah, I double-checked.
You're right.
No bloat, but the sequence gaps are happening.
We have a sequence gap.
Haki: You know, there was a period
where I liked to ask in a job
interviews, how would you implement
gapless sequences?
Well, have you ever talked about
this?
Michael: Evil question.
Haki: Evil question.
Yeah.
Nikolay: Yeah.
It has many depths to
go into, right?
Yeah.
Deeper, deeper, deeper.
Yeah.
Michael: There's a company that
a few of my friends and former
colleagues work at called incident.io.
They do like incident management
within things like Slack and
teams.
And they had this issue where they,
well, they blogged about
it, but they wanted gapless sequences
for their incident numbers.
They didn't, The customers got
confused by the idea of the incident
number jumping by like 5 when they'd
only had, you know, they've
had 1 incident, then a second incident,
a third incident, suddenly
they're on incident 6.
What happened?
So there's a really nice blog post
there.
We can explain how that was.
But I'm curious, why are you bothered
by sequence gaps, Nikolay?
Why is that an issue?
Nikolay: Well, it's not a big issue.
It might be an issue for some cases,
but actually I don't care.
It gave me a false impression that
there is bloat.
But now I double-checked, I'm absolutely
right.
No bloat.
So INSERT ON CONFLICT DO NOTHING
probably is the winner in Postgres
16 or older.
Michael: I've got a question here,
because you mentioned the
kind of trade-off between the issues
in the highly concurrent
cases or like the potential issue
of INSERT ON CONFLICT DO NOTHING
returning null if you happen to
have this case with the 2 concurrent
sessions inserting the exact same
tag at the same time, wouldn't
INSERT ON CONFLICT DO UPDATE avoid
that issue at the cost of
some bloat?
Haki: Well, that's a different
issue.
The issue I describe is that when
you have 2 transactions trying
to insert a new tag at the same
time, the same new tag at the
same time, then the second ON CONFLICT
would not do anything
and it would also not return anything.
This is the unexpected part because
one of my requirements is that
the operation would be idempotent.
So if I give it tag a, I expect
to get an item returned.
So in this case, there is a scenario
where I provide it with
a list of N tags and I get less
than N tags in return.
Michael: But if they both updated,
imagine if both concurrent
sessions were doing INSERT ON
CONFLICT UPDATE, they both get
to the update stage.
You get 2 updates, but you still
get back all the tags that you
wanted to insert.
Haki: First of all, in my scenario,
I don't update.
This is a very, a lot of people
got this very confused.
This is why I added the comment
and some people made like a very,
let's call them funny comments
about it.
Okay.
But there's a difference between
upsert where you want to update
or insert, which is also a very,
a very popular pattern.
In this case, by the way, it's
simpler because if you actually
update the row, then it's going
to get returned by RETURNING
STAR.
Michael: Returning.
Haki: Yeah.
So that's like the easy case.
In my case, I don't want to touch
the row, but I still want
to get it back.
This is why get_or_create() is like
a more difficult variation
of UPSERT.
If you will.
Okay.
So, there's a tricky part.
So apparently there's no good solution.
By the way, the post by Erwin on Stack Overflow lands
on a brute force solution where
you basically loop until you
get what you expect.
I don't know if I would have, if
I, maybe this is the way to
go.
Nikolay: Loop where in, in a different
language or in PL/pgSQL?
Haki: He ended up writing a function
where you essentially loop
and you need constraint violations.
Yeah.
Nikolay: But you, you need a subtransaction.
So this is no go for me.
Definitely.
Right.
This doesn't scale.
I mean, this,
Haki: Nikolay, you are a man with
a mission.
Nikolay: Well, it's, it's no joke.
I, I would have clients literally
last week I spent like half
1 hour, and people went and switched
off it in Django.
Because it hurts, it hurts constantly
people, like subtransaction
SLRU, wait a minute.
It happens all the time.
The fact that Django by default
uses them and people don't realize
and then come to us with subtransaction
SLRU, it's good for my income,
but it's bad in general.
By the way, I wanted to highlight
1 thing.
If you do INSERT ON CONFLICT DO
NOTHING, RETURNING STAR, it won't
return anything if a row already
exists.
I just wanted to highlight that
you have a trick.
You use CTE, right? And then SELECT
for such case, right.
Yeah.
Additionally, with UNION, right?
Haki: Yeah, that's right.
Nikolay: UNION or what?
Haki: Yeah.
This is the only way that he can
actually communicate between
parts of a query.
Otherwise, they all see the same
snapshot and you don't get.
Nikolay: This is a trick also not
straightforward at all.
Honestly.
Yeah.
Like some people can move away
from INSERT ON CONFLICT DO NOTHING
just because of that.
Michael: Yeah.
Right.
Yeah.
One of many cool tricks in that book.
Haki: I went through this process
where I try to figure out,
wait, I just inserted this row
while, when I SELECT outside the
CTE, I
Nikolay: don't get it.
Michael: Yeah.
Yeah.
I understand there's some costs
to UPSERT, but given the complexity
we're running into here, is it,
would you both see it as acceptable
to pay that extra cost, the kind
of heavier, I know you mentioned
the double hit of the table
being annoying, but like
those updates, even though they'd
be setting the tag name to
the exact same tag name, potentially
over and over again.
Again.
I just see the guarantees
that provides is so attractive.
Haki: I'll be honest with you.
I understand what you're saying
and you're probably correct.
This is like the practical approach,
but I would not do it.
It just bugs me.
I would not do it.
I would not do a meaningless update
just to save, I would just
do two queries.
I would INSERT, and then I would
SELECT.
Separately.
Nikolay: Like an INSERT fallback,
INSERT fallback, you end up
having a lot of bloat depending
on the concrete situation.
So this, so subtransactions and
bloat, huge limitations of performance
here.
Right?
Haki: I know.
I think this is a scenario.
This is a case where you kind of
understand the limitation and
restrictions of the database and
you kind of end up solving it
at the application level.
Now, you mentioned at the beginning,
I know that I, well,
get_or_create() is, is useful, but I
haven't had a chance to implement
it as much.
So yeah, for the rare occasion,
what I actually need to do get
or create, I would just do it at
the application level because
as it turns out, it's very complicated
at the database level.
Nikolay: Right.
In applications, it becomes simpler.
You like, you don't deal with,
probably don't deal with subtransactions
and bloat, but latency like there
are trip times between attempts
to do one step, another step, it
increases chances of failures,
right?
I mean, collisions and so on.
Yeah.
Haki: But you know, if, if my main
business is to get or create,
I would come up with a very innovative
solution, put a lot of
engineering work into it.
But if it's just updating tags
for members in an organization,
I would go.
Yeah.
But I understand what you're saying.
I spent a lot of time doing unnecessary
optimizations on, you
know, weird places in the code
just for fun.
I do it all the time.
Yeah.
Like 90% of my blog posts are inspired
by these strange endeavors
where I try to optimize things.
So yeah, I, this is interesting.
And I think that this very simple
problem surfaced a lot of issues.
I also learned a lot from writing
this.
It got me interested in, in speculative
insertion, the mechanism
used by INSERT ON CONFLICT.
It, I think, broadened my understanding
of how the interaction
between different queries and a
common table expressions work
within the same query.
No, lots of things that I'm now
aware of them.
So at least I'm better equipped
to debug issues I might have
and don't even know about.
Michael: Yeah.
Yeah.
I really liked it as a journey.
I would encourage people to read
this, even if they don't have
this problem, just as a, almost
as just to watch somebody go
through this problem.
And you know, I feel like you've
included things as well.
I like there's 1 section that's
I took a long time as well, like
that.
And that's so helpful to share
those things with people.
Cause otherwise you can read some
of these blog posts and they
just sound like the person knew,
like just got it right straight
away.
And it's quite frustrating as somebody
that quite often goes
down long-term for me reading those.
So I appreciated that.
Nikolay: I just wish you had
also some warning about subtransactions
anyway.
Haki: You know what, let me just
change the banner.
The top of the website, instead
of the about page, I would just
place like this huge warning, They
don't need subtransactions.
Nikolay: Let me put right here.
If you see a PL/pgSQL with BEGIN,
EXCEPTION
WHEN or something.
So anyway, nested BEGIN... END blocks,
this is subtransaction.
Haki: Yeah.
But in this case, it's warranted
because otherwise, it cancels
any calling transaction.
So that would be the responsible
thing to do in this case.
But you know what?
Maybe it needs a comment saying
this is a good subtransaction.
Nikolay: It's not a good subtransaction
because if you have
high...
Haki: This is the least worst type
of subtransaction.
Nikolay: I cannot agree with you
here because you talk about,
you explore clashes and concurrency
issues.
Yeah.
It means you think that there will
be a lot of sessions issuing
this query.
It means eventually the project
will bump into this wall.
Good subtransactions is, for example,
DDL, which happens not
often in 1 session.
This is good.
And even there, it can become bad.
But here I cannot agree because
we should consider multiple sessions
fighting and this is where things
can go wrong.
Haki: What you're saying reinforces
what I'm saying that in this
situation, because it's so complicated
in the database, I would
just elevate that to the application
level and try to find a
solution there to avoid all the
complexity.
But it was a nice thought exercise,
understanding different types
of transactions.
Nikolay: Just to try regular, regular
transaction, just to try
regular transaction maybe.
Right.
Yeah.
It should not be so complex,
right?
Haki: I know.
I was surprised that it was so
complicated.
Nikolay: Why is it so?
Michael: What do you, like, what
would you both like as a, is
it that you would like MERGE to
work as expected?
Like what is the nice solution
on the database side for this?
Haki: Well, the dream syntax would
be for SELECT DO NOTHING RETURNING
STAR to just return everything.
That would be like the dream syntax.
Nikolay: I think part of the problem
is that RETURNING STAR is
quite like not super popular thing,
which was added later to
INSERT and UPDATE and DELETE.
It was not originally there.
So it's not a super natural thing.
Michael: Also, once you've defined
that behavior, it's so difficult
to like, you can't change the behavior
for existing applications.
Haki: You can't.
And I can even give you an example
because you usually use INSERT
ON CONFLICT.
With, you know, there is no MERGE
command in Postgres.
So if you want to sync data, you
use INSERT ON CONFLICT and then
when matched do nothing.
OK.
And more often what you want to
do is you want to avoid the unnecessary
updates.
So you do when matched, do UPDATE
where something actually changed.
OK.
And then some rows end up not
being affected.
So at this point you do RETURNING
star.
And then usually what I like to
do in my ETL processes is I count
the rows so I can log how many
rows were affected.
Right?
So now the question is, if I expect
to get the number of affected
rows, we go for a circle here.
I know.
So if I'm just expecting that the
affected rows, that if I'm
going to get rows that were not
affected, also very strange,
right?
Nikolay: Right.
Haki: Yeah.
So what's the conclusion?
Come on, guys.
I mean, what's the conclusion?
Just do it in the application?
Nikolay: Don't go there.
The main conclusion is don't go
there.
Don't think about it.
Haki: Yeah.
Nikolay: If an error happens, it only
confirms that application
developers should implement proper
retry logic.
Always.
If a transaction fails, code should
be ready to retry it.
That's it.
Universal solution to everything.
Haki: We should rename this to
just do it in the application
FM.
Well, it's complicated FM.
Michael: It's complicated.
Haki: No, it is complicated.
No, no jokes aside.
It is.
I don't know why it's so complicated,
but it is.
Michael: Turns out concurrency
is hard.
Haki: Yeah.
You know, one of the things
that I remember is that
when I read this RDS article about
a unique constraint violation
causing bloat, I was shocked.
I was shocked because coming from
Python, I'm like encouraged
to trigger exceptions.
This is how I do things.
I fail and then I, I adjust.
So, that was very strange for me.
I have this entire lecture about
how to handle concurrency with
a short URL system and everything.
Like there's a problem of how you
generate short random IDs.
How do you do it?
You need to, if you know, like
a short URL system, you need to
generate keys, very small keys.
So the space is limited.
So you want to generate random
unique keys.
So how do you do it?
So you end up in a solution where
you try to INSERT and then
you loop until you don't get unique
constraint violations anymore.
Right.
And now all I'm thinking is I may
have caused lots of bloat without
intention.
Maybe you have a better solution.
By the way, it's very similar to
the question about the gap,
the sequences.
Michael: Yeah.
I get that bloat is bad, right?
Like I understand, I see lots of
slow queries that are caused by
bloat, but once you're aware of it,
it's like something that can be managed.
And you've written a great blog
post about how to manage it at
Haki as well, that I'll include.
But it feels to me like it might
be one of those pains worth paying
sometimes when you're aware, like
every single PostgreSQL table
is going to have some bloat in
it, in a natural work life.
You're not going to get it down
to zero, even when you do some of
these maintenance tasks.
So we're always managing some amount.
It's just like, what's a, what's
a healthy amount and can, can
we keep it within those boundaries?
It's going to be more painful with
certain workflows.
Like maybe once we're getting near
the end of that, like once
we get past maybe halfway full
of that space, we're starting
like an average of 2 tries per
unique URL might start to get
annoying.
So maybe that's the point where
you start expanding the space.
by an extra digit or something.
I don't, I imagine that once you're
aware of the problems, you
can design a solution that isn't
that bad and then manage the
bloat.
Haki: Well, it all boils down to
how many collisions you actually
expect to get.
Nikolay: I think the fact that
rollback inserts cause bloat makes
me think that only undo people
can save us someday.
Michael: We had a quote last week.
Nikolay: The new term, undo people.
Because indeed, if you think
Michael: about it,
Nikolay: this different approach
would place a new row in the
same position on the page, right?
And if it's committed, it's there.
If not, like, I mean, the different
approach would not cause
bloat if an insert is rolled back.
Right.
So this is just the design of PostgreSQL's
MVCC we should blame.
Right.
Yeah.
It's very unexpected for sure.
Haki: Yeah.
Well, at least we can run very
long queries without worrying
about undo segments running out.
Nikolay: Yeah.
Michael: I've been there.
And it's great to have an old group,
like a former ORACLE person
understanding the downsides of
undo too.
Right.
Haki, any, any last things that
we should have mentioned that
we didn't?
Nikolay: And
Haki: no, I think that we zigzagged
across this article very,
very nicely.
Nikolay definitely delivered on
his promise.
Nikolay: Yeah.
Well, are you planning part 2 with
benchmarks?
Haki: I don't know if I still don't
care much about performance
in this case, but I think that
I might add an update on some
of the findings about how merge
and insert can end up eventually
causing a unique constraint violation
and non-idempotent results,
I hope that you implemented upsert
so you can identify where
blog post is updated.
So your AI can answer this correctly.
Nikolay: Oh, this is good.
Good goal.
Actually, we don't have.
Haki: Yeah.
Misconceptions about blog posts.
They never update.
Yeah.
I just solved your get or create
problem.
You can just update all the time.
Nikolay: It's not that easy because
we have many pre-processing
stuff for really long articles
as yours, because you know, the
LLM has usually has some limitation
OpenAI has like 8K tokens
as input, and I'm sure your article
exceeds it a lot.
So.
Haki: I try to write shorter ones.
Nikolay: Yeah.
Probably you should have 2 versions.
No, like for people who have issues
with long reads.
Yeah.
I do actually.
Haki: You like blog posts in parts.
Nikolay: I like short versions
till they are, but with some details.
Michael: I'm a big fan of long
blog posts.
I can, I know there's a lot of
others out there too?
So thank you for continuing to
work for us.
Nikolay: Yeah.
It's something that I I'm going
to cite for sure.
Yeah.
Links will be used to this
article.
Yeah.
Thank you for this.
Haki: I need an idea for the next
1.
So if you have 1 ping me, I have
a working title in my mind.
It's called how to take a dump
like a pro.
I couldn't resist the title.
Nikolay: That's a huge topic actually.
Haki: Yeah.
I've actually had some experience
with it lately and I think
that it's Interesting to see
how you can optimize pg_dump with
the different compression algorithms,
how they affect the CPU.
And when dumping in parallel is
actually not useful at all.
Nikolay: Unpartitioned tables.
Haki: No, if you have 1 big table.
Nikolay: Right, it's possible,
but you need to engineer yourself
with snapshots and repeatable transactions.
Haki: So that's common.
Nikolay: And use ranges of IDs.
This is what Patroni, for
example, does for logical replication
initialization.
So it's possible, for sure.
Haki: Yeah, but I do have another
consideration that you didn't
think about.
Because 1 of my restrictions was
that I wanted to stream the
output directly to S3 and not to
the local file system.
Imagine that you run this in a
Lambda.
In this case, using parallel is
problematic, but using a single
file, you can just stream it directly
and skip the file system.
And also if you want to pass the
dump through some kind of encryption,
then also doing things in parallel
makes it a bit difficult.
So yeah, different restrictions,
different solution.
Nikolay: When you do this, like
you, you find some like lacking
feature.
Do you have sometimes idea to implement
something?
Haki: Implement?
Nikolay: To
become a hacker.
To become a hacker.
Haki: Oh, what, like contribute
to Postgres?
Nikolay: Because if you, well,
dump lacks a lot of stuff all
the time, like, but it's super
complicated to add some things
there, but definitely it likes
a lot of stuff.
So I'm curious if you've had such
feeling, like idea, like, Oh,
I'm going to try to implement this.
No?
Haki: I've looked at the Postgres
code many, many times.
I think that the documentation
in the source itself is absolutely
excellent.
I remember when we worked on this,
hash indexes article, we looked
at the actual source file.
There's lots of very, very useful
information.
By the way, this information was
a lot more useful than anything
I could find online, including the
documentation itself.
So it was absolutely excellent.
And I think that the code is very
well written.
I'll be honest with you.
I don't think I'm capable enough
to contribute to Postgres itself.
But, you know, I do from time to
time think about my contribution,
my type of contribution to the
Postgres community.
And I know that the most straightforward
way to contribute to
Postgres is to contribute code,
but I feel like I contribute
in my way by advocating for Postgres
and educating people how
to use Postgres correctly because
I believe that if people would
have good systems, they'll be happy
with Postgres, they'll continue
using it, they'll recommend it
to their friends, and this is
how you build a community.
So I think that I'm more capable
in telling stories, educate
people about Postgres, working
on ETL processes and applications
that I do working on, you know,
the tooling, the internal stuff.
There are people much smarter than
me that can do this stuff.
I'll write about them.
Nikolay: Well, yeah.
Postgres documentation lacks how-to
recipes, and this is what
you do.
You describe some recipes and pros
and cons.
This is an excellent thing.
Haki: Yeah.
War stories.
Michael: You absolutely do all
those things, and we really appreciate
it, Haki.
Thank you so much for coming on
as well.
It's been a pleasure having you
here.
Haki: Okay.
Thank you, Nikolay, Michael.
It was a pleasure.
This is the first time for me in
a podcast.
Michael: But you did great.
Haki: Thank you so much.
Bye.
Nikolay: Thank you. Have a good week.
Michael: Take care.
Haki: Thanks.