Don't do this
Michael: Hello and welcome to Postgres
FM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard.
And this is my co-host Nikolay, founder
of Postgres.AI.
Hey Nikolay, what are we talking
about today?
Nikolay: Let's talk about something
we shouldn't do.
Michael: All the things that we
shouldn't do with respect to
Postgres, right?
Nikolay: Of course, we are not talking
about alcohol or sugar,
right?
Michael: Neither of which you've
ever done on the podcast, yeah.
Nikolay: Well, I was drinking a
lot of Red Bull on the podcast and
it has a lot of sugar, so.
But at least it's not alcohol.
Michael: True.
So there's a fairly famous wiki
page, which I was actually surprised
is only about 6 years old.
It's probably the wiki page I link
to the most in the PostgreSQL
wiki.
And it's called, it's titled, don't
do this.
And it's a list of a bunch of things
that are largely inadvisable,
but there's also a couple of other,
you've written a how-to that
includes a lot of don't do this
that aren't on the wiki.
And there was a really good page
that you sent me that I've looked
at as well by Yandex who also have
some additional don't do this
advice.
So we're going to pick some of
our favorites from those and help
explain why not and what you can
do instead.
Nikolay: Right, right, right.
So the wiki list is quite large.
It's not huge, but quite large.
And I like almost everything.
Not everything, not everything.
For example, between, I use sometimes
between, you just need
to understand how it works.
So it's like including both sides
of the range, both boundaries,
right?
I don't see anything wrong if you
use it with an integer, for
example.
Michael: But that's what they say,
isn't it?
I think that's what the wiki is
really good at doing.
Nikolay: Well, if this list is
supposed to be simple to memorize,
and if something is included, like,
don't use between, and you
think, okay, I should avoid it.
Maybe actually you should avoid
it.
But if we take between, SQL is
such a rich language, we have
many things to exclude then as
well.
For example, I don't know, like
various things with join.
Like, this can bring us to the
area, for example, always use
as, don't use aliases without as,
right?
And so on, like these things, these
kinds of things.
Because it can lead to confusion.
Don't use order by 123, or group
by 123.
Actually which is good advice in
general, but most of experienced
guys use it anyway because it's
just short, right?
And like this list, I actually
don't like this.
These kinds of items I don't really
like, in my opinion, they
are quite shallow.
Michael: So I'm going to push back.
I think this isn't a list to be
memorized and not looked at.
I think one of the benefits of having
it written down with some
extra points, like for example,
in the title, it says don't use
between in a parenthesis, especially
with timestamps.
And then it has why not explains
that it includes the boundaries,
whereas you probably don't want
that if you're dealing with continuous
ranges like timestamps.
And then it says, when should you?
So it says, between is safe for
discrete quantities like integers
or dates, as long as you remember
that both ends of the range
are included in the result.
And then they say, but it's a bad
habit to get into.
And I actually think that contains
all of the subject you were
just talking about.
Nikolay: Well, let's maybe edit
wiki and change the title to
don't use between for timestamps
to make it clear.
Because for integers I don't see
the problem at all.
Michael: So you could say like
0 to 9 and 10 to 19 or you know,
yeah.
Nikolay: With timestamps it's tricky
because sometimes if people
don't spend time understanding,
like thinking about what exactly
they include, what they don't,
and they can make quite bad problems.
They can have troubles even without
between, using just simple
comparison.
Also grouping of timestamp, truncated
timestamps, like with that
date trunk, can be tricky as well.
You just need to understand that,
like, depending on your data,
the edge can be tricky.
Sometimes it includes a lot of
rows.
Midnight can be a lot of rows and
you need to decide where should
it go to the previous day or next
day, right?
Or sometimes it can be 0 rows on
the edge because you always
have milliseconds.
Michael: Or it's even more precise
than that, right?
Nikolay: Yeah, you need to think
about edges and where to include
them.
And of course, if you use between,
it's hidden a little bit.
I agree with this.
But in general, don't use between,
especially with timestamps.
I cannot agree with this.
Michael: Okay, well, yeah.
I guess it's a wiki for a reason
and edits are allowed.
Right.
But should we, should we focus
on some of the ones we do agree
more with or like the,
Nikolay: Yeah, sure.
Choose anything.
Michael: Well, I like the not in,
like, I think it starts with
the wiki.
The wiki starts with a bunch that
I don't see as being that useful,
don't seem to trip that many people
up.
But it gets to a few sections in,
it gets to SQL constructs and
the first of those is don't use
not in.
Nikolay: Yeah I would recommend
reading Juk blog post Lucas said
about this like it's very well
explained why it's bad and if
you remember I sometimes 1 day
I mentioned the very nasty problem
I had with 1 of my startups in
the past when we couldn't launch
properly for several months, losing
traction and then I found
an issue related to nulls.
This exactly was related to this.
It was in the episode about nulls,
1 of our very first episodes.
And yeah, nulls are in general
very tricky, and this is 1 of
the traps where it can catch you.
And you need to study 3-value logic,
but for some persons, for
example myself, it will be always
a trap.
I mean, I know I can, even with
20 years of experience, I can
be caught there.
That's why, like, you just need
to practise avoiding these traps.
So basically if you have not in
and you have 1 of the values
inside it now, it will always be
unknown, so it will not return
what you expect.
Michael: Yeah, and I think the
reason I think this trips
people up the most is that in and
exists can be used say interchangeably,
but that's not true for not exists
and not in.
And that's confusing like That's
Nikolay: confusing.
Yeah.
Now it's confusing in general.
Three-valued logic is confusing
in general.
So now is a big trap itself.
It has just many faces and not
in is 1 of these faces.
Michael: Yeah, well, I'll link
to the I'll link to the episode
we did on nulls in the show notes
for anybody that wasn't wasn't
a listener back then or missed
that 1.
That also reminds me on the in
the Duk list.
There was there was some schema
advice.
So some general don't do this with
your design.
And 1 of the things was around,
like, constraints on columns
in the first place.
Like, it's the default to let,
in Postgres at least, I think
in all databases I've used, or
at least the ones I remember,
the default is columns can contain
nulls.
Nikolay: Can accept
Michael: nulls.
And it's an interesting point that
you make, that maybe that
should be flipped and not null
should be the default.
Or at least when we're designing
schemas, it could make sense
to do it that way.
Nikolay: Yeah, yeah, maybe.
At least when you design something,
it's a good habit to consider
everything as not null.
If needed, the next step is to
think about the default value
and only then accept nulls.
Yeah, just reduce the number of
cases where nulls can appear.
And that's a great, I would say,
style.
This is about style.
Michael: It does have performance
implications, that 1 as well,
because if the database can know
that there are no nulls in that
column, that can allow certain
optimizations as well.
Nikolay: Well, yes and no.
Well, it depends.
Nulls are stored differently.
If you put some default value,
it will be stored as is, right?
Unless you're adding a column for
an existing table.
In this case, since Postgres 11,
we have like a virtual default
and it's stored virtually, not physically.
But for all new rows, a default real
value will be really stored.
And nulls are just stored as like
a bitmap, right?
And, But I agree in general with
indexes and so on, nulls sometimes
present some challenges in terms
of performance.
Michael: I think also it can rewrite
certain queries if it knows
for sure that a column cannot contain
nulls.
If there's like a constraint on it.
If you've written, I guess this
is a contrived example, but if
you've written where X is not null,
It can just ignore that completely.
It can just ignore certain filters
or certain operations.
Nikolay: Good point.
Michael: But yeah, this is not
important in the grand scheme
of things.
Nikolay: Just not in is dangerous.
That's it.
Not in is dangerous.
Here we can put it to, I think
the wiki page has it as well,
right?
And don't use not in.
Agreed.
With this statement, I agree.
Michael: What do you think of the
next one on the wiki which
says don't use uppercase table
or column names?
Nikolay: Again, it's a matter of
style.
Right here.
It's a matter of style.
So I don't like uppercase in general,
not only for identifiers.
I don't like it for keywords like
select.
I prefer lowercase.
But of course, if you don't want
to be required to use double
quotes, then don't.
Or just if you do it, you can avoid
using double quotes, but
in this case you need to always
use...
So, basically, if you don't have
double quotes, names are like
case-sensitive search.
Michael: They're lowercased, right?
Nikolay: They're internally lowercase,
but you can use mixed
case and it will be the same as
lowercase, right?
So, no big problem if you just
type all table names uppercase
until you start using quotes, double
quotes or some of your software
starts using it and then you're
in trouble.
So better to lowercase everything.
But sometimes we have migrated
systems from other database systems
and in this case we do see table
names like in camel style for
example, right?
In this case it's just legacy we
need to deal with.
I personally like to lowercase
everything because it's like,
like, economical reasons, like
less, less actions when you type,
right?
That's it.
Michael: Yeah.
And underscores between words in
object names?
Nikolay: Underscore, yeah.
Well, maybe sometimes, yeah.
So like in Python and Shell, snake
style, right?
Michael: Snake case, yeah.
Nikolay: Snake case.
Michael: Cool.
So we're into the meat of the,
and we've got, we've discussed
the timestamp stuff, and we've
got a whole episode on times and
timestamp things so we can link
that as well.
Nikolay: In general, I wanted to
mention it in the beginning
but I forgot.
This episode is like kind of basic.
We like basic episodes, right?
Because we know many people are
not database experts who listen
to us and it's good to not to dive
into some very narrow problems
and deep but sometimes just to
remind about some basics and it's
good if no new in this episode
for you.
In this case, it's also good, but
I hope for some folks it will
be useful.
Michael: I do think it's also useful
sometimes as people that
have got a lot more experience
to remember what it was like being
a beginner.
Chances are you've got people on
your team or people you have
to help those new to this stuff
and having a few more resources
that you can point people at as
to like, why this is a bad idea,
instead of having to explain everything
again from first principles,
I find that somewhat helpful.
Hence why I link people to the
wiki so often.
But yeah, also in considering should
some defaults change?
Can we make the wiki clearer?
Like, if you do know all this stuff
already, it'd be cool if
you could help us make these resources
clearer or give better
advice.
Is there anything that you advise
people never to do in Postgres
that isn't on the wiki yet?
Could you update it?
So I'd love even more advanced
people to think about how they
can make things easier for new
folks coming along.
Should we make defaults different?
No beginner is going to be able
to dive into the Postgres codebase
and argue on the mailing list for
making a default different,
I don't think.
Nikolay: Right.
Yeah, many defaults are outdated,
you know my opinion about this.
Michael: But this is a bit different,
right?
We're often talking about GC settings,
when we're talking about
those defaults.
This is more a design goal.
Nikolay: Yeah, here as well.
Well, again, I don't see how a nullable
column can, like...
Not null can become a global standard,
the global default.
But I can see how in a particular
project it can become a code
style or schema design style standard
easily.
Michael: Yeah, it's an interesting
point.
Obviously, because we have to support
backward compatibility,
it's impossible now, I think.
But at the beginning, you could.
Nikolay: Also, I guess, there may
be some standard things here.
Anyway, I don't know.
Maybe there is a standard regarding
defaults, maybe not.
Let's proceed to some more like
bigger problems maybe.
What about...
Let's finish about small problems.
Don't use money.
Don't use money.
Michael: The data type.
I know you find this funny every
time.
Nikolay: Well, the wiki says don't
use money.
They don't say don't use data type
money.
They say don't use money.
Some communists here, I think.
So, yeah, money is a bad data type.
Why is it still there?
Maybe it's standard as well?
I don't think so.
I don't know.
So implementation is terrible.
You will start losing sense if
you use it.
Michael: Surrounding's an issue,
Partial.
There's quite a lot of pricing
these days.
If we're paying for compute and
paying for API, like tokens or
access to things, often we're paying
fractions of a cent per
image, per minute, per whatever
we're paying for.
So fractions of
Nikolay: a cent...
If you format transactions, you
will start losing some money,
so don't do it.
Yeah.
Use numeric, but check the performance,
obviously.
Michael: Yeah, I've also seen people
use integers with like well
Nikolay: okay but what about fractions
of cents for example again
Michael: well That's the idea like
you can just move the decimal
place like if you store it as an
integer and just with like
Nikolay: Well, okay, if you define
precision in advance and then
follow this rule, okay, but why
not?
Michael: I guess performance reasons.
I don't know, I didn't ask.
Nikolay: Right, right. Okay, good
simple right. Don't use money.
What's next?
Let's choose some bigger topic.
What do you think?
Michael: Well, I think the primary
key...
Yeah, let's go to...
Let's jump to your list, your how-to
guide that we'll link up
as well.
Nikolay: Let's do it.
So yeah, primary key is quite a simple
thing, just don't use integer
4 primary keys.
Consider them in the same manner
as integer 2 primary keys.
Do it only when you fully understand
the consequences and reasons.
I mean, like, integer 4 has a capacity
of 2.1 billion rows and 2.1
billion values.
The maximum is roughly 2.1 billion.
And sometimes we reach this.
Yeah, positive, negative.
Well, it's interesting that we
can use negative values, but usually
people for surrogate keys, they
don't use negative values.
There is an idea, well, like to...
I saw it many times, people say,
okay, we are approaching 2.1
billion soon, let's just start
using negative values.
Okay, good luck with that in URLs
and so on.
It's like, I think it's definitely
possible, but the amount of
work is not trivial.
Again, it's better to convert to
int8, which is not trivial
if you have a billion rows already
or 4000000000 rows, I would
say probably it's a case of emergency
already.
But our checkup tool suggests,
I think it starts raising big
flags, yellow or even red, after
50% of capacity used.
By the way, it's an interesting
problem how in one query to find
all tables checking pg_class and
all other system catalogs which
have primary key, surrogate primary
key, int4, or it can
be maybe a multi-column primary key,
but one of the parts is int4
and then to check the current
value in the sequence.
It's not trivial.
Unfortunately, I remember I was
trying to solve it with pure
SQL, not possible, I downgraded
to PL/pgSQL and have a snippet.
It's an anonymous do block or maybe
a function, anyway.
And then in this case, it can scan
your schema and then check
values in sequences and report
the capacity for each int4
primary key case.
And again, my approach is very
simple.
If you use int4 primary key,
it's similar to int2 primary
key.
Because the data volumes we deal
with usually are already so
big.
So, it makes sense if you do know
what column tetris and padding
alignment are.
If you know how the storage is
organized.
You always,
Michael: Yeah, you jumped to this
before, like, I think you're
right to mention it.
But I think the larger argument
is there's almost no cost to,
like, even when there is a cost
to using bigint over int,
so even when you have considered
column tetris, still the cost
is so small compared to the long-term
advantages.
Nikolay: 4 bytes for each row.
Michael: Yeah, it's trivial.
Nikolay: But this is why people
usually choose int4.
They say, okay, we will save 4
bytes.
We won't, in this table, we won't
reach 2000000000 and we will
be saving 4 bytes for each row.
It's good for us and they choose
it.
Michael: What you think is
Nikolay: like, I think most...
I'm talking about a conscious choice.
Michael: I don't, yeah, I think
most choice is unconscious.
I think most people are just thinking
it's default.
It's in there, like, yeah, maybe
an ORM, maybe like a tutorial they're
following.
Maybe like, I want an integer primary
key.
Why not
Nikolay: choose the type
Michael: that's called integer?
So I think most of the choices
of this is not conscious.
Nikolay: I agree with you.
Yeah, behavior of humans is that
they just choose defaults and
go because they don't have time
to understand.
But why I'm talking about this
so precisely?
Because I saw it many times, people
are already approaching 2
billion, they have problems, it's
hard to migrate without downtime,
it's possible.
And then we tell them, you know
guys, not only do you have a
big problem and you'll be down
soon, I mean partially down, not
accepting inserts anymore for this
table, but also look at this,
ID and created at.
4 bytes, 8 bytes, so you even don't
have any benefits from int4.
4.
It's so emotionally big, understanding
that you have four zero bytes
for all rows.
So it's like, why did we do this?
Why did we do this?
I just, this is an emotion.
That's why I'm talking about it
so much.
I just, I saw it in many teams.
I'm just sharing experience here.
Michael: Yeah, and you're totally
right to mention it, that a
lot of the time due to alignment
padding we don't even get the
benefit of the saved 4 bytes.
But what I meant was more, like,
so totally right to mention
it.
And I think that will shock people.
But even without the fact
that there's sometimes zero benefit,
I think it's still worth it.
Nikolay: Right.
Well, in the end of the day, four
bytes for each row, it's not,
it's not a huge saving for a billion
rows.
It's just 4 billion.
But it's not, not, not a lot, right.
Noticeable probably, but not a
lot.
Michael: Yeah, and even if you
add up the fact that that's, it's
probably duplicated in some indexes,
like you probably have several
indexes that involve the
primary key, like maybe some
multi-column indexes.
Maybe you've worried about
memory, like, but yeah, it's
not big.
And the time you will realize that
it wasn't worth it is the
one project.
Let's say you have a hundred tables and
you've put them all as BigInt
and instead of, or you've used
int4 for all of them.
And only one of those tables has
to be converted to int8 later.
That project you go through
to do that conversion, you're
going to realize we'd have been
better off going with int8
for all of them.
You only have to go through one of
these projects once to realize
it's just not worth it.
The time investment alone.
Never mind the stress, like, under
pressure.
Nikolay: Interesting that the wiki
list doesn't have this at
all and it doesn't talk about primary
keys at all and it says
don't use serial. But it's not about
the number of bytes, it's
about in general serial is not
a good thing.
So I mean, I guess bigserial is
also a bad thing in the opinion
of people who wrote it.
So interesting, right?
I mean, it's quite common.
We could edit, yeah.
Next, choose anything.
What do you think?
Michael: From your list, we've
talked about NULLs a bit already.
I think Transactional DDL and DELETE.
DELETE is a great one.
DELETE a lot of rows with one command.
Nikolay: Massive DELETE is massive.
Yeah, I had incidents.
It was on weaker storage, but it
was very painful incidents when
just deleting 10 million rows led
to more than 10 minutes outage
and a lot of money loss.
So yeah, it's not easy to delete
rows in non-partitioned tables,
in large tables.
So it's a big task actually if
you need to clean up.
And maybe I told you like one day
I went to the VLDB conference which
is probably like the oldest, maybe
in the area of databases, the
big conference, mostly with academic
people.
It was in Los Angeles and it was
one of the keynotes saying data volumes
grow so massively that we need
to study how to delete data and
not how to store it, or how to
find things which we need to delete.
But with Postgres, knowing how
MVCC is organized in Postgres
and how DELETE works, you need
to understand, without downtime,
without big stress, DELETEs should
be batched.
And the batch size should be...
When I was young, it was difficult
for me to understand how to
find the proper batch size.
If you take batch size one, probably
too much transaction overhead.
And it will affect this throughput
as well.
Like you will be deleting fewer
rows per minute or hour, your
background jobs, right?
Michael: Yeah.
Well, I tend to see like low thousands.
Is that how, I guess it depends,
but no, what do you tend
Nikolay: to do?
I have a simple approach.
Remember our very first episode.
I actually already described this
and you see, like, reminding
things it's also useful sometimes.
So I have a very simple approach.
It's based on what our final goal
is.
Our final goal with OLTP systems
is that systems should be working
fast in general.
For humans, fast means below 100
or 200 milliseconds, right?
Michael: Nice,
Nikolay: yeah.
1 second is quite slow, 10 seconds
is very slow.
Deletes can block some people,
some other transactions.
Plus, like, they can lead to I/O.
Effects, like, and so on.
So I prefer batches to last not
more than 1, 2, 3 seconds.
In this case, we know negative
effects won't last 1 or a few
seconds.
And even in the worst case, we
will be blocking, for example,
someone for just 1 second, it's
quite slow, but not terribly
slow.
Yeah.
If it's already 10 seconds, it's
quite too slow for people and
some of them, like, we might start
losing traffic if we talk
about web projects or people start
complaining and leaving us,
right?
If it's 100 milliseconds, it's
good, but probably too small.
So batches like half a second,
1 second, in my opinion, are perfect.
But sometimes it's hard to understand
the duration in advance
for all batches.
You try a few batches, they are
fast, but over time they can
degrade.
So you need to keep in mind that
if you're batching and deleting
according to some order by, for
example, timestamp deleting very
old data first, then new, new,
new.
You need to control vacuum because
probably your delete trying
to find the next batch will be
scanning too many dead tuples.
I mean, index will have a lot of
links to, pointers to dead tuples,
and performance of a single batch
delete will degrade over time.
Not to allow it, you need to control
vacuum behavior and maybe
to do vacuum yourself from time
to time, cleaning up dead tuples
and let index be fresh.
Or an additional thing, like sometimes
I found myself, maybe we
need an episode about delete actually,
or massive operations
if we haven't had it.
But sometimes I just decide, okay,
I like stateless queries which
delete batch after batch and we
don't remember the state.
But sometimes, like instead of
dealing with vacuum myself, I
just delegate it, like I say, okay,
I will memorize, I will make
my script or program memorize the
state and just know the latest
ID or timestamp deleted and start
from there.
It's similar to key set pagination
basically.
Michael: Yeah.
It's
Nikolay: not similar, it's it.
In this case, you don't depend on
the vacuum behavior and can
go faster and so on.
Michael: The other time I've seen
this cause real issues is when
people don't see how many deletes,
for example if they, let's say
it's like a SaaS account and they're
deleting an account and
then that account has users and
those users have events
and these like the amount of deletes
a single delete can cascade
to can be very different depending
on like if you're deleting
a large account that's a very different
amount of deletes than
if you're deleting a small account
so I'm guessing this is where
you avoid delete cascade on cascade.
Nikolay: Yeah that's interesting.
And in the systems I built fully,
which were built fully under my
control, I always tried to avoid
cascade deletes.
But surprisingly, I saw quite big
systems which use it.
Yeah.
With many dependent objects deleted
and so on.
And I don't know, I think it depends.
You need to think about it in advance,
of course, and maybe just
rely on it.
There might be a hybrid approach
developed.
So if we know this object is not
huge, we delete it relying on
delete cascade.
But if we know the object is huge,
maybe we perform like the last
delete, deleting the main object, we
clean up asynchronously before
it and only then we delete.
Michael: In patches.
It
Nikolay: depends.
Right, right, right.
It depends on a particular system,
but the surprise to me was
to see that there are good systems
relying on cascaded delete
with many, many dependent objects,
like thousands, tens of thousands,
and kind of okay.
Performance, I would say, as always,
it's worth thinking in advance
and testing.
Just testing your system, your
environment hardware, performance,
imagine the worst case, test it,
see how it works, and then follow
this rule, like 1 second is good,
half a second is good, right?
Michael: Yeah, I like that.
I like that a lot.
Nikolay: Yeah, but we actually,
since we spoke about, like, this
is kind of an episode about basics,
of course, we forgot an elephant
in the room, right?
If you delete without any conditions,
without a WHERE clause.
From a table, it will be slow if
the table is big, but also it
will be interesting to see that
sometimes the table is not like,
the space is not immediately available
or like it won't be your
disk space immediately.
Because delete consists of 2 processes.
First is your delete, synchronous,
and then vacuum, which really
deletes physically the data tuples.
Michael: That isn't where I thought
you were going with that.
I used to work on a tool for SQL
Server, Microsoft SQL Server,
it was like a plug into the IDE
that they have, that Microsoft
has.
And one of our favorite features,
one of the features people loved
the most was a warning to say,
did you mean to...
It was for delete and update.
Did you...
If you try and run delete without
a where clause, it would warn
you before running it.
If anybody's out there who's writing
Postgres IDEs, please add
that feature.
It's so helpful.
Nikolay: Actually, yeah, there
is an extension to prohibit this.
And actually, it was funny, I think
the very first thing we developed
with Andrei was this exact patch
for Postgres, but it was rejected
by
Michael: Oh, for Postgres?
Nikolay: By hackers, yeah, yeah.
But, like, we just wanted a warning
to be produced or maybe the
ability to forbid wireless deletes.
So Yeah, and yeah, let's maybe
continue.
Michael: But IDEs don't have to
worry about the core Postgres.
It can be implemented on a case-by-case
basis there, at least.
Nikolay: This is a big mistake
if you forget where and execute
it, and sometimes we don't need
a semicolon in the end.
For example, in PSQL, if it's hyphen
C and you write something,
it will execute it.
But of course, you need quotes.
Sometimes we don't need semicolon
if it's a single query execution.
In this case, it can be terribly
bad.
So you deleted everything.
Right, right, right.
Okay, but in general, yeah, in
general, delete is tricky.
So.
Michael: Oh, I have one more beginner-
friendly one.
From the quick list.
Don't use select star, or at least
in application code.
I think it's useful for ad hoc
queries and exploration.
How do you feel about this one?
Nikolay: Yeah, but do you remember
what should be used instead?
Because I don't like like 50 columns
to be listed also.
And then you understand that that's
all of them.
Right?
Michael: Well, but...
So I think there's a few reasons
for this.
Firstly, do you really need 50
columns?
Like, what are you doing that requires
50 columns?
Nikolay: There are so many cases.
This advice cannot be generic.
Because, for example, if I define
a view which should consist
of all columns of the underlying table,
I will definitely use star,
then later if I need to add a column
I redefine view, and I don't
need to rewrite this query, right?
And I know everything is included.
There are many different situations.
Of course, it's a minimalistic
approach.
Don't take too much, right?
In general.
But listing all columns, if I need
all columns, I would say I
would use star, an asterisk instead.
Michael: So in application code,
I think it can break in unexpected
ways.
So like when you if you then add
some columns to that, can your
code handle more columns coming
back in the future than the current
depends.
Nikolay: If I put the star, of
course, I think about future changes
of schema.
I do think about it.
Michael: Cool.
Well, okay.
Interesting.
Nikolay: You know, in Postgres, you
can say select table1 from
table1.
Michael: Or just table1.
Nikolay: No, no, no, no, it's different.
I mean, table1 is just selecting
everything, right?
Michael: You can do like.
Nikolay: Select table name from
table name.
It will give you just 1 column,
but of a record type.
Like, kind of a virtual data type.
And everything will be collapsed,
and you can unwrap it later
in your queries if it's a subquery.
But this is powerful; you don't
need to list all columns.
And it's better than star because
sometimes you just need...
This is the magic of Postgres,
where maybe First Normal Form
is broken, actually, because you
basically wrap everything into
a single column.
Okay.
Michael: The
Nikolay: whole table.
Which is great.
And then you can, if you're inside,
for example, PL/pgSQL
context, it's perfect to do this.
You just define a record variable,
and you can insert a whole
row with all these columns into this
row.
Later, if there is evolution of
schema and more columns, this
code will still work.
But depending on your language
and driver, I don't know what
will happen if you select table
name from table name limit 1,
for example, with some WHERE clause.
I don't know.
It's worth checking, but it's interesting.
This is when you said if we have
more columns when they're needed
all right.
Michael: yeah well the time
I see this used the most where
I like the reason I think this
advice is good is I so often see
I think mostly as a result of
ORMs people selecting every
column when they only need 2.
Nikolay: Yeah, yeah, yeah.
This minimalism approach, I get
this.
But I have a student right now
who writes Python code.
And sharing experience, I just
see the code which is written
and I always say like, oh, you
define this function but you use
it just once.
Why do you need the function here?
You define the variable which you
assign to a different variable
and then you just use it once.
Why do you do this?
You had some constant you defined
and used it once.
Just like, why do you need these
things?
And here I see a similar pattern.
If we select column names, what
if we rename these column names,
right?
We need to go there and rewrite
it as well, something like this.
Well, it depends on the situation,
of course, but I don't like
the idea that we will list all
column names many times if we
know we need the whole row, for example.
It's just like observations trying
to...
There is minimalism in star as
well.
This is what I'm trying to say.
A different kind of minimalism.
So there is a trade-off here.
Michael: So definitely not always
don't do this, but I'd say
don't do this without thinking.
It sounds like.
Nikolay: Right, right.
You need to consider the code and
the future evolution, and so
on.
Michael: Nice.
Do you want to end on any that
you actually do think are universal?
Nikolay: I don't know.
It's hard, actually.
Michael: It always depends, right?
Nikolay: Yeah, maybe about transactional
DDL.
Postgres is cool.
It has transactional DDL until
you cannot use it.
And most really heavily loaded
systems understand that transactional
DDL is some kind of myth.
I mean, you need it, but to really
apply schema changes without
downtime, you need to break it.
Create index concurrently is non-transactional,
right?
We discussed it, you mentioned
that it has 2 transactions.
If it's interrupted, it's not fully
rolled back, so it's not
a transaction already.
You will have an invalid index
left, leftovers of your actions.
If you want to change schema in
any aspect, If you want to do
it without downtime, you usually
need multiple transactions.
For example, if you want some foreign
key to define or some check
constraint to define, not now,
you will always need, and you
have already a big table loaded.
You will need to think about multiple
steps, definitely, and
be ready to roll back 1 of it and
go retry.
So, transactional DDL is not easy.
This topic is not easy.
It's not like, oh, we have transactional
DDL, all problems solved.
No.
In a highly concurrent environment,
it will actually put you to
downtime if you just blindly use
transactional DDL.
And create index concurrently is
a perfect example.
It's not transactional.
If you use create index, which
is transactional, you have downtime.
Partial downtime.
Michael: I'm trying to think if
drop index concurrently is transactional.
Nikolay: It needs to acquire an
exclusive lock.
It's similar to dropping a column,
but Postgres doesn't offer
any tools for 0 downtime drop column.
Drop column cannot be 0 downtime
unless you cook it properly
with a low lock timeout and retries.
Drop index concurrently, it's shipped
with Postgres; this tool
exists, great, but it's not transactional as well because
it might fail as well.
Right, for example, like, and well,
in terms of leftovers, no
leftovers.
Michael: No, no, no.
But it might fail, that's the good
thing.
Nikolay: It's just to acquire a
lock gracefully, right?
Not to block selects or other queries
which came after you started.
Michael: You've worded this quite
well in your how-to, but the
idea here is don't assume that
Postgres having transactional
DDL will make all your problems
go away.
You're still going to have some
complex...
Nikolay: I mean, it has it, but
in many cases you cannot use
it and in heavily loaded systems
you absolutely cannot use it
as is.
You need to cook it properly and
have a lot of things around.
A lot of dances need to be learned
right
Michael: because of the heavy locks
because you normally
Nikolay: Yeah, yeah because of
locking issues basically exactly
Cool, right
Michael: Nice one.
Well, I think I'll link all of these
up in the show notes.
Maybe the wiki will have changed
by the time this episode goes
out.
It'd be cool to hear if anybody
thinks there should be additional
ones that should be in there.
Nikolay: If we say don't do it,
don't use transactional DDL.
I mean, use it, but only partially,
like as pieces.
You cannot use it in 100% of everything,
like all schema changes.
You cannot use create index, which
is transactional.
Don't use transactional DDL.
I hope people understand when I
say this.
Don't use transactional DDL.
Michael: Yeah, I guess create index
is the one that will catch people
out if they don't know about
Nikolay: locks.
You cannot drop a column without
proper lock timeout and retries.
And this is already beyond a single
transaction.
Michael: Well, and in the past,
even adding columns...
Nikolay: You cannot add a column
without default.
You cannot add it.
It's the same as dropping a column.
You need an exclusive lock and
if there is an ongoing long transaction,
you're blocked and you have troubles.
It means you need some additional
orchestration.
Michael: Nice, I think you'll find
it hard to word that for the
wiki, but I think it would be a
good addition.
Nikolay: Don't use transactional
DDL.
Okay.
Michael: You'd have a lot of caveats.
Nikolay: Right, well it's very
similar to many items we discussed
today.
Michael: True, true.
Nikolay: Right, this advice.
Okay?
Michael: Yeah, thanks so much,
Nikolay.
Thanks everyone for listening.
Catch you next week.