Postgres FM | Transcript: Don't do this

April 5, 2024 • 45 Minutes

Don't do this

Michael: Hello and welcome to Postgres
FM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard.

And this is my co-host Nikolay, founder
of Postgres.AI.

Hey Nikolay, what are we talking
about today?

Nikolay: Let's talk about something
we shouldn't do.

Michael: All the things that we
shouldn't do with respect to

Postgres, right?

Nikolay: Of course, we are not talking
about alcohol or sugar,

right?

Michael: Neither of which you've
ever done on the podcast, yeah.

Nikolay: Well, I was drinking a
lot of Red Bull on the podcast and

it has a lot of sugar, so.

But at least it's not alcohol.

Michael: True.

So there's a fairly famous wiki
page, which I was actually surprised

is only about 6 years old.

It's probably the wiki page I link
to the most in the PostgreSQL

wiki.

And it's called, it's titled, don't
do this.

And it's a list of a bunch of things
that are largely inadvisable,

but there's also a couple of other,
you've written a how-to that

includes a lot of don't do this
that aren't on the wiki.

And there was a really good page
that you sent me that I've looked

at as well by Yandex who also have
some additional don't do this

advice.

So we're going to pick some of
our favorites from those and help

explain why not and what you can
do instead.

Nikolay: Right, right, right.

So the wiki list is quite large.

It's not huge, but quite large.

And I like almost everything.

Not everything, not everything.

For example, between, I use sometimes
between, you just need

to understand how it works.

So it's like including both sides
of the range, both boundaries,

right?

I don't see anything wrong if you
use it with an integer, for

example.

Michael: But that's what they say,
isn't it?

I think that's what the wiki is
really good at doing.

Nikolay: Well, if this list is
supposed to be simple to memorize,

and if something is included, like,
don't use between, and you

think, okay, I should avoid it.

Maybe actually you should avoid
it.

But if we take between, SQL is
such a rich language, we have

many things to exclude then as
well.

For example, I don't know, like
various things with join.

Like, this can bring us to the
area, for example, always use

as, don't use aliases without as,
right?

And so on, like these things, these
kinds of things.

Because it can lead to confusion.

Don't use order by 123, or group
by 123.

Actually which is good advice in
general, but most of experienced

guys use it anyway because it's
just short, right?

And like this list, I actually
don't like this.

These kinds of items I don't really
like, in my opinion, they

are quite shallow.

Michael: So I'm going to push back.

I think this isn't a list to be
memorized and not looked at.

I think one of the benefits of having
it written down with some

extra points, like for example,
in the title, it says don't use

between in a parenthesis, especially
with timestamps.

And then it has why not explains
that it includes the boundaries,

whereas you probably don't want
that if you're dealing with continuous

ranges like timestamps.

And then it says, when should you?

So it says, between is safe for
discrete quantities like integers

or dates, as long as you remember
that both ends of the range

are included in the result.

And then they say, but it's a bad
habit to get into.

And I actually think that contains
all of the subject you were

just talking about.

Nikolay: Well, let's maybe edit
wiki and change the title to

don't use between for timestamps
to make it clear.

Because for integers I don't see
the problem at all.

Michael: So you could say like
0 to 9 and 10 to 19 or you know,

yeah.

Nikolay: With timestamps it's tricky
because sometimes if people

don't spend time understanding,
like thinking about what exactly

they include, what they don't,
and they can make quite bad problems.

They can have troubles even without
between, using just simple

comparison.

Also grouping of timestamp, truncated
timestamps, like with that

date trunk, can be tricky as well.

You just need to understand that,
like, depending on your data,

the edge can be tricky.

Sometimes it includes a lot of
rows.

Midnight can be a lot of rows and
you need to decide where should

it go to the previous day or next
day, right?

Or sometimes it can be 0 rows on
the edge because you always

have milliseconds.

Michael: Or it's even more precise
than that, right?

Nikolay: Yeah, you need to think
about edges and where to include

them.

And of course, if you use between,
it's hidden a little bit.

I agree with this.

But in general, don't use between,
especially with timestamps.

I cannot agree with this.

Michael: Okay, well, yeah.

I guess it's a wiki for a reason
and edits are allowed.

Right.

But should we, should we focus
on some of the ones we do agree

more with or like the,

Nikolay: Yeah, sure.

Choose anything.

Michael: Well, I like the not in,
like, I think it starts with

the wiki.

The wiki starts with a bunch that
I don't see as being that useful,

don't seem to trip that many people
up.

But it gets to a few sections in,
it gets to SQL constructs and

the first of those is don't use
not in.

Nikolay: Yeah I would recommend
reading Juk blog post Lucas said

about this like it's very well
explained why it's bad and if

you remember I sometimes 1 day
I mentioned the very nasty problem

I had with 1 of my startups in
the past when we couldn't launch

properly for several months, losing
traction and then I found

an issue related to nulls.

This exactly was related to this.

It was in the episode about nulls,
1 of our very first episodes.

And yeah, nulls are in general
very tricky, and this is 1 of

the traps where it can catch you.

And you need to study 3-value logic,
but for some persons, for

example myself, it will be always
a trap.

I mean, I know I can, even with
20 years of experience, I can

be caught there.

That's why, like, you just need
to practise avoiding these traps.

So basically if you have not in
and you have 1 of the values

inside it now, it will always be
unknown, so it will not return

what you expect.

Michael: Yeah, and I think the
reason I think this trips

people up the most is that in and
exists can be used say interchangeably,

but that's not true for not exists
and not in.

And that's confusing like That's

Nikolay: confusing.

Yeah.

Now it's confusing in general.

Three-valued logic is confusing
in general.

So now is a big trap itself.

It has just many faces and not
in is 1 of these faces.

Michael: Yeah, well, I'll link
to the I'll link to the episode

we did on nulls in the show notes
for anybody that wasn't wasn't

a listener back then or missed
that 1.

That also reminds me on the in
the Duk list.

There was there was some schema
advice.

So some general don't do this with
your design.

And 1 of the things was around,
like, constraints on columns

in the first place.

Like, it's the default to let,
in Postgres at least, I think

in all databases I've used, or
at least the ones I remember,

the default is columns can contain
nulls.

Nikolay: Can accept

Michael: nulls.

And it's an interesting point that
you make, that maybe that

should be flipped and not null
should be the default.

Or at least when we're designing
schemas, it could make sense

to do it that way.

Nikolay: Yeah, yeah, maybe.

At least when you design something,
it's a good habit to consider

everything as not null.

If needed, the next step is to
think about the default value

and only then accept nulls.

Yeah, just reduce the number of
cases where nulls can appear.

And that's a great, I would say,
style.

This is about style.

Michael: It does have performance
implications, that 1 as well,

because if the database can know
that there are no nulls in that

column, that can allow certain
optimizations as well.

Nikolay: Well, yes and no.

Well, it depends.

Nulls are stored differently.

If you put some default value,
it will be stored as is, right?

Unless you're adding a column for
an existing table.

In this case, since Postgres 11,
we have like a virtual default

and it's stored virtually, not physically.

But for all new rows, a default real
value will be really stored.

And nulls are just stored as like
a bitmap, right?

And, But I agree in general with
indexes and so on, nulls sometimes

present some challenges in terms
of performance.

Michael: I think also it can rewrite
certain queries if it knows

for sure that a column cannot contain
nulls.

If there's like a constraint on it.

If you've written, I guess this
is a contrived example, but if

you've written where X is not null,
It can just ignore that completely.

It can just ignore certain filters
or certain operations.

Nikolay: Good point.

Michael: But yeah, this is not
important in the grand scheme

of things.

Nikolay: Just not in is dangerous.

That's it.

Not in is dangerous.

Here we can put it to, I think
the wiki page has it as well,

right?

And don't use not in.

Agreed.

With this statement, I agree.

Michael: What do you think of the
next one on the wiki which

says don't use uppercase table
or column names?

Nikolay: Again, it's a matter of
style.

Right here.

It's a matter of style.

So I don't like uppercase in general,
not only for identifiers.

I don't like it for keywords like
select.

I prefer lowercase.

But of course, if you don't want
to be required to use double

quotes, then don't.

Or just if you do it, you can avoid
using double quotes, but

in this case you need to always
use...

So, basically, if you don't have
double quotes, names are like

case-sensitive search.

Michael: They're lowercased, right?

Nikolay: They're internally lowercase,
but you can use mixed

case and it will be the same as
lowercase, right?

So, no big problem if you just
type all table names uppercase

until you start using quotes, double
quotes or some of your software

starts using it and then you're
in trouble.

So better to lowercase everything.

But sometimes we have migrated
systems from other database systems

and in this case we do see table
names like in camel style for

example, right?

In this case it's just legacy we
need to deal with.

I personally like to lowercase
everything because it's like,

like, economical reasons, like
less, less actions when you type,

right?

That's it.

Michael: Yeah.

And underscores between words in
object names?

Nikolay: Underscore, yeah.

Well, maybe sometimes, yeah.

So like in Python and Shell, snake
style, right?

Michael: Snake case, yeah.

Nikolay: Snake case.

Michael: Cool.

So we're into the meat of the,
and we've got, we've discussed

the timestamp stuff, and we've
got a whole episode on times and

timestamp things so we can link
that as well.

Nikolay: In general, I wanted to
mention it in the beginning

but I forgot.

This episode is like kind of basic.

We like basic episodes, right?

Because we know many people are
not database experts who listen

to us and it's good to not to dive
into some very narrow problems

and deep but sometimes just to
remind about some basics and it's

good if no new in this episode
for you.

In this case, it's also good, but
I hope for some folks it will

be useful.

Michael: I do think it's also useful
sometimes as people that

have got a lot more experience
to remember what it was like being

a beginner.

Chances are you've got people on
your team or people you have

to help those new to this stuff
and having a few more resources

that you can point people at as
to like, why this is a bad idea,

instead of having to explain everything
again from first principles,

I find that somewhat helpful.

Hence why I link people to the
wiki so often.

But yeah, also in considering should
some defaults change?

Can we make the wiki clearer?

Like, if you do know all this stuff
already, it'd be cool if

you could help us make these resources
clearer or give better

advice.

Is there anything that you advise
people never to do in Postgres

that isn't on the wiki yet?

Could you update it?

So I'd love even more advanced
people to think about how they

can make things easier for new
folks coming along.

Should we make defaults different?

No beginner is going to be able
to dive into the Postgres codebase

and argue on the mailing list for
making a default different,

I don't think.

Nikolay: Right.

Yeah, many defaults are outdated,
you know my opinion about this.

Michael: But this is a bit different,
right?

We're often talking about GC settings,
when we're talking about

those defaults.

This is more a design goal.

Nikolay: Yeah, here as well.

Well, again, I don't see how a nullable
column can, like...

Not null can become a global standard,
the global default.

But I can see how in a particular
project it can become a code

style or schema design style standard
easily.

Michael: Yeah, it's an interesting
point.

Obviously, because we have to support
backward compatibility,

it's impossible now, I think.

But at the beginning, you could.

Nikolay: Also, I guess, there may
be some standard things here.

Anyway, I don't know.

Maybe there is a standard regarding
defaults, maybe not.

Let's proceed to some more like
bigger problems maybe.

What about...

Let's finish about small problems.

Don't use money.

Michael: The data type.

I know you find this funny every
time.

Nikolay: Well, the wiki says don't
use money.

They don't say don't use data type
money.

They say don't use money.

Some communists here, I think.

So, yeah, money is a bad data type.

Why is it still there?

Maybe it's standard as well?

I don't think so.

I don't know.

So implementation is terrible.

You will start losing sense if
you use it.

Michael: Surrounding's an issue,
Partial.

There's quite a lot of pricing
these days.

If we're paying for compute and
paying for API, like tokens or

access to things, often we're paying
fractions of a cent per

image, per minute, per whatever
we're paying for.

So fractions of

Nikolay: a cent...

If you format transactions, you
will start losing some money,

so don't do it.

Yeah.

Use numeric, but check the performance,
obviously.

Michael: Yeah, I've also seen people
use integers with like well

Nikolay: okay but what about fractions
of cents for example again

Michael: well That's the idea like
you can just move the decimal

place like if you store it as an
integer and just with like

Nikolay: Well, okay, if you define
precision in advance and then

follow this rule, okay, but why
not?

Michael: I guess performance reasons.

I don't know, I didn't ask.

Nikolay: Right, right. Okay, good
simple right. Don't use money.

What's next?

Let's choose some bigger topic.

What do you think?

Michael: Well, I think the primary
key...

Yeah, let's go to...

Let's jump to your list, your how-to
guide that we'll link up

as well.

Nikolay: Let's do it.

So yeah, primary key is quite a simple
thing, just don't use integer

4 primary keys.

Consider them in the same manner
as integer 2 primary keys.

Do it only when you fully understand
the consequences and reasons.

I mean, like, integer 4 has a capacity
of 2.1 billion rows and 2.1

billion values.

The maximum is roughly 2.1 billion.

And sometimes we reach this.

Yeah, positive, negative.

Well, it's interesting that we
can use negative values, but usually

people for surrogate keys, they
don't use negative values.

There is an idea, well, like to...

I saw it many times, people say,
okay, we are approaching 2.1

billion soon, let's just start
using negative values.

Okay, good luck with that in URLs
and so on.

It's like, I think it's definitely
possible, but the amount of

work is not trivial.

Again, it's better to convert to
int8, which is not trivial

if you have a billion rows already
or 4000000000 rows, I would

say probably it's a case of emergency
already.

But our checkup tool suggests,
I think it starts raising big

flags, yellow or even red, after
50% of capacity used.

By the way, it's an interesting
problem how in one query to find

all tables checking pg_class and
all other system catalogs which

have primary key, surrogate primary
key, int4, or it can

be maybe a multi-column primary key,
but one of the parts is int4

and then to check the current
value in the sequence.

It's not trivial.

Unfortunately, I remember I was
trying to solve it with pure

SQL, not possible, I downgraded
to PL/pgSQL and have a snippet.

It's an anonymous do block or maybe
a function, anyway.

And then in this case, it can scan
your schema and then check

values in sequences and report
the capacity for each int4

primary key case.

And again, my approach is very
simple.

If you use int4 primary key,
it's similar to int2 primary

key.

Because the data volumes we deal
with usually are already so

big.

So, it makes sense if you do know
what column tetris and padding

alignment are.

If you know how the storage is
organized.

You always,

Michael: Yeah, you jumped to this
before, like, I think you're

right to mention it.

But I think the larger argument
is there's almost no cost to,

like, even when there is a cost
to using bigint over int,

so even when you have considered
column tetris, still the cost

is so small compared to the long-term
advantages.

Nikolay: 4 bytes for each row.

Michael: Yeah, it's trivial.

Nikolay: But this is why people
usually choose int4.

They say, okay, we will save 4
bytes.

We won't, in this table, we won't
reach 2000000000 and we will

be saving 4 bytes for each row.

It's good for us and they choose
it.

Michael: What you think is

Nikolay: like, I think most...

I'm talking about a conscious choice.

Michael: I don't, yeah, I think
most choice is unconscious.

I think most people are just thinking
it's default.

It's in there, like, yeah, maybe
an ORM, maybe like a tutorial they're

following.

Maybe like, I want an integer primary
key.

Why not

Nikolay: choose the type

Michael: that's called integer?

So I think most of the choices
of this is not conscious.

Nikolay: I agree with you.

Yeah, behavior of humans is that
they just choose defaults and

go because they don't have time
to understand.

But why I'm talking about this
so precisely?

Because I saw it many times, people
are already approaching 2

billion, they have problems, it's
hard to migrate without downtime,

it's possible.

And then we tell them, you know
guys, not only do you have a

big problem and you'll be down
soon, I mean partially down, not

accepting inserts anymore for this
table, but also look at this,

ID and created at.

4 bytes, 8 bytes, so you even don't
have any benefits from int4.

It's so emotionally big, understanding
that you have four zero bytes

for all rows.

So it's like, why did we do this?

Why did we do this?

I just, this is an emotion.

That's why I'm talking about it
so much.

I just, I saw it in many teams.

I'm just sharing experience here.

Michael: Yeah, and you're totally
right to mention it, that a

lot of the time due to alignment
padding we don't even get the

benefit of the saved 4 bytes.

But what I meant was more, like,
so totally right to mention

it.

And I think that will shock people.

But even without the fact
that there's sometimes zero benefit,

I think it's still worth it.

Nikolay: Right.

Well, in the end of the day, four
bytes for each row, it's not,

it's not a huge saving for a billion
rows.

It's just 4 billion.

But it's not, not, not a lot, right.

Noticeable probably, but not a
lot.

Michael: Yeah, and even if you
add up the fact that that's, it's

probably duplicated in some indexes,
like you probably have several

indexes that involve the
primary key, like maybe some

multi-column indexes.

Maybe you've worried about
memory, like, but yeah, it's

not big.

And the time you will realize that
it wasn't worth it is the

one project.

Let's say you have a hundred tables and
you've put them all as BigInt

and instead of, or you've used
int4 for all of them.

And only one of those tables has
to be converted to int8 later.

That project you go through
to do that conversion, you're

going to realize we'd have been
better off going with int8

for all of them.

You only have to go through one of
these projects once to realize

it's just not worth it.

The time investment alone.

Never mind the stress, like, under
pressure.

Nikolay: Interesting that the wiki
list doesn't have this at

all and it doesn't talk about primary
keys at all and it says

don't use serial. But it's not about
the number of bytes, it's

about in general serial is not
a good thing.

So I mean, I guess bigserial is
also a bad thing in the opinion

of people who wrote it.

So interesting, right?

I mean, it's quite common.

We could edit, yeah.

Next, choose anything.

What do you think?

Michael: From your list, we've
talked about NULLs a bit already.

I think Transactional DDL and DELETE.

DELETE is a great one.

DELETE a lot of rows with one command.

Nikolay: Massive DELETE is massive.

Yeah, I had incidents.

It was on weaker storage, but it
was very painful incidents when

just deleting 10 million rows led
to more than 10 minutes outage

and a lot of money loss.

So yeah, it's not easy to delete
rows in non-partitioned tables,

in large tables.

So it's a big task actually if
you need to clean up.

And maybe I told you like one day
I went to the VLDB conference which

is probably like the oldest, maybe
in the area of databases, the

big conference, mostly with academic
people.

It was in Los Angeles and it was
one of the keynotes saying data volumes

grow so massively that we need
to study how to delete data and

not how to store it, or how to
find things which we need to delete.

But with Postgres, knowing how
MVCC is organized in Postgres

and how DELETE works, you need
to understand, without downtime,

without big stress, DELETEs should
be batched.

And the batch size should be...

When I was young, it was difficult
for me to understand how to

find the proper batch size.

If you take batch size one, probably
too much transaction overhead.

And it will affect this throughput
as well.

Like you will be deleting fewer
rows per minute or hour, your

background jobs, right?

Michael: Yeah.

Well, I tend to see like low thousands.

Is that how, I guess it depends,
but no, what do you tend

Nikolay: to do?

I have a simple approach.

Remember our very first episode.

I actually already described this
and you see, like, reminding

things it's also useful sometimes.

So I have a very simple approach.

It's based on what our final goal
is.

Our final goal with OLTP systems
is that systems should be working

fast in general.

For humans, fast means below 100
or 200 milliseconds, right?

Michael: Nice,

Nikolay: yeah.

1 second is quite slow, 10 seconds
is very slow.

Deletes can block some people,
some other transactions.

Plus, like, they can lead to I/O.

Effects, like, and so on.

So I prefer batches to last not
more than 1, 2, 3 seconds.

In this case, we know negative
effects won't last 1 or a few

seconds.

And even in the worst case, we
will be blocking, for example,

someone for just 1 second, it's
quite slow, but not terribly

slow.

Yeah.

If it's already 10 seconds, it's
quite too slow for people and

some of them, like, we might start
losing traffic if we talk

about web projects or people start
complaining and leaving us,

right?

If it's 100 milliseconds, it's
good, but probably too small.

So batches like half a second,
1 second, in my opinion, are perfect.

But sometimes it's hard to understand
the duration in advance

for all batches.

You try a few batches, they are
fast, but over time they can

degrade.

So you need to keep in mind that
if you're batching and deleting

according to some order by, for
example, timestamp deleting very

old data first, then new, new,
new.

You need to control vacuum because
probably your delete trying

to find the next batch will be
scanning too many dead tuples.

I mean, index will have a lot of
links to, pointers to dead tuples,

and performance of a single batch
delete will degrade over time.

Not to allow it, you need to control
vacuum behavior and maybe

to do vacuum yourself from time
to time, cleaning up dead tuples

and let index be fresh.

Or an additional thing, like sometimes
I found myself, maybe we

need an episode about delete actually,
or massive operations

if we haven't had it.

But sometimes I just decide, okay,
I like stateless queries which

delete batch after batch and we
don't remember the state.

But sometimes, like instead of
dealing with vacuum myself, I

just delegate it, like I say, okay,
I will memorize, I will make

my script or program memorize the
state and just know the latest

ID or timestamp deleted and start
from there.

It's similar to key set pagination
basically.

Michael: Yeah.

It's

Nikolay: not similar, it's it.

In this case, you don't depend on
the vacuum behavior and can

go faster and so on.

Michael: The other time I've seen
this cause real issues is when

people don't see how many deletes,
for example if they, let's say

it's like a SaaS account and they're
deleting an account and

then that account has users and
those users have events

and these like the amount of deletes
a single delete can cascade

to can be very different depending
on like if you're deleting

a large account that's a very different
amount of deletes than

if you're deleting a small account
so I'm guessing this is where

you avoid delete cascade on cascade.

Nikolay: Yeah that's interesting.
And in the systems I built fully,

which were built fully under my
control, I always tried to avoid

cascade deletes.

But surprisingly, I saw quite big
systems which use it.

Yeah.

With many dependent objects deleted
and so on.

And I don't know, I think it depends.

You need to think about it in advance,
of course, and maybe just

rely on it.

There might be a hybrid approach
developed.

So if we know this object is not
huge, we delete it relying on

delete cascade.

But if we know the object is huge,
maybe we perform like the last

delete, deleting the main object, we
clean up asynchronously before

it and only then we delete.

Michael: In patches.

Nikolay: depends.

Right, right, right.

It depends on a particular system,
but the surprise to me was

to see that there are good systems
relying on cascaded delete

with many, many dependent objects,
like thousands, tens of thousands,

and kind of okay.

Performance, I would say, as always,
it's worth thinking in advance

and testing.

Just testing your system, your
environment hardware, performance,

imagine the worst case, test it,
see how it works, and then follow

this rule, like 1 second is good,
half a second is good, right?

Michael: Yeah, I like that.

I like that a lot.

Nikolay: Yeah, but we actually,
since we spoke about, like, this

is kind of an episode about basics,
of course, we forgot an elephant

in the room, right?

If you delete without any conditions,
without a WHERE clause.

From a table, it will be slow if
the table is big, but also it

will be interesting to see that
sometimes the table is not like,

the space is not immediately available
or like it won't be your

disk space immediately.

Because delete consists of 2 processes.

First is your delete, synchronous,
and then vacuum, which really

deletes physically the data tuples.

Michael: That isn't where I thought
you were going with that.

I used to work on a tool for SQL
Server, Microsoft SQL Server,

it was like a plug into the IDE
that they have, that Microsoft

has.

And one of our favorite features,
one of the features people loved

the most was a warning to say,
did you mean to...

It was for delete and update.

Did you...

If you try and run delete without
a where clause, it would warn

you before running it.

If anybody's out there who's writing
Postgres IDEs, please add

that feature.

It's so helpful.

Nikolay: Actually, yeah, there
is an extension to prohibit this.

And actually, it was funny, I think
the very first thing we developed

with Andrei was this exact patch
for Postgres, but it was rejected

Michael: Oh, for Postgres?

Nikolay: By hackers, yeah, yeah.

But, like, we just wanted a warning
to be produced or maybe the

ability to forbid wireless deletes.

So Yeah, and yeah, let's maybe
continue.

Michael: But IDEs don't have to
worry about the core Postgres.

It can be implemented on a case-by-case
basis there, at least.

Nikolay: This is a big mistake
if you forget where and execute

it, and sometimes we don't need
a semicolon in the end.

For example, in PSQL, if it's hyphen
C and you write something,

it will execute it.

But of course, you need quotes.

Sometimes we don't need semicolon
if it's a single query execution.

In this case, it can be terribly
bad.

So you deleted everything.

Right, right, right.

Okay, but in general, yeah, in
general, delete is tricky.

So.

Michael: Oh, I have one more beginner-
friendly one.

From the quick list.

Don't use select star, or at least
in application code.

I think it's useful for ad hoc
queries and exploration.

How do you feel about this one?

Nikolay: Yeah, but do you remember
what should be used instead?

Because I don't like like 50 columns
to be listed also.

And then you understand that that's
all of them.

Right?

Michael: Well, but...

So I think there's a few reasons
for this.

Firstly, do you really need 50
columns?

Like, what are you doing that requires
50 columns?

Nikolay: There are so many cases.

This advice cannot be generic.

Because, for example, if I define
a view which should consist

of all columns of the underlying table,
I will definitely use star,

then later if I need to add a column
I redefine view, and I don't

need to rewrite this query, right?

And I know everything is included.

There are many different situations.

Of course, it's a minimalistic
approach.

Don't take too much, right?

In general.

But listing all columns, if I need
all columns, I would say I

would use star, an asterisk instead.

Michael: So in application code,
I think it can break in unexpected

ways.

So like when you if you then add
some columns to that, can your

code handle more columns coming
back in the future than the current

depends.

Nikolay: If I put the star, of
course, I think about future changes

of schema.

I do think about it.

Michael: Cool.

Well, okay.

Interesting.

Nikolay: You know, in Postgres, you
can say select table1 from

table1.

Michael: Or just table1.

Nikolay: No, no, no, no, it's different.

I mean, table1 is just selecting
everything, right?

Michael: You can do like.

Nikolay: Select table name from
table name.

It will give you just 1 column,
but of a record type.

Like, kind of a virtual data type.

And everything will be collapsed,
and you can unwrap it later

in your queries if it's a subquery.

But this is powerful; you don't
need to list all columns.

And it's better than star because
sometimes you just need...

This is the magic of Postgres,
where maybe First Normal Form

is broken, actually, because you
basically wrap everything into

a single column.

Okay.

Michael: The

Nikolay: whole table.

Which is great.

And then you can, if you're inside,
for example, PL/pgSQL

context, it's perfect to do this.

You just define a record variable,
and you can insert a whole

row with all these columns into this
row.

Later, if there is evolution of
schema and more columns, this

code will still work.

But depending on your language
and driver, I don't know what

will happen if you select table
name from table name limit 1,

for example, with some WHERE clause.

I don't know.

It's worth checking, but it's interesting.

This is when you said if we have
more columns when they're needed

all right.

Michael: yeah well the time
I see this used the most where

I like the reason I think this
advice is good is I so often see

I think mostly as a result of
ORMs people selecting every

column when they only need 2.

Nikolay: Yeah, yeah, yeah.

This minimalism approach, I get
this.

But I have a student right now
who writes Python code.

And sharing experience, I just
see the code which is written

and I always say like, oh, you
define this function but you use

it just once.

Why do you need the function here?

You define the variable which you
assign to a different variable

and then you just use it once.

Why do you do this?

You had some constant you defined
and used it once.

Just like, why do you need these
things?

And here I see a similar pattern.

If we select column names, what
if we rename these column names,

right?

We need to go there and rewrite
it as well, something like this.

Well, it depends on the situation,
of course, but I don't like

the idea that we will list all
column names many times if we

know we need the whole row, for example.

It's just like observations trying
to...

There is minimalism in star as
well.

This is what I'm trying to say.

A different kind of minimalism.

So there is a trade-off here.

Michael: So definitely not always
don't do this, but I'd say

don't do this without thinking.

It sounds like.

Nikolay: Right, right.

You need to consider the code and
the future evolution, and so

on.

Michael: Nice.

Do you want to end on any that
you actually do think are universal?

Nikolay: I don't know.

It's hard, actually.

Michael: It always depends, right?

Nikolay: Yeah, maybe about transactional
DDL.

Postgres is cool.

It has transactional DDL until
you cannot use it.

And most really heavily loaded
systems understand that transactional

DDL is some kind of myth.

I mean, you need it, but to really
apply schema changes without

downtime, you need to break it.

Create index concurrently is non-transactional,
right?

We discussed it, you mentioned
that it has 2 transactions.

If it's interrupted, it's not fully
rolled back, so it's not

a transaction already.

You will have an invalid index
left, leftovers of your actions.

If you want to change schema in
any aspect, If you want to do

it without downtime, you usually
need multiple transactions.

For example, if you want some foreign
key to define or some check

constraint to define, not now,
you will always need, and you

have already a big table loaded.

You will need to think about multiple
steps, definitely, and

be ready to roll back 1 of it and
go retry.

So, transactional DDL is not easy.

This topic is not easy.

It's not like, oh, we have transactional
DDL, all problems solved.

No.

In a highly concurrent environment,
it will actually put you to

downtime if you just blindly use
transactional DDL.

And create index concurrently is
a perfect example.

It's not transactional.

If you use create index, which
is transactional, you have downtime.

Partial downtime.

Michael: I'm trying to think if
drop index concurrently is transactional.

Nikolay: It needs to acquire an
exclusive lock.

It's similar to dropping a column,
but Postgres doesn't offer

any tools for 0 downtime drop column.

Drop column cannot be 0 downtime
unless you cook it properly

with a low lock timeout and retries.

Drop index concurrently, it's shipped
with Postgres; this tool

exists, great, but it's not transactional as well because

it might fail as well.

Right, for example, like, and well,
in terms of leftovers, no

leftovers.

Michael: No, no, no.

But it might fail, that's the good
thing.

Nikolay: It's just to acquire a
lock gracefully, right?

Not to block selects or other queries
which came after you started.

Michael: You've worded this quite
well in your how-to, but the

idea here is don't assume that
Postgres having transactional

DDL will make all your problems
go away.

You're still going to have some
complex...

Nikolay: I mean, it has it, but
in many cases you cannot use

it and in heavily loaded systems
you absolutely cannot use it

as is.

You need to cook it properly and
have a lot of things around.

A lot of dances need to be learned
right

Michael: because of the heavy locks
because you normally

Nikolay: Yeah, yeah because of
locking issues basically exactly

Cool, right

Michael: Nice one.

Well, I think I'll link all of these
up in the show notes.

Maybe the wiki will have changed
by the time this episode goes

out.

It'd be cool to hear if anybody
thinks there should be additional

ones that should be in there.

Nikolay: If we say don't do it,
don't use transactional DDL.

I mean, use it, but only partially,
like as pieces.

You cannot use it in 100% of everything,
like all schema changes.

You cannot use create index, which
is transactional.

Don't use transactional DDL.

I hope people understand when I
say this.

Don't use transactional DDL.

Michael: Yeah, I guess create index
is the one that will catch people

out if they don't know about

Nikolay: locks.

You cannot drop a column without
proper lock timeout and retries.

And this is already beyond a single
transaction.

Michael: Well, and in the past,
even adding columns...

Nikolay: You cannot add a column
without default.

You cannot add it.

It's the same as dropping a column.

You need an exclusive lock and
if there is an ongoing long transaction,

you're blocked and you have troubles.

It means you need some additional
orchestration.

Michael: Nice, I think you'll find
it hard to word that for the

wiki, but I think it would be a
good addition.

Nikolay: Don't use transactional
DDL.

Okay.

Michael: You'd have a lot of caveats.

Nikolay: Right, well it's very
similar to many items we discussed

today.

Michael: True, true.

Nikolay: Right, this advice.

Okay?

Michael: Yeah, thanks so much,
Nikolay.

Thanks everyone for listening.

Catch you next week.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

Don't do this

Creators and Guests

Some kind things our listeners have said