Version control for databases
Michael: Hello and welcome to Postgres fm,
a weekly show about all things Postgres qr.
I am Michael, founder of PG Mustard, and this is
my co-host Nicolai, founder of Postgres Ai Heyi.
What are we talking about today?
Nikolay: Hi Michael.
I have no idea how to call it.
Let's you do it.
Michael: Well, let's start with calling it database version control.
So source control scheme.
Yeah.
Nikolay: database scheme version control.
I don.
Michael: Yeah.
Nikolay: If, if you say just database version control,
someone can think that it's also about data changes
and like time travel for data and it's very different.
We, we mostly talk about schema, but.
Talking about schema, we should not forget about a big data
warnings when we, you need to adjust your data if you need to
make some schema change, which requires data changes, right?
For example, data type change of a, of a column.
So, yeah.
Michael: absolutely.
Well, I think probably in the application world, this is a well covered topic.
there's been various generations of tools that have improved.
Yeah.
Nikolay: I've, I've recently read articles about it.
There are, there are three generations of code version control
or code change management, or how to source code management.
Thirst was very naive and lock based.
Second was CVS Aversion and others per force.
It was better, but it was very centralized.
It was better in
Michael: I remember Subversion team, Foundation server.
Yeah.
Nikolay: Right.
Right.
And the third generation is GI and Mercury, which are also tier, centralized
Michael: Yeah.
And I think,
Nikolay: and like lock free, decentralized.
Michael: I don't know anybody using
mercurial anymore, so it seems like get one
Nikolay: Several years ago.
I know for sure Facebook used it, but I'm not sure about now.
Yes.
Michael: So, yeah, so on the application
side, it feels like this is pretty standard.
It's very rare that I meet anybody, not
source, controlling their application code.
But I do still meet people who don't have the same.
Systems for their database.
And it's super interesting cause this is a problem I was working on back in
20, even as early as 2010, I was working at a company called Redgate that
we're bringing out a source control product for a database for SQL Server.
So it's super interesting that still, what, 12 years later, still
pretty common for me to come across teams that don't have it.
And even the ones that
Nikolay: Yeah.
For database you mean?
Right?
Michael: For date.
Yeah, exactly.
For da.
Yes.
Nikolay: Mm-hmm.
Michael: And even the ones that do, don't normally have it hooked
up to they're using it for kind of communicating changes between
developers, but not always for deployments or continuous integration
Nikolay: Yeah, let's, let's discuss the goals.
Let's start for with regular code.
Which goals such tool like gi.
Is aiming to solve.
For example, if you are just a sole developer, it still makes
sense for you to, to track changes and to have history, right?
Like, so you can time travel basically in terms of source code.
But do you need branches and do you need multiple versions of your software?
Michael: Well, this is super.
Do you mean for databases or do you mean for
Nikolay: For, for, for regular applications.
Simple like single page application.
I dunno.
Something simple,
Michael: If you're on your own, I think
branches have minimal additional benefit.
I know some people love them for kind of exploring different
paths if they want to try something experimental and then
quickly come back and fix a bug for a customer, that's, it's
like nice to not have been developing on the main branch.
So I can see benefits to that.
And also I think it teaches good discipline if you, when
you do have collaborators, but I don't see massive benefits.
How about you?
Nikolay: Well, yeah.
This is interesting.
So branches can be used in various ways.
And if, if I'm alone and I have solid developer, I still have some benefit
of them because I can, exactly like you said, I can experiment with
some path of development and then abandon it if, if I consider it wrong.
And with branches, it's good that they are very quickly to create them.
It, it costs nothing very quickly.
Boom.
And I have a branch.
I can experiment and I can uh, Multiple paths of
development, and then I choose which one to take.
Maybe I will combine multiple ones.
This gives flexibility and freedom, right?
But also branches sometimes are used not as for free.
It's like for feature development or back fixes.
They are used to track different versions of software and Postgres.
Doesn't post this project doesn't use feature branches, right?
It uses major version branches like stable release number 15.
We have a branch and all.
Bug fixes are happening inside this branch, so you have
multiple lines of your tree of development and it, I can easily
imagine if I'm so developer, I develop some popular software.
I still can benefit from branches.
Like in this particular case, I maintain some old version.
I already re refactored that a lot.
I already have new version, but I still maintain the old one.
to back fix it sometimes.
So branches are needed for this as well.
But maybe we went too far from databases already, right?
Maybe not.
Maybe yes.
Michael: I think this is all useful, but I would say that
this is where the next major benefit that I was gonna mention
is obviously in teams being able to work in isolation and
not have to worry about stepping on each other's toes.
And that's where it gets super relevant for databases.
So when I was at Redgate, we actually supported two
different models of source controlling your database.
One we called the shared model because most teams at the
time and probably still now tended to have, even if they had
databases locally for developers, they would tend to have one.
Big database, so I know this is where your product comes in key.
But they would have one large database where you could do
proper testing, proper checking on decent data volumes, and
they had to be really careful of stepping on each other's toes.
For example, if you, you and somebody else happened to be working on the same
stool procedure or function at the same time, you get the case where the.
Last one to change, it wins.
And that's a, that's a real problem in terms of testing.
If you don't understand why, why your feature's not
working and you then you realize somebody else changed it.
So we did support that model and we would, we
actually ended up implementing it via locks.
So, or triggers, actually we called it locks, but you could
lock an object so other people couldn't change it, for example.
So, There was like some super interesting
features around that, but that was a stop gap.
I, I saw it as a stop gap because the better
version was we called it the dedicated model.
So where everybody had their own dedicated version of the database
and they could make changes in isolation and then commit those back
to the repository for, with, or the source code and, and maybe push
those to a central system, like deploy those maybe to an environment.
Nikolay: but okay.
But back to gi as being third generation of source
code management systems there's also some small things.
Which delivers some value.
For example, you can easily check diff before you push.
So you already committed.
You can review your own changes and with additional um,
additions in, in form of GitHub or GitLab, you can also
have very, very powerful process of like a review process.
Pool requests, merge requests, and so mul there are
multiple levels of quality control, I would say.
Right.
And this, this is great.
And like also also forking, So if you have open source
project, it's easy to just fully clone it for it and develop.
And if you see some value push back changes.
Propose suggest in form of pull request or merger request,
suggest these changes to the up like upstream project.
But benefits are not like in the same area.
There are like different in the different areas, but all of
them allow to build very powerful and very good workflow.
And there are not many workflows.
Not one which are proven to be successful in various teams.
But back to databases, we know that currently I would say we
have like also some generation of schema change management tools.
Some people's name them, I believe it's came from Ruby.
I'm not sure.
Database migration tools.
Right?
Because Ruben Rail's active record has part of it called database
migrations, and it's not migrating from Oracle or to pores.
It's about chemo changes basically,
Michael: Yeah, migrating from one version of
your schema to another version of your schema,
and it.
this is really important.
This comes back to your diff point from earlier in code
in applications we can d and it's, it's simple text files
and we can deploy it by overriding one with the other.
And that's fine because there's no data, there's no, Yeah.
Whereas in databases we need, like if we are going to, we can't
just replace if we, if we rename a column for example, we can't
just delete that column and, and add a new one with the new name.
We
Nikolay: in.
If you roll those data, yes.
Michael: Yeah, of course.
So that's, that's a very simple example, but there are loads of
these examples, and I think this is where at Redgate we had a huge
advantage because we had a schema comparison tool that did these DIFs.
It was, that was the starting point,
Nikolay: this schema comparison hap That doesn't happen without context.
Like what happens with this example?
Michael: This is where it gets really complicated, and
I think it's mostly unviable in Postgres at the moment.
I don't think anybody's doing this in Postgres at the moment.
You need to have a live database.
Source code files.
So these, these files are create statements.
They create statement per object and you need to be able to do comparisons.
But of course if you're comparing two static files, how do you know if it's a
rename or dropping one column and creating another one through static files?
You can't tell.
So
you need to.
You need to add migrations or, or what,
what rails cause migrations at some level.
So they ended up kind of backing into that and ended up also acquiring
a, a tool called Flyaway DB because that was very migration based.
So I think just.
Just quickly, I wanted to say, I think there
are two two d completely different models here.
One of which is kind of nice in principle, but I don't think works in
practice, at least not yet, which is these static create statements.
And the other one, which is a, a single dump of your schema
from a point in time and then migrations between versions
thereafter, which is what all of the frameworks like Rails
and Jango are, are doing, and the other tools do as well.
Nikolay: Right.
And if you have some live database, even if it's not
production you can distinguish drop create versus alter column.
If you check catalog system catalogs,
Michael: Like object IDs.
Yeah, exactly.
Nikolay: Not, Yeah, object or, or, or in pos It's P attribute has
at nu, so like numerical or, or number where column is, and you
can say, Oh, this is the same column, like, just rename so you
can distinguish this case or versus it's a completely new column.
Interesting.
So you mentioned Flyway and now I know Redgate.
Few years ago purchased this project.
There are other projects like Ski.
I like ski because it's like very, quite powerful, although
it's written in Pearl, so it's hard to contribute for me.
It's very powerful and works with many, many databases as well.
Also, there is very popular, like Liquid base, right?
Liquid base is
very jungle.
Also like a frame being framework.
Jungle has that device migration capabilities similar to Rails, but
all of them, correct me if I'm wrong, all of them leave on top of gi.
They rely on GI for version control.
Right.
Okay.
So what's their value?
Why do we, why do we need them at all?
Michael: Is it to do with deployments?
Is it to do with
Nikolay: Okay.
Michael: actually applying the changes, picking them
up and, and tying them to, to code changes as well.
For example, if I have database change that needs to go
alongside an application change, that's my understanding.
Nikolay: Yeah.
If for example, if we talk about, well of obviously framework in Rails and,
and John Go, they are also, they support both frameworks are also ORMs.
So migration parts of those rms also support like upper.
Like higher language, not sequel.
Right.
So because that's all I am, so we can change chemo also
using our native language, avoiding trying to avoid sequel.
But for quite advanced changes, you still need sequel.
For example, if you want trigger.
Probably you need to write some PL purchase
scale code and some SQL code and so on.
If you need some advanced ct, you also
need or define some view with advanced ct.
You need, you need some sequel.
But generally they try to Of to also
support native language like Ruby or Python.
And also, for example, in Rails, I see that people start,
I observe it in multiple projects on the GI lab as well.
They start with not dumping schema.
So they, we support steps and also dumping schema for additional control.
This is our like, current baseline.
And we shifted.
Shifted, but usually projects start with.
Schema, rb, I think in Rails.
And then they switched to Tructure sql, which is a scale regular dump,
and it's easier for Postgres guys to compare it and deal with it.
So yeah, I remember GitLab did it when they decided to stop supporting my.
Before before they supported both SQL and my uh, pogs
and it was not possible to have a structure skill,
which is, which should be like single database only.
But my question is to this like systems, okay,
You provide some, like usually like forward.
And backward.
You can define forward and backward steps.
Sometimes you can define also testing step.
they allow you to deploy.
They allow you sometimes like to have some dry deploy, not doing anything but.
Checking correctness of your changes.
But why don't you care about database data?
Database is data.
It's not tiny, unlike source code.
Sometimes we have millions of code, of course, but
usually it's relatively small compared to database.
But database is usually big.
And if you need to change something without blocking,
not blocking others, you need to have some DML as well.
Not only ddl, and this DML should be split to batches.
Why don't these all tools supported at all?
Michael: I think it's difficult and I think if we look, if we talked
about those three stages of application progress or application,
source control, progress, I think we are in database land.
We're kind of in step one.
Still.
I think the tools are still quite immature for
Nikolay: Generation one.
You, you, you, like.
We
Michael: Maybe, yeah.
Maybe that's a bit harsh.
Maybe, maybe we are like just sneaking into number
two with some of these tools becoming better.
But it's, if, if you're not on rails or Jango, I think it's
actually quite difficult to, to get this stuff up and running.
So, two.
Well one, one issue, for example, even before you worry about
online changes with big data, but you know, actually worrying about
zero downtime deployments even before then, What about merging?
What about if I make some changes on a branch and add a
migration and you make some changes on a branch and add
a migration, but we want to deploy those together like.
If they
Nikolay: Conflict resolution.
It's called
conflict
Michael: Yeah.
so exactly.
Conflict resolution is difficult.
Nikolay: Yeah.
So two, questions.
We erase zero downtime migrations, for
example, Actually basic diff capability.
If we rely only on gi it'll be very weak.
We, for schema, we need diff and not as pja and others.
very naive or negro or others.
So they have very naive diff like for example,
they say create index not mentioning concurrently.
So if we have substantially beacon and substantially loaded database, we will.
Block them and nobody, no.
Only tiny projects can afford using such approach.
So create index should be concurrently alter table changing column data type.
Also, this is harder.
This should be in batches and so diff.
You, you say conflict resolution, I also say deployment
accord in accordance with this zero down time approach.
So you, we need batch management background
like workers, sometimes lasting days.
if you need, for example, to change data, type in
a colon in the table, which has a billion of rows.
You need to, to be very patient and wait
one or two days and split it to batches.
Batch size should be properly adjusted so they don't care about it.
Why Hard problems, or I, I'm, I'm mistaken and it's not needed at all.
Michael: I think it's a hard problem personally, I think
Nikolay: Three problems already.
Already.
Three
problems, ,right?
Michael: And I've actually been, so I've
got a friend working in this area, kind of.
Somebody I used to work with got called Neil Anderson working on a tool
called Postgres Compare, and I've been trying to encourage him because,
because as you say, the diff tools in most the diff tools that have been
added to most IDs and the open source ones aren't super intelligent.
They're good.
I'm sure a lot of people have put time into them,
but I think he's put a lot more time in the charges.
It's a commercial tool and he's been able to put a lot more time into it.
So it's, it's
Nikolay: It doesn't support zero downtime migrations.
Michael: I'm not sure.
It's mostly for
live schema to, Yeah, it's mostly for like live schema
to schema comparisons and I don't know if there, like
Nikolay: scheme of the without data.
Right.
Michael: No, he does do data as well, actually.
It's in, in the same
Nikolay: Oh, okay.
Okay.
Michael: Yeah.
Nikolay: but it's kind of different features, right?
So
Michael: yeah, I think data, like I was gonna mention data here
actually, cuz I think data can sometimes be an important part of
your schema, which again, some of these tools forget and neglect.
If you've got a lookup table, for example, of
what the, the famous one is country codes or Yeah.
Ex anything where you, you need that
Nikolay: People put it to migrations.
I see it.
Although quite often in Flyway, for example, just in short.
Michael: Yeah.
Nikolay: And sometimes some teams consider this as a
bad practice and try to avoid, but some teams do it.
Michael: I don't know what the alternative is
though, if you're gonna, If that's the migration.
Nikolay: good question.
I don't know either.
So.
Michael: So we, we had to worry about that because
we were having create statements in version control.
So if, if you've, if you are trying to do the alternative, which
is have a create statement per object and manage DIFs, you do need
to c you have to choose which tables need to also go in data wise.
So it's, it is a com I think it's a hard problem and complicated.
So I do have.
empathy with the people making these products and making these services?
I do think databases are fundamentally harder to do this for than
applications, but I do also think there's a lot we can do to make them better.
the create index concurrently as a perfect example of quite
a simple change that probably should be in most of them.
Nikolay: Yeah.
Also, you know, you know what?
I think if somebody developed open source tool, which supports zero
downtime, diff, I think it would solve the problem of merging and conflict
is adoption because we would see this deep and this is our emerging process.
We just.
Maybe not.
actually If we merge to, for example, main
branch, it's our deployment to production.
What do you think?
Michael: Yeah, I think mergings trickier, but we
might be able to lean on, get for some of that.
So for example, if we are changing the same function, , the function's gonna
look very similar between our two, but if we're ordering a table that's,
maybe we can apply one after the other, but I'm not sure when order matters.
It's a tricky one.
I haven't thought about this for a few years, to be honest.
Nikolay: if like, if you consider a functions triggers as a part of.
Schema, and I think we should this also, we probably, yeah, we
probably already, in previous episodes, we already mentioned the
problem of include in all these tools, current generation 1.5, right?
Tools, if we include different file and keep functional budget there
for convenience, like for example, each function, it, it's a non.
Easier to find it, navigate and so on.
Do we still want to support long sequences of deployments?
For example, we have several installations of
our product and one is already very up to date.
And another one is letting few months of changes.
And the problem is when you include you need the current
version of file, but when you already function changed many.
And you try to replay old migration file it, it tries to
include the very fresh, very new function body and it's,
it's not what it is expected like couple of months ago.
Right.
And you have a problem.
. So you need to somehow fetch, you can use GI history and
fetch older version, which corresponds to this migration file.
But why Nobody cares about this.
I needed to implement it myself.
And like I thought about how to contribute it
to ski and it was like, Quite hard actually.
I raised the topic about it and my linguist and I didn't see support.
Things like it.
It looked to me like I'm alone with this problem.
Like nobody cares, right?
Maybe I'm really alone.
How people do this or from your practice is it the problem?
Or people just don't need to replay along sequences of change.
Michael: I've seen more people kind of like try migrating forwards.
I d I actually see quite a few people not even
worrying about backwards, like moving backwards.
And it's it's an interesting, potentially.
Nikolay: back backwards, it's about also if you have multiple,
you have multiple sit ups and one of them is legging quite
much, they need to replay, say 10 database migrations
and they fetched the very fresh version of our code.
But this code already has a much newer function file.
And then when they try to replay some old
migration, which tries to include this file.
It ex it expected very different function or view, for example.
So problem with includes including different files.
It's a problem.
You always need to snapshot whole function value
to migration, which like blows your code, right?
But you cannot keep a function in a separate file.
This is the problem with all these.
Michael: Yeah, I'm, If we only have migrations from
a, from a baseline, I don't think this is a problem.
Nikolay: Problem with starts when you try to use, including,
include some file like with ski, it's very ski feeds files to P sql.
So you can use backslash.
I
Michael: Oh, okay.
Yeah.
Okay.
Nikolay: this is a problem.
And if you use besla shy, and this is like two month old
migration, this file already changed and you, you need to re.
And, uh, this is the problem.
Mm-hmm.
.
Otherwise you need to, every time to, to put a whole
function body to migration and bloods, everything is.
Very huge.
But I mean the, this, if you, if you put every time you say create
a replace function or drop function and create it again and put
everything and in reward as well, like whole function, again, it blows
it so much that I, I'm starting to ask why, like, do we have GI still?
No, Nobody cares about, We have good here.
This is why it was created for Right . Why
we copy paste functions, function versions.
Where is version control at all?
Michael: this is what I liked about the single create files per Object,
but it has such big limitations right now that I don't think it's viable.
So I think we have to accept the putting the full object back in.
Maybe you could argue
Nikolay: But how I, I don't understand.
If you keep object in separate file, how you replace sequences of changes.
So like, You still have versions, right?
Like our first version, then like maybe timestamp, you take day or something.
Timestamp.
This is new version.
New version, new version.
If you keep object in separate file, you need to include it.
And if you try to replay very old migration with
very new state of code, you'll have problem.
Michael: I think you need to have a database aware diff functionality.
And it also probably needs another folder of overrides.
So if you have, for example um, easy migrations, like adding, if we are
always adding, adding columns, adding indexes, adding tables, easy stuff.
We don't need to have any special cases.
We can just The diffs are easy.
If we do something more tricky, let's say from I know
GIT doesn't work like this, but let's say from version
15 to version 16, we need a more complex migration.
We can say, we can put in a folder somewhere.
for going between 15 and 16, which includes going from
14 to 18 or any other way past that boundary, we can say.
Don't use what you plan to use, use this
uh, override instead, but it gets comp.
That's kind of like a hybrid between the
two models we've been discussing and I,
Nikolay: Well, there is a big difference between like objects with data.
And very lightweight.
Objects.
Objects with data is is tables.
It's called tables.
That's it.
Only tables have data.
If we forget about marginalized use, which
is derivative for, for from tables, right?
View views, triggers, functions.
They are light that don't have data, so we always
Michael: you
Nikolay: them, throw them,
away and electrify.
Well, yeah, Yeah.
But in sense it's like part of it's, it's like part of tables
probably, right?
Because they cannot live without tables.
It's like, okay, it's addition to tables, add
onto tables, but functions, triggers and views.
Regular views.
Maybe we trust use as well, because they
like have only derivative data as well.
You, you always can redefine them and you can keep them in files.
That's why
Michael: Wow, that's a really interesting point.
I haven't seen a system that those two.
Nikolay: I built It a couple of times,
Michael: Okay.
Cool.
So you had one system for
Nikolay: Yeah, remember we discussed the approach when functions are
kept in separate ches and you can keep old version of this is smart
and good, and I think it's possible to build without copy based seal,
like to rely on geek for version control and have like version one
schema functions version one, functions version two, and being able
to even keep them several versions at the same time and different
application notes can address different versions of functions.
This is very smart and this is exactly when we
probably need to functions in separate files as well.
Right.
And I, I should mention that this idea was like came
to mind of Valentine kri long, long ago from Solano and
then propagated to Russia, to ATO and others and so on.
So it's great idea.
Still relevant.
I think I'm not sure about triggers because triggers,
they consist of trigger function plus trigger itself.
So trigger itself, it's a bit easier that it also requires.
Things.
Oh, by the way, big question number four or five probably.
Why do all these guys don't care about it?
Like about simple problem of looking attempts?
If you try to alter table at com, you always must have
small, locked time out and not be ready to try otherwise.
Well, you, you don't notice the problem if
you have not big load, not high lot, right?
Not high lot.
But if you have noticeable load like thousand gps, sooner or later
you will bump into this problem very badly as many did already.
So you need be graceful.
Otherwise you try to alter somebody looks and
you cannot acquire a look and you have huge.
Line, including selects after you waiting, so you block everyone.
They don't care about it.
Why
Michael: Well,
Nikolay: Because it's generation
1.5, generation 1.5.
That's why.
Michael: It's a really good point, but that
also lives in a slightly different place, right?
That lives in the, in the config file.
And that's, you know, not necessarily well, is
that in version control, ho, like hopefully.
But it's not in the same version control for most people.
So it's deployed in a different way, updating in a different way.
So it's very, very interesting.
Yeah, really good point.
Maybe, maybe that should be in there as well.
Nikolay: I hope they will develop in the right directions.
So people need these directions.
People need zero downtime, migrations,
conflict resolution graceful deployments.
Right?
And by the way, I want to advertise.
Uh, GitLab has open source and background migrations
with, it's called Migration helper helpers do B file.
So it's open source.
Anyone can check.
And if you are Ruby user developer, you can benefit from
learning their experiences, an enormous volume of experiences
already coded there to support zero downtime aggressions.
Also, actually GitLab has many installations and some of them
may leg and need to, This is exactly the problem I described.
I'm not sure if they keep some objects in separate files, probably not.
But they have this problem to deployments
should be without issues on many, many setup.
So it's, it's,
interesting to check.
I hope that one day they will open source it separately as a
separate jam or like library to be great for Ruby developers.
Michael: That'd be really cool.
there's a really good blog post, about zero
downtime migrations from the team at Braintree.
That's excellent.
That I think I'll include as well.
Nikolay: Don't forget to include my lifetime
out post about this red rice problem.
Michael: Will do.
And yeah, I would love to hear from anybody listening to
this that's screaming at us and saying that you've got this
solved or you've got a really good process on your side.
It would be great to hear how people
Nikolay: Yeah, maybe we are wrong like this.
This episode is full of criticism, but it's
opportunity to improve for our tooling.
So maybe someone can say there is some
very magic tool which solves all problems.
Right.
Michael: Yeah, let us know.
Wonderful.
Nikolay: Okay.
Michael: Well, thanks Nicola.
Thanks everybody for listening.
appreciate everybody that's got in touch recently with feedback
and yeah, looking forward to speaking to you next week.
Nikolay: Great.
Thank you.
Bye bye.