Postgres FM | Transcript: Database branching

November 11, 2022 • 31 Minutes

Database branching

Nikolay and Michael discuss database branching — how to define it, and the current state of tools and approaches.

Michael: Hello and welcome to Postgres fm.

A week show about all things Postgre.

I'm Michael, founder of PG Mustard, and this
is my cohost Cola founder of Postgres ai.

Hey, Nicola, what are we talking about?

Nikolay: Hi Michael.

Last time we discussed a versioning of database schema or.

Database Migration management.

I already forgot the proper name of it.

We can continue this field and discuss database branching today and how
it is different from, like snap ting for example, or from , schema version

control, like, cause it's, like adjacent area which is not yet developed.

We don't have good tools yet, by the way.

So, we probably will discuss some ideas and concepts and what to expect
from the future in terms of various doing and what other companies do,

and also what my company does because looks like we go in this direction.

We develop a database branching right now, so,

Michael: Yeah, exactly.

Depending on exactly how you define it.

It seems like there's quite a few database companies at the moment talking
about branching, but each one means something slightly different, or as you

dig into it, I've seen your conversation on Twitter with a few people trying
to understand what they mean by it and trying to get some definitions down.

So I'm looking forward to hearing your thoughts around that.

Is it, is it worth us?

Oh yeah.

So database lab engines worth discussing maybe in terms of what it
does what, what you are calling branching and what some others are.

Is that worth doing first?

Nikolay: Yeah.

Well, I think discussion of data base engine as a whole, it's
maybe a separate discussion because there are many things that

it can do and many different use cases where it is useful.

But briefly, I think it's a good idea and yeah.

Let, let me do some overview of data.

Branching topic.

So database, lab engine, which we post develop.

It was born when we needed to experiment very quickly to
check, first of all actually, sequel optimization ideas,

not on production, but in some non-production environment,
which behaves where pogs behaves identically to production.

And we needed to isolate.

Experiments of different people.

And we also needed to iterate.

So to reset quickly and to throw out bad ideas switch to new ideas quickly.

But when you build indexes during hours or you already change
your schema heavily changing data, sometimes you need like

to spend many hours converting some column and then you.

Uh, Dead end.

You need to start from scratch.

It's quite difficult usually to have another environment provisioned
quickly, so we solved this originally for optimization using

think cloning, think provisioning of based on Zetas, either
Zetas or lvm, although other options also possible to implement.

Without any big details, like you can run single server with dozens of.

Independent logically independent post instances where
database is the same everywhere, but it's writeable.

So, so you can deviate, you can create your own index,
and, and the planner behaves is exactly as own production.

This is the trick and the creation of CL takes
only a few seconds regardless of database size.

Sounds like magic, but this magic is going to kill another.

I call it like many people call it black magic posts.

DBA knowledge is and skills is like area of black magic.

You need like to spend 15 years and then you
quickly say, This will work, this won't work.

And people say, Oh, you are like black magic guy.

Yes, black magic means like something is hidden.

Our magic is is wide magic.

No, nothing is hidden in any developer, any engineer.

See behavior not disturbing others and experiment and fail and so on.

So this is what we did.

Many clones are running on single machines.

So you pay for one machine and have dozens of clones.

And we switched them to area of testing in C I C D pipelines.

So like it's whole new world as as well, like, again, whole big topic how to.

What, can be tested in C I C D pipelines
in terms of database, in terms of Postgres?

Our idea is obviously we have database finally, not tiny, small, like one
gigabyte of something just generated, or, I don't know, brought to you by.

Docker pool, but we have whole database.

Like we can, can be handed gigabytes or terabytes.

It's, it does, it doesn't matter for us.

We can just set it up and, and make pipelines
working, provisioned very quickly in, in few seconds.

So testing is another area.

Some people do various things.

For example, some people just test ppg
upgrade inside our containers also possible.

And so The key here is that we already do it
for a couple of years or maybe three years.

We always said this is like, think cloning.

We use the term cloning, Clon cl clone.

It's very natural for cloud engineers, for dbs, for
like SRE because This is cloning term used in cloud.

Right.

You can clone your S volume also from snapshots, so copy right there as well.

It's also like thin provision, but thin provisioning, but it's
kind of different because you pay for each volume separately.

But still cloning is used there or RDS clones.

Aurora has also think loans and so you can
single storage, but multiple instances.

Using all of them, use this same storage, and you can have multiple write.

Instances.

So you can do rights independently similar, but again,
you, you need to pay for each compute node separately.

That's why no, neither RDS clones nor of ro clones are
good for testing in i c d pipelines because you, you want.

Constant price, constant cost when you need to pay.

Also, by the way, you need to wait many minutes to provision them both.

Michael: I guess thin cloning is specifically named as opposed
to thick cloning as it, you could take a full copy of it.

And I, and that's what a lot of systems have offered for a long time.

And this, this is obviously a step above that in terms of perform, in
terms of speed, but also in terms of not having to have that extra,

Nikolay: thick cloning.

Actually, we can devote a whole episode to it, and maybe we
should, because it's also interesting, an interesting question,

how to clone a large database using regular tools, for example.

Do, do you.

Clone P data directory or use P based backup at physical level, right?

How to do it live without interruptions.

Of course, any experience post DBA knows the answer.

PPG startup backup or just use PPG based backup by
default, it will be okay and you can do it live or you

can you use it at logical level using pizza dump store.

It has questions, how to speed it up and so on.

But roughly we can assume that a regular, both of them, by the way, we
consider thick cloning, but we distinguish physical and logical levels.

So that restore is also cloning, but at logical level.

But you can choose which objects to clone there, right?

And you can speed it up using dash.

But in this case, you need additional space
because you cannot use that j and do it on the.

The problem, which is solved by another tool,
Demetri Fontain, is developing pg copy db.

I, maybe I'm wrong with name of the tool, but it's quite new and it exactly
provides your ability at logical level to use multiple threads and avoid.

Intermediate file backup file so you can do it on the fly.

It's, that's interesting.

But it also raises question how long our transactions are on the source.

Many, many, many things in the area of thick glowing.

Why?

I know it very well.

Because to provision database, lab engine, we need first
to get data in a regular way, either logical or physical.

So right

Michael: like one thick clone that you can
base the thin clones off, but that's where I

Nikolay: And we need to main, we need to also to maintain it
either continuously or to do the full refresh on schedule.

Everything

Michael: nightly or something.

Yeah.

So, but that's where this becomes really useful, I think, for the
branching discussion because suddenly if we can do thing clones

or something like them, we get the concept of maybe you can have
branches that aren't just empty, that aren't just just the schema.

They can have real data behind them as well.

Nikolay: So what happened with branching, the branching term?

First of all, I, I didn't realize it in the past, but now I see it very well.

Cloning is very infrastructure, language.

It's not friendly To developers because in
there is lon, but it's kind of different.

You clon whole repository.

There you have revisions.

Or commit number.

Commit numbers and branches.

And clones.

It's a language of SRE people or DBAs, DBAs infrastructure people,
of course all any engineer knows, like clone in various aspects, but

still they prefer branching and suddenly some time ago planet scale,
which originally provides charting my sequel vis, they are develop,

they develop, vis and founders are the same, who created vis, it
looked like my per, from my perspective, suddenly to the tohar problem.

They added a schema management capabilities and they called it, Okay.

We have now database branching and we have zero
downtime deployments for ischemic changes, hassle free.

So like, great.

And on the front page it was like, it was like maybe last year
in 2021 and on the front page I saw that it was branching.

But when I clicked inside documentation, I was curious because it's our area.

I felt like, Okay, okay.

Do they play with think loans or what Think provisioning.

But inside you can see database schema branching already.

So slightly different term.

Right.

And you realize that they clon only sche.

, then you can change it.

They, then they produce diff uh, we some of topics of our previous episode.

And this diff you can see it, you can approve it, other people can approve it.

So there is some flow.

And then it's deployed in zero downtime fashion,

Michael: Yeah.

Nikolay: and nothing about data.

Michael: So that, that raises the question about test,
like how do you test on the performance side of things, You

know, is this all and how do multiple people work together?

That kind of thing.

Nikolay: Right.

Well, first of all, this is also viable approach, I, I should
admit, because this is hard problem, zero down time migrations.

By the way, last time we didn't mention my article I dunno,
Like 18 mistakes of making schema changes MPOs before measuring

this article, I selected, there are many types of mistakes
you can do, but I selected some and discussed in detail.

It's on, on our website ai.

So it's a good problem to solve.

Very, actually hard because most of diff tools we see,
and I, maybe I'm wrong, but liquid base has diff tool

we mentioned and some, some, somebody in comments.

Thank you so much.

On YouTube raise this liquid base also has Diff and Jein has
diff, There are separate projects like Mira, but all of them show.

Simple diff like create index without word, concurrently,
not discussing the problem, how to change data, type

in, in 1 billion row table so they solve hard problem.

But there is hardest, hardest problem, how to generate diff.

In zero downtime fashion, as far as I understand, a planet base
they show diff regular in regular form, but when they apply

changes, they perform something like similar to PPG Repak approach.

When you create full copy of table recording all changes.

In some delta table, like change lock, right?

And then in single transaction, or in multiple transactions,
it's interesting also topic, but in steps you apply

all changes and then you already switch to new table.

Of course, this approach requires some disc space
and it's kind of too heavy for small changes.

Sometimes, like it, it depends, But it's
interesting that they have full automation.

But again, they don't care about data in this case, but their CEO
in in Twitter discussion said they are working on data branching.

I'm very curious how they will solve the terminology problem.

You know, two biggest problems in computer science, Right?

Naming and cash.

Cash and validation.

Right?

Michael: and off by one errors.

Nikolay: Yeah.

Yeah.

, So obviously they have naming issue because they
already use database branching for sche only.

Then this year earlier I saw that super base have database
branching in their roadmap and also neon appeared.

And Neon said We are open source Aurora, right?

Michael: Yep.

Nikolay: Aurora has think loaning, which
is, in my opinion, not good for testing.

It's too expensive and it's too.

It, they, it's like think cloning.

Yes, but you need to wait minutes and you need
to pay for compute for each clone separately.

So it's, if you use big O notation in terms of money,
it's a big O from number of clones for compute power.

Fortunately not for storage, but also of course theora also charges you for a.

And for testing, also not very pleasant, but
I guess big enterprises are okay with this,

Michael: it's, boy, it's better than nothing as well, right?

It's better than not having it.

Nikolay: Of course you can.

Yeah, you can test some heavy changes in this way, but
it, this is not something you will use for each pull

request or merge request of back end call changes.

It's too, too much.

But I also, by the way, I found that observed this area.

Heavy clones where we can use all CPUs and so on.

They are needed only infrequently for infrastructure teams.

For example, upgrades, big migrations to
some new hardware or operation system.

But developers these days, they do changes
many times per day, sometimes, right?

So like it's very often.

And there we need the very, very cheap and fast cloning.

So, okay, back to database branching.

So a new one in the very beginning said we are going also to be very
good database for C I C D pipelines and we have database branching, not

discussing what it means actually, like, in detail, how, how it is, how
is it different from cloning, for example, or snapshoting or like these

infrastructure languages or terms and Someone else also said, like some other
projects also said like, we have gi like approach for databases, for pogs.

And then like I spent some time trying to realize how branching could, behave
for, for database for POGS to solve problems of development and testing.

And finally I realized that branches are very different from our.

Because clones they take some memory, they consume memory.

For example, I, I want, in GI I can have thousands of
branches and nobody, like I don't pay for it extra.

Okay?

Some, some small storage penalty that.

Right.

But if, if all branches are identical I don't pay at all.

No, Nothing.

Nothing.

Right.

But when you run a think loan and database engine,
it consumes some memory because it has shed buffered.

So it's like it has something pogs running.

So it, it has some cost.

We have some limit, of course defined by size of your memory
on, on the server and and she buffers, for example, right?

So we can adjust and run more clones, but
still we have some limit for branches.

We don't want to have limit, right?

This is one thing, Of course, name is
also a thing, but also a thing like in gi.

It's very good.

We discussed it like decentralized and one of properties of gi.

It allows you several stages of review.

You can review yourself before you push your commits.

You can ask your colleagues to review if you
have full request me request and GitHub GitLab.

So to see difference between branches bef before you
merge your development branch to your main branch.

But if we say our CL.

Branches.

We cannot do it because we, we want to do this.

We want to say this is our state, and then say multiple colleagues or
multiple CI pipelines tested, check it and continue working with it.

For example, so we obviously we realized we need it, It
was in our roadmap for a while, but we realized branching

looks like snapshotting on demand for your clones.

So you have your clone, you change something.

You put snapshot via api, cli, or ui, We have all of them.

And then you say, Okay, this is, this is
Snapshot, this is commit, or this is it.

Continue working with it.

Right?

So Snapshot started to look like a branching, right.

And like kind of a name snapshot.

And then I, you know what I did?

I just opened a.

Documentation and started to read it from scratch.

And they say they have, by the way, slightly
conflicting definitions of branching.

There is no good definition, like in the beginning, you need to go deeper, but
I found good definition Branch is a pointer to commit a named winner to commit

well, There are issues with this term, but it's kind of also works for us.

We say, Okay, we have a named pointer to commit.

It can shift.

If new commit is created, it shifts automatically.

This, this kind of branch and that's it.

And we already developed prototype at I think it'll
be database lab engine 4.0 when it, it'll be released.

We are not in hurry, so we want everything work
very smoothly and tested by many teams properly.

But it's already, we have prototype, it's working.

At CLI level so far, not ui.

So you can maybe if you listen like one month earlier,
later, we already have we should have UI as well and so on.

But you say like, I want Branch, so you, you
started to deviate, you run, clone for this branch.

Others can run their clones.

, Clone is like your working directory.

You just grabbed the content of your code base and
opened some ID or editors and started to change it.

So clones is a, like, it's a mean means to change the state.

Michael: Well, it's like a running application, isn't it?

Like the source code isn't like a running application.

Nikolay: Right.

But usually running application doesn't mean like
changing of schema, but I, I usually avoid it.

Changing of schema should be during deployment, not.

During normal work.

Some people do it actually some evolution of schema initiated by your users.

Also temporary tables is a part of it, but I consider
it as like very questionable practice leading to

issues with management from DBA point of view.

So I would say, Normal run, run of running
applications should not change your sche.

You should try to avoid it,

but

developer openside it or changes it and then commits.

Get, get, get push.

Right?

Michael: What I meant more is I really like the snapshot analogy, and I think.

The code at a specific commit is kind of like a snapshot of
the application, but it's not the application running and

the, and in the same way a clone is a running database, right?

That you, that you can create from a snapshot maybe,
or from, I don't know quite how you're defining these

things, but yeah, we don't need them running all the time.

Just like a developer doesn't need an application, like
they're a version of application running all the time.

Just, just while we are debugging something,
just while we are actually trying to, to test it.

So yeah.

Makes a load of sense.

In theory,

Nikolay: Yeah.

I wanted to to emphasize also that we
consider these snapshots as a whole with data.

It can, can be production data if you can afford it.

There are no, if there are no issues with pi i
g d and others, but we focus on schema changes.

So data, We can, we also snapshot it and we
provide it to CLS or branches, snapshots and so on.

But what is most meaningful is schema changes, right?

Because these should be deployed sometimes.

Of course data also should be deployed, but we want like gi like approach.

With data, but applied to schema only.

We don't want to have data comparison and when deploy this,
maybe we will want it as, as well, because we have it here, right?

We can do something here as well.

But so far I'm like, I'm looking at the problems we have.

We just want to mirror the capabilities of GE and bring.

Branches, database branches to build very like effortless
effortlessly build non-production environments matching your code.

So you have development, branch code, and
you have development branching database.

So you can quickly take this code somewhere on your laptop or
on your some Like, I dunno, non-production machine in cloud.

And then you can request a clone for this branch, latest snapshot.

And this branch will be used and we have posts running and you
can, they go together and you can start testing, developing

your application and see how it works with a lot of data.

Similar to production, this is what we do.

But if you do some changes, data changes, we think they are not
such, so relevant because some tester can do many weird things

with data there, , then we just need to, to throw it away.

So when we meet, we meet fully, but we look mostly on s.

And so far we relied that we use one of these tools we criticized
last, last week Soke, liquid base, Flyway rail migrations.

We, we see that people already use them, so we are
not going to, so there is something to help there.

We discussed problems.

They have new generation I'm sure will be born.

The nearest future, I think.

But what's really not solved is how to test it properly with a lot of data.

Here we have this branching.

So I just described some maybe not very well.

I described it because still it's kind of not very precise.

This concept.

It's already clear, but not super clear.

But what I'm trying to do here is try to define what the best branching.

And this is how we see it.

We already developing in this direction.

I'm curious what other companies think actually.

So, but I think it would be good to, to synchronize thoughts and
to move in similar directions because in this case everyone wins

and developers Have similar concepts in various products, right?

Actually in source control management systems snapshots,
clones and so on, like branches, they also.

Have different meanings, if you compare detail.

They have differences.

So pro probably here it'll happen as well.

Database branching can have different meanings in different tools.

obviously, Right.

Sorry, I'm talking too much.

yeah.

Michael: this is great, and I think you're working on a document, right?

Are you gonna share that on Twitter when you're ready or, Yeah.

Nikolay: Yeah, I have some draft for rfc in this area.

Discussing goals and anti antigo because for example,
we want, to deal with data, deep data and so on.

We focused on schema changes mostly because
data production has probably different data.

For example, we don't want to release data patches.

And also there is.

There is a problem.

Interesting problem.

You created a column.

I'm very sorry.

I feel very sorry all the time because I, I'm, we have delay
probably, and you trying to interrupt me, but I'm already switched.

You wanted to ask something, Sorry.

Michael: I was just gonna add on the data front.

I think I added it last week, but some, we found when I was doing this
before, we found sometimes data is schema, like the lookup tables.

So you might need to worry about that, but yeah, exactly.

So some, you might have to worry about that at some point, but it
wasn't like we, I think we added it in version two or something.

So it, definitely can get away with not having it in there or at.

It does feel like it's, It is there.

It is important.

Nikolay: Right.

I spent some time trying to also to think about how to merge Merch
is, is basically, it's deploy, if you consider main is what should

be done and should be present on our production environments.

So merch is some you need to have diff and then you need to go.

If his D to production, roll it out.

But, but, but Uhhuh.

Michael: A diff is slightly different.

Isn't like diff is like compare these two things, but then I need a script,
like a, in order to actually make one of them the same as the other.

And that's not the same as a diff, but yeah, I agree.

So first step is diff, and then Second Step is a

Nikolay: Well diff can be seen as a series of DDL comments.

In this case, it's, it's same, but the problem is that you see it
these dd well diff tools for post schema, they show alter comment.

created these comments.

Michael: Yeah, most of the I, But I consider that a second feature, right?

The first feature is compare these two schemers
and they, they'll like highlight differences and

Nikolay: Well, maybe.

Yeah.

So we have two approaches for diff I agree.

But the problem with deployment will be if you go, if you're
creating this comment to, to production, you will block people.

And so you need to, have advanced diff and we
spend some time prototyping this as well, and.

Then I realized it's already kind of solved, like it's not solved well.

But we have a zoo of various tools for deployment
management, like ski rub rails, migrations, and they all

ignore the fact that data should be changed in batches.

Again, I'm advertising GitLab migration help.

So which solves this very well for Ruby, and like, it's hard for us to like,
either we need to choose something or we need to somehow abstract obstruction.

Like it's, and then I realized, okay, people already solve this somehow.

Let's just avoid this problem.

And we, I consider this, currently consider this
as antigo and we just take care of conflicts.

So if someone already changed schema in this branch, you trying to.

So in other words, we have something like CVS or Subversion, very centralized.

So before you put your changes, you need to update
and then you can, you can already push the changes.

So already resolving conflicts and maybe replaying
your changes on top of other person changes.

So we just take care of conflicts in quite simple.

And we don't solve the problem of merge fully, maybe
postponing it, but we, what we have among goals.

For example, imagine you created a column, which is empty.

Everything is filled.

You have full database, but one column, it's new and it's empty.

And how to test it, You need something there to test it.

So it looks like we need to think about ability to provide some i d.

what I don't like, like fixtures and like seed databases
where we have some fake data, but here we need it.

We need to feel like we need to feel new columns, new tables and
developers should decide how to do it should provide some mean for testing.

So we have everything, but somehow we need to fill new call

Michael: Yeah, like data generation.

Nikolay: Right.

Maybe Snapshotted consider this, like, this is our test data.

We have deviation from our production, from
MA main branch, but we already have test data.

Let's go.

Like, it's good.

Any other engineers can work with it and test it and play
and, and explore how this feature behaves with many roles.

Right?

Michael: Yeah.

Feels like a whole nother topic.

Nikolay: Yeah.

Well, testing is a whole another topic.

Definit.

There are major areas we can discuss there as well.

So what excites me here is I, any, any direction I
go, I feel, How come this is still not developed?

How can we leave if without it, like we, I see how we can leave.

We test on production every time I see, I have some question like, It's about
like this, not very beautiful term, but it's, it's called shift left testing.

When we want developers test first, then some
like testing should be done in the very beginning.

It should be shifted to very left in this infinite
develops develops sign, you know, like this

Michael: Yeah, I know what you mean, but like even if it's
not done in production, it's often done in like a stage, like

Nikolay: Right.

But staging often is very different from production.

So we end up testing it on production,

Michael: I can see what you mean.

Nikolay: really testing, really testing.

We, we pretend sometimes with mark check boxes, it was
tested in lower environments and staging everywhere.

But if you think about, was this testing real?

It was, it was fake testing.

And this is what we want to fix in our development processes.

Michael: Yeah.

Yeah, that'd be great.

Any last thoughts or things you wanted to share with people?

Nikolay: Well just keep an eye on on what we are doing.

Stay tuned and Any feedback, any ideas are welcome.

I'm always ready to discuss this topic with everyone.

I'm like, it's very, very, I think it's one of the
hottest topics in the area of databases right now.

I mean, many problems are being solved, Kubernetes and so on, but
this problem needs to be solved as well, and like, Majority of

development teams will benefit immediately from, from better new
generation tooling to, to build new non-production environments.

So I think we spend too much time thinking on production, but to solve
problems on, on production, we needed to start from non non-production.

And this is, this is interesting.

So ready to talk with everyone.

Just teach me out on Twitter, email anywhere and our regular mantra.

Thank you for, for feedback everyone, for subscriptions, likes topic ideas.

I think we will, again, next time, we should choose
one of the topics proposed by, by our audience.

Very

appreciate ideas, and that's it.

Please, ah, share in your social networks, working groups.

Slack, discord, Master Don, right

everywhere.

Michael: Yeah, of course.

Absolutely.

Well, thank you Nicola.

Thanks everybody.

Take care.

Nikolay: Thank you, Michael.

Bye bye.

Michael: Bye.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

Database branching

Creators and Guests

Some kind things our listeners have said