Postgres FM | Transcript: Blue-green deployments

November 10, 2023 • 44 Minutes

Blue-green deployments

[00:00:00] Michael: Hello and welcome to PostgresFM, a week to share about all things PostgresQL. I am Michael, founder of pgMustard. This is my co host Nikolay, founder of PostgresAI. Hello Nikolay, what are we talking about today?

[00:00:11] Nikolay: Hello, Michael. The RDS team just released a blog post about blue green deployments, and I thought it's a good opportunity to discuss this topic in general and maybe RDS implementation in particular, although... I haven't used it myself, I've just read the blog post, but I know some issues and problems this topic has, so I thought it's a good moment to discuss those problems.

[00:00:36] Michael: Yeah, awesome. It's a, even if we look at the basics, it's, I think it's interesting to most people. Everyone has to make changes to their database. Everyone needs to deploy those changes. Most people want to do that in a safe way as possible with as little downtime as possible. So I think it's a good topic in general to revisit and it looks interesting.

[00:00:57] Nikolay: Right, right. So, I think, in general, in general, it's a great direction of development of methodologies, technologies, and ecosystem, like various tools, and so on, Because bigger projects, not only the biggest projects, some smaller projects also needed, especially those who change things very often.

But before we continue, I would like to split this topic to two subtopics. First is, uh, not frequent changes we do when we, for example, Upgrade, perform major upgrade of Postgres, or we switch to new operational system if we have self managed Postgres, with glibc version switch, right? Or, for example, we switch hardware, I don't know, like something like big, big changes.

[00:01:49] Michael: Major version upgrade.

[00:01:51] Nikolay: Right, or maybe we try to enable data checksums, maybe also this is one of the, it's generally possible with rolling upgrade approach when you just change it on one replica and then another, like, you know, rolling upgrade. But maybe this, , idea of blue green, which came from, stateless.

So, this is a big class of changes which usually is performed by the structure teams. And, , it's, not very often, right? A few times per year, usually, right? Versus a different, very different category of problem, which is, changing our application code, maybe several times per day, trying to react to the market needs, and our changes, trying to move forward, like, go to market strategy, and so on.

So, continuous deployment. Schema changes, various stuff. So, obviously, interesting that the original idea described by Martin Fowler is about the second thing, schema changes and so on, like application changes, which is done probably not by the infrastructure team, but by the engineering team or development team, which is usually bigger in size and they need more often changes, but each one of those changes...

Postgres is lighter, much more, like, it's not as heavy as major Postgres operators. But it's done, it needs to be done very often, and probably in fully automated fashion, like through CI CD pipelines, continuous integration approach, right? So we just change it, a lot of automated testing, and we just...

Approve, merge, and it's already in production, right? So, original idea by Martin Fowler, and I think we need to start discussing it already, right? It's about the second problem for developers. While what the RDS team developed is for infrastructure team and major upgrades, it's very different. Class of tasks to solve, right?

Do you agree?

[00:04:03] Michael: Yeah, I do. And I probably jump in the gun a little bit here, but I feel like they might be slightly misusing the phrase blue green deployment for this, for the description of this feature. And I, I really liked this feature. If I was on RDS, I think I would use it, especially for major version upgrades. I think it makes that process really simple and lower downtime than most other options, smaller database users have, but yeah, I completely agree that this is not at all appropriate for application teams wanting to roll out new features, add a column to a table, add an index. It just doesn't make sense.

[00:04:44] Nikolay: Because Logical doesn't support DDL replication yet, right? That's, that's why this is like stop, full stop, hard stop.

[00:04:50] Michael: Oh, and even if it did, I think the, the way that this is done wouldn't be appropriate.

[00:04:58] Nikolay: Here I would agree with you, but let's do it later. Like, I'll just, just let, let's make a mark that I have, uh, I have multiple opinions here, no final opinion. So, I have different thoughts. Let's discuss it slightly later. So. Okay, let's talk about the original idea of BlueGreen.

First of all, why such weird naming, which reminds me, red, red, black trees, from the algorithm and structure from computer science, basically. Binary trees, the next idea is red, black trees, and so on. So, why, why, why this name? You, you, you've read about it, right?

[00:05:37] Michael: Yeah. I saw in a, an old Martin Fowler blog post that I'll link up that they, they had, I, I suspect, I didn't actually look at the timelines, but I suspect it was back from when they were consulting. I think probably at ThoughtWorks. That seems to be where a lot of these things have come from. and they had some difficult to convince clients, , that they wanted to increase the deployment frequency.

But people were scared of risk, as always, and they had this idea that, well, I mean, it's kind of standard now, but I guess back in the day it wasn't as standard, that staging needed to be as close to production as possible, so that you could do some testing in it, and deploy the changes to production in as risk as possible, and then they took that a bit further and said, well, what if staging was production, but with only the change we wanted to make Different.

And instead of making that change on production, we instead switched traffic to the, uh, we would previously have called the staging environment. And they, they talked about naming for this,, I don't even know what you'd call it, but methodology, I guess. And they thought about calling it AB deployments, which makes a lot of sense, but

[00:06:52] Nikolay: Well, I, i... A B means we split our traffic, maybe only read only traffic in the case of databases, and we compare... Two, two paths, right, for this traffic,

[00:07:08] Michael: Well, the main objection that Martin, the main objection Martin had with that naming is that they were scared the client would feel that there's a hierarchy there, and if we talked about, if we talked about there being a problem and we were on the B instance instead of the A instance, the question is why were you on the B one when the A was available?

And I think that's I'm not sure, I think you're quite right that A B testing might have already been a loaded term at the time. But it also is a good counter example where most people understand that in an A B test we're not assuming a hierarchy between A and B.

[00:07:41] Nikolay: right, but also the approach, this approach says it will be the second cluster, like secondary cluster, which follows, okay, I'm thinking about databases only, right, let's, let's switch, like since we discuss ideas, we should talk only about. Stateless parts of our system and the database we should touch only a little, right?

So, okay, stateless. For example, we have a lot of application nodes and some of them are our production, some other are not production, and what I'm trying to say... It's not only about hierarchy and which is like higher, of course, , so, yeah, by the way, it's, I remember, it's a similar name, okay, I'm a database guy, I remember, if you give hostname primary to your primary, but you, after fallover, you don't switch names, it's a stupid idea, because this replica now has hostname primary, it's similar here, right, so we need, We need to distinguish, but not to permanently say this is the main one, because we want them interchangeable, symmetric, right, so we switch there, then we switch back, back and forth.

And, uh, always one of the parts of the, like, one cluster is, or one set of nodes is our, our real production. And another is considered as, like, a kind of powerful staging, right? But key question is not only about, only about, uh, hierarchy, but how exactly is testing done? , in one case... We can consider, okay, this is our staging, and we send only test workload there, which is done, for example, from our QA test sets, from pipelines, or we consider this secondary cluster Secondary node set as a part of production and put like, for example, 1 percent of whole traffic there.

This is very different testing strategies, right? So two different strategies. I think in original ideas it was like, it's staging, all production traffic goes to main node set. Blue or green, depending on the current state, and that's it, right? So, we cannot say it's A B, because in A B we need to split 50 50 or 20 80 and then compare.

[00:10:15] Michael: Yeah, sometimes, sometimes I, like in marketing I've heard people talk about A B testing which is... Concurrently testing two things at the same time. And then sometimes they call what this might be as cohort testing. They say we're going to test this month, uh, the timelines will be different, but if you wanted to test, if you wanted to switch from blue to green in one go and send all traffic to the new one, that would be considered, it's not A, B because it's not concurrent, but you might say this cohort is going to this new one and this cohort.

[00:10:48] Nikolay: I would say that they are both A B, in my opinion, because they both use production traffic to test. So, this is exactly, by the way, the idea, we can switch there for one hour, then switch back. And then, during next week, study the results, for example, right? It makes sense to me, or next hour, I don't know.

I don't really care if it's like concurrent or sequential, but the idea is we use production real traffic. It's a very powerful idea. Not only data, or not only... Applications, application nodes are configured exactly like on production, because they are production sometimes, right, we switch them.

But also we use real traffic to test. I think original idea was we don't do it, this secondary... Node set is used for lab environment. It's still production data, right? Or production. It talks to production database. But we generate traffic ourselves, like special traffic, special workloads under control.

This is the idea, the original idea, like we do with staging. But we know this is our final testing. It's very powerful. It uses the same database, first of all. So we should be careful not to send... emails, not to call external APIs, and also to convince, various auditors that it's fine, because they always say if you do production testing, maybe it's not a good idea.

Who knows? So, but, uh, it's very powerful testing, right? But it's not done with production. Workloads.

[00:12:22] Michael: yeah, interesting, and have you heard the phrase testing in production? That this feels like a, uh, it's like

[00:12:29] Nikolay: times. I do it all the time.

[00:12:31] Michael: but

this

is,

[00:12:32] Nikolay: yeah.

[00:12:33] Michael: yeah, well it kind of feels like that partly when we're switching over as well, because as much as testing as we've possibly done, most of us with a bit of experience know that. You can do all the testing in the world and production is just different.

Like real users are just different. They will use it or break it in ways you didn't imagine, or have access patterns you just didn't imagine. So we kind of are testing. And I think that's one of the big promises of blue green deployments in the theoretical, or at least in the, in the stateless

world, you

[00:13:04] Nikolay: Blue Green. Let's, let's introduce this term, stateless Blue Green. Yeah.

[00:13:08] Michael: is that you can switch back if there's a problem. That feels to me like a real core premise. And why it's so valuable is if something goes wrong, if you notice an issue really quickly, you can go back to the previous one. It's still alive and there are no ill effects of moving backwards. And I think that's a tricky concept in the database world, but we can get to that later.

[00:13:31] Nikolay: Yes, and this is exactly, let's continue with this. I think we already covered the most, the major parts of the original idea, of stateless idea. We can switch to stateful ideas. And this is the first part where. The RDS Blue Green Deployment implementation radically differs from the original stateless ideas.

I noticed that from the very beginning of reading the article, they say, this is our blue, this is our green, and, like, they distinguish them.

It's a different approach. It's not what Martin Fowler talked to. Very different. So, and obviously, like, reading from this article, obviously the reverse replication is not supported, but it could be supported.

It's possible, and actually we already implemented it a couple of times, and I hope soon we will have good materials to be shared, but in general, you just, why not create, when you perform switchover, why not create a reverse application and consider old cluster as a Like, kind of staging now. Not losing anything.

And without this idea, it's a one way ticket, and this is not an enterprise solution, sorry. It's

[00:14:49] Michael: Well,

[00:14:49] Nikolay: it's,

not an enterprise solution.

[00:14:51] Michael: it's definitely not, well, it's not blue green either, I don't think. Um, but you're, it's an interesting point about scale. So if I'm, just a small business with a small database, and I'm doing a major version upgrade, and I want to be able to go backwards, it would be tricky to do with this, I think.

[00:15:09] Nikolay: Yeah, so, it's not tricky, it's tricky if you don't want data loss. You can go back, right, but you lose a lot of writes. So...

[00:15:20] Michael: but I can't do, let's say, if it's not a major version upgrade, if it's something like... maybe changing a configuration parameter. I could do, uh, what Amazon, according blue to green. Change the parameter in the green one switch, switch to it, and then I can do it the same process again. Switching it back,

[00:15:38] Nikolay: But you lose data, new writes will not be replicated backwards,

[00:15:43] Michael: not back. So, so let say I go blue to green, change the parameter on green Switch over. Now I've realized there's a problem and I want to, I set up a new one, a new green , as they call it.

[00:15:56] Nikolay: Oh,

[00:15:56] Michael: Uh. And switch again.

[00:16:00] Nikolay: well, okay, okay, in this case, we deal with a very basic example, which probably, like, doesn't require so heavy solution to it, because, like, depending on which parameter you want to test, maybe you should just test it on one of replicas, maybe you should just test it on the same Postgres, just minimizing downtime when you restart. It would be easier to just do that because, especially because the second, consideration reading this article is that the downtime is not zero.

[00:16:33] Michael: Yeah, that's a

[00:16:34] Nikolay: So restart is not zero and here is also not zero. I don't remember if they mentioned the issue of checkpoint or minimize. Downtime, I think, no,

right?

[00:16:45] Michael: that alone is probably enough for you in your books to say it's not enterprise ready, right? And to their credit, they do say it. They do say low downtime switchover, right? They're not trying to claim that it

[00:16:57] Nikolay: right, if we say that, if we, if this is our characteristic, it's not zero downtime, it means that this solution competes with regular restarts. That's it, . So why, why should I need this to try to switch to different parameter? I can do it, just with three restart, right? And not losing data, not paying for extra machines. And so, so, but for major upgrades, different story, you cannot, downgrade. Unfortunately, there is no PG downgrade, tool, uh, yet. So you need, uh. Like, you just need to use reverse logical replication and key state it properly, and it's possible 100%.

And this would mean

[00:17:42] Michael: not through Amazon right now. Like,

[00:17:44] Nikolay: yeah, it's not implemented, but it's solvable, and I think everyone can... Implement it. It's not easy. I know a lot of details. It's not easy, but it's definitely doable, so...

[00:17:58] Michael: What, what were the tricky parts?

[00:18:00] Nikolay: Tricky parts are, like, if you need to deal with, ... Aha! We had a whole episode about logical, right?

So... Tricky part, the main tricky part is always not only like sequences or DDL replication. It's very, very well known limitations of current logical replication. Hopefully, they will be solved. There is good work in progress for both problems. There are a few additional problems which, are observed not in every cluster, but these two are usually observed in any cluster, because everyone uses sequences, even if they use, uh, this new, like, syntax generated identity, always, like, I don't remember, I still use BigSerial.

[00:18:44] Michael: as identity, I

[00:18:45] Nikolay: Yes, yes, but, uh, behind the scenes, it's also sequences, actually. And everyone is, like, usually doing... Schema changes. So these two problems are big limitations of current logical replication. But the most tricky, the trickiest parts are performance and lags. So two capacity limitations on both sides. On publisher it's a wall sender.

like, we discussed it, right?

[00:19:13] Michael: Yeah. We can link up that episode.

[00:19:15] Nikolay: yes, yes, so, well, sender limitation and logical replication worker limitation, and you can have multiple logical replication workers. And interesting that, this is like, actually, says that the article needs some polishing, because they say... Mark's, logical application worker, like I'm reading Mark's logical application worker, and I don't see s because it's plural, the setting, and I'm saying, Hmm, in inaccuracy here.

And then the whole sentence is saying. When you have a lot of tables in database, like, this needs to be higher, and I'm thinking, Oh, do you actually use multiple publications and multiple slots if I have a lot of tables automatically? This is super interesting, because if you do, as we discussed in our logical replication episode, you have big issues with foreign key... Violation on the logical replica side, right, on subscriber side, because by default foreign key is not followed when replicating tables using multiple pubsub streams, right? And this is a huge problem if you want to use such replica for some testing, even if it's not production traffic. You will see, like, okay, this row exists, but the pending row is not, it's not created yet.

Foreign key violated. And it's normal for a logical replica, which is replicated by multiple slots and publication subscriptions, so... not discussing this problem means that probably there are limitations also at large scale if you have... If you have a lot of tables, it's not a problem, actually. The biggest problem is how many tuple writes you have per second.

This is the biggest problem. Roughly like thousand or couple of thousands of tuple writes on modern hardware with a lot of... VCPUs like 64, 128, or 96.

I'm talking Intel numbers, usual Intel numbers. You will see a single logical replication worker will hit 100 percent CPU, and that's a nasty problem. That's a huge problem, because switch to multiple workers, but now your foreign keys are broken. It's a hard to solve problem for testing. So, I mean, if you use multiple, you need to pause sometimes to wait until consistency is reached, and then test in frozen mode.

This is okay. But it adds complexity. But if your traffic is below, like, 1000 tuples per second, roughly, depending also, is it Intel? But, by the way, it doesn't matter how many cores, because I talk here about limitation of single core. It matters only, like, is it... modern core or quite outdated? On the family, it depends if you talk about AWS on the family of EC2 instances you try to use or RDS instances you try to use.

So, this single core limitation on the subscriber side is quite interesting. But if you have below 1000 TPS writes, inserts, updates, deletes, probably you're fine. Yeah, so this is interesting to check. And this lagging is, I think, the biggest problem, because when you switch over, you When you install reverse logical replication, you also need to make sure you catch up, and this defines the downtime, actually.

[00:22:48] Michael: Yeah. Because we can't switch back until, we can't switch or switch back until the other one has caught up.

[00:22:54] Nikolay: Right, because we prioritize, we prioritize, uh, avoidance of data loss over HA here, over high availability here, and RDS, The RDS blog post talks about, you know, like it's not zero downtime, they have additional overhead. But if you have, for example, pitch Bauer, you are going to use post with zoom to achieve real zero down downtime, then you need the, the leg to be close to zero.

And uh, the limitations of logical application worker will be number one problem. Another problem, long run transactions, which until post 16, I think. Right, cannot be parallelized and, or 15. So, if you have long transaction, you have big, logical replication lag.

So you need to wait, wait until you have a good opportunity to, like, to switch over with lower downtime.

[00:23:51] Michael: that's one thing I think I do want to give them some credit for. This does catch some of those. So, for example, if it does, if you do have long running transactions, they'll prevent you from switching over. Equally, there's a few other cases where they'll stop you from, Causing yourself issues, which is quite nice, and I wanted to give a shout out to the, like, Postgres core team and everybody working on features to improve logical replication has enabled cloud providers to start to provide features like this, and that's really cool, it feels like the good features going into Postgres core are enabling cloud providers to work at the level they should be working at to add Additional functionality.

So it's quite a call, like, uh, success, not necessarily that we're there yet, and as logical replication improves, so can this improve, but they are checking for things like long running transactions, which is cool.

[00:24:42] Nikolay: Yeah, and definitely Amit Kapil and others who work on logical replication, kudos to them, 100%. And also RDS team, I'm criticizing a lot, but you know, like, it's hard to criticize someone who is doing nothing, right? You cannot criticize such guys who don't do anything. So the fact that they move in this direction is super cool. A lot of problems, right, but these problems are solvable, right, and eventually we might have a real blue, like, like, the question is blue green, the blue green deployment terminology going to stay in the area of databases and Postgres ecosystem particularly, what do you think?

Because this is a sign that probably yes, it should be reworked a lot, I think, but in general. Maybe yes. What do you think?

[00:25:34] Michael: I, yeah, I don't know, uh, obviously, predicting the future is difficult, but I do think that badly naming things in the early days makes it less likely, like calling this blue green when it's not. Actually, I think. Reduces people's trust in using Blue Green later in the future when, when it is more like that.

but you, you've got more experience with this than me, in the, for example, in the category of database branching, like taking these developer, terms that people have a lot of prior. Assumptions about, and then using them in a database context that they don't 100 percent apply to, or they're much more difficult in, I think is, is dangerous, but equally, what choice do we have?

Like, how, how else would you describe this kind of thing? Like, is there, maybe it's a marketing thing? I'm not sure.

[00:26:25] Nikolay: that's cool direction of thinking. So, let me show you some analogy. Until some time, not many years ago, I thought, as many others, that changing something we need to perform full fledged benchmarking. Like, for example, if we drop some index, we need to check that all our queries... Uh, ROK. In this case, okay, we like, we can do it with pgBench, sysBench, JMeter, anything, like, we, or simulate workload with our own application using multiple application nodes, a lot of sessions, like, running, like, 50 percent CPU, and we, and this is just a test, , an attempt to drop indexes, it sounds Overkill, like, I mean, nobody is doing it, actually, because it's too expensive, but people think in this direction, like, we would, it would be good to test, holistically, but, actually, there is another approach, Lean Benchmarking, Single Session, Explain, Analyze Buffers, Focus on Buffers, IO, and so on, similar here, and first class of testing is needed for an infratasks, Mostly upgrades and so on, to compare the whole system, log manager, buffer, pool behavior, everything, file system, disk, everything.

But it's not only, as I said, like once per quarter, for example. Of course, for every cluster, if you have Thousands of clusters, you need almost daily, I think, right? to run such benchmarks. And similarly, these upgrades, major upgrades, and so on. These tasks go together, usually.

You need upgrades, or you need to do a benchmark. But for small schema changes, you do it every day, multiple times, maybe. Okay, once per week, maybe, depending on the project. You release often, you develop your application quickly. You don't need full fledged benchmarks, and you also probably don't need , full fledged blue green deployments, right?

But maybe you need, I don't know, maybe you need, still need it. This is where I said I have open ended questions, like, what should we use for better testing? Because I could, like, if we are okay to pay two times more, we could have two clusters with... One way replication, but when we switch, perform switch over, zero downtime switch over, immediately we set up reverse application.

So, real blue green approach. In this case, probably we could use them for DDL as well. Of course, DDL should be solved. But we can solve it applying DDL manually on both sides, actually. This unblocks logical replication. We, like, we just need to control DDL additionally, not just alter. We need to alter there and alter here.

In this case, probably it would be a great tool to test everything and then... If we go, like, if we diverge, slightly diverge from blue green deployments idea, but use A B testing idea, so we point, like, 1 percent of traffic to this cluster, read only traffic only. I'm not going to work with, like, active active schema, like, multi master, no, no, no.

So, when we can, then we can test at least read only traffic. For change schema, but again, there will be a problem with schema replication, because logical replication is going to be blocked, we cannot, we need to deploy, uh, the schema change on both, uh, it's not only about, uh, the lack of logical replication of DDL, it's also about, even if DDL would be also replicated, if you deploy it only on one side, It don't deploy on another site.

Logical replication is not working, or it replicates it, right? So, I'm not quite sure. Ah, actually, we can drop index on the subscriber, or we can add a colon on subscriber, and the logical replication will be still working. But, some certain cases of DDL will be hard to test in this approach. But still, imagine such approach.

It will be full fledged blue green deployment with... Simple, like, symmetric schema, simple switch back and forth. Reliable. I don't know, like, maybe it's a good way to handle all changes in general. We just pay two times more, but for some people it's fine if the costs of error and risks are, uh, costs of problems are higher than this.

What do you think? Yeah, you

[00:31:06] Michael: yeah, this is a tricky one. I, uh, first company, the first database related company I worked for did a lot of work in the, schema change management tooling area, not for Postgres, but for the databases. And it gets really complicated fast, just trying to manage versions between. Like, just trying to manage deployments between versions of maintaining data.

And the concept of rolling back is a really strange one. Like, going, going backwards, let's say you've deployed a simple change, you've added a column for a new feature, you've gathered some data, does rolling back, like, maybe temporarily, Involve dropping that column? I don't think so, because then you destroy that data.

But then it's now in the old version as well. And there's this weird third version. I often talked about, in the past, rolling forwards rather than rolling back. And I think that's gained quite a lot of esteem in the past few years. The idea that you can't, with data, can you actually roll back? Because do you really want to drop that data?

[00:32:15] Nikolay: know, dropping column doesn't remove data, you know it, right? Because that's why it's fast, but it's not a story. Well, this approach with Reshape and now how this new tool is called to handle DDL in the Reshape model when it's similar to what PlainScale does with MySQL. The whole table is recreated additionally, so you need two times more storage, and you have a view which masks this.

Machine, machinery, right? And then, uh, we, like, in chunks, we just update something. There, you can have ability to roll back

[00:32:59] Michael: Because it's maintaining it in both places?

[00:33:01] Nikolay: because for some period of time, you have both versions working and synchronized inside one cluster, but this, the price is quite high, and, uh, views have, limitations, right? But here, if we talk about, like, we're replicating whole cluster, oh, it's, the price is even higher.

[00:33:20] Michael: Yeah, and the complexity is even higher, I think.

Managing it within one database feels complex,

[00:33:27] Nikolay: PgRoll. It's called PgRoll. A new tool which, is a further development of that idea of the Reshape tool. So, which is not developed as I know because the creator of that tool, Reshape, went to work in some bigger company, not Postgres user, so, unfortunately. So, uh, I don't know, like...

The problem exists. People want to simplify schema changes and be able to revert them easily. Right now we do it hard, I mean, hard in terms of physical implementation, when we, if we say revert, we definitely revert. But dropping, dropping columns are usually considered as... Non revertible step, and usually it's quite known.

When, like, people usually design in larger projects, they usually design so, like, first application stops using the column. And a month later you drop it, and then already you

[00:34:38] Michael: So I'm, I'm actually talking about adding a column, which is way more common. I'm talking about adding a column because if you need to support rolling back, that becomes dropping a column.

[00:34:49] Nikolay: Okay, so what's the problem? Data loss if you do rollback? Or what? Oh, you want to move forth, then back, then forth again without data loss?

[00:35:01] Michael: Possibly.

[00:35:02] Nikolay: You want too much. Yeah.

[00:35:06] Michael: so, I think that's... Like we talked about blue green deployments, right? Let's say part of what you're doing is rolling out a new feature, and you roll it out for a few hours, and some of your customers start using that feature, but then there's a major, it's causing a major issue in the rest of your system, so you want to roll back.

Does that use of that feature, are we willing to scrap those users, work, while we, like, in order to fix the rest of the system, I think people would want to retain that data.

[00:35:36] Nikolay: Yeah, well, let's discuss it in detail. First of all, on subscriber we can add a column. If it has default, logical replication won't be broken, because it will be just inserting, updating.

[00:35:50] Michael: mm hmm, mm

[00:35:51] Nikolay: Okay, we have extra columns, so what? Not a problem, right? But when we switch forth... Setup of reverse replication will be hard because we have now extra columns and our old cluster doesn't have it. So, we cannot replicate this table. Uh, in

[00:36:12] Michael: hmm. Unless we replicate DDL, which, if we start replicating DDL backwards, then we're kind of reverting to our existing state.

[00:36:20] Nikolay: right.

[00:36:20] Michael: is strange,

[00:36:21] Nikolay: This is one option, yes. And another option is to, I know there is an ability to fill the rows and columns, I guess, right? So, you can replicate only specific columns, right? I never did it myself, but I

think there is this, yeah, yeah, yeah. So, if you replicate on a limited set of columns, you're fine, but in this case, moving forth again will, like, you lose this data.

And it's similar to what you do with Flyway, Skitch, Liquibase, or Redis migrations. Usually, you define up and down. Or, like, how's... Upgrade, downgrade steps, in this case, you create column, alter table, add column, then alter table, drop column, and if you go... if you went back, of course, you lost data, which was inserted already, and it's considered normal, actually, usually,

[00:37:11] Michael: well yeah, so, but that's, that's my background as well, is that people often wouldn't end up actually using the rollback scripts, what they would do is roll forwards, they would end up with an old version of the application, but the column and the data are still there in the, in the,

[00:37:27] Nikolay: you talk about people who closely... like, you... you talk about companies who are both developers and users of this system. But if you imagine some system, which is developed... And, like, for example, installed in many places.

[00:37:43] Michael: hmm.

[00:37:43] Nikolay: Some software. They definitely need a downgrade procedure to be automated, even with data loss, because it's more important, usually, to, like, fix system and make it alive again.

And people, users in this case, not necessarily understand details, because they are not developers, and it's okay to lose this data and downgrade, but make system healthy again. Right? In this case, we, like... In this case, we're okay with this data loss.

[00:38:13] Michael: Well, yeah, but I guess going back to the original topic, I, you asked, do we, do I think Bluegreen deployments will take on, take off in database, database world? And I think it's the switching back that's tricky. But I don't want to diminish this work that's been done here, regardless of what we call it, because I think it will make more people able to do major version upgrades specifically with less downtime than previously they would have been able to, even, even though it will still be a little bit.

[00:38:46] Nikolay: Yeah, I don't know, maybe we need to develop the idea further and consider this blue green concept as some, like, intermediate. Like, it reminds me, uh, uh, it reminds me of red black trees, right? In, like, binary tree, red black tree, then AVL tree and so on, like, and then finally B tree and this is, like, development of the indexing also.

Approaches, uh, algorithms, and data structures. So maybe, like, closer to self balancing than a lot of children for each node. Maybe here also, like, it's a very distant analogy, of course, because we talk about architectures here. But maybe these blue green deployments. For green blue deployments, I think we should start mixing this,

[00:39:36] Michael: Yeah.

[00:39:37] Nikolay: to emphasize that they are balanced, right, and symmetric, and also, like, say, tell RDS guys that... , it's not fair to consider one of them as always source and another as always target. We need to balance them. So, , I think there should be some new concepts also developed. So it's interesting to me. I don't know how, how the future will be, will look like. Also let me tell you a story about naming. In our systems we developed, we chose, like, we know, like, master slave in the past and Secondary or primary standby, official Postgres terminology right now.

Writer, reader in Aurora terminology. Also, leader, follower, Patroni terminology. Then logical application terminology, publisher, subscriber. Here we have blue green. Right? In our development, we chose source target clusters. And it was definitely fine in every way. In monitoring, in all testing, like everyone understand, this is our source cluster, this is our target cluster.

But then we implemented reverse logical replication to support moving back. And it was, like, source target clusters naming showed it's a wrong idea immediately, right? So I started to think in our particular case, we do it like, we set up these clusters temporarily. Temporarily might mean multiple days, but not persistent, not forever.

In original Blue Green Deployment, as I understand, Fowler, if I understand correctly, it's forever. Right, we just, this is production, this is staging, then we switch. so, I, I chose the name, new naming is old cluster, new cluster, right? But if it's persistent, that's also bad naming.

[00:41:30] Michael: Yeah.

[00:41:31] Nikolay: blue green is, is okay, green blue, blue green, but, uh,

definitely,

[00:41:36] Michael: why don't you use the Excel, uh, naming convention with the final, final V2 at the end. This is the final server.

[00:41:45] Nikolay: don't know.

[00:41:45] Michael: final, final server.

[00:41:48] Nikolay: So, naming is hard, we know it.

[00:41:52] Michael: Wonderful. I think

[00:41:53] Nikolay: Okay, it was good, I enjoyed this, thank you so much.

[00:41:56] Michael: Thank you. Thank you everybody. And catch you next week.

[00:42:00] Nikolay: Yeah, don't forget to like, share, subscribe, share is the most important, I think, or like is the most important, what is it? Like, we don't

[00:42:07] Michael: comments, comments are the most important to me anyway. Like, let us know what you think in the, in YouTube, comments maybe or on Twitter or on, on Mastodon,

[00:42:16] Nikolay: You know, I wanted to take a moment and emphasize that our... We continue working on subtitles. Subtitles, they are great. They're high quality. Yesterday I asked in a Russian speaking telegram channel where 11,000 people talk about Postgres. I asked them if, to check YouTube. because we have good quality English subtitles, they understand terms.

We have 240 terms in our glossary. We feed our like AI based pipeline to generate subtitles. and I wanted to say thank you to my son, who is helping with this, actually. uh, who is, uh, still, like, teenager, school, but also learning Python and so on. So, uh, YouTube to any language. So, to me, the most important is sharing, because this delivers our content to more people.

And if those people cannot understand English very well, especially... We have two very weird accents, British and

Russian, right? Yes, so it's good to just, on YouTube, to switch to automated generated subtitles in any language, and people say it's quite good and understandable. So,

share and tell them that even if they don't understand English, they can consume this content, and then maybe if they have ideas, they can write us.

[00:43:40] Michael: Perfect.

[00:43:40] Nikolay: this is, this is the way. Thank you so much. Bye bye.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

Blue-green deployments

Creators and Guests

Some kind things our listeners have said