Postgres FM | Transcript: High availability

July 28, 2023 • 41 Minutes

High availability

[00:00:00] Nikolay: Hello. Hello. This is episode number 56. , I'm Nikolay. This is my co-host Michael. Hi Michael.

[00:00:06] Michael: Hello, Nikolai?

[00:00:08] Nikolay: So we are going to talk today about, ha, very funny topic. if you write it three times, you will have ha ha ha, right?

[00:00:17] Michael: Yep. Uh, I never thought of it like that.

[00:00:20] Nikolay: yeah. Ha, very funny topic. Actually. Many people fight, for some achievements in this area, but, it's, uh, high availability. opposite to low availability, which is like downtime and incidents and so on.

[00:00:33] Michael: Yeah, or number of nines. That's how it was, uh, popular to describe how many nines you were a few years ago, wasn't it?

[00:00:41] Nikolay: right? What, how many nights do you think is good? I.

[00:00:43] Michael: Uh, well, I mean, I remember when loads of the major providers were claiming five nines, and then when you did the math and realized like quite how little downtime that is,

[00:00:53] Nikolay: It is like five minutes per year

[00:00:54] Michael: yeah. Nearly impossible to, guarantee that.

[00:00:57] Nikolay: it's called downtime budget.

[00:01:02] Michael: So, yeah. What do you think? I mean, I think Ford's still pretty good, but you can what? What would you say?

[00:01:06] Nikolay: yeah, excellent, I think if during whole year you allowed only five minutes of downtime, and I think it's quite good. Of course, six means like half a minute of downtime per year. It's, uh, I think absolutely excellent, but I would take three as well. It's, it's like 3, 9.

It's, more than four hours. Well, for many it's okay. Not for all. Actually, I skipped one four, the four nines. It's 52 minutes per year and it's, okay. Well, it's already quite good goal. Four nines,

[00:01:38] Michael: Yeah.

[00:01:38] Nikolay: right? But of course, five nines, I think maybe it's like a mark and six nines, a plus mark.

If you can achieve it. And I, I mentioned budget because, some people have management for ha for, downtime. so-called budget. And for example, if you plan some works which have planned downtime or you have high risks of, some downtime, If you, for example, upgrade something oss or major post upgrade or something, and you expect some things can go wrong.

So in this case, if you already had, downtime minutes or seconds, which contributed to low down your ha characteristics in this month, you probably want to postpone your work till next month, right? This is regular approach to manage, uh, this like, Ha. Budget, downtime. Budget.

[00:02:29] Michael: and presumably 'cause some services do credits don't they, for, their customers if they don't achieve certain goals. So that might avoid them paying out lots in terms of, penalties.

I had a question for you as well though. Like with, we normally talk about extremely, high throughput environments, but if you've got quite a low, like let's say you might even provide an internal application for a team or something.

Have you ever heard that phrase? Uh, or the, the kind of philosophical question, if a tree falls in a forest and there's no one around to hear it, does it make a sound? Um, if a service is unavailable, but it receives no requests in that time. is it down?

[00:03:07] Nikolay: Uhhuh. Well, I, I usually consider a service down. Uh, Observed by external. Service, it should be observed externally. So if, , from outside, from various points, for example, from various regions. So in the world, , service cannot be reached. And some simple checks like, , front page is, is not responding at all.

Or returning, , five xx errors, meaning some problems on server side or some like simple a p I calls like, I don't know, like. Health checks are, are not responding from outside. This is downtime. Internally, probably we can have everything working in just some, uh, network incident happening , which usually affects like some small issue if, uh, network, configuration can lead to g.

Huge downtime, huge outage for many customers. For example, sometimes we see a w S or G C P has issues with some reconfiguration of network, devices or routing and so on. And this leads to,

huge

incidents.

[00:04:11] Michael: it's always d n s,

right?

[00:04:13] Nikolay: Not always actually, but very often. Yes.

[00:04:16] Michael: Uh, that, uh, that was the joke. Sorry. I, um, uh, uh, just for listeners, I have awful internet today, so this is the first day that we can't see each other. So Nikolay, I couldn't see that I was grinning and which is my normal giveaway that I'm, telling a joke.

[00:04:30] Nikolay: episode number 56. And I didn't learn how to recognize your smiling just from audio.

[00:04:38] Michael: yeah,

[00:04:39] Nikolay: Okay. So, uh ha. Is, About uptime. Uptime is opposite of downtime.

And what should we discuss In this area? Like where, where do you want to start?

[00:04:52] Michael: Well, I think there might be a lot of people out there like me that conflate. HA and high availability with just switch over and failover tools and processes and run books. And you make a really good point that it's not about those. Those are a couple of techniques we have to achieve higher availability, but it's really about avoiding downtime.

And when you made that point to me beforehand while we, while we were preparing it made sense to then go and list what are all the. Ways we can cause ourselves downtime or the ways, what are the main causes for, let's say, Postgres specific downtime? and then maybe once we've gone through those, or maybe as we go through those, discuss some of the things that we can then do about it.

Does, does that sound sensible?

[00:05:42] Nikolay: Yeah, it does. And you are also right it depends on your point of view, if for example, you are responsible only for infrastructure and remember, we can distinguish. Infrastructure, D b A and application D b A. And if you are purely infrastructure, D b A at some point you probably should consider everything, engineers, developers do, , with, , application changes, for example, schema changes and so on.

Probably you should consider that as a black box. And, , care only about failures related to infrastructure. For example, as you mentioned, , auto fell over, meaning that some node is down. Due to number of reasons, or PG bouncer or any, , pooler also having, being a single point, of failure. Uh, S p o fpo, right?

Or some networking issues like some something like this. So pool infrastructure, and you, you can see everything, uh, application does as a black box. But if you are application debate or if you, care of both. Parts of like, actually, the boundary here is quite not so strict, right? So I prefer consider everything and if you are a database expert or you part of your work, at least part is about. Databases related to databases. In this case, , I prefer taking both infrastructure, things like auto failover and, , purely application things.

For example, we created primary key RA four. We're going soon. We, we have 2 billion rows soon we have huge outage, or we forgot to implement, a low. Lock time out and retries logic and, uh, releasing something. Some small, , schema change. When auto introduction Nigeria around prevention mode is working, we're going to have.

queue of queries which are waiting for us while, while we are waiting for this auto vacuum. So we have at least partial outage as well. Sometimes global. I mean, sometimes everything is going down because of this simple mistake, which is easy to do because it's default behavior. If you don't do anything, at some point you will have it.

So I prefer to care about both worlds. And consider, include, application stuff to, , the set of measures to prevent downtime. So, for example, for me, uh, primary key interval, four, when we know we will have a lot of rows inserted, it's also, definitely related to ha.

[00:08:35] Michael: Yeah, good point. I mean, you've mentioned the D B A roles, but really it's development as well, isn't it? It's, everybody who has responsibility for any of the design, is involved in this as well. And, and even things that we can't avoid, right? we've had episodes already about major version upgrades and discussed quite, how difficult it is to do those, potentially even impossible to do those without any down, like with absolutely zero downtime.

Um, so it, yeah. Oh yeah. Are you? Yeah. Well,

[00:09:04] Nikolay: We discussed it, it's possible

I. I provided a recipe. By the way, we mentioned, , this new tool. orchestrate major upgrades. , the tool called, uh, PG Easy upgrade p, easy replicate, sorry, p Easy Replicate, which, uh, supports, uh, major upgrades.

Written Ruby, as I remember. And, uh, we mentioned that I, I haven't tried it, but I criticized it a little bit, but it, this tool got attention at Hacker News and yesterday the author contacted me directly on Twitter. So we discussed improvements and, and obviously it's possible to achieve. Practically zero downtime.

I mean like one, two seconds of, latency spike if you have, at least, if you have not huge data volumes and not big, t p s numbers. For example, if you have less than 10 terabytes and if you have less than 10,000 t p s on the primary, it's definitely possible to achieve, like basically zero downtime grade when you will have.

just a spike of latency, but this needs, , component besides posts. You need pitch, bounce, or any pooler who had episode about pools, which supports posts zoom with ability to switch primary to different, machine,

[00:10:16] Michael: Yeah. Or, or presumably some of the load balancing tools, and operators support that kind of thing as well.

[00:10:22] Nikolay: Um, Postum is not very popular, I think, but I might be mistaken. it's interesting topic. I think it's, uh, under developed, under, like it received a note. A lot of attention so far. This topic because, you know, like, we have bigger problems. we have bigger points when we can have downtime.

For example, even minor upgrade can lead to downtime for up to one minute. And, and there are many risks where you can put everything down to many minutes.

[00:10:53] Michael: Yeah. I know we don't gonna have time to discuss each one in depth, but should we run through a few more?

[00:10:58] Nikolay: You know, , let's discuss first, auto because of course it's very important component for to achieve. Ha And I remember discussions 10 years ago or so, discussions that it should leave outside Postgres. this should not be inside Postgres itself. So here we are. We have patroni and alternatives, and we inside Postgres, we almost don't have anything.

Actually, we have something. And in recent Postgres versions, we have, improvements in ku. This is, uh, library, which is used in many, many, many. drivers, , which are used by application code to connect to res and, already for many years, you could use multiple hosts in connection.

So you can say this host or that host or that host and automatically rip the cool. So like, I mean, at a driver level, it would try first, if it's down immediately, try second one. And this is already convenient, , to achieve, lower downtime, right?

[00:11:59] Michael: Yeah. Or higher

availability.

[00:12:03] Nikolay: Or have like our, bigger uptime or how to

sell better uptime.

And, uh, in, in post August 15 or 16, I don't remember particularly, , additional improvements happened there. I don't remember details, but I remember previously it was quite, you know, like the order is hard coded. Ah, load balancing went there. So load balancing is supported at protocol level right now in very new Postgres versions.

So maybe in 16 actually, which is not released yet. And it's better too, as we speak. But, this is good addition. And so meaning that some things go to Postgres core, basically because Le PICU is part of Postgres. but, also there is target attributes. I don't remember exactly, but you can say, now I need, , read, write connection, and now I need only read only connection.

And, uh, it establishing connection. Uh, lip, knows what you need. And it can automatically switch to the primary if you, it knows you will need rights. And this is also some interesting thing you can build, uh, interesting stuff, on top of it. But of course it's not auto fill over and auto of a love is needed.

I remember when we had like couple of servers or three servers, and it was many people, many debates considered auto lover as a very bad thing because who knows what it'll do. Right. It was quite, popular opinion that you should do it yourself. It's better to be paged, wake up at 1:00 AM and perform fell over properly, and it's okay to be down for a few minutes but not allow some, Who, who knows, who developed it and like who knows what it'll do.

Maybe it'll lead to split brain or something. So it, it was a matter of trust. But, then, tooling started to improve. I. as usually I should warn, don't use replication Manager, uh, rep, M J R and other tools which are not based on consensus algorithms like, uh, raft, or Axxis, right?

So use Patroni or similar tool, which is based on consensus alga, and, a lot of work, , was invested to make it, , work properly. So with replication manager, I personally had several cases of split brain with Petroni. I never had it. I had different issues and, and usually Kukushkin, the maintainer of Petroni fixes sw quiet, quick. So it's, quite well maintained and I think right now it's standard def facta, many systems, many separators included, uh, as a component, and now nobody thinks already that, uh, you should use, manual fell over, right? It's already nonsense.

[00:14:48] Michael: Yeah, I can't imagine, somebody still believing that. But for anybody that isn't familiar with Split Brain, the time I've come across that most is, well, in fact, actually definition wise, is that just when. you can end up with one database, uh, one node, believing one thing and one node believing another in terms of, data.

[00:15:07] Nikolay: So we have two primaries, two uh, old name for primaries Master. We have two masters. So we have two primary nodes and some rights go to one node, other rights go to another node and nobody knows how to merge 'em back and, uh, solve it's very hard. Usually in general case, in some cases it's possible, but in general case, for example, if, we have had reduction of, Split brain when some rights on one node happened to particular tables and on another node to different tables.

In this case, probably it's quite easy to merge, but uh, in general case it's unsolvable problem, so you should avoid it. This is big biggest problem of auto fell error basically. So this risk of, split brand is the biggest problem. And those tools which do not use consensus algorithm. basically are bad in this area, even if they have like witness node and a lot of heuristics coded.

It's not good. And I remember I spent some time analyzing replication manager C code, and I found, places exactly where, , split break can occur. It's like, unsolved tissue, so just don't use it. I'm quite, maybe like sound not polite. , but uh, it's based on many years of experience and observations of troubles I saw in my customers, so like, and experienced. So patron is battle proven and standard def factor right now. But, but there are a few other alternatives which also use. Uh, based on raft or, or, or ax,

[00:16:42] Michael: Well, I think a lot of people these days are using, Cloud provider defaults. You know, you can, you can just click a button that says high availability and it gives you your replicas and auto fail as like a, as a feature, not having to worry about exactly how that's implemented and the people, and even people, some people are using.

I believe it's a big feature of the Kubernetes operate, or, still not how sure how to pronounce that, but, the operators that are, coming up left, right, and center. So I think there are newer options. where you don't have to worry about the exact underlying technology, or maybe I should be more worried about that.

[00:17:18] Nikolay: Yeah, yeah, yeah. okay. So split brain is one, risk if you use auto filler. And we agreed we should use auto filler to achieve better ha characteristics. Another risk is data loss by default, if we use asynchronous, standby nodes, if auto failover happens. We might lose some data, which were written on the primary but known of standbys, received it.

In this case, auto hell over happens and different node becomes primary and it lacks a Tale of Rights. So some portion is not propagated there yet. Because, for example, of replication leg, which all standbys had 'cause they are, are allowed to have leg because they're asynchronous by default.

Physical standbys are asynchronous and peti, for example, has special, knob in configuration, which says, Legs more than by default. I think 10 megabytes are not allowed, for, for AutoFlow to happen. So if all standby nodes have a synchronous standby nodes have leg above this threshold, AutoFlow won happen.

So we choose, we need to prioritize. We have a trade off here. What to choose, uh,

[00:18:36] Michael: Data

loss or

availability.

[00:18:39] Nikolay: loss versus, uh, versus downtime. And ha goes down, I mean, ha characteristics. So uptime, uh, worsens and usually people can just, we have two solutions. We can tune this threshold and say, okay, even one megabyte is too much for us.

but we understand that, uh, it increases risks to be down because AutoFlow doesn't happen. Cluster exists, but in read only mode because primary has gone. Due to number of reasons. Some hardware failure, for example, and then we, just need to, manual intervention is needed. But, , by the way, , story about, , AutoFlow and how they work properly. So we post ai, it's not a big database. Our, I mean our website, uh, and it's, , running on Kubernetes Zalando operator. Patian side and at some point I noticed that timeline is exceeds 100

[00:19:36] Michael: What do

you mean?

[00:19:37] Nikolay: 100. Uh, well, we have a fell over switcher when we change primary. No timeline increments by one.

So we start, we created cluster N G B. Timeline is one. , we have switchover. We change primary or failover. So switchover is planned failover basically, right? Failover is like unplanned change of primary, switchover is planned. Well, for example, we want to upgrade or something, change something we performance with other, and then we, for example, Remove, , old node, old primary.

And I've noticed that, , our timeline exceeded a hundred, meaning that we had so many auto fell lovers and nobody noticed

[00:20:18] Michael: Is that a

good thing or a bad thing, like

[00:20:22] Nikolay: It's both because it was due to, in Kubernetes, we, it's usually regular problem to have out of memory killer killing Postgres. It's a big problem and people have like solutions to it, so we needed to, to, to like adjust some settings.

Had more memories, so, so we did. But , main problem here was, our monitoring didn't tell us about these cases. And we left for with it some months, maybe actually a year or maybe more. So it, we had, we had the changing node, uh, changing primary node all the time. And, , it just happened and happened and happened and, , nobody complained, which is good thing.

But, lack of proper monitoring, alerting is a bad thing. Of course, we should notice earlier, but this case demonstrates, how AutoFlow can be helpful. not to notice some issues, right? Meaning it can hide those issues, , so it worked well, pat. Ron worked well.

What not works well, uh, monitoring in this case. And it's of course our fault. We should, configure it properly and, receive all alerts and so on. Every time follower happens, we should receive alert of course, and investigate and try to prevent, because of course it's a stress. Again, I mentioned risks of, split brain, which I can say patron. Almost certainly shouldn't have, shouldn't have. And, , risk of data loss. So this is, not good, but in these cases, data loss didn't happen because, okay, shut down. But when we have two, streams of data propagation, first streaming replication, when, uh, I, I'm, I'm actually, I'm not sure if, uh, on killer issues, it's, is it secure? So meaning that we'll sender won't send all bytes to, standbys, right? But at least we have also archiving and archiving, goes to object storage. It was world G and then we restore. So, data loss at least was not noticed there, but , it's a bad story actually. I, I feel right now, embarrassed about it.

So.

[00:22:28] Michael: Yeah.

Do as I say, not as I do. Right.

[00:22:30] Nikolay: Yeah. Don't, don't do this.

[00:22:32] Michael: Yeah. or learn from my mistakes. it feels like this is a good time to like, go onto some of the other things we could, you've mentioned alerting already, few other things that we've mentioned in previous episodes that you can alert on our, at certain points of Running out of transaction IDs, so making sure

[00:22:48] Nikolay: So hold, hold on.

Sorry. Hold on.

let me finish with second solution. Second solution is using synchronous replicas, and this is what Sirius, setups should have. you have a synchronous replica and better, you have quorum commit. So you have a semi synchronous approach when you say I have, okay, I have five replicas, five standby nodes.

And when uhit happens on the primary, it should be at least received or applied by at least one of nodes And Postgres, modern Postgres, allows. Very fine tuned configuration here. So you need, you can choose and decide between, like of course if you do it, you increase latency of CTO for writing transactions,

[00:23:28] Michael: we talked about that in detail. I think in our replication episode.

[00:23:32] Nikolay: Right. So if this is what you should have, like, you should use it in this case, ideally if AutoFlow happens, you will have zero data loss because, data always exists somewhere already. It's at least at one, at one node, it's guaranteed. And in this case, this is for serious approach. This is what you should do.

And, , that's it. Also, I wanted just to finish about switchover. Switchover by default will lead to downtime because of, shutdown down checkpoint. We discussed that a few times as well, but if you just decide to perform switchover, patoni needs to shut down primary. I don't remember if patoni issues explicit checkpoint. It should probably, but worth checking. The recipe for good, fast, switchover is. Make sure you issue explicit checkpoint this helps, shut down process to perform shutdown checkpoint much faster because for clean shutdown, Postgres needs to write all dirty buffers from the buffer pool to disc.

And if we issued explicit checkpoint shut down, checkpoint will have only little to write this is the recipe.

[00:24:37] Michael: Yep.

[00:24:38] Nikolay: this case, when you, you perform shutdown during shutdown wall Sanders, make sure that everything is sent to. To standbys. So clin shut down standby knows, received all data from the primary, so no legs, and uh, also archive, command.

everything is backed up. And this is a big problem I noticed in Petroni, actually, we had an incident when. We also, due to poor monitoring is super important for of course, but due to poor monitoring, we didn't notice that archive command was lagging a lot, meaning that a lot of walls were not backed up so primary accumulated many, like say, thousand walls being not backed up.

But when you try to perform clean shutdown and Patron tried to do it right. It, , tries to wait on it. Postgres, this is a Postgres behavior. Postgres waits while archive command is working and it's working very long because, and maybe there is some reason it cannot work properly, so it still waits and waits and pati waits and so on.

So now I know it was fixed in Petro and now it, it waits not long and performance fell over. also trade off here, right, because we want to back up everything. But Patroni choice is to perform, fell over in this case and let BKA process to work like additionally. This is super interesting topic, again, like worth checking details, but many people failed to understand and I also felt was failing to understand that when we shut down Postgres, it waits, while our C Commander is working.

And, this can lead to downtime, actually unarchive, uh, walls,

[00:26:20] Michael: I feel like we could do a whole episode solely on Peti. That might be a good one in the future.

[00:26:25] Nikolay: yeah, actually good idea. I can refresh memory on some, interesting cases and uh, and so on and, and yeah, we can do it. I think it's a good idea. Petroni is an interesting tool, developed at good pace. And, receiving a lot of good features. But, , I must say, it's mostly on shoulders, on on a single maintainer, Alexander Kukushkin, and definitely it's, it's Python.

So I, I think more contributors. I. Could help this project because more people work with Python than C, right? And it, for some people, it's easier to read and write Python code. So I advertise, to help Alexander with, better box description testing and sometimes contribution in form of pull request to, GitHub repository of Pat. It's worth doing it

right. Okay. I think we covered both failover and uh, switchover and how it can contribute to downtime and affect uptime or HA characteristics. Let's switch to something else.

[00:27:29] Michael: Well, I don't think we've actually covered a bun. I, I've written a list in preparation of all the things that I've, I've seen cause downtime in the past. I guess we're running a bit short of time and we could run through a few of those. But in terms of solutions like.

It's not just those, right? It's not just failover and you made this point early on. We've also got those alerts that we can set up. We've got logging we can do to help not only just to help avoid this in the first place, but also to help us learn when it does happen exactly what happens, so we can prevent it from happening again.

some of it's just learning, right? Like it's, it's good to be. notified or alerted when you're running out of intes in a, in an in four column. But it's even better to not have in four columns in the first place. Like I, I read a really good post by, Ryan from Rustproof Labs. Yeah.

Just saying why not default to in eight, it's, you know, it's only a few extra bytes. and you avoid so much pain in the future if you ever scale. So yeah, that makes a lot of sense to me. So some of it's learning, I think some of it's, helping educate your team. and some of it I, I've, I've seen you and others do or suggest people do.

Annual health checks, annual capacity planning, things like that where you do look ahead a bit and try. And so a lot of these issues are scaling issues, right? It's not, too surprising that, you can cause yourself downtime through scaling. We've all read the stories on, normally on Hacker News or a lot of good blog posts over the years of people causing themselves downtime via lots of traffic or, just general scaling issues.

[00:29:01] Nikolay: Or just mistakes

[00:29:02] Michael: Yeah,

[00:29:04] Nikolay: which are not mistakes if people don't know about it. That's why it's good to learn and read about, other people failures.

[00:29:12] Michael: Yeah, exactly.

[00:29:13] Nikolay: Yeah, so let me emphasize one thing here. I'm naturally, uh, working in the area. Of, prevention of issues. And I tend to, underpay attention to, monitoring and reactive components.

They are very important. So every time something occurs, it should lead to alert. I think it's even more important than prevention as a first step of building good ha system. First, you need to cover yourself with good alerts. And, uh, for example, they can be different. For example, sometimes it's alert about failover, part after disc space in incident about some, uh, lockin, bed locks and so on.

Many things. But then you cover yourself with lords and, and I must say, Monitoring systems. Currently Postgres monitoring systems. None of them have good alerts for Postgres. None of them, like many have parts of it. We can also like, play with this game. Give me some monitoring system, and I name you 10 alerts.

It's, it's lacking. This is the problem we have, we have no mature monitoring systems designed yet or, or provided on the market. It's a big, uh, big, uh, underdeveloped area. Many, many systems. They are all good in their sense, but, no. Systems, which I would give even, uh, b plus, I'm not speaking about amark grade, right?

So not, no systems would get even B plus from, from me. But, uh, so we, need to cover with, , like incident happened, some problem happened with all these, , things. Then we should, , Try to think about prevention already and cover with some alerts about thresholds,

Up to here
---

[00:30:52] Nikolay: like 80% of something.

For example, you mentioned integer four, overflow. We know when it happen, it's 2.1 billion. If some sequence has 2.1 billion value. Okay, we should define some alert. In this case, I would define it maybe at 40, 50% even, because it takes a lot of time to, to develop proper solution and, and so on. Or also, transaction idea around prevention.

By the way, these two cases after won't help because everything, uh, you have on primary. Standbys also have, right? So you need, to work at application level, with schema, with adjusting something, or in the case of transaction idea around at like d b A level, you need to fix vacuuming, you need to make sure auto vacuum or your explicit vacuum did job properly. also mentoring long transactions. A lot of things and a lot of alerts can happen. Then we have like 80%, 50%, uh, thresholds for alerting, for dangerous behavior and also we need some forecasting company. And, because it's not, enough usually to say, okay, we reached 50% of interval for capacity.

Usually people immediately ask next question, when. when will Doomsday happen? Right? So they, they need some forecast and you can apply some machine learning techniques here or use some libraries. yeah.

[00:32:18] Michael: Or just basic statistics and or some rough estimates. it's not, it doesn't have to be too

[00:32:24] Nikolay: Yeah, Yeah.

[00:32:25] Michael: yeah.

[00:32:25] Nikolay: yeah. So, yeah. And, and in this case, uh, this is already about prevention and, , Yeah.

Health checks , can help. And , these like proactive, not reactive, but proactive behavior management of your Postgres can help find problems. Do you want to mention some other issues we wanted to mention

[00:32:48] Michael: Should we do a quick fire and you tell me how common they've been in your experience?

[00:32:52] Nikolay: Let's do it.

[00:32:53] Michael: Hardware failure.

[00:32:54] Nikolay: Well, hardware failure, actually, it improves over time. I remember like 15 years ago, it was nightmare. Right now, like even cloud environment, when hardware failure should be considered as quite, , common thing. Uh, I, I, I feel it's, it it improved. I mean, cloud providers and many providers improved the characteristics of hardware and, , it's becoming less common actually.

So like, of course if you manage like many thousands of servers, you will have multiple incidents per week. Definitely, maybe almost every day you will have incidents. It's normal. But if you manage only the, like a dozen of servers, , it's quite stable already. if you have an incident per week, with hardware managing just dozen, a few dozens of, virtual machines or like physical machines, it's time to think about moving somewhere.

[00:33:45] Michael: Change provider. Yeah.

how about operating system failures?

[00:33:50] Nikolay: Operating system failures also improve, like we, unless you went with some, fancy kernel settings. It's not happening often. Usually here we have feathers of different kind, not like we, everything goes fine and then suddenly something u usually, well, we need tuning of course some, some kind of tuning and we need to take care of behavior of, page C and Linux and so on.

Especially if you have a lot of rights. But usually when we upgrade the version system, this is where we might have problems, you know, related to, first of all, related to gypsy

[00:34:26] Michael: Yeah.

[00:34:27] Nikolay: and collection changes

[00:34:29] Michael: Not normally downtime, but corruption. Right,

[00:34:31] Nikolay: It's not downtime. I agree. yeah, yeah,

Yeah. you're right.

However, Postgres 15 started to complain about, um,

j serious change.

And I don't remember, is it warning on an error?

[00:34:44] Michael: I

think it's a log

[00:34:45] Nikolay: actually.

[00:34:46] Michael: warning. I'm not sure.

[00:34:48] Nikolay: It's good if it's just a warning, right? I don't remember

[00:34:51] Michael: yeah, I actually, I'm not sure at all. I've got a couple more, three more. I think pull or load balancer issues.

[00:34:58] Nikolay: pool we discussed recently.

[00:34:59] Michael: Mm-hmm.

[00:35:01] Nikolay: if you grow, grow, grow, if you have a single, first of all, pool can, uh, for example, pitch bomber , can be a single point of failure. And if this virtual machine down, it's not good. You need to have multiple ones and balance between them. And second, , big problem.

It, it happens all the time with many people. Uh, still, PE bouncer is a Single threaded process. So it'll utilizes only one core. You can have dozens of core on the, on this machine, but it'll take only one core. So you need to run multiple edge bounces. Otherwise you'll have, situation and it'll cause downtime.

Definitely somewhere between like 10 or 20,000. t p s roughly, depending on, on particular, um, uh, C P U you are using. I'm, um, this, this experience is based on intel prop mostly. If you example, on arm, processor, in this case, it can happen earlier. So it's already like, I mean, 10,000 t ps it's already not a huge load, today.

So single, p bouncer, it's a big. bomb. That can, trigger at some point, and this

[00:36:08] Michael: Yeah,

[00:36:08] Nikolay: good.

[00:36:09] Michael: second to last one, running out of disc space.

[00:36:14] Nikolay: Very common. Of course, if you don't have monitoring and your project is growing, it's very common. It's good thing that, , one of the reasons, was improved. I mean, previously for lo replication slots, basically mainly logical replication slots if you created, but. No logical replica is using it. It leads to accumulation of walls on the primary, and eventually you are out of disc space.

But now you can specify threshold after which, Postgres will say Enough waiting. I give up this logical slot in favor of, , not being down because of Disc space place,

it can be improved at Postgres. I think, people complain about behavior when Postgres reaches the out disc space. You need, , to perform some dance to fix it, and it can be improved. I remember recent discussions and don't remember details, but I was personally in this position many times.

Usually I escape rather quickly, but it's very annoying to people when you are out of this space and you need to escape. yeah,

[00:37:15] Michael: Yeah, I thought you were gonna say this is less common now with bigger disks and cloud providers, that some of them let you kind of scale this, don't they?

[00:37:23] Nikolay: Yeah, most of them support zero down time scaling. I mean, if you sees

90% you scaling up Yeah. Not down. Yeah. Yeah. Good point. And if you have 90%. I remember some issues on Azure, but I don't like Azure anyways, so it's my nature choice. Like they didn't support online change until recently, and also recently also with some limitations.

But G C P and Azure and others support online change. So you can just say I want like a terabyte more, and then you, depending on your file system, you need to issue a few commands in in terminal. And then the Postgres already is fine, right? So it's easy.

[00:38:05] Michael: For what it's worth, I'm warming to Azure. I think there's some cool Postgres stuff going on there. but I haven't tried it, myself.

[00:38:12] Nikolay: well, different level, like I, I call it second layer of cloud world, right? , managed services. Of course, we should have it fully automated. If, , your managed cloud provider doesn't offer scaling up for disk space, it's not good at all. This feature all should have already.

[00:38:30] Michael: final one. Direct denial of service either an external, actor or I've seen this internally as well, where someone's accidentally, hammering their own database. Have you seen this? I'm, I guess it isn't

necessarily a full outage, but yeah.

Par, at least a partial outage normally.

[00:38:48] Nikolay: Yeah, first slow down and then outage when, uh, especially if you have max connections a lot and like, uh, it's, it's, it's very common. If some mistake happened in application code code and it started to issue some queries, uh, at very, at huge rate, a lot of queries per second of some kind, it can put you down.

Definitely. Of course.

[00:39:10] Michael: Yeah.

Awesome. That's all I

[00:39:13] Nikolay: Unfortunately. Good. I think it was, uh, good, overview of, some, maybe some like not, comprehensive, because this topic is huge actually. And especially if you include, uh, as I prefer if you include application level stuff into it, ha is a huge topic and, big departments are working to achieve, , good ha. So SREs and, and D DBAs, D Bres. And so on. But, I hope we perform some both overview and dived into few particular areas. Of course not into all of them. Thank you so much.

[00:39:50] Michael: Yeah. Awesome.

[00:39:50] Nikolay: Oh, I, I, Yeah, I, I forgot to mention, the importance of likes and subscriptions. , please do it because, yeah, I know like we received a lot of attention recently with recent episodes and also our anniversary and so on, and every time you like it, please make sure you left some like, or comment or something, which is super important for us.

This is our fuel and also it helps to, grow our channels. We have multiple channels, and this helps, to grow and, uh, more people can benefit from what we discuss here. So please don't, uh, underestimate these, social media likes and comments and, and so on and share with colleagues as usual.

Thank you ev everyone for everything.

[00:40:37] Michael: Yeah, we appreciate it. Take care.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

High availability

Creators and Guests

Some kind things our listeners have said