Infra cost optimization

Infra cost optimization

Nikolay and Michael discuss some options for reducing costs — from big-effort-big-reward items, to some smaller things that could also be very impactful.

030 Infra cost optimization

Michael: [00:00:00] Hello, and welcome to Postgres fm. We show about all things Postgres Cure. I'm Michael, founder of PG Mustard, and this is my co-host Nikolai, founder of Postgres AI here. Nikolai, what are we talking about today?

Nikolay: Uh, let's talk about optimization of, budget. first of all, for infrastructure costs. I mean, like, if you run a database, And it grows of, obviously it increases your bill, cloud bill, usually sometimes not cloud, of course. And, uh, there's a question which becomes more and more acute. how to pay less for it and continuing to grow in terms of your company growth, how to.

Reduce the costs and uh, I think it's the topic that can be interesting for both infrastructure and engineering teams. And, it's becoming more and more important because we have crisis already. A lot of people are fired, unfortunately are being fired. And this is, uh, This is one of the questions, if you can optimize, infrastructure costs, probably [00:01:00] it's good, uh, way to optimize May, maybe you can save some, uh, jobs, right?

Michael: Yeah. It feels like we've gone through a boom period where a lot of people were throwing money at problems and maybe upgrade, maybe doing a lot of things and having a lot of costs that they don't necess. Solely need longer term. So hopefully, in some bigger environments, there's some big room for improvement.

But also I think for smaller companies, there's often, there's often ways of, uh, reducing costs there as well. So yeah, I'm looking forward to this one.

Nikolay: Right. So, uh, let's start with, clouds. Maybe. there isn't a good article, it's actually like maybe two years old already from Anderson Horowitz. It's called the cost of cloud, a trillion dollar paradox. And obviously clouds are super interesting, super, like, revolutionary concept, which changed engineering drastically.

Right? But, uh, observation is that with all those additional services, uh, when [00:02:00] companies, start, it's great because it helps them. Focus on product development and finding product market feed and, and so on faster. But then obviously the bill grows significantly. Of course, the companies like a gas, Google Cloud, Microsoft, faia and others, they, they win a lot.

And if we consider, the market is growing, growing, growing, but if you, right. do a simple, ex simple exercise. I did it recently and come take some, um, I don't know, like some mediums medium. Server, for example, uh, Intel or AMD server, uh, with 16 core, uh, hundred 20, 28 mega gigabytes of memory, for example, and compare, uh, the costs of physical instances or GCP instances, versus, the reprise of the server if you buy it, for example.

Right? Of course, it's extreme right now. We all, we want. everyone wants right now to run just API call or run [00:03:00] Terraform task and provision instances. So it's so good. Nobody wants to, deal with, ordering servers, then replacing disks. It's like, it's it's nightmare from the past, but if you compare those costs or also there are some low.

providers like O V H or Hener where you can rent, rent servers and they will be replacing, uh, failed disks, uh, CPUs and so on. You will be, Wondered why, why it's so high. So my first question, when we talk about database infrastructure, cost optimization, where do we run it? And, uh, extreme, optimization is to go with, uh, rented servers or with own data centers if you are a big company.

And it'll compare, the costs, of maintenance, of course, because if you use clouds, maintenance costs, ownership costs and so on. And, but, uh, it, it might be a total sense. [00:04:00] To, for example, to go with rented servers or if your own data center for smaller teams granted rented server for larger teams, your own data center to build it.

But of course a lot of, many, many questions here. Uh, it's, it's not database only topic, definitely, but I would like to mention one important, uh, one important key here with Kubernetes and a lot of automation. For example, you can run POGS with, Kubernetes, or we can run it with Ansible easily. Like we, we have a bundle of pogs, uh, many options.

For example, POG scale cluster. Uh, one of, person from my team maintains it vital. It's great product open source, so you can run it from Ansible and it has backups, patoni for auto, everything. So you can manage pos. Kubernetes is like full fledged automation and, achieve a lot of good results.

So, so with, uh, low [00:05:00] cost here, so

Michael: Did you, did you see the recent blog post by the team at Basecamp? Or at least, d h h I think they're, they're a MySQL setup, but yeah, they were talking about exactly this and

Nikolay: like, it's insane how big, uh, the bill is right now. If you grow, if you have. , you pay a lot. Of course. Uh, at this scale, you already should negotiate, uh, large discounts with cloud provider, obviously a hundred percent. So a large, I mean like 70%. It's not like 20 or 30, 70, 80% discount.

You should do it if you pay more than 1 million, you, I, I think you already do it,

Michael: probably that's the thing, the kind of thing people are already doing. But it does feel like there are people starting to question, are we getting the benefits that the cloud promised? are we still having to employ people to do a lot of the infrastructure things? Having to worry about the lower level things?

If you are, and if. Then are we really getting the, are we really happy paying those premiums? But [00:06:00] equally, I think , that's a very, very high effort way of potentially saving a lot of money. And it feels like probably in the rest of the list we have, we're probably gonna cover some things that are lower effort.

Maybe they don't save quite as much money on an individual item basis, but you might be able to do them sooner.

Nikolay: Right. The, I I, I wanted to emphasize that this should be exercise you do like constantly from time to time. I mean, if you are city or something, or, or database, uh, lead, lead, uh, database lead in your company, I don't say go bare metal. With Kubernetes, but it should be considered analyzed, when you do capacity playing.

And, I also like must say, clouds and managed services like rds are awesome. All those clouds, like ev, every cloud is awesome. Ivan Timescale Cloud, super base, has neon. All of them are great. All of them provide a lot of value, but you need to deal with pricing. We need, you need to analyze costs and consider alternatives and, and we helped, uh, [00:07:00] Our customers, uh, to make decision.

And it's always not an easy decision. So on one hand you like to go to rds, or not to go to rds, to go to Avor rds or to go with, uh, Kubernetes. Uh, there are options there as well. Maybe another topic, right? So it, it's always a quite hard decision because you think, okay, I already have a table comparing all, all aspects of pricing, all on my, my budgets, but maybe I forgot.

right? They provide, they do provide value. They have good, metrics. I mean, good logistics in terms of uptime and durability. Will our infrastructure team can handle it? If it's small? Maybe not. If it's large, okay. Maybe now it's, it's doing, maybe later it's not. So it's, it's a good question. But still my advice, if you are too small or too, Consider bare metal, definitely.

Or rented service. I mean, if you're too small, consider rented service. It can save you a lot, like a lot. For [00:08:00] example, instead of paying 1000, you can pay 100 easily. For infra, but for work, yes, it'll require the use of some tools. Yes, it'll require maybe some subscription to some services for support. But overall, okay, you'll be paying $200 instead of 100, but before you paid one 1000 to cloud, right?

So. RDS is great and without rds, for example, uh, I think I wouldn't come to idea to create, think loaning database lab pension because first I worked with RDS clones. I understood how convenient it is to provi, to provision temporary machines to experiment, to, to develop better and then to delete them.

and then like extreme case, uh, think loans, was implemented. and I'm sure many will remain on cloud providers and it'll continue. , but in crisis, it's one of the biggest questions. Should you go self-managed pos or managed by some other company? Po.

Michael: Yeah. And we do have a whole episode on this as well, right? One of the earliest [00:09:00] ones. If anybody, you've got new listeners, , making this decision at the moment, uh, that might be worth a listen, but yeah, it feels like a, a huge decision. Lots of work if you're having to migrate, and, and maybe you don't have to go all or nothing, right?

Like. , there are some services that they've run clearly running on very thin margins, like the storage services on the clouds, but the databases do feel like they do have, quite a high margin added to them, versus the, like S3 cloud, like, simple storage solutions.

Nikolay: Right. Also, if you go to, if you go with cloud, there are many optimization techniques there, there remain , like maybe we could discuss only this, uh, during whole episode. For example, if you also in cloud means self-managed or versus, uh, Additional service like rds, right? Also, also everywhere there are trade offs and decisions to make.

So if you run, self-managed in cloud, of course you can, um, benefit from various techniques to optimize costs for, and, uh,[00:10:00] for example, some people also like it's extreme technique, but some people use spot instances for even to run databases. in aws it's better than in gcp, for example.

I'm not sure about Asia. I don't have enough experience with Asia somehow, like our clients are mostly on, not on Asia. Not on Asia. Also, if you, work on self-managed a little bit more, options to cut costs, to cut the bill. If you go with reserved instances, for example, you can, uh, you can buy a contract for, bigger type instances and also have convertible instances on edible.

Yes. So you can, then dynamically split large instances to into pieces. Uh, if, when you need smaller instances, it's quite interesting technique. Also, I learned recently there is secondary market who can sell those contracts and so on. There are companies who. Uh, focus on this cost optimization. For example, if you go, I use Service Easy two and four, which provides like, I don't know why AWS [00:11:00] doesn't provide normal table with all prices on one page, but these guys do, very good job.

And, the company behind it right now is One Touch, one Touch H. So they provide, a service, automated a service to optimize cloud. For all big three providers, that's quite interesting. Uh, I hope we will have a positive TV episode with them soon already discussing. Uh, so this, there are many options to optimize there also, in terms of computer storage, both, I, I wanted to highlight one thing.

Uh, there is a serverless approach, Which isn't very interesting, but you should very carefully analyze its pros and cons because like in general it's great. Like, I mean, you pay only for, for what you need, but you need to understand your workload pattern. If, it's spiky, of course, it's great.

but it's, if it's not spiky, there is some, level when, going serverless, the, the overhead you pay for, for the company to, for this service itself, it's already too [00:12:00] high. So it's like also there are trade offs. You need to do a lot of work, filling cells in some table with calculator maybe, and, sourcing information from, from various places.

Michael: And maybe some testing, right? Like, sometimes paying a little bit to test some things is much better than I saw. I saw a blog post not that long ago. I, I can't remember who it was. So I'm gonna find it and put it in the notes from a company who did a switch to, I think it was Aurora, in this case, from regular rds.

And were really shocked. Increase in pricing when they were expecting a decrease. So it's,

Nikolay: Ha ha, Aurora. It's making some loop with, with topics like I wanted to discuss first, like clouds, not clouds and clouds. various types of contracts, spot instances, reserved, convertible, many things to, to, for example, I don't know why rds don't have spot instances. Well, well, okay. I understand you should like, it's like state stateful you, but.

People do it actually. And if you lose compute, storage is there [00:13:00] and POCUS will recover for non-production it's good, but they don't provide it. but then you talk about Aurora na aro on Aurora. One interesting aspect is that on Aurora topic of query optimization, which, we, when we discussed, when we prepared for this episode, you raised the query optimization naturally as a first topic and I said, no, no, no, no, no.

Based on what I saw in larger companies, it goes, maybe last, but with Aurora, maybe not last because on Aurora you need to pay for IO and if your database doesn't fit, the buffer pool, shell buffer. And you, your queries are not optimized. They do a lot of work with disk. In this case, uh, you, you, your unoptimized queries hit your bill.

So if you optimize them, you reduce your spending, which is good. So cost optimization, uh, a query optimization on Aurora directly leads to cost optimization, which is interesting case, but because this is how pricing is organiz.

Michael: [00:14:00] Yeah. And actually we probably should go back to some of your bigger ticket items. You talked about that these, some of these bigger companies you go in and if you suggest query optimization, they. They come back and tell you that's, that's a very fine grain solution. What's the, the bigger picture is that they've got loads of, I think you mentioned even unused instances.

Of course, if you've got, that's gonna be a huge saving if you can just make sure you're catching those early. Do you have any, uh,

Nikolay: Reg, reg, normally it's usually, a lot of unused instances left after some experimenting, so you need to control those. If your organization is growing, you have, for example, hundreds of thousands of nos. Of course. if some team provision and forgot to shut it down, it's not good.

There are services which control cloud resource. And, and help you find, those which are not, used enough. But I wanted to mention that in non-production, it's better to use thin loading and branching. So that engine here, like I need to advertise my service once again. By the way, we also, [00:15:00] since like we are entering crisis, I understood that there is a good need, in infrastructure cost optimization.

So our consulting win win of AI team, uh, is also working right now on, structuring all, all the aspects of, database cost optimization. Our topic today, so we can help with this, as well, But, uh, in terms of tooling data, best engine provides you with one machine.

You have dozens of clones independent, and this keep your bill basically, While you need to, to run many loans, you, you keep your bill constant in terms of non-production databases. And from, for benchmarking, it's, it's slightly different if benchmarking, it's heavy benchmarking. You need to run, heavy tests like utilizing all core and so on.

Of course, you need to provision, uh, full size clones in cloud and it's a good question if you can do it. Temporarily and then shut it down. But big, big question. You need [00:16:00] to not to forget, to collect all aspects of all artifacts, logs, monitoring. There are approaches, uh, that can be used. For example, I can recommend name data because you can export dashboard there and, uh, then you can remove machine.

So you need to, to have, centralized lock, accumulated everything. And you also need. You're monitoring to be able to remember metrics for instances which already don't exist and keep them longer. And, uh, net data with expert, capability. Unfortunately, it's manual for now because it's front end, feature.

It's great. automation is great and so on. So it can be big cost saver if you start with non-production. To, to keep, bill sane there. But as for production, several things. First, interesting that my opinion right now is, uh, that AMG is better than Intel these days. if you, if you need [00:17:00] the beef server for smaller service, maybe it doesn't matter that much, but if you need the, the.

Hundred, 200 VCPUs. If you need many hundreds of, gigabyte of ram. AMG is better to handle all TPU workloads because it provides, a lot, much more VCPUs, uh, for less, uh, price. I mean in terms of, uh, if you compare benchmarks, you'll see that, uh, epic Rome, epic Milan is beating, uh, many Brazilian processors.

And of course, uh, big cost server is. Like Graviton two on, uh, uh, AWS for smaller services. Good, but it's limited in number of processors. So for really large, heavy loaded databases, it might, it might be not enough to have 64 course maximum, right? I don't remember from top of my head, but it's limited. It's not like 96, 244 VCPUs.

Others provide already. And amd like if you need a lot of [00:18:00] VCPUs, a lot of, workload to handle, for example, hundred thousand TPSs for on one node, amd Epic is the way, good way to go. And it's cheaper in terms of, power versus money. but if you. If you have smaller requirements, maybe you should consider arm, and even on rgs arm is good, right?

So it's already there for a couple of years. Maybe

Michael: How much do you see people saving, going from one to the other?

Nikolay: it can be dozens of percent. of course it's sometimes it's hard to compare. For example, if you run microbe marks, for example, CSBE is good for microbe benchmarks to check cpu, to check for m and then few for. Io. So SBE and Field is my two, tools to use for microbe marking. You will make conclusion probably that Ammg is good if you like need a lot, of course.

But if you. go to database level benchmarking. The question is how, right? [00:19:00] Because, if you have, like this is different question, how to benchmark maybe another episode, right? But, in general, your benchmark should be good enough to, for, to make conclusions because if you're just trans s. P bench, it's not enough.

It, it'll be not representative for, for, it'll be very far from your real workload. So you need to take care of proper benchmarking. And, as for arm, uh, there is also certain type of saving. I don't remember numbers. We did it when they appeared, quite long ago for one customer. And they started to use it.

But for, for those instances which need smaller workloads, not, not, not like a hundred thousand gps,

Michael: Yeah. You say smaller workloads? I say normal workloads like the, for the

Nikolay: Normal. Okay. like, uh, less than thousand TPSs, it's normal, right? Okay.

Michael: You

Nikolay: maybe it's good. Maybe you have thousand microservices and, uh, shrink everything, like cut everything to small pieces. And this is [00:20:00] maybe if, if this system is more resilient, right? If you cut the pieces, but

Michael: a good

Nikolay: you have huge monolith, it can be hard.

Michael: You mentioned microservices and I, one thing I didn't have in mind, but I wonder if you see, is whether the push to microservices and having lots of databases, maybe a database per application, if there's room for savings, if people combine some of those and you know,

Nikolay: Well, uh, in general the rule applies here if you have some strict, boundaries between some resources. Uh, of course, uh, if these boundaries are not elastic, they cannot, shift to one way on other, you'll start paying more. It's like with disk, uh, if you have the need to have several volumes on one disk and, uh, it's, you need to control several free disk space.

Right, and, and it, uh, already it's some overhead in terms of managing it. and also it'll probably be [00:21:00] less efficient if, because somewhere it can be 40%, somewhere it's like 70% 3D space. But if, for example, it can be elastically, United to, like for example, with zest, if you use data sets and you have one single number of freed disk space and just control to be it more than 20, 30, 40% always and all data sets are using what they need to use.

That's it. Right? And in, in this case, same, same applies to to splitting to some microservices. Of course, if you put, uh, database on separate, in. Provide, standby notes for. Some will be, uh, under neutralized. Some will need to, to, to grow. Of course, this is how, uh, the idea of serverless and autoscaling appears.

This is great that these ideas exist, but they also, uh, have some, uh, extra, um, Overhead and they won't work in your data center. If you go my extreme, uh, advice to consider [00:22:00] data center, right? So, so this is interesting topic of course. if you are, if your microservices are small and for future, you might consider splitting them to separate instances.

But for, beginning, you might combine them in single pocus cluster as logical databases, right? In this case, they will share single. instance, single resource, for example, RAM and CPU as well, like one right now, one of them is using more cpu. It's okay. Then another is using more cpu. It's okay. On average we have this number.

We control only one. , uh, metric in, and we, in terms of SLA and so on, we control only one metric here. It's good. It is elasticity, right? In terms of cpu, in terms of RAM as well. Only one shirt, buffers, one page, cash number, and so on. And they, uh, just like with spike in one, it's okay, but if you split them as physically in separate instances, you have more, more metrics to control and probably cost, cost efficiency drops here of.

Michael: Yeah, makes sense. [00:23:00] Are there any other big ticket items you had that you know,

Nikolay: Well, of course the biggest tip item, like from any DBA or database expert, of course, query optimization is like probably the tip of the our, of our heel, right? The top of our heel. This is great topic and sometimes it saves a lot. Uh, my advice is, uh, to have, if you have many. fir first of all to control, PTA statements in terms of, of course, regular metrics to control CPU around disk, everything like, uh, to to see how far from, uh, from situation we are.

But, uh, if you take for example, PTA statements, it can provide you interesting metric already each second. How many seconds are spent for query process. , right? And if you like, this is, this is quite simple, but interesting. Uh, advice you compare this metric seconds per seconds. Two, for example, say two, right?

It means that roughly we need two course, very roughly. [00:24:00] Of course there are switches, other nuances, but, we need roughly two course. But in this case, we'll be already kind of saturated, right? If we give it four course, We are like 50%, u utilized. And of course, you don't want to be 50% utilized unless it's on peak hours,

Michael: Yep.

Nikolay: right? So this is top level query optimization. Then you dive to details and understand how, which queries are consuming most of time and try to optimize them. And in this case, for if you optimize targeted, to resource, resources to, to like, to help, save resources. You need to order by total time, in my opinion, by total time.

And then many details inside. We had, episode about query analysis, macro analysis from top to bottom. And then tools like FI master can help, uh, to explain some queries and, uh, understand how to.

Michael: Yeah, I was actually looking at the, so total [00:25:00] time. I completely agree, but I've, I, I haven't put in a blog post yet, but I've written the query now to do it for total buffers as well. So if you're on Aurora, you might prefer to start with looking at total buffers instead of total time. Thought that was an interesting, different angle that you could take.

but yeah, so that, that makes sense. And I, I think something people don't even consider. Uh, my tool doesn't help with is do you even need to be doing that query? Like, uh, from an application level, how often do you need to be running these things? Do you have, like, do you have things, that could be materialized or do you have things that could be done less frequently?

So there's, there's a whole host of optimization at the macro level that you can

Nikolay: Well, if if you moving by total time, you'll find some query, which has a lot of, uh, high frequencies. So calls is very high, but, timing is very low. In this case, it's better to apply, not se uh, query optimization technique, but just application optimization technique. It maybe. Like to call it less often to have cash maybe, And so on.

Michael: And I think this is most relevant for folks who are [00:26:00] either getting close to thinking they might need let, let's say you're on a cloud provider. Let's say you're on r D s and you're getting close to a boundary where you think you might need to upsize the instance. I've seen people avoiding doing so by doing a whole host of query optimization, but I haven't seen many folks go down an instant size, but it feels like it.

Why? Like that should be possible, right? Like if you are, if you're underutilized, why can't you? I just haven't seen it much in the last few

Nikolay: I, I saw it many times. For example, uh, there is some, uh, is issue with performance. We, we, we come help optimize queries and then they go down in terms of instance size. This, this is normal

Michael: Nice. Yeah. Great. That feels like it should be possible. And so if, if in the past few years you've thrown money at a

Nikolay: Or, or, or for example, even, even more, they remove couple of, uh, standby notes because they're not needed anymore because queries are so well optimized. It

Michael: Nice. So reducing like read replicas, that kind of thing?

Nikolay: Yeah. Because the, uh, the replicas, are, [00:27:00] uh, they. Mirror everything, and it's like redundant storage. It's good for ha but in terms of performance, probably it's not good because you, you cannot access other host memory, right? So you, you, you keep the same me, like it's, it's, it's good to scale, uh, redundantly queries, but in gen in general, in, in terms of resource and the use of resources is not very optimal. Actually to scale

Michael: Yeah. Super interesting. On the, for any smaller, like pe, if people are still listening from really small companies, I think it's also worth mentioning things like a lot of people don't even realize quite how many credits are available for startups. Huge amounts of cloud credits and things that could, you know, could be thousands or tens of thousands, that you haven't used or you haven't even applied for. but yeah, I didn't have much else. Anything else on your list?

Nikolay: I could continue, but I need to drop off another call. I'm sorry. Like, let's, let's wrap up here.

Michael: Well thank you everybody for listening. Thank you, Nikolai, and see you next week.

Nikolay: See you. Bye-bye. Thank you.

Michael: Bye.

Some kind things our listeners have said