036 Wait events
Michael: [00:00:00] Hello, and welcome to Postgres , a weekly show about all things Postgres Cure. I am Michael, founder of PG Mustard, and this is Michael Whot, Nikolai, founder of Postgres ai. Hey, Nikola, what are we talking about today?
Nikolay: Hi Michael. Finally, it's my choice, and I chose very interesting topic weight events and weight event types. We have two column just activity, I think it's still very, uh, not very appreciated by backend engineers, first of all, because tooling is not perfect still in many cases. most monitoring systems, uh, don't touch this area, but they should actually.
And, uh, I think it's worth knowing that, this thing exists and how to use it, So weather ends.
Michael: I like it. So every time I pick the topic, it's boring. Every time you pick the topic, it, it's exciting and interesting. I see how it, I see how it,
Nikolay: but boring topics also needed, you know, like, I cannot, say it's useless. Sometimes boring is even more useful, but in this case, we have a perfect combination of entertaining and useful topic with.
Michael: [00:01:00] yeah. So should we go into what they are first? maybe why they were added, times they can be
useful, that kind of thing.
Nikolay: Yeah, so I've just learned from you that it was added in 96. I thought it was in 9.5. It means 2016, right? Or maybe 15. But, I observed this, like, since my POGS experience is quite, made sure it's 18 years old already, I can tell you that, most of, my pocus life, we didn't have it.
I, I personally, POCUS more and more and stopped using Oracle in 2004 five before I used it for a couple of years, uh, when I was working in some companies. and I, I just thought it's fine. But then in 2014, 15, some events happened. Interesting events. And actually, I'm not sure like how these events were connected, but I observed it, from.
Being in the US and, uh, visiting Russia, back, I saw that, [00:02:00] somehow it's some, like, it was connected actually to politics as well. And there was a big move in Russia started to let's, uh, get rid of Sickle Server and Oracle and, uh, use pogs It was after Crimea, first, invasion invasive, Actions from, uh, aggressive Russia against Ukraine.
there was obvious thought that let's, uh, use poss and Poss Pro was founded and at the same time also at Yandex. I remember I was surprised at. already for a couple of years there was ongoing work to move to migrate from Oracle to pogo, but for economical reasons, they said actually two reasons, economical reasons.
because if you want more machines, you need to pay a lot to Oracle and also support and how they react to issues. Uh, it's very slow. And if you take pauses, you can fix things your. . This was second reason for them, but economical reasons, were also very important. So we had a combination POGS Pros [00:03:00] founded in the, with the idea, let's, grow pogs inside Russia and was related to politics.
Definitely they have clients, for example, I personally cannot work, with Postgres Pro since last February. Like, I, I just don't. To them actually because of their, clients actually, and connections and, whole idea of, uh, I I will skip it. Uh, like I, I, I don't want to say, let's skip this, podcast out of politics because this is also a mistake, I think, to say, I, let's, but I, I also don't want to spend the whole episode for this, but so what happened?
A lot of, interest. Companies migrated to from Oracle Deposits
Nikolay: was one of them,
Michael: to. I think this is worth mentioning, and the reason this is related to weight events is that, uh, I was actually working on Oracle tooling, around that time a bit before. And even, it wasn't even performance related Oracle tooling and I was aware of how much Oracle users loved weight events and a product called [00:04:00] Ash and things.
Yeah, exactly. So even I was aware of how much people love this feature
in Oracle and the idea of
Nikolay: ashes to ashes. Right.
Michael: well moving in. Moving to a system that doesn't have some features that you absolutely love is a tough, like sometimes you're willing to do it still for cost reasons or political reasons, but if you can get that feature added, I can imagine that was a big driver for, let's add the feature so that we can migrate or to make the migration less painful.
Or maybe after migrating because it's so painful not being
able to have them. Is that, does that
Nikolay: Right, exactly. So people with good experience of Oracle moving to poss, uh, started to say, oh, I, I, I remember because I was, actually doing. Quite successful meetups in Moscow, in St. Petersburg. That time I stole some ideas from George Bergers observing it in San Francisco and brought uh, it to Moscow.
email@example.com we quickly, started to be on second place below New York, but above San Francisco. [00:05:00] I was very proud of that. But, so people coming from Oracle experience, they started to say, you know, POGS is great, but why don't you have. , right? They, they named several things. Among of them was, declarative partitioning.
So like, we cannot work without, without declarative partitioning. We, you have some partitioning with, inheritance, but it's so. To be at this point. Let's have declarative partitioning and weight events. We cannot analyze performance well. Our statements is not enough. A local analysis is not enough. We need to have weight events at the same time.
We di we discussed that JSON maybe was, uh, contributed a lot like Crimee happened, right? And pogs Pro. So Russia decided to, to use more pogs. It's important, but at the same time, J was added and also rds. Pogs created like around the same years, like 14, 13, 14, 15 these years. And. observing all monitoring systems, I like this exercise to,[00:06:00] to observe various monitoring systems.
rds implemented performance society, making the bet that this is default tool for query performance analysis in rds performance insights and performance insight is Oracle. Active session history, basically these colorful graphs, which, uh, work with these weight events from pocus. But to work with it, you need to put it to POCUS first.
And this was done if, uh, if I'm not mistaken by Amit Cap. And was working at POGS Pro at that time, and it was released 9.6 in 2016. Right.
Michael: is such a
Nikolay: It was reaction to criticism. We should have it in pogs. this was influenced by former or current Oracle users who started to work with pogs, declarative partitioning, and this,
Michael: Yeah, and I
think some performance features as well, like
Nikolay: oh, many other things as well. I just want to name big things, very
Michael: at the Amazon Insights fascinating because Amazon, of course, are also a famous Oracle user themselves that
migrated to Postgres. So maybe
Nikolay: To work to AUR version podcast, but yes. So Aurora
Michael: with Performance Insights. and still with weight events, of course. So it's, yeah. Really, really interesting. Intertwining. So actually we, I don't think we've actually covered what weight events are and how
Nikolay: Yeah, let's do it. Uh, we have, in documentation, in pause, it's called cumulative statistics. but, some parts of it, it's not about cumulative, metrics. Activity is not cumulative. It has timestamps and some current, metrics. So, for example, process Id query. Sometimes stamps and it has two columns right now.
Uh, uh, weight event type and wet event. Wet event type is, it's more high level thing only roughly 10 types exist. Nine maybe. And weight event is like more fine grained, thing, in which type. We have many, many, [00:08:00] many actual events, And what cannot give to you, it can give to you, the knowledge.
Okay. This process idea, this post goes backend, is working with this query right now, but what is it doing? Is it doing some IO or some CPU work, or is it waiting to acquire a lock on some. Row. For example, oration, like heavy lock. It's called lock, but it's heavy lock or it's, waiting for some lightweight lock acquisition, which is actually latches, but it's, in memory, physical lock, right?
So we have, we talk about, locking some, uh, pages in memory, buffers. Buffers again, right? in this case you will, you will see in weight event type column, l w lock, lightweight lock, and this, this can be quite, sometimes quite, difficult to optimize and so on. I remember in sign 96 actually also.
I think it's maybe also connected to this, uh, observability improvement. there was a big, um, improvements in terms of how the buffer pool is [00:09:00] working. I remember a post from Kara Cough also together with my SQL Fox. Like, let's achieve hundred million TPSs on select only workloads on one machine. And to achieve that, they need to improve the buffer pool behavior.
To get rid of long-lasting l l w lock in. We type like you around, you see a lot of light welock, so they have this raw data in ed activity and it's already cool, but to work with it, for end users, you need additional tools right now because despite of the fact that this chapter is called cumulative statistics, POCUS doesn't provide you.
Metrics for this type of information, only current view. So you need to use some sampling to analyze it.
Michael: Yeah. You mentioned actually quite early on that a lot of monitoring tools don't offer this. I was reading a really good blog post by the team at PG Analyze who do, who added this a
few years ago, and I'll [00:10:00] share that as well. Cause I think it's quite useful, even if you're not.
Uh, using that product, to get an understanding of the kinds of things this can, do. Because once, once you're sampling it, you can see those cumulative numbers and you can see what your standard workload looks like. And then if you've got any, anomalies, you can start to dig into what are causing those and you start
to, see patterns.
Nikolay: Yeah, in general, I, I usually not fully satisfied with any monitoring because I think, only few implement very good query analysis. In general, it's, it should be done systematically with many metrics involved, and, uh, you should be able to, analyze many dimensions of query analysis, including like total time IO numbers, buffer numbers, and so on.
Averages total numbers and, and many things. And the, uh, weight event analysis is very, I. Part of query analysis. It's very like, or fogal, to what the P statements or local analysis might provide. And, Only if you exactly implemented RDS is good example. I don't remember implementation of ize. I need to revisit.
Maybe it's also [00:11:00] interesting, but I would like to mention back to times before 96 when we didn't have, weight events in PS activity. What happened? Okay. We have some, performance issue. We identified some queries which are concerning. For example, we see they are taking. Three seconds like at night when, for example, Chrome drop is running during daytime, we check this queries manually, right on production because we don't have database lab installed.
Okay. And we see three milliseconds like thousand times faster. What's happening? We include buffers. We see only a few buffers. Okay? What's happening at night? We enable, maybe explain and off to.
Michael: Auto explain.
Nikolay: Or to explain with buffers enabled to understand I your numbers. Okay, we see three seconds. I, I'm, I'm describing recent, uh, real case actually, if you already maybe guessed so we understand.
Okay. Three seconds and only a few buffers, like 200 buffers or maybe like a hundred buffers involved. What's happening? Right. [00:12:00] We don't understand. We suspect that maybe at night we have some issues with that tacos or something, and like, I would expect a lot of. buffers involved. Check, no. If we have database lab engine installed, we can point in time recovery to proper point and try to explain, analyze buffers, uh, look at it, under some angles, but without concurrent workload, you probably won't notice a real problem using these tooling.
Right. Auto explain producer statements doesn't explain what, where those like, okay, we can enable tracko.
Nikolay: And at least was it related to IO or not io, which is good. We should do it with, we double checking. I don't remember the tool. PO Binaries has a tool to check timing overhead on concrete machine.
Michael: PG test timing, I think.
Nikolay: yeah, yeah. So usually it's, uh, you can afford it and, uh, usually we recommend to our clients with, even with very heavy loads, to enable tracko timing [00:13:00] to have it in PK statements and explain everywhere. But, uh, still it, it only can tell you is it IO related or not. It's just segments only. We, we need to dive deeper and understand what happened.
Maybe we had some, Lock waiting, like heavy lock waiting. Or maybe we had some contention in the buffer pool. Or maybe something else it, maybe clients are very slow and we are waiting for communication with clients. And this is exactly where Weight Event helps. It extends query analysis from, uh, only PTA statements, cumulative statistics, metrics and log analysis like auto explain, which is quite advanced technique as well.
And I'm glad Pier a log analysis tool. Exists developed and it supports how to explain, actually, it's good. We move from them to different dimension or fogal to others and We look inside. We could look inside using perf, for example. And also answer questions [00:14:00] where does Bcan spend time executing this query?
But it's, , hard to have it all the time, right? Actually there, there are, uh, attempts to have using epf F or so and, and so on. There are attempts to have continuous observability, perf like flame graph, like, and so on. there is code root, uh, new monitoring.
Michael: Yeah. That's cool.
Nikolay: Old friend of mine is developing, actually, they developed, uh, OK Meter before that.
AKI meter, in my opinion, was the best implementation of query analysis for Postgres now. So Coru, check it out. It's interesting. And they, they recently integrated with some tool which provides flame graph, like, , uh, analysis continuously. In this case, maybe we even without weight events, actually you can do it with Epf f as well and try to understand which part of POGS called, where did you spend time for execution, this query.
But with weight event analysis, you have official way, but you need extra layer of. To have cumulative [00:15:00] metrics and, to understand historically, for example, at night, what happened In this case, if you don't have it, my advice is to implement some like, poor mans tool. You just sample it every, I don't know, 200 milliseconds or half a second
Michael: Or even a second, right? Like you're getting more information than
nothing at that point.
Nikolay: Yeah. Sometimes people implement it in the application. If you have rub python, you can do it there and you can even combine it with additional dimensions, like, which, uh, module of our application was that, or url, or part of url. And so you can easily localize this problem and improve. So, but in general, you need some sampling if you have original positives, if it's not rds, which has performance insights.
If your monitoring doesn't. You need some sampling or you can install p Watch post edition, which I still think we will probably bring to new P watch version, which isn't [00:16:00] under development as I know. And in this case, uh, you will ah, and you need to install extension p p g wide sampling. So there is extension sampling, developed by Alexander K.
when he was working at pogs Pro, so this extension again from pogs Pro, or you can use also PG Sentinel. There is additional extension, from, uh, uh, engineer who, and as I know works at, at edibles team, I might be mistaken. Sorry, I forgot name.
Sorry. we actually interact on, on Twitter and so on quite, often, but I, I'm very better with names. Sorry. So there are, you can use either position, annel or p sampling to have better sampling inside pocus. If you can install extensions, that's great because, for example, PWA sampling by. We'll sample every 10 milliseconds relating everything inside you won't miss anything actually with this approach, right? Because if you do infrequent sampling, you can miss, important formation. If your queries very fast, you have fast spikes, very acute, [00:17:00] you can miss them.
Michael: I'm guessing the overhead's
pretty low of this kind of
Nikolay: Oh, speaking of overhead, you can, you need to take care of proper, uh, cleanup for p
Nikolay: it can be very detailed information. per process, ID even, but you usually don't need it. You need a per query id, to be able to draw a a rough pie chart using weight event type or detailed pie chart, fine grained using weight event and understand, okay, this query usually spans this, oh, it also has history, so you can understand what happened at night.
Finally, with this, . and if you install PWA Sampling, uh, sampling Plus PWA two po edition, you will have colorful graph similar to Performance Insight, and everything is open sourced and free right now. So it allows you to build, uh, your, like, it can be your second monitoring, for example.
some of our clients, they have Datadog and. In, terms of consulting, we tell them it's okay to have second [00:18:00] monitoring, sometimes even third, because with Datadog you don't have it, and Postgres is improving. Database monitoring is improving inside Datadog, but still very far from being il. So it's, it doesn't hurt actually to have this monitoring probably using pool approach.
And so PWA sampling collects samples every 10 milliseconds and then your. being outside the, of your post machines, pulls this data every LA laser like every 10 seconds or even every minute. But since we have internal sampling, very frequent, we observe all queries quite well, right? So this can be improved in CSUs.
I think these metrics should be probably start to be cumulative. There should be some efforts maybe ongoing. Do.
Michael: I, no, I don't, I think I got excited briefly while I was preparing for this, but I was getting confused with the post 16 recent edition of PG stat io. So I think
f I know, I know it's, I know it's different, but, uh, [00:19:00] briefly got confused. So it's, it feels like people are investing in these system views though for better observability.
So hopefully at some
Nikolay: Wal Warf mentioning that this is not, like from zero to one. And that's it. when wait events appear, uh, this number of wait events, tracked. We are much smaller than right now. Some POCUS versions, had improvements since then and, it's growing, area.
Michael: Yeah, we were just looking at the docs, weren't we? And I think there's, like, well over a
different, uh, types,
Nikolay: more than a hundred? Definitely, yes. So a lot of them.
Michael: So if we, if we do start monitoring this ourselves, like are we gonna even notice any overhead? Is it gonna be like super low?
Nikolay: it depends on how you do it. With perge sampling, you can have some overhead you need to take care of, uh, cleaning up, uh, not to exceed some, uh, uh, memory allocated for it. Shared memory
allocated. So I, I don't remember [00:20:00] details and, but you. Pay attention to this area, but if pro, if it's properly tuned, overhead is very low.
It's just something that collects, uh, data from activity. That's it. So it's not a big deal if you do it yourself, like pull man's implementation on your application code like every second or so. Well, it's just some query to Petri start activity. I just must warn you that you cannot use now. if you use now and apply it to, calculate intervals.
So we, for example, query, query start or exact start when transactions started or something like this. You sometimes will have negative failures because, because while you are reading previous activity, previous activity is, it's not a table, not a regular table. you don't, you're not dealing with some consistent snapshot. if you have a lot of, uh, backends, a lot of entries in producer activity while you are reading them, your time is shifting and first timestamps and last timestamps. They are from [00:21:00] different point points in time of reality. So you need to use clock timestamp.
So every row you'll have new now, uh, function now in pos, it provides you timestamp for the beginning of transaction, your transaction. So it's always some constant single value for all, but for if you use clock timestamp, you have it for each row, sep, new value. Each time there is overhead of calculating new timestamp, but it's quite low also.
Michael: Cool. Good tip. I'm thinking so we can, we can monitor this and then look back historically, we can look at active connections and see, like diagnosed issues And, the other big use case I saw for this, uh, see people using this for like diagnosing complex issues.
Like if you've got, uh, well locking issues, I guess, uh, dead, like dead
Michael: that. Yeah.
Nikolay: Well, yeah, if, if you see for example, io uh, weight of Type io and you can understand reading or [00:22:00] writing, You already understand that you need to pay attention to your disk subsystem, maybe to upgrade your disks, give more power, more iops, more throughput.
Michael: Or more cash like incre,
reducing the load on
Nikolay: exactly. Good point. You, you can, uh, attack it from this angle. So you allocate more memory, you increase shared buffers of value to have bigger buffer pool, and in this case, less IO is needed. So also good approach. Sometimes you need to combine them, uh, if you see lock, uh, with and type and check which, uh, lock event is.
It's there. So in this case, you just need to to, to redesign your workload actually, right? So to get rid of, long log waiting, for example, very good trick is to use, uh, select, uh, for update. Either skip locked or no wait. So either you fail if in case of collision on some, [00:23:00] uh, rock acquisition for some role, for example, or you.
Skip those roles and for processing, get next to, uh, available, not yet locked. This is good for paralyzation of some batch processing. For example, if you see, lightweight lock, this is sometimes maybe the trickiest part. There are several lightweight locks. , it's about memory work with memory buffer, pool and so on.
And sometimes, uh, you need to optimize positives itself. Sometimes, as I've said in the 96, uh, big improvements happened, but I would like to mention an interesting case I recently had with, lightweight lock spikes. Still not solved. Dark is ongoing. But, what was there like we observed, lightweight. Particularly, uh, lock manager. It's called Lock Manager, and I just like had no idea originally what to do. And here I must, say big thanks to the AWS team because they have very, very good documentation, [00:24:00] which POCUS itself doesn't have POCUS documentation. This area is very, , right? So very limited, very like reference style.
This very small description. Of course, you can go to source code and understand more always, even if you don't understand C, you can read some comments and it's very helpful if you search for a particular, uh, log, weight event. I mean, no, sorry, weight event you can find.
Or lightweight lock as well. You can find interesting, uh, information.
But, aws, rds and Aurora documentation, I originally referenced only Aurora documentation about we events. But Aurora is more, much more heavily modified pos and uh, Jeremy Schneider from AWS team, uh, mentioned that RDS documentation also has it similarly and it's closer to positive, so definitely.
So you can check there what is written there, and they have more like how to and explanation style documents there for each lock. Maybe [00:25:00] not all of them, but many of them. And if you check Lock Manager for example, they provide some example how to actually have it in synthetic environment. They also discuss what's happening.
They, discuss some, uh, constant OUS code, which, leads to it. They discussed fast, path in just activity there, a com, in locks fast path. And what is. Why it is and so on. And they provide some very, very useful tips how to get rid of this type of issue, this type of spikes. So they have more like practical knowledge there.
And I wish POGS documentation original one also be improved in this area. There's a big opportunity here.
Michael: Yeah, I saw your conversation with Jeremy. I thought it was great, and I'm gonna share, in fact, I already shared the link in my newsletter just earlier today, but, um, I'm also gonna share it in the, in the notes here. It's worth mentioning that the. Individual weight event types are clickable and you can dive
into them [00:26:00] for those like deeper,
Nikolay: Oh yeah. I, I, I usually underestimate this. I like spending so much, so many years in, uh, the area of uxl like, social media and so on. Like, I still, sometimes miss this. Yes, indeed. it's not obvious that, they are clickable and, uh, there are gems inside.
So, so yes, you, you can click, uh, any of those, weight events and see a lot of info inside.
But usually Google provides good, does good job. Uh, when you search for some weight event, you can find AWS stocks.
Michael: Good point.
Nikolay: Yeah, and Ralph mentioning, uh, recently, in Seattle, they had meet up on this topic. And Jeremy, as I understand, Jeremy was presented, uh, with event and performance, uh, improvements around this.
Michael: need to get him on Postgres tv. Open
talks unless it was recorded.
Nikolay: that's a good idea. Like I, I will double check the, I think it would be interesting to watch this talk if it's available online. And if it's not available, I should ask. so just [00:27:00] little bit kudos to AWS team for this wonderful documentation. And, uh, if you don't use AWS still, you can benefit from it as well.
Michael: Yeah, absolutely. Anything else you
wanted to make sure we covered?
Nikolay: Well, maybe that's it. So sampling, man's approach works. Just sampling in, in the loop to some, small ba in your code and your Python, Ruby, Java in your code. sampling, extension if you can afford it, or PG Sentinel, uh, if you can afford it. And some. Monitoring. If you use just extension, you still need maybe some sampling or you need some scripts to analyze.
Its, its content. It's good to visualize it. So you need to either put it to your existing monitoring, which can be a big task, or you need to use P Watch to post edition or you mentioned Page Analyze also PPG Sentinel. I think it's used in some, I don't remember name of monitoring. We will attach it to third monitoring option.
To have it, or if you are rds or a broad user, you already, [00:28:00] I'm a hundred percent sure, you know, performance inside because they put it on the first page of everything. Like it's where, query analysis starts for RDS users, but it's not so for regular post users yet, I think, should be also improved.
So PTA statements are great, but they don't answer all questions. So,
Michael: No. Including these ones.
Nikolay: Actually vice versa. Weight event analysis quite often does answer all questions as well.
Michael: Yeah. It's the combination. Yeah.
Nikolay: one more thing I if you like, uh, adhoc tools. There is also Patch Viewer available on GitHub, uh, also from some Russian developer. so there is a, it's a Java. You can just download it and connect to your database and start having these performance insights like Graph with some details.
You can check and see exactly which event wait went time, which event, which queries are, involved, and, but [00:29:00] you need to keep it open so it's like ad hoc tool. And finally, PG Center by another Russian developer OV. , this, tool is more like console style, but it also has sampling for we events, but more for folks which prefer console terminal over ui, graphical ui.
So there are some at hoc tools as well, which can be
helpful, but without sampling. Good luck with, uh, with catching this, this, this knowledge from PPG p set activity. So I hope, uh, future
Postgres versions will have some really cumulative metrics. I don't know if it'll be a table, additional table with hundred plus columns or something else.
And overhead is also a question, but I hope, it'll be created and inside it'll become standard. Fors fors, uh, cases. Okay.
Michael: Awesome. Thank you, Nicola. Thank
Nikolay: I hope [00:30:00] I told something useful for someone.
I learn a lot as well.
Nikolay: Yeah. Usual. Thanks to our listeners, we see feedback. We appreciate it a lot. We also appreciate when you put likes to videos or to podcast systems, you. and also please share with your colleagues who need to optimize pauses. They had very interesting topic, I think, in my opinion.
Uh, related to query optimization. We had a query optimization topic in the past and monitoring, and now we just had some, kind of deep dive to one of the areas. I think, if you deal with positives, perform. You do need to understand this topic. So some interesting topic, I think.
Okay. I, I hope I entertained also a little bit. At least I find this topic very useful. So please share with your colleagues, uh, who you think, uh, need it.
Michael: Nice and please vote on which, which episode you find the most
Nikolay: We need to have some some charts like most boring. Less [00:31:00] boring. Yeah. Okay. We already have how many?
Michael: I think this is episode 36.
Nikolay: 36. Wow. Cool. Cool. Without any weeks, missing.
Michael: not yet.
Nikolay: Okay. Good.
Okay. Thank you for that as particularly, that's, it's it always great to chat with you. Until next time. Bye.