Postgres FM | Transcript: pg_wait

April 24, 2026 • 44 Minutes

pg_wait_tracer

Nikolay: Hello, hello, this is
Postgres.FM.

My name is Nik, PostgresAI, and
as usual, my co-host is Michael

of pgMustard.

Hi, Michael.

Michael: Hi, Nik.

Nikolay: And we have a great guest
today, Dmitry, who is a very

experienced DBA, originally Oracle,
now Postgres, already many

years.

Hi, Dmitry.

Dmitry: Hi, thanks for inviting.

Nikolay: Thank you for coming.

So we have, I think, very, like
for me it was surprise when you

did it, when you built this tool,
and I didn't expect that it's

coming to Postgres ecosystem.

But obviously it's something new
and in the area which is in

my personal key interests lately.

So active session history analysis,
wait event analysis, but

you did it very differently compared
to what we usually did with

pg_wait_sampling or other tools,
performance insights in RDS.

So it's a great tool.

Where to start?

Dmitry: Yeah, so I think we can
start with the time when I first

started working with Postgres,
as I started switching from Oracle.

When Oracle, 1 of my favorite topics
always and tasks was performance

related investigations.

And when I switched to Postgres,
of course, I start looking for

similar tasks and understood that
the way of Postgres troubleshooting

is very different from how it's
done in Oracle.

And definitely the felt lack of
tooling, lack of a standard approach

and so on.

Okay.

So after some time, I realized
that there are some tools that

can bring similar kind of feeling,
let's say what I had in Oracle.

And I'm talking about the weight
event analysis because I know

in Oracle world, it's a given thing.

Yeah.

Nikolay: I'm going to interrupt.

So when did this switch happen
from Oracle to Postgres for you?

Dmitry: Recently I checked, it
happened 10 years ago.

Nikolay: 2016 roughly.

I remember very well a big stream
of, like, big wave of people

coming from Oracle in 2013-14 and
all of them were complaining

about lack of wait event analysis.

This is how 2 columns were added
to pg_stat_activity.

It was like maybe a couple of years
before basically your switch

and you also had this same feeling
like these tools are needed.

Dmitry: Yeah, Because when you
work with Oracle since, I don't

know, 2002 or something like that,
it's a given way of troubleshooting.

Any Oracle guideline, any book,
any training official, non-official

says that just open the active
session history.

And so it's a, it's a first, it's
a gate that you go through

and you start with.

And so I try to use the same approach
with Postgres because for

me it was so obvious why it's a
way to analyze the performance.

Because basically what's the wait
event says you say you so why

it was slow where the session spend
the time session or database

in general so you do not need to
guess was it's I or CPU basically

where it spent time.

So it just says very rarely.

So then why it happened that way.

So you need to investigate, but
at least you know where, because

the regular approach in progress
just can say that this query

was slow In average, or with some
hacking, you can get some precise

numbers of the execution, but it
does not give the idea why it

was slow.

Okay, yes, you can use the explain
analyze to see where in the

plan you spend the time, but again,
it does not explain why sometimes

this join takes I don't know 1
second or sometimes 10 seconds

from the plan perspective it looks
the same and the tooling

for wait event analysis evolved
year by year.

And so we have the pg_wait_sampling.

And since they introduced the new
event tracing as well, so it

became really a tool that gives
really good data, good source

for wait sampling analysis.

And I use it everywhere where I
manage Postgres.

And if it's self-hosted Postgres
for sure, I will, I will convince

the owner of this Postgres to install
pg_wait_sampling.

It's a default way of analyzing.

On all my dashboards, it's the
first panel you see when you open

any dashboard that I created.

So it's a wait event analysis.

I convinced all my colleagues to
use this approach.

You convinced me

Nikolay: To put this ASH graph
on the top of the first dashboard,

which I resisted initially because
the first dashboard for me

was seen as very shallow and very
wide, troubleshooting where

you identify areas, but you convinced
me to put ASH graph on

the very top.

And now I'm convinced.

Yeah, it's a great idea.

So it's very quickly helps you
understand like some pain points

for performance.

Dmitry: Yeah, I have a, I will
try to make it brief.

2 examples when it's the wait event
analysis shines.

So 1 was, there was a migration,
hardware migration from, from

let's call it this old hardware
to new hardware.

The data files stored in the storage
appliance.

So we just remounted.

So the data layout was the same.

Everything was the same in terms
of the data layer, storage layer,

but the hardware, the compute part
changed and changed to the

highest spec.

So the number of virtual cores
increased, the clock increased,

everything increased.

All tests were good.

When we start loading production,
so we see the really huge degradation

in performance and so what you
would think.

So the plan flip or something
like that, but you have the same,

absolutely everything is the same.

Just hardware is different.

So it basically cannot be planned,
because the parameters are

the same.

Nikolay: So your slide deck has
very interesting example, very

simple, so anyone can understand
about contractors doing some

work.

So in this very example, you right
now, describe right now, it's

like basically you hired a different
team and do, and say, do

the same work, but it does it very
differently.

Right.

Yeah.

So that example, I just to explain
what like those who haven't

seen.

So it's a simple, like there are
contractors, they say, we are

busy, we spend so much time, and
without wait event analysis,

you just know they are busy.

But what they exactly were busy
with, you don't know.

And in this case, okay, they were
busy with I/O or something

else, right?

Dmitry: And yeah, exactly.

Nikolay: Switching hardware, you
see different picture and so

what was in the end of that story?

Dmitry: So, the problem was LWLock
Manager, so you know, it's

1 of your favorite LWLocks.

Nikolay: A lot of dots are connected
here because we spoke about

so many problems which are explored
with wait event analysis

on this very podcast.

So, right Michael, you remember
lock manager?

Michael: Yeah, of course.

It feels like whenever there's
like a system-wide issue, it's

such a great starting point, right?

Dmitry: That's sorry, I...

That's the reason, next case, when
I would say that's on lower

level.

Nikolay: Let me first disagree
with Michael, because I agree

on 1 side, system-wide.

But on the other side, more and
more I think we just don't have

this tool yet, but it should exist.

On pgMustard we should have it.

We have plan analysis and for 1
call it would be great to see

wait event profile as well.

Because imagine we were holding,
like we were blocked by log

acquisition which were pending.

You will see a lot of time spent,
buffer numbers are low, and

that's it, you need to guess.

With wait event analysis for
a single query execution, you

will see, okay, this was lock wait
event type.

Michael: But wait, but I think
this is where we get into the

core of what Dmitry's done, which
is cool, which is moving from

sampling to tracing.

I think when it historically has
been reliant on...

I do think it's really important.

Maybe I'm wrong, but my understanding
is that sampling...

Nikolay: I also thought this, like
this is what expectation I

had as well.

It's about a single backend tracing.

But let's move on and hear what
Dmitry will say.

Exactly.

Dmitry: Yeah, just the next case
when the wait event really saved

a lot of time.

And when this single backend analysis,
I mean, it's single backend

analysis case.

It does not work only on the instance
level.

It's on the single backend that
also works.

I would have the case, I got the
complaint that sometimes query

takes from few milliseconds and
it's killed with 1 second timeout.

We have, I don't know, 100 calls,
10 of them failed with 1 second

timeout and the query is primary
key lookup.

So you cannot, they cannot mess
up with the plan for this case.

So it's just a B-tree primary key
lookup, very simple.

It's not join nothing, so just
very simple lookup.

Nikolay: And it's not lock manager
contention because exactly

lock manager contention, in many
cases, it's primary key lookup.

It was not that.

Dmitry: Yeah.

Okay.

So you open the dashboard so you
don't see any spikes of any

specific wait events.

So it's definitely very low, a
low level with some specific backend

problem.

So I also captured just in case
the explainer, the plan was the

same.

So everything was the same, But
sometimes it goes from a few

milliseconds to 1 second, and I
start sampling.

So I make some custom scripts that
will just catch the needed

backend and start sampling.

And I saw that when it's slow on
I/O, I mean, I saw the I/O,

I mean, look how it can be, just
a very simple read.

So it just reads very few blocks
from disk.

So next step, so I probably understood
with the wait event

analysis it's nothing else but
just slow I/O for some cases.

And with the eBPF and tracing these
backends, I understood that's

a problem with the long tail latency
on the storage appliance

that we use.

Sometimes it goes from 1 millisecond
to tens milliseconds.

And indeed if you're not lucky
enough and you need to read these

blocks from disk, not from memory,
not from OS cache, these even

few blocks become seconds.

And yeah, I think without wait
event analysis it will be just

guess game.

But this just in a couple of days
I found the problem.

Michael: A couple of quick questions.

1 is around, like you can get I/O
timings in explain analyze output.

Obviously it's a bit tricky if
they're always getting cancelled

I don't know how you got the explain
analyze for the queries

that were timing out actually how
did you do that

Dmitry: I just increased the time
out for some portion of them

to make this finish in a few seconds.

Nikolay: It's about 0, right?

Michael: Did you have I/O

Timings on the server?

Dmitry: Yes, I do.

But when you have 1k calls a minute
of this type and just very

few and it fails you cannot track
it it just so it's 1 of my

favorite topics that averages it's
every time so you have to

have to have P99 and better more
nines to see the actual picture

of the senior system?

Nikolay: Just to push back,

Michael: I think this is a really
interesting topic.

I think actually in this 1 specific
case auto-explain might have

been okay, like catching the...

That's where it can be really good
in terms of catching the outlier

queries that go beyond a set.

I think as long as you've got the
I/O timings on, in this 1 case,

we might have seen, oh, I/O timings
are the issue for these slow

queries.

But I'm sure there are other examples
where the wait events would

have shown something more interesting.

Nikolay: But wait event will analysis,
or not, AutoExplain will

show you the plan, will show you
timing if you enable it, will

show you buffer numbers if you
enable it.

Good.

But it won't show you WaitEventProfile.

Although now I think, why not?

It's a good idea, probably.

This is maybe, yeah, even maybe
better than for manual explain

to have it in auto explain.

Michael: I think you're right,
but I think, I was talking about

track I/O timing.

You know the parameters of track
I/O timing?

Nikolay: Yeah, I/O timing, great.

But I/O timing is working, first
of all, it's working at higher

level.

But what if it would be not I/O timing,
but IPC, or like with local

or log or something, like full
picture would be great to have,

right?

Also wait event, we talk like this
is a wait event type level,

we have how many like 7, 9 wait
event types, but how many wait

events?

Hundreds.

So it's very precise understanding
what the system was doing.

So I think, just imagine for auto-explain
to have a wait event

profile logged optionally.

I think it's a good idea for those
who want to hack Postgres

right now.

Michael: Yeah, me too.

And I think it probably transitions
onto like, why don't we have

this?

What's, what are the downsides?

What are the, what's the overhead?

What are the tricky parts of this?

Dmitry: And we will go there.

Yeah, because it's definitely a
thing that I cannot take out

of my mind for weeks already.

And so this last case with this
single backend problem, I reflected

this experience, I spent a couple
of days on it, and I thought,

so what tooling can help me with
it to next time to finish this

investigation in hours and better
probably even enable dev people

to do it instead of DBAs.

And the Oracle has a very simple
answer for this type of problems.

It's the 10046 event.

It's a tracing.

When you enable it on a certain
level, it writes everything in

the file.

Every single syscall, every single
tuple, I mean the raw joint,

everything I mean very detailed
like real tracing.

And I decided to create the pg_10046
extension that will do it in

Postgres.

And it was the prototype was really
good.

So I really proud of it.

But for wait events, Oracle also
traced the wait events.

And I did the same, but I had to
sample it because I didn't find

by that time any tracing approach
for wait events in Postgres.

I know when I started posting on
LinkedIn this, Jeremy Schneider

came and commented.

So yeah, while it's really nice,
you cannot call it 146 because

wait events are not tracing.

Very true, But there is no way
in Postgres to do it.

Nikolay: Yeah, yeah, 1 second,
1 second.

I just wanted to send a big hello
if Jeremy Schneider listens

to this.

And I hope 1 day we will have him
on this podcast.

And Jeremy Schneider is important
because he was making it popular,

blog posting about ASH a lot.

And he was in the RDS team when
they also started to do this

like performance insights and so
on.

So it's great that you connected
over LinkedIn with him.

Michael: I think was also involved
in the very good wait event

documentation that RDS and Aurora
have.

Nikolay: Oh yes, we discussed this
with Dmitry offline a lot.

Yeah.

Dmitry: Yes, I mean, it's really
a lot of kudos to them for this

work.

Okay.

And, and again, I was brainstorming
with all LLMs that I had

by that moment.

And I started complaining.

Sometimes I use, I talk to them
like to real people and I was

complaining.

So yeah, I had a really nice extension,
but lack of real tracing.

It's a big caveat.

And Claude said to me, yeah, but
you can trace.

And I said, no way.

How?

And he explained to me details,
how I can really trace the weight

events.

And for 2 days, I challenged this
idea a lot for reason.

It could not be me the first who
found this, not me, okay, the

Claude found, but I'm not the first
guy in the Postgres world

who used this.

Either this just does not work
or it was basically the only idea

so it just does not work because
otherwise someone should find

it before.

Anyway 2 days of testing this idea
showed me that it basically

works and I start prototyping the
tool.

It's still in a very immature state,
but it's already showing

nice pictures and already can help
to find some interesting,

how to say, correlations.

For example, I found that, you
know, that most of extensions

now still use extension wait event.

And when extension does something,
you really cannot distinguish

it from doing a real job or it's
just a wait loop.

When I use pg_wait_sampling, for
example, to cross-check the numbers

that I get from my tool and from
pg_wait_sampling, I understood

that the background of pg_wait_sampling,
just Even if it does

nothing, because it works in the
background and there is no job

for it, it's still in a waiting
state, in there.

And so, I think since PG17, you
can have custom wait events for

extension.

And I really hope if someone listens,
just please use the custom

wait events in your extensions,
that it will really help to understand

is your extension busy with doing
some valuable work or it's

just spinning loop.

Nikolay: Right.

This is why we, in my pg_ash also,
listening to Dmitry, I also

spent some time analyzing colors.

So green is what's usually called
CPU.

And this is why in pg_ash and in
our monitoring tool, CPU has

an asterisk.

So CPU star means that it's CPU
or maybe not.

And we also discussed it a few
times on our podcast.

And yeah, so this is exactly why.

So some extensions doing something.

Yeah.

Michael: I was just thinking whether
pg_ash would show up in pg_ash

with its own custom wait events.

But then I remembered it's not
really an extension.

So it can't.

Exactly.

Sorry.

Yeah.

Nikolay: True Postgres.

Dmitry: So that's basically the
story how this came to life.

So what are we to show it?

Nikolay: We cannot show it on podcast,
unfortunately, but just

to recap, like you complained to
LLM that something is impossible

And LLM said, let's do it.

Dmitry: Yeah.

And this is how it was.

Nikolay: This is crazy.

It's like, it's, it means that
it's good to sometimes complain.

It would be good to have, unfortunately
we cannot.

And then just, maybe you just don't
know how and it's possible.

It's great.

Dmitry: Yeah.

Brainstorm with the LLM, I think
it's 1 of the best features.

Yes.

Nikolay: Similarly, our monitoring.

I remember I was talking to you
and you said, oh, it's really

hard or maybe impossible to show
the beginning of query in query

analysis in Grafana.

And that's it.

Dmitry: We mean query, query text.

Nikolay: Yeah.

We, we, we see chart, we see query
IDs, but it would be good

to see query itself.

Is it select or something or in
source?

So I just went and did it with
AI And it's there right now.

Because we had limiting belief
like it's not possible.

It's interesting.

So yeah, AI helps here.

Sometimes it doesn't help, but
here definitely helps to explore

new options and yeah, create something.

Dmitry: Yeah.

The new idea of validation that
it's also 1 of fantastic features.

You can really fast, I mean, check
the idea if it basically working

or not.

Just of course, bring the mature
software will take time.

So, I mean, an LLM cannot do it
in 1 night, but prototyping the

idea, cross-check.

So it's, yes.

Nikolay: So this new tool is tracing.

So it's true, like it won't miss
any wait event or it still have

some sampling or because when you
showed me the pictures which

look amazing, I still think you
need to post those pictures to

add those pictures to read me because
this is like super impressive.

And it was microsecond precision.

I was like, my mind was blown.

So it's, it's really crazy.

But then we started talking and
Dmitry actually came to our hacking

sessions, which happened on Wednesdays
online on YouTube.

So a couple of times we sat and
thought about how to do something

and that something was about changing
Postgres, right?

So why is it needed to change Postgres
for this tool?

Dmitry: And here we, no, for this
tool, we do not need to change

Postgres.

But the thing is this tool has
overhead.

Nikolay: Observer effect, right?

Dmitry: Yes, in more or less, I
would say, if you can call it

regular load, not pathological,
let's call it unacceptable way,

level like 10% something plus minus.

So you pay overhead for every wait
event transition, so when

you get from 1 to another.

And if you get more transitions
per second, so you pay more overhead.

And The maximum that I got was
220K transitions per second.

And in this case, yeah, 220, 000
transitions.

Yeah.

And so in this case, my overhead
was basically 30% of the query

duration.

So.

Nikolay: Was it with very tiny
buffer pool, shared buffers, tiny

shared buffers?

Dmitry: Yes.

When it read, that's why it needs
to go to the OS to get the

block, but block is in OS cache.

So that's why it's really fast.

But you need to almost for every
block you need to go.

Yes.

And for even for evicting, so you
need to be pinning, unpinning,

so all this kind of stuff.

So it goes, I mean, in this case,
I was able to generate this

amount of transitions.

And while it's a kind of a, how
to say, not edge case, but kinder,

very synthetic.

But anyway, so it says that the
overhead, so I mean, you cannot

avoid it.

And when the Jeremy Schneider raised
this concern about it's

not tracing, He also said that
the patch for progress can be

very simple.

You just add 2 probes in this function
that's used for set and

unset wait events and basically
that's it.

Nikolay: And then you created this...

Dmitry: And then you can

Nikolay: use eBPF?

Yes.

Yeah, yeah.

Dmitry: And you created this patch
for...

Nikolay: I wanted to say that even
if, like you said, for regular

workload, it's not 30, like 10%
overhead observer effect, it

still sounds super useful to me
to use it in lab environment.

So if we reproduce some problem
and study it, not in production,

but on some clone or something,
right?

It's like I'm ready to pay 10%,
20%, maybe even 30% because it

gives me an exact understanding
of everything that happened,

right?

But of course it would be good
to have a very low observer effect

and have it in production right
away.

Dmitry: To be honest, I personally
would pay even 20% in production,

just will instrument me to find
the problem really fast.

Because once I spent, I don't know,
2 or 3 days finding where

the problem is because there was
some changing in the security

configuration of the server and
basically everything became very

slow.

And again, so I spent a lot writing
custom tooling, custom scripts

to find out and go to the problem.

But if I had really good instrumentation,
like we now are discussing,

so it would took a couple of hours
to find the problem.

And I, as the tunnel producer said
in some of the comments, also,

I mean, he was triggered by these
pictures, that the observability

is investment.

It's not for free.

And for example, in Oracle world,
you just pay it, you just don't

know it.

Because you cannot have the Oracle
without the instrumenting.

So I think they instrumented kernel
in version 7.

Nikolay: And you cannot strip it
out and test it without it.

AntennaPolar is just worth mentioning,
is a very noticeable expert

in the Oracle world.

It's great that also Paths connected
here.

Dmitry: Yeah, well...

And that's why I think the wait
event analysis should be in the

capabilities for that, should be
in the Postgres core.

And 2 ways of doing it.

A small patch that you did with
this...

Did

Nikolay: some patch?

I already forgot.

Dmitry: Yes.

Yeah, yeah.

But I wanted to go further and
build the patch that will bring

Oracle level of wait event analysis,
bring this system time

model, the tracing on the backend
level.

So without this, and at the beginning
I was skeptical and I thought

that overhead would be kind of
similar to this because why it

should be really less.

But in reality, way less.

I have this patch in my GitHub
and so last couple of weeks I

was iterating it, making it better
and better and I hope it will

be ready to propose to a community
just today.

But today I was analyzing, there
was a commit in the master that

breaks a lot.

Yes, they changed some memory management
and basically breaks

some parts in my code and I need
to rework some.

Nikolay: Yeah and it's like now
it's the heavy work of preparing

Postgres 19 and maybe like slightly
later it will be less like

super active in terms of code changes
I guess.

I wanted to mention not only 1
limiting belief was like, OK,

it's not possible.

Now you think, OK, observer effect
is too high.

And obviously, it's not so high.

And I remember the upgrade and
post-resume in PagerBouncer, I

had so big belief it won't work,
but we see now how it works

beautifully under huge loads and
huge clusters.

So we just go and check, right?

Now experimentation is cheap with
AI.

It's great to check instead of
thinking it's not possible or

hard.

It's great.

Like it's super inspiring to hear
what you say.

Dmitry: Yeah.

And back to the Tuli, I still think
it's very applicable in case

you can, you, I don't know if the
patch, my patch or similar

patch with a similar idea would
be committed any day upstream.

Or you cannot afford new versions,
so you will stick with some

vanilla Postgres without these
new capabilities.

Nice thing that it's fully independent
from Postgres.

It runs like a sidecar, so you
can switch on and off.

You do not need to recompile Postgres,
you don't need to change

anything in Postgres.

It just runs.

You can start to next to it.

Yes, you need to self-host Postgres,
you need to have access

to the box.

Nikolay: Root access.

Dmitry: Not root, but with elevated
access rights for running

eBPF, basically.

And so that's a nice thing about
the tracing.

As soon as it's pure tracing, it's
not sampling at any moment,

It opens doors for very interesting
things that are not even

implemented in Oracle world.

And here we can beat Oracle in
progress because it's tracing.

It means that we have a full history
of transitions from 1 state

to another state to the next state
and so on.

And you can reconstruct all performance
problems or clashes or

bursts in your system, not only
in time but also find the dependencies,

who started the problem.

Because even with a wait event
analysis, with a sampling approach,

for example, you cannot distinguish
between you had 10 short

I/O waits or 1 long.

With the tracing, you can do it
and you can find the session

who started the problem.

Because in the regular way, it's
really hard to distinguish the

session suffered from the problem
or it caused the problem.

So with this, I believe, with the
tracing, you can find the root

of the problem and all the consequences
of it.

And with all these transitions,
you can see it as the servers

and can analyze this from the queuing
theory perspective.

You can get the already really
mature mathematician approach

and everything is already there.

You can analyze it, how much you
can serve in a given period

of the time.

And you can find the real bottleneck,
for example, with a regular

wait event analysis you can see
that a lot of I/Os, for example,

that I/Os are bottlenecks.

But in reality, if you see from
the tracing that by I/O always

gets the LWLock, for example,
for any reason.

Even if you would be able to solve
the problem with I/O, giving,

for example, find the faster disks,
the I/O would be shorter,

but anyway, you would be locked
by LV locks.

So the bottleneck would be just
moved to the next stage.

And.

Nikolay: This is what makes benchmarking
so hard.

Yes.

And what you're saying, as I hear,
I don't know all the details,

but what I hear is there is a math,
mathematics, like apparatus

basically, which could help here
and analyze bottlenecks much

better and understand real bottleneck
reliably.

That sounds amazing.

I want this.

Dmitry: So it's not at all.

So not yet.

I have more ideas that's semi-implemented.

So I'm just playing with it because
collect data, it's not the...

Even analyze the data, it's not
the last step.

The last step is present data to
the human to make it consumable

from human to understand this data,
to make the action.

So this is the process analysis.

There is a, again, already people
spent a lot of time, really

smart people thinking and they
already have the really good approach

how to analyze the graphs and when
you move data from 1 state,

for example, tickets, when you
move from 1 state ticket to the

open in progress, I don't know,
test completed and same with

the, I don't know, the logistic
when you need to find the bottlenecks

on your graph.

And you can, with this math, you
can find the patterns that the

most frequent path is that, and
the bottleneck is this specific

state.

It means that the, I don't know,
every, most of your backends

goes through the buffer pin, for
example.

And buffer pin basically the limiting
factor.

You do not need to even look at
the dashboard with your eyes.

System, not LLM, simple code can
do it for you.

We analyze the backends, traces,
yes, everything goes through

the buffer pin.

And so we analyze that the buffer
pin and the system can make

100k buffer pins a second, but
your system demands 200.

So basically you need to solve
this problem here.

So the problem is here.

It's a very deterministic approach.

It can give the report really easy.

You do not need hallucinations.

You don't need to look at your
dashboards for that.

And so I think it opened doors
for really interesting stuff.

And 1 last, I think, idea that
I have and tried to implement

already.

System can find the deviations.

For example, now for Query ID,
the common wait event profile

is, I don't know, spent 1 millisecond
in CPU, 1 millisecond in

I/O, and after that, sometime in
the logs.

But instead of 40 milliseconds,
it took 50 milliseconds.

And it's because it's introduced
a new wait event in between.

So again, it's very easy to, I
think, to write the code to make

this analysis in a semi-automatic
way.

So you just can, like in Oracle,
you open the 1 hour AWR report,

but in the bottom you will have...

So I made analysis for you, so
the problem is here, the query

with the query id, change the wait
event profile from that to

that, and you have a problem with
this, I don't know, disk or

file or something like that.

So it's a kind of a project, but
I think it's doable nowadays.

Michael: Going back quickly, you
mentioned let's say we're self-managing

using Postgres today.

Presumably this tool already would
work for us.

So I'm wondering what the changes
in core would be that you want

to make.

What added benefit would that have?

Or is it mostly about getting it
to work on managed service providers

or something else?

Dmitry: From 1 hand, this tool
I think would enable a lot of

debates to make the perfect wait
event analysis.

But moving these capabilities to
the Postgres core will reduce

the overhead, from 1 hand.

Second hand, you will have it in
vanilla everywhere.

So you would not need to bring
any tool.

You would not need to have access
to the box.

You do not, because not all Postgres
DBAs have it, right?

Nikolay: You have interface, maybe
SQL interface or something.

Dmitry: Yeah.

So it's definitely going to be
available through the SQL views.

Yes.

Nikolay: But the question, we haven't
started any discussions

in hackers yet.

Right.

Yes.

So this is the interesting point.

If when, what will they say?

Because.

Dmitry: Yeah.

And why I still, how to say, want
to make this patch at least

from the first look really solid,
because I expect a lot of pushback

because it's a kind of a consensus
for some reason.

When I talk to on any conf on any
place to the production DBAs

everybody says yes we need it no
dubs but when you talk to the

people who has some weight in the
community who can really promote

something, they, how to say it,
I would say they spend less with

a real production issue, less time
with the real production issues

and that's why probably they pushing
back for any changes in

core that can introduce the performance
degradation, even if

it brings a lot of new, I don't
know, tooling for DBAs.

Because it really depends on your
perspective.

I look at the Postgres as the production
DBA, that's why I think

I would pay 20% overhead so I don't
care just give me a tool.

It's already

Nikolay: paid, so you don't need
to go to them.

As you said you're already working
with this like 10-20% or something

overhead.

Yeah but I guess...

Michael: Quick question on the
tool as it is, it's on the GitHub

repo, it says version 0.8 is the
last 1 I saw.

Are you working towards, what does
1.0 look like?

Is it a reliability thing?

Is it like a more tests thing?

What's missing that you don't want
to call it 1.0 yet?

Dmitry: Yes, I still in some tests
it gets out of memory so seems

to there is a I mean like so yes
I need the the more tests increase

the quality of the code and I want
to get to run it on real box,

not virtual machine from some cloud
provider on real box to make,

you know, to say to people, so
it's if not production ready,

but at least it's safe to run somewhere
on the test environment

because nowadays I still cannot
say it for sure.

But again, I thought I was really
close to the patch to make

this ready for community to start
a conversation.

I know that it's a really long
journey and I wanted to start

this.

And as soon as I had a feeling
that I'm really close to it.

I invested all my time last couple
of weeks in the patch and

I didn't commit anything to this
tool.

That's, I mean, kind of a post,
but I will get back to it for

sure.

But yes, the quality of the question.

A couple of

Nikolay: questions For those who
are interested in trying out

earlier, first of all, what are
best use cases for it in its

current state, not in future state?

And second question is how to start
quickly.

Just clone repo and that's it.

Dmitry: Yeah, I hope at least I
tried to make the readme up to

date and it should be sufficient
just to go through the readme

to compile it and just run it.

I tailored it for my use case,
I mean for my day-to-day life

and I hope that it's a good example
and people should be able

to use it just right after the
cloning.

Nikolay: I obviously see, for example,
the next big benchmark

actually probably should be worth
using it to see how it helps

to understand what's happening
with that benchmark.

But maybe what are the other cases,
like some troubleshooting

of complex situation in production
and try to reproduce it on

a clone or something else?

Dmitry: I think if you, for example,
you have a problem and fortunately

enough you can reproduce it, but
you just don't know in some

isolated environment, you just
don't know what happens during

the problem, you can run the tool
and it would be just look into

the problem with microscope.

It will track everything, it will
show everything to you.

So I think at this moment it's
good enough for that type of usage,

investigating some problem in isolated
environment.

But if you're brave enough, you
can try it on some close to some

load in production because you
just can kill it.

Just

Nikolay: brave enough or desperate
enough.

Sometimes you have a problem you
cannot reproduce and let's go

to production if it's self-managed.

I, in this case, someone goes and
needs it in production because

otherwise it's not like possible
to understand what's happening.

Do you recommend to attach to only
once backend or to analyze

all of them?

Like how's it better to use this
tool?

Dmitry: I built it with the trace
everything, including the auxiliary

processes, this postgres processes,
because sometimes that's

the problem.

And I build this tool with this
in mind, so just let's say everything

and have everything.

But

Nikolay: It's possible to narrow
to only 1 session, only 1 backend,

is it possible?

Possible as well,

Dmitry: right?

Nikolay: Yes.

Good.

So if like, it's just less risk
is lower if you decide analyze

only 1 backend, right?

Then...

Dmitry: Yeah, theoretically, yes.

Surface should be smaller, yes.

Nikolay: Yeah, that's great.

Yeah, I think I'm definitely will
plan to use it in some next

benchmarks I will be having.

I had some benchmarks last Friday.

I just realized I should involve
this tool, but I'm going to

revisit those benchmarks.

They are fully scripted, so I guess
I will just try and see what

it will say.

It's about this queuing Postgres tool.

I will definitely connect to you
about that.

Anyway, I'm excited to see this
work.

It's a very unexpected for me angle
of active session history

analysis, which actually dissolves
1 of the biggest concerns

about ASH methodology, which is
sampling.

Producer statements are precise,
unless query ID is evicted or

unless we talk about the part which
is like cancelled statements,

which statements don't see.

But they are precise, it's just
cumulative metrics counters incrementing,

like tracking all.

I really liked producer statements
when they were created because

before that we had only
pre-pgBadger, we had pgFouine, it's

a French name, and it was PHP script
and it was based on sampling

like everything related to logging
is painful and sampling is

painful.

So okay we have now we now have
exact precise instrument tool

to work with.

But then ASH, ASH is impressive,
great approach, but it's sampling

based.

Now you say ASH can be precise.

And this is like new horizon or
opening for me.

I'm super excited.

Thank you for making this work.

It's great.

Dmitry: And when I working on this
tool and working on the Python

for progress, I understood that
there are some, how to say gaps

in the attributing time to the
queries.

That's not really, how to say straightforward
to understand.

For example, the client reads the
wait event that we see really

often, but is it really database
time or it's still client time?

It's not that simple and easy,
but even if you go deeper, so

there are, for example, you ended
the query.

I mean, you open transaction, begin,
you made some changes, the

last update finished, and you issue
commit or end.

And really, you need to attribute
the time and waits to end,

to commit.

I think it's defaulted behavior.

But also, even after commit, there
are some work that Postgres

does, some cleaning up.

And from pg_stat_statements' perspective,
it's already not there.

And some part of the work is kind
of unattributed, and it goes

to nowhere.

It's not really a big amount of
work, but still, playing with

this precise tool will, if you
are curious enough, so will get

you to the next level of understanding
how Postgres actually

works to the internals.

Really interesting.

I really like for that because
I never thought about it until

I started working with it.

Nikolay: And it's true that not
only you can understand the like

exact sequence of wait events for
this particular load, but also

you can map it to source code,
right?

Dmitry: Of course, nowadays LLM
can explain it to you, so it's

not a foreign language anymore,
so you can speak to it, you can

really easy to understand.

Nikolay: Which opens doors for
more optimization of Postgres

itself?

Dmitry: I hope so, yes.

Nikolay: Like usually with GDB
or with perf, now with this tool

also it's possible.

If you have some pathological workload
you will reproduce it

in a synthetic environment.

Let's go, That's it, yeah, that's
great.

Michael, you wanted to add something,
right?

Michael: Only when you were talking
about unexplained overhead,

it often shows up in explain analyze
when you look at the difference

between the execution time at the
end of the query plan and the

actual total time of the like the
top level operation you almost

always see a small discrepancy
there as well and not accounted

for anywhere which is yes the same
thing you're talking about

I think.

Dmitry: Yeah a few percent there
and there and you lost 10 bucks.

Michael: Yeah well yeah I'd like
to echo what Nik's saying it's

really interesting work that you're
doing and thanks for sharing

and also I think you you published
the tool under the Postgres

license which is

Dmitry: really cool.

Nice 1.

Yeah I personally believe in it
so if you develop for Postgres

just then share it.

Nikolay: And what book on Queue Theory
you are reading like for quite

a long time as I know?

Dmitry: Yeah so it's a quite old
1 so let me...

Nikolay: Because it connects me
to my current work, like this

tool I just released.

And I guess I need to read it as
well.

Dmitry: I will send it so you can
put it to the description.

Nikolay: Okay, good luck with the
tool, definitely, I hope it

will be popular.

And I hope your patches will be
welcomed in hackers mailing list.

Dmitry: Yeah, we will see how it
goes.

Nikolay: Great, thank you so much
for coming again.

Michael: Thank you.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

Guest

Dmitry Fomin

Postgres expert and DBA

pg_wait_tracer

Creators and Guests

Some kind things our listeners have said