pg_wait_tracer
Nikolay: Hello, hello, this is
Postgres.FM.
My name is Nik, PostgresAI, and
as usual, my co-host is Michael
of pgMustard.
Hi, Michael.
Michael: Hi, Nik.
Nikolay: And we have a great guest
today, Dmitry, who is a very
experienced DBA, originally Oracle,
now Postgres, already many
years.
Hi, Dmitry.
Dmitry: Hi, thanks for inviting.
Nikolay: Thank you for coming.
So we have, I think, very, like
for me it was surprise when you
did it, when you built this tool,
and I didn't expect that it's
coming to Postgres ecosystem.
But obviously it's something new
and in the area which is in
my personal key interests lately.
So active session history analysis,
wait event analysis, but
you did it very differently compared
to what we usually did with
pg_wait_sampling or other tools,
performance insights in RDS.
So it's a great tool.
Where to start?
Dmitry: Yeah, so I think we can
start with the time when I first
started working with Postgres,
as I started switching from Oracle.
When Oracle, 1 of my favorite topics
always and tasks was performance
related investigations.
And when I switched to Postgres,
of course, I start looking for
similar tasks and understood that
the way of Postgres troubleshooting
is very different from how it's
done in Oracle.
And definitely the felt lack of
tooling, lack of a standard approach
and so on.
Okay.
So after some time, I realized
that there are some tools that
can bring similar kind of feeling,
let's say what I had in Oracle.
And I'm talking about the weight
event analysis because I know
in Oracle world, it's a given thing.
Yeah.
Yeah.
Nikolay: I'm going to interrupt.
So when did this switch happen
from Oracle to Postgres for you?
Dmitry: Recently I checked, it
happened 10 years ago.
Nikolay: 2016 roughly.
I remember very well a big stream
of, like, big wave of people
coming from Oracle in 2013-14 and
all of them were complaining
about lack of wait event analysis.
This is how 2 columns were added
to pg_stat_activity.
It was like maybe a couple of years
before basically your switch
and you also had this same feeling
like these tools are needed.
Dmitry: Yeah, Because when you
work with Oracle since, I don't
know, 2002 or something like that,
it's a given way of troubleshooting.
Any Oracle guideline, any book,
any training official, non-official
says that just open the active
session history.
And so it's a, it's a first, it's
a gate that you go through
and you start with.
And so I try to use the same approach
with Postgres because for
me it was so obvious why it's a
way to analyze the performance.
Because basically what's the wait
event says you say you so why
it was slow where the session spend
the time session or database
in general so you do not need to
guess was it's I or CPU basically
where it spent time.
So it just says very rarely.
So then why it happened that way.
So you need to investigate, but
at least you know where, because
the regular approach in progress
just can say that this query
was slow In average, or with some
hacking, you can get some precise
numbers of the execution, but it
does not give the idea why it
was slow.
Okay, yes, you can use the explain
analyze to see where in the
plan you spend the time, but again,
it does not explain why sometimes
this join takes I don't know 1
second or sometimes 10 seconds
from the plan perspective it looks
the same and the tooling
for wait event analysis evolved
year by year.
And so we have the pg_wait_sampling.
And since they introduced the new
event tracing as well, so it
became really a tool that gives
really good data, good source
for wait sampling analysis.
And I use it everywhere where I
manage Postgres.
And if it's self-hosted Postgres
for sure, I will, I will convince
the owner of this Postgres to install
pg_wait_sampling.
It's a default way of analyzing.
On all my dashboards, it's the
first panel you see when you open
any dashboard that I created.
So it's a wait event analysis.
I convinced all my colleagues to
use this approach.
You convinced me
Nikolay: To put this ASH graph
on the top of the first dashboard,
which I resisted initially because
the first dashboard for me
was seen as very shallow and very
wide, troubleshooting where
you identify areas, but you convinced
me to put ASH graph on
the very top.
And now I'm convinced.
Yeah, it's a great idea.
So it's very quickly helps you
understand like some pain points
for performance.
Dmitry: Yeah, I have a, I will
try to make it brief.
2 examples when it's the wait event
analysis shines.
So 1 was, there was a migration,
hardware migration from, from
let's call it this old hardware
to new hardware.
The data files stored in the storage
appliance.
So we just remounted.
So the data layout was the same.
Everything was the same in terms
of the data layer, storage layer,
but the hardware, the compute part
changed and changed to the
highest spec.
So the number of virtual cores
increased, the clock increased,
everything increased.
All tests were good.
When we start loading production,
so we see the really huge degradation
in performance and so what you
would think.
So the plan flip or something
like that, but you have the same,
absolutely everything is the same.
Just hardware is different.
So it basically cannot be planned,
because the parameters are
the same.
Nikolay: So your slide deck has
very interesting example, very
simple, so anyone can understand
about contractors doing some
work.
So in this very example, you right
now, describe right now, it's
like basically you hired a different
team and do, and say, do
the same work, but it does it very
differently.
Right.
Yeah.
So that example, I just to explain
what like those who haven't
seen.
So it's a simple, like there are
contractors, they say, we are
busy, we spend so much time, and
without wait event analysis,
you just know they are busy.
But what they exactly were busy
with, you don't know.
And in this case, okay, they were
busy with I/O or something
else, right?
Dmitry: And yeah, exactly.
Nikolay: Switching hardware, you
see different picture and so
what was in the end of that story?
Dmitry: So, the problem was LWLock
Manager, so you know, it's
1 of your favorite LWLocks.
Nikolay: A lot of dots are connected
here because we spoke about
so many problems which are explored
with wait event analysis
on this very podcast.
So, right Michael, you remember
lock manager?
Michael: Yeah, of course.
It feels like whenever there's
like a system-wide issue, it's
such a great starting point, right?
Dmitry: That's sorry, I...
That's the reason, next case, when
I would say that's on lower
level.
Nikolay: Let me first disagree
with Michael, because I agree
on 1 side, system-wide.
But on the other side, more and
more I think we just don't have
this tool yet, but it should exist.
On pgMustard we should have it.
We have plan analysis and for 1
call it would be great to see
wait event profile as well.
Because imagine we were holding,
like we were blocked by log
acquisition which were pending.
You will see a lot of time spent,
buffer numbers are low, and
that's it, you need to guess.
With wait event analysis for
a single query execution, you
will see, okay, this was lock wait
event type.
Michael: But wait, but I think
this is where we get into the
core of what Dmitry's done, which
is cool, which is moving from
sampling to tracing.
I think when it historically has
been reliant on...
I do think it's really important.
Maybe I'm wrong, but my understanding
is that sampling...
Nikolay: I also thought this, like
this is what expectation I
had as well.
It's about a single backend tracing.
But let's move on and hear what
Dmitry will say.
Exactly.
Dmitry: Yeah, just the next case
when the wait event really saved
a lot of time.
And when this single backend analysis,
I mean, it's single backend
analysis case.
It does not work only on the instance
level.
It's on the single backend that
also works.
I would have the case, I got the
complaint that sometimes query
takes from few milliseconds and
it's killed with 1 second timeout.
We have, I don't know, 100 calls,
10 of them failed with 1 second
timeout and the query is primary
key lookup.
So you cannot, they cannot mess
up with the plan for this case.
So it's just a B-tree primary key
lookup, very simple.
It's not join nothing, so just
very simple lookup.
Nikolay: And it's not lock manager
contention because exactly
lock manager contention, in many
cases, it's primary key lookup.
It was not that.
Dmitry: Yeah.
Okay.
So you open the dashboard so you
don't see any spikes of any
specific wait events.
So it's definitely very low, a
low level with some specific backend
problem.
So I also captured just in case
the explainer, the plan was the
same.
So everything was the same, But
sometimes it goes from a few
milliseconds to 1 second, and I
start sampling.
So I make some custom scripts that
will just catch the needed
backend and start sampling.
And I saw that when it's slow on
I/O, I mean, I saw the I/O,
I mean, look how it can be, just
a very simple read.
So it just reads very few blocks
from disk.
So next step, so I probably understood
with the wait event
analysis it's nothing else but
just slow I/O for some cases.
And with the eBPF and tracing these
backends, I understood that's
a problem with the long tail latency
on the storage appliance
that we use.
Sometimes it goes from 1 millisecond
to tens milliseconds.
And indeed if you're not lucky
enough and you need to read these
blocks from disk, not from memory,
not from OS cache, these even
few blocks become seconds.
And yeah, I think without wait
event analysis it will be just
guess game.
But this just in a couple of days
I found the problem.
Michael: A couple of quick questions.
1 is around, like you can get I/O
timings in explain analyze output.
Obviously it's a bit tricky if
they're always getting cancelled
I don't know how you got the explain
analyze for the queries
that were timing out actually how
did you do that
Dmitry: I just increased the time
out for some portion of them
to make this finish in a few seconds.
Nikolay: It's about 0, right?
Michael: Did you have I/O
Timings on the server?
Dmitry: Yes, I do.
But when you have 1k calls a minute
of this type and just very
few and it fails you cannot track
it it just so it's 1 of my
favorite topics that averages it's
every time so you have to
have to have P99 and better more
nines to see the actual picture
of the senior system?
Nikolay: Just to push back,
Michael: I think this is a really
interesting topic.
I think actually in this 1 specific
case auto-explain might have
been okay, like catching the...
That's where it can be really good
in terms of catching the outlier
queries that go beyond a set.
I think as long as you've got the
I/O timings on, in this 1 case,
we might have seen, oh, I/O timings
are the issue for these slow
queries.
But I'm sure there are other examples
where the wait events would
have shown something more interesting.
Nikolay: But wait event will analysis,
or not, AutoExplain will
show you the plan, will show you
timing if you enable it, will
show you buffer numbers if you
enable it.
Good.
But it won't show you WaitEventProfile.
Although now I think, why not?
It's a good idea, probably.
This is maybe, yeah, even maybe
better than for manual explain
to have it in auto explain.
Michael: I think you're right,
but I think, I was talking about
track I/O timing.
You know the parameters of track
I/O timing?
Nikolay: Yeah, I/O timing, great.
But I/O timing is working, first
of all, it's working at higher
level.
But what if it would be not I/O timing,
but IPC, or like with local
or log or something, like full
picture would be great to have,
right?
Also wait event, we talk like this
is a wait event type level,
we have how many like 7, 9 wait
event types, but how many wait
events?
Hundreds.
So it's very precise understanding
what the system was doing.
So I think, just imagine for auto-explain
to have a wait event
profile logged optionally.
I think it's a good idea for those
who want to hack Postgres
right now.
Michael: Yeah, me too.
And I think it probably transitions
onto like, why don't we have
this?
What's, what are the downsides?
What are the, what's the overhead?
What are the tricky parts of this?
Dmitry: And we will go there.
Yeah, because it's definitely a
thing that I cannot take out
of my mind for weeks already.
And so this last case with this
single backend problem, I reflected
this experience, I spent a couple
of days on it, and I thought,
so what tooling can help me with
it to next time to finish this
investigation in hours and better
probably even enable dev people
to do it instead of DBAs.
And the Oracle has a very simple
answer for this type of problems.
It's the 10046 event.
It's a tracing.
When you enable it on a certain
level, it writes everything in
the file.
Every single syscall, every single
tuple, I mean the raw joint,
everything I mean very detailed
like real tracing.
And I decided to create the pg_10046
extension that will do it in
Postgres.
And it was the prototype was really
good.
So I really proud of it.
But for wait events, Oracle also
traced the wait events.
And I did the same, but I had to
sample it because I didn't find
by that time any tracing approach
for wait events in Postgres.
I know when I started posting on
LinkedIn this, Jeremy Schneider
came and commented.
So yeah, while it's really nice,
you cannot call it 146 because
wait events are not tracing.
Very true, But there is no way
in Postgres to do it.
Nikolay: Yeah, yeah, 1 second,
1 second.
I just wanted to send a big hello
if Jeremy Schneider listens
to this.
And I hope 1 day we will have him
on this podcast.
And Jeremy Schneider is important
because he was making it popular,
blog posting about ASH a lot.
And he was in the RDS team when
they also started to do this
like performance insights and so
on.
So it's great that you connected
over LinkedIn with him.
Michael: I think was also involved
in the very good wait event
documentation that RDS and Aurora
have.
Nikolay: Oh yes, we discussed this
with Dmitry offline a lot.
Yeah.
Dmitry: Yes, I mean, it's really
a lot of kudos to them for this
work.
Okay.
And, and again, I was brainstorming
with all LLMs that I had
by that moment.
And I started complaining.
Sometimes I use, I talk to them
like to real people and I was
complaining.
So yeah, I had a really nice extension,
but lack of real tracing.
It's a big caveat.
And Claude said to me, yeah, but
you can trace.
And I said, no way.
How?
And he explained to me details,
how I can really trace the weight
events.
And for 2 days, I challenged this
idea a lot for reason.
It could not be me the first who
found this, not me, okay, the
Claude found, but I'm not the first
guy in the Postgres world
who used this.
Either this just does not work
or it was basically the only idea
so it just does not work because
otherwise someone should find
it before.
Anyway 2 days of testing this idea
showed me that it basically
works and I start prototyping the
tool.
It's still in a very immature state,
but it's already showing
nice pictures and already can help
to find some interesting,
how to say, correlations.
For example, I found that, you
know, that most of extensions
now still use extension wait event.
And when extension does something,
you really cannot distinguish
it from doing a real job or it's
just a wait loop.
When I use pg_wait_sampling, for
example, to cross-check the numbers
that I get from my tool and from
pg_wait_sampling, I understood
that the background of pg_wait_sampling,
just Even if it does
nothing, because it works in the
background and there is no job
for it, it's still in a waiting
state, in there.
And so, I think since PG17, you
can have custom wait events for
extension.
And I really hope if someone listens,
just please use the custom
wait events in your extensions,
that it will really help to understand
is your extension busy with doing
some valuable work or it's
just spinning loop.
Nikolay: Right.
This is why we, in my pg_ash also,
listening to Dmitry, I also
spent some time analyzing colors.
So green is what's usually called
CPU.
And this is why in pg_ash and in
our monitoring tool, CPU has
an asterisk.
So CPU star means that it's CPU
or maybe not.
And we also discussed it a few
times on our podcast.
And yeah, so this is exactly why.
So some extensions doing something.
Yeah.
Michael: I was just thinking whether
pg_ash would show up in pg_ash
with its own custom wait events.
But then I remembered it's not
really an extension.
So it can't.
Exactly.
Sorry.
Yeah.
Nikolay: True Postgres.
Dmitry: So that's basically the
story how this came to life.
So what are we to show it?
Nikolay: We cannot show it on podcast,
unfortunately, but just
to recap, like you complained to
LLM that something is impossible
And LLM said, let's do it.
Dmitry: Yeah.
And this is how it was.
Nikolay: This is crazy.
It's like, it's, it means that
it's good to sometimes complain.
It would be good to have, unfortunately
we cannot.
And then just, maybe you just don't
know how and it's possible.
It's great.
Dmitry: Yeah.
Brainstorm with the LLM, I think
it's 1 of the best features.
Yes.
Nikolay: Similarly, our monitoring.
I remember I was talking to you
and you said, oh, it's really
hard or maybe impossible to show
the beginning of query in query
analysis in Grafana.
And that's it.
Dmitry: We mean query, query text.
Nikolay: Yeah.
We, we, we see chart, we see query
IDs, but it would be good
to see query itself.
Is it select or something or in
source?
So I just went and did it with
AI And it's there right now.
Because we had limiting belief
like it's not possible.
It's interesting.
So yeah, AI helps here.
Sometimes it doesn't help, but
here definitely helps to explore
new options and yeah, create something.
Dmitry: Yeah.
The new idea of validation that
it's also 1 of fantastic features.
You can really fast, I mean, check
the idea if it basically working
or not.
Just of course, bring the mature
software will take time.
So, I mean, an LLM cannot do it
in 1 night, but prototyping the
idea, cross-check.
So it's, yes.
Nikolay: So this new tool is tracing.
So it's true, like it won't miss
any wait event or it still have
some sampling or because when you
showed me the pictures which
look amazing, I still think you
need to post those pictures to
add those pictures to read me because
this is like super impressive.
And it was microsecond precision.
I was like, my mind was blown.
So it's, it's really crazy.
But then we started talking and
Dmitry actually came to our hacking
sessions, which happened on Wednesdays
online on YouTube.
So a couple of times we sat and
thought about how to do something
and that something was about changing
Postgres, right?
So why is it needed to change Postgres
for this tool?
Dmitry: And here we, no, for this
tool, we do not need to change
Postgres.
But the thing is this tool has
overhead.
Nikolay: Observer effect, right?
Dmitry: Yes, in more or less, I
would say, if you can call it
regular load, not pathological,
let's call it unacceptable way,
level like 10% something plus minus.
So you pay overhead for every wait
event transition, so when
you get from 1 to another.
And if you get more transitions
per second, so you pay more overhead.
And The maximum that I got was
220K transitions per second.
And in this case, yeah, 220, 000
transitions.
Yeah.
And so in this case, my overhead
was basically 30% of the query
duration.
So.
Nikolay: Was it with very tiny
buffer pool, shared buffers, tiny
shared buffers?
Dmitry: Yes.
When it read, that's why it needs
to go to the OS to get the
block, but block is in OS cache.
So that's why it's really fast.
But you need to almost for every
block you need to go.
Yes.
And for even for evicting, so you
need to be pinning, unpinning,
so all this kind of stuff.
So it goes, I mean, in this case,
I was able to generate this
amount of transitions.
And while it's a kind of a, how
to say, not edge case, but kinder,
very synthetic.
But anyway, so it says that the
overhead, so I mean, you cannot
avoid it.
And when the Jeremy Schneider raised
this concern about it's
not tracing, He also said that
the patch for progress can be
very simple.
You just add 2 probes in this function
that's used for set and
unset wait events and basically
that's it.
Nikolay: And then you created this...
Dmitry: And then you can
Nikolay: use eBPF?
Yes.
Yeah, yeah.
Dmitry: And you created this patch
for...
Nikolay: I wanted to say that even
if, like you said, for regular
workload, it's not 30, like 10%
overhead observer effect, it
still sounds super useful to me
to use it in lab environment.
So if we reproduce some problem
and study it, not in production,
but on some clone or something,
right?
It's like I'm ready to pay 10%,
20%, maybe even 30% because it
gives me an exact understanding
of everything that happened,
right?
But of course it would be good
to have a very low observer effect
and have it in production right
away.
Dmitry: To be honest, I personally
would pay even 20% in production,
just will instrument me to find
the problem really fast.
Because once I spent, I don't know,
2 or 3 days finding where
the problem is because there was
some changing in the security
configuration of the server and
basically everything became very
slow.
And again, so I spent a lot writing
custom tooling, custom scripts
to find out and go to the problem.
But if I had really good instrumentation,
like we now are discussing,
so it would took a couple of hours
to find the problem.
And I, as the tunnel producer said
in some of the comments, also,
I mean, he was triggered by these
pictures, that the observability
is investment.
It's not for free.
And for example, in Oracle world,
you just pay it, you just don't
know it.
Because you cannot have the Oracle
without the instrumenting.
So I think they instrumented kernel
in version 7.
Nikolay: And you cannot strip it
out and test it without it.
AntennaPolar is just worth mentioning,
is a very noticeable expert
in the Oracle world.
It's great that also Paths connected
here.
Dmitry: Yeah, well...
And that's why I think the wait
event analysis should be in the
capabilities for that, should be
in the Postgres core.
And 2 ways of doing it.
A small patch that you did with
this...
Did
Nikolay: some patch?
I already forgot.
Dmitry: Yes.
Yeah, yeah.
But I wanted to go further and
build the patch that will bring
Oracle level of wait event analysis,
bring this system time
model, the tracing on the backend
level.
So without this, and at the beginning
I was skeptical and I thought
that overhead would be kind of
similar to this because why it
should be really less.
But in reality, way less.
I have this patch in my GitHub
and so last couple of weeks I
was iterating it, making it better
and better and I hope it will
be ready to propose to a community
just today.
But today I was analyzing, there
was a commit in the master that
breaks a lot.
Yes, they changed some memory management
and basically breaks
some parts in my code and I need
to rework some.
Nikolay: Yeah and it's like now
it's the heavy work of preparing
Postgres 19 and maybe like slightly
later it will be less like
super active in terms of code changes
I guess.
I wanted to mention not only 1
limiting belief was like, OK,
it's not possible.
Now you think, OK, observer effect
is too high.
And obviously, it's not so high.
And I remember the upgrade and
post-resume in PagerBouncer, I
had so big belief it won't work,
but we see now how it works
beautifully under huge loads and
huge clusters.
So we just go and check, right?
Now experimentation is cheap with
AI.
It's great to check instead of
thinking it's not possible or
hard.
It's great.
Like it's super inspiring to hear
what you say.
Dmitry: Yeah.
And back to the Tuli, I still think
it's very applicable in case
you can, you, I don't know if the
patch, my patch or similar
patch with a similar idea would
be committed any day upstream.
Or you cannot afford new versions,
so you will stick with some
vanilla Postgres without these
new capabilities.
Nice thing that it's fully independent
from Postgres.
It runs like a sidecar, so you
can switch on and off.
You do not need to recompile Postgres,
you don't need to change
anything in Postgres.
It just runs.
You can start to next to it.
Yes, you need to self-host Postgres,
you need to have access
to the box.
Nikolay: Root access.
Dmitry: Not root, but with elevated
access rights for running
eBPF, basically.
And so that's a nice thing about
the tracing.
As soon as it's pure tracing, it's
not sampling at any moment,
It opens doors for very interesting
things that are not even
implemented in Oracle world.
And here we can beat Oracle in
progress because it's tracing.
It means that we have a full history
of transitions from 1 state
to another state to the next state
and so on.
And you can reconstruct all performance
problems or clashes or
bursts in your system, not only
in time but also find the dependencies,
who started the problem.
Because even with a wait event
analysis, with a sampling approach,
for example, you cannot distinguish
between you had 10 short
I/O waits or 1 long.
With the tracing, you can do it
and you can find the session
who started the problem.
Because in the regular way, it's
really hard to distinguish the
session suffered from the problem
or it caused the problem.
So with this, I believe, with the
tracing, you can find the root
of the problem and all the consequences
of it.
And with all these transitions,
you can see it as the servers
and can analyze this from the queuing
theory perspective.
You can get the already really
mature mathematician approach
and everything is already there.
You can analyze it, how much you
can serve in a given period
of the time.
And you can find the real bottleneck,
for example, with a regular
wait event analysis you can see
that a lot of I/Os, for example,
that I/Os are bottlenecks.
But in reality, if you see from
the tracing that by I/O always
gets the LWLock, for example,
for any reason.
Even if you would be able to solve
the problem with I/O, giving,
for example, find the faster disks,
the I/O would be shorter,
but anyway, you would be locked
by LV locks.
So the bottleneck would be just
moved to the next stage.
And.
Nikolay: This is what makes benchmarking
so hard.
Yes.
And what you're saying, as I hear,
I don't know all the details,
but what I hear is there is a math,
mathematics, like apparatus
basically, which could help here
and analyze bottlenecks much
better and understand real bottleneck
reliably.
That sounds amazing.
I want this.
Dmitry: So it's not at all.
So not yet.
I have more ideas that's semi-implemented.
So I'm just playing with it because
collect data, it's not the...
Even analyze the data, it's not
the last step.
The last step is present data to
the human to make it consumable
from human to understand this data,
to make the action.
So this is the process analysis.
There is a, again, already people
spent a lot of time, really
smart people thinking and they
already have the really good approach
how to analyze the graphs and when
you move data from 1 state,
for example, tickets, when you
move from 1 state ticket to the
open in progress, I don't know,
test completed and same with
the, I don't know, the logistic
when you need to find the bottlenecks
on your graph.
And you can, with this math, you
can find the patterns that the
most frequent path is that, and
the bottleneck is this specific
state.
It means that the, I don't know,
every, most of your backends
goes through the buffer pin, for
example.
And buffer pin basically the limiting
factor.
You do not need to even look at
the dashboard with your eyes.
System, not LLM, simple code can
do it for you.
We analyze the backends, traces,
yes, everything goes through
the buffer pin.
And so we analyze that the buffer
pin and the system can make
100k buffer pins a second, but
your system demands 200.
So basically you need to solve
this problem here.
So the problem is here.
It's a very deterministic approach.
It can give the report really easy.
You do not need hallucinations.
You don't need to look at your
dashboards for that.
And so I think it opened doors
for really interesting stuff.
And 1 last, I think, idea that
I have and tried to implement
already.
System can find the deviations.
For example, now for Query ID,
the common wait event profile
is, I don't know, spent 1 millisecond
in CPU, 1 millisecond in
I/O, and after that, sometime in
the logs.
But instead of 40 milliseconds,
it took 50 milliseconds.
And it's because it's introduced
a new wait event in between.
So again, it's very easy to, I
think, to write the code to make
this analysis in a semi-automatic
way.
So you just can, like in Oracle,
you open the 1 hour AWR report,
but in the bottom you will have...
So I made analysis for you, so
the problem is here, the query
with the query id, change the wait
event profile from that to
that, and you have a problem with
this, I don't know, disk or
file or something like that.
So it's a kind of a project, but
I think it's doable nowadays.
Michael: Going back quickly, you
mentioned let's say we're self-managing
using Postgres today.
Presumably this tool already would
work for us.
So I'm wondering what the changes
in core would be that you want
to make.
What added benefit would that have?
Or is it mostly about getting it
to work on managed service providers
or something else?
Dmitry: From 1 hand, this tool
I think would enable a lot of
debates to make the perfect wait
event analysis.
But moving these capabilities to
the Postgres core will reduce
the overhead, from 1 hand.
Second hand, you will have it in
vanilla everywhere.
So you would not need to bring
any tool.
You would not need to have access
to the box.
You do not, because not all Postgres
DBAs have it, right?
Nikolay: You have interface, maybe
SQL interface or something.
Dmitry: Yeah.
So it's definitely going to be
available through the SQL views.
Yes.
Nikolay: But the question, we haven't
started any discussions
in hackers yet.
Right.
Yes.
So this is the interesting point.
If when, what will they say?
Because.
Dmitry: Yeah.
And why I still, how to say, want
to make this patch at least
from the first look really solid,
because I expect a lot of pushback
because it's a kind of a consensus
for some reason.
When I talk to on any conf on any
place to the production DBAs
everybody says yes we need it no
dubs but when you talk to the
people who has some weight in the
community who can really promote
something, they, how to say it,
I would say they spend less with
a real production issue, less time
with the real production issues
and that's why probably they pushing
back for any changes in
core that can introduce the performance
degradation, even if
it brings a lot of new, I don't
know, tooling for DBAs.
Because it really depends on your
perspective.
I look at the Postgres as the production
DBA, that's why I think
I would pay 20% overhead so I don't
care just give me a tool.
It's already
Nikolay: paid, so you don't need
to go to them.
As you said you're already working
with this like 10-20% or something
overhead.
Yeah but I guess...
Michael: Quick question on the
tool as it is, it's on the GitHub
repo, it says version 0.8 is the
last 1 I saw.
Are you working towards, what does
1.0 look like?
Is it a reliability thing?
Is it like a more tests thing?
What's missing that you don't want
to call it 1.0 yet?
Dmitry: Yes, I still in some tests
it gets out of memory so seems
to there is a I mean like so yes
I need the the more tests increase
the quality of the code and I want
to get to run it on real box,
not virtual machine from some cloud
provider on real box to make,
you know, to say to people, so
it's if not production ready,
but at least it's safe to run somewhere
on the test environment
because nowadays I still cannot
say it for sure.
But again, I thought I was really
close to the patch to make
this ready for community to start
a conversation.
I know that it's a really long
journey and I wanted to start
this.
And as soon as I had a feeling
that I'm really close to it.
I invested all my time last couple
of weeks in the patch and
I didn't commit anything to this
tool.
That's, I mean, kind of a post,
but I will get back to it for
sure.
But yes, the quality of the question.
A couple of
Nikolay: questions For those who
are interested in trying out
earlier, first of all, what are
best use cases for it in its
current state, not in future state?
And second question is how to start
quickly.
Just clone repo and that's it.
Dmitry: Yeah, I hope at least I
tried to make the readme up to
date and it should be sufficient
just to go through the readme
to compile it and just run it.
I tailored it for my use case,
I mean for my day-to-day life
and I hope that it's a good example
and people should be able
to use it just right after the
cloning.
Nikolay: I obviously see, for example,
the next big benchmark
actually probably should be worth
using it to see how it helps
to understand what's happening
with that benchmark.
But maybe what are the other cases,
like some troubleshooting
of complex situation in production
and try to reproduce it on
a clone or something else?
Dmitry: I think if you, for example,
you have a problem and fortunately
enough you can reproduce it, but
you just don't know in some
isolated environment, you just
don't know what happens during
the problem, you can run the tool
and it would be just look into
the problem with microscope.
It will track everything, it will
show everything to you.
So I think at this moment it's
good enough for that type of usage,
investigating some problem in isolated
environment.
But if you're brave enough, you
can try it on some close to some
load in production because you
just can kill it.
Just
Nikolay: brave enough or desperate
enough.
Sometimes you have a problem you
cannot reproduce and let's go
to production if it's self-managed.
I, in this case, someone goes and
needs it in production because
otherwise it's not like possible
to understand what's happening.
Do you recommend to attach to only
once backend or to analyze
all of them?
Like how's it better to use this
tool?
Dmitry: I built it with the trace
everything, including the auxiliary
processes, this postgres processes,
because sometimes that's
the problem.
And I build this tool with this
in mind, so just let's say everything
and have everything.
But
Nikolay: It's possible to narrow
to only 1 session, only 1 backend,
is it possible?
Possible as well,
Dmitry: right?
Nikolay: Yes.
Good.
So if like, it's just less risk
is lower if you decide analyze
only 1 backend, right?
Then...
Dmitry: Yeah, theoretically, yes.
Surface should be smaller, yes.
Nikolay: Yeah, that's great.
Yeah, I think I'm definitely will
plan to use it in some next
benchmarks I will be having.
I had some benchmarks last Friday.
I just realized I should involve
this tool, but I'm going to
revisit those benchmarks.
They are fully scripted, so I guess
I will just try and see what
it will say.
It's about this queuing Postgres tool.
I will definitely connect to you
about that.
Anyway, I'm excited to see this
work.
It's a very unexpected for me angle
of active session history
analysis, which actually dissolves
1 of the biggest concerns
about ASH methodology, which is
sampling.
Producer statements are precise,
unless query ID is evicted or
unless we talk about the part which
is like cancelled statements,
which statements don't see.
But they are precise, it's just
cumulative metrics counters incrementing,
like tracking all.
I really liked producer statements
when they were created because
before that we had only
pre-pgBadger, we had pgFouine, it's
a French name, and it was PHP script
and it was based on sampling
like everything related to logging
is painful and sampling is
painful.
So okay we have now we now have
exact precise instrument tool
to work with.
But then ASH, ASH is impressive,
great approach, but it's sampling
based.
Now you say ASH can be precise.
And this is like new horizon or
opening for me.
I'm super excited.
Thank you for making this work.
It's great.
Dmitry: And when I working on this
tool and working on the Python
for progress, I understood that
there are some, how to say gaps
in the attributing time to the
queries.
That's not really, how to say straightforward
to understand.
For example, the client reads the
wait event that we see really
often, but is it really database
time or it's still client time?
It's not that simple and easy,
but even if you go deeper, so
there are, for example, you ended
the query.
I mean, you open transaction, begin,
you made some changes, the
last update finished, and you issue
commit or end.
And really, you need to attribute
the time and waits to end,
to commit.
I think it's defaulted behavior.
But also, even after commit, there
are some work that Postgres
does, some cleaning up.
And from pg_stat_statements' perspective,
it's already not there.
And some part of the work is kind
of unattributed, and it goes
to nowhere.
It's not really a big amount of
work, but still, playing with
this precise tool will, if you
are curious enough, so will get
you to the next level of understanding
how Postgres actually
works to the internals.
Really interesting.
I really like for that because
I never thought about it until
I started working with it.
Nikolay: And it's true that not
only you can understand the like
exact sequence of wait events for
this particular load, but also
you can map it to source code,
right?
Dmitry: Of course, nowadays LLM
can explain it to you, so it's
not a foreign language anymore,
so you can speak to it, you can
really easy to understand.
Nikolay: Which opens doors for
more optimization of Postgres
itself?
Dmitry: I hope so, yes.
Nikolay: Like usually with GDB
or with perf, now with this tool
also it's possible.
If you have some pathological workload
you will reproduce it
in a synthetic environment.
Let's go, That's it, yeah, that's
great.
Michael, you wanted to add something,
right?
Michael: Only when you were talking
about unexplained overhead,
it often shows up in explain analyze
when you look at the difference
between the execution time at the
end of the query plan and the
actual total time of the like the
top level operation you almost
always see a small discrepancy
there as well and not accounted
for anywhere which is yes the same
thing you're talking about
I think.
Dmitry: Yeah a few percent there
and there and you lost 10 bucks.
Michael: Yeah well yeah I'd like
to echo what Nik's saying it's
really interesting work that you're
doing and thanks for sharing
and also I think you you published
the tool under the Postgres
license which is
Dmitry: really cool.
Nice 1.
Yeah I personally believe in it
so if you develop for Postgres
just then share it.
Nikolay: And what book on Queue Theory
you are reading like for quite
a long time as I know?
Dmitry: Yeah so it's a quite old
1 so let me...
Nikolay: Because it connects me
to my current work, like this
tool I just released.
And I guess I need to read it as
well.
Dmitry: I will send it so you can
put it to the description.
Nikolay: Okay, good luck with the
tool, definitely, I hope it
will be popular.
And I hope your patches will be
welcomed in hackers mailing list.
Dmitry: Yeah, we will see how it
goes.
Nikolay: Great, thank you so much
for coming again.
Michael: Thank you.