Postgres FM | Transcript: Monitoring from Postgres AI

July 25, 2025 • 45 Minutes

Monitoring from Postgres AI

Michael: Hello and welcome to Postgres.FM,
a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard
and I'm joined as usual

by Nikolay, founder of Postgres AI.

Hey Nik, how you doing?

Nikolay: Hi, Michael.

I'm doing great.

How are you?

Michael: Good.

Thank you.

And this week is launch week or
was maybe if people are listening

in the future.

Nikolay: Yeah, I can tell you,
I can assure you people are listening

to us in the future because we
are not live.

Michael: Yeah, of course.

So yes, so launch week for your
company, Postgres AI.

Nikolay: Yeah, for the first time.

I'm excited.

Yeah.

So we decided enough observing
how others do it.

First of all, Supabase, they just
had launch week before this.

I think number 15 already.

So many.

And since our team has grown and
we had stuff accumulated to

be released, so instead of releasing
it earlier, I said let's

just accumulate a little bit and
release in 1 week.

It's like sprint 5 days in a row.

And we actually, we are releasing
more, we, I hope we released

by the time people are listening,
I hope we released more than

5 things.

So we have also concept of extra
bonus releases.

And we started on Monday, we started
with DBLab version 4, which

actually took 2 years to develop.

Michael: Yeah, that's a long time.

Nikolay: Yeah, finally we polished
it and many users already

installed it.

I mean release candidates.

We needed to have 5 release candidates.

Michael: That's more
than Postgres.

Nikolay: Yeah, yeah.

Well, because a lot of new problems
appeared because we changed

a lot there.

Yeah, but it's another story.

Today we talk about our second
big release, which is called Postgres AI

Monitoring.

Michael: Yeah, so this is 1 you're
doing, so we're recording

on Monday.

This is 1 you're doing tomorrow.

So what is it?

What are you up to?

Nikolay: Yes, we are doing this tomorrow.

Everything is ready already.

So I'm pretty confident we are doing this.

Yeah.

And only a few people know that we are releasing this kind of

monitoring.

And it's called Postgres AI monitoring, but right now there is

a little AI there.

And this is intentional.

Let me explain this first, because of course we had a short chat

before recording and you noticed this discrepancy.

We use AI a lot.

Every day, for example, in my team, it's prohibited to do anything

without AI-first.

It's prohibited to write code.

It's already so.

Well, you can write code, but only if your AI tools failed.

And first people were resistant to this.

Some people said, like, what the heck, I don't want it.

It's bad, it's poor quality.

But in my opinion, the Anthropic release of Cloud 4 a couple

of months ago, it was a game changer.

Quality already very acceptable in many cases.

Not always, sometimes, especially if you are expert in some narrow

topic, you understand problems which were never discussed publicly.

In this case, of course, you will see AI will be wrong 100%.

We have these cases.

For example, a perfect example is our guests last week.

They experienced some problem which was not covered by public

discussions.

Obviously, in such cases, if you just apply AI in a straightforward

way, it will hallucinate some bullshit.

But if you apply AI to brainstorm some ideas and ground it, pre-ground,

post-ground with materials, reliable materials like source code,

documentation, information from reliable sources and experiments,

post-grounding, in this case it's different.

You can explore, behave, reproduce it, and this is the path we

chose.

Not just this problem, take AI and create, I don't know, like

change requests somewhere.

This won't work.

Ungrounded approach won't work.

So this is 1 point.

Another point is that 6, 7 years I keep talking, I avoid building

monitoring system.

For example, when I met with Lukas Fittl in San Francisco 1

day a few years ago, I told him this, like, you're building great

monitoring system, pganalyze.

We build some checkup because we need for our customers to find

problems and then to start exploring them, experimenting.

Our key strength of Postgres AI
team is experiments.

That's why we built DBLab to simplify
specific type of experiments

and benchmarks and so on.

And I mentioned checkups and somehow
pganalyze has checkup section

since then, but it's a different
story.

And I said I'm not going to build
monitoring system, but it was

inevitably like I was pushed somehow
to this point where it's

inevitable that we need to do it
because no other system, and

not only open source system, not
other system is providing abilities

we need.

We need several specific abilities
with our clients.

We always show it.

And some of them is for troubleshooting.

So like high, very high level analysis
of Postgres components.

And then you can understand where
you can dive deeper.

And actually, I remember great
talks from Percona, from Zaitsev,

and so on.

And I saw some bits of good stuff
here and there, like

Percona Monitoring and Management,
or how is it called?

Michael: PMM.

Nikolay: Yeah.

Yeah, yeah, yeah.

And things like use and 4 golden
signals approaches like methodologies.

Some great pieces there.

But then you see something is really
missing.

Like For example, proper query
analysis.

We talked about it so many times.

Comprehensive analysis using pg_stat_statements, pg_stat_kcache,

pg_wait_sampling, top-down, single
query with query texts with

proper characteristics.

And We didn't see any system like
this.

A few years ago, we started maintaining
so-called pgwatch2 Postgres AI

edition, which was a fork of pgwatch2
with our metrics and dashboards.

But then Pavlo Golub decided to
rewrite pgwatch.

So we started lagging in terms
of this vehicle we used as pgwatch

itself.

So time has come and we realized,
okay, and we experimented a

lot.

We actually started with InfluxDB.

It was insane.

We eliminated it.

We used TimescaleDB.

And then we had a big thinking
process.

Should we stay with PureSQL and
TimescaleDB, or we should use

PromQL?

We decided to use PromQL, but with
some nuances.

We still have Postgres inside.

So we have both Prometheus and
Postgres.

So the idea was we need to build
a new monitoring system, which

will be part of our process.

And in consulting, we had maybe 30 clients over the last 1 year

and a half.

We had a few dozens of clients, right?

And to all of them, we installed always our pgwatch2 Postgres AI

edition, and also used

postgres-checkup tool, old tool we

always used, to generate reports in static form, like markdown

or HTML PDF.

So this time we said, okay, we finally do monitoring, but the

main part of it will be open source.

So monitoring itself is open source, but it's a part of bigger,

bigger picture where we perform health checks and help clients

proactively using AI, of course.

Michael: So I'm going to ask your question always.

When you say open source, what do you mean?

Nikolay: License is Apache 2.0.

As I remember, pgwatch is a BSD or Postgres license?

1 of these 2, but our dashboards are Apache 2.0, so anyone can

use it.

Majority of that is...

So we couldn't use a so-called traditional Postgres exporter

for Prometheus because it's so much stuff and it's really slow

to put pieces there.

So basically we have our own exporter inside, I mean, metrics

and dashboards.

And it's Prometheus.

Why Prometheus?

Because we just see that in many cases, well, several reasons.

1 of the key reasons is that we had already 2, actually 3 companies

where we, in consulting mode, we contributed a lot to implementing

our approach to their own monitoring system.

So this is why we have confidence this monitoring system is great.

And usually, bigger companies, they have some stack.

Some people use Datadog, but you know, if you have 100 terabytes

of data, or say, 1000 clusters, like microservices, Datadog is

super expensive.

Some people still keep paying.

But it's not only expensive, it has blind spots in terms of Postgres

monitoring.

A lot of them, for example, pg_stat_statements, they have some

not all metrics.

And for essential metrics, they, for example, can give you per

call, but not per second, or per second, but not per call.

There is no completeness there.

If you check carefully, so they differentiate in 1 dimension,

but not another.

For another metric, it's vice versa.

So it lacks many things.

And also how dashboards are organized.

I don't understand why people put Tuple stats on the front page.

It's not the main metric to observe.

Not the main metric.

It's not the number 1 thing you want to see when you have an

incident and need to troubleshoot.

And people don't put queries per second because Postgres doesn't

have queries per second.

But we can extract it from previous assignments with some asterisk

that it's approximate because some things can be lost.

Sometimes we see queries per second are lower than transactions

per second, but it cannot be so.

Right?

Yeah.

In fact, because we don't see full picture, but in most cases

it's vice versa.

So we see vice versa.

So this monitoring is open source fully, and it can be used in

pieces if your organization already has Prometheus-based something.

But lacks a systematic approach for dashboards.

This is where we can, I mean, you can take this and improve,

take pieces, maybe contribute back?

Also good idea to contribute back always.

And I still think, although I personally invested like more than

5 years already to this.

First few years I wanted to say, no, no, I'm not going to do

it.

But then I was forced.

Well, well, life sometimes force you to do something right.

And now this is my like, I resurrected to this idea, like, okay,

I'm doing this.

And yeah, and not only I, we have really super experienced members

in our team, who also spent a lot of time thinking and contributing

to this.

It's not only my own effort.

So we have several very good team members who contributed a lot.

So we decided to use PromQL also because of this, because it's

more compatible with enterprise systems.

Michael: Yeah, you mentioned using Prometheus, but also Postgres.

Nikolay: But also Postgres, because in Prometheus there is a

problem with query texts.

In query analysis, you do want, like you have to just have statements,

But in Prometheus, it's not normal to get query and store it.

It's very inefficient.

So you need secondary storage, and we chose Postgres, obviously.

Basically we have 2 data stores.

1 is the main 1, Prometheus.

And the second 1 is...

Actually we are debating right now.

We probably will...

Prometheus, it will be Mimir.

We also thought about Victoria Metrics.

So it's still...

We will switch in the next few releases to something else.

And for query texts, it's Postgres, but it's a small 1.

Because you know, by default, pg_stat_statements for each node,

5000 statements.

And do we track all of them?

I need to check.

Maybe no.

So 1 of the things we noticed, we had cases with our previous

monitoring, that sometimes people have millions of objects in

the database.

And this is really requires optimization.

And we made this optimization for really large cases.

I remember our checkup actually also had cases with 100,000

indexes.

In case of monitoring pgwatch2, it was previous version, basically,

right?

Our Postgres AI edition, we had cases with more than 1 million

indexes and hundreds of thousands of tables.

Bloat analysis was not working.

But like, it's and this is interesting and a separate question,

how to perform index maintenance in such a situation, for example.

But

Michael: it's also a weakness of pg_stat_statements, right?

Because often, I don't know if this is the case always, but often

when you see that number of objects, it's because somebody's

doing per tenant, but by schema.

So a lot of the tables are the same dozen or several dozen tables

and indexes.

And then pg_stat_statements kind of falls flat on its face because

it sees objects as unique and it tracks them all individually.

So you can't look at.

It's flooded.

Yeah, and 5000 is not enough.

Nikolay: It's not aggregating properly.

Michael: Exactly, yeah.

Nikolay: Yeah, we discussed this recently also, there is a problem

with temporary tables, for example,

Michael: sometimes

Nikolay: it's like the same schema, same table, but somehow you

end up having different queries, 2 different queries and pages

are set in the same.

I would personally prefer some control how aggregation is done.

There is a compute_query_id mechanism.

So basically it would be great to define a rule, for example,

ignore namespace, ignore schema name, right?

Michael: Yeah, wow.

Yeah, interesting idea.

Nikolay: Yeah, maybe.

Michael: Because the object, the other table names will probably,

for most people, the table names
would stay the same, it's the

schema name that would change.

Nikolay: When analyzing logs, I
still prefer, I know since Postgres

16 we have query ID in the logs.

We can't have, we need to enable
it for auto-explain, but I still

often prefer Lukas Fittl's, already
mentioned him, his second

time, his great library libpgquery
and Qt fingerprints.

I know it works purely based on
syntax, no OIDs from pg_class

or something, so I know how it
will aggregate.

And it's mostly often enough.

Michael: Do you know if it can
handle that?

The different schemas, same object
name?

Nikolay: Well no, I think it distinguishes
them.

If query has fully qualified names
involving schema name, I think

it will produce different fingerprints.

But I guess it's easier to tweak
it than to try to change.

Michael: Maybe.

Nikolay: pg_stat_statements approach
and so on.

I don't know.

I just feel, yeah, it would be
great to have some more flexible,

some default path, but some exotic
paths to tune it, how we aggregate.

Because otherwise you end up with
5000 slots in pg_stat_statements

by default, And then maybe 5 different
queries, each 1000 times,

right?

But just with some differences.

You know, the in problem, I don't
know, was it fixed already?

I know, for example, the lipoge
query, regardless of the members

of in list, it will produce the
same fingerprint.

Michael: Nice.

Okay.

I don't know if the feature set
statements got fixed, but yeah,

you're right.

If there's like the difference
in having 1 in list versus 2 versus

3, they were all considered unique.

Nikolay: Some people implemented
some regexps to convert it before

calculating something.

Usually you want, if you have in-listed,
regardless of number

of members in the list, it should
be considered the same.

Semantically it's the same.

2 members, 3 members, it doesn't
matter.

But pg_stat_statements, that was until, like,
I thought I saw some discussion.

I like my memory tricks me sometimes.

Maybe it

Michael: was definitely been discussions.

I just don't know if there's any
commit.

I think gut feeling is it might
be in a very recent version,

but I'm not I'm not 100% sure without
checking the but the like,

for example, any if you if you
if you transform it into a query

with any, you're going to get the
same entry in pg_stat_statements

every time.

So why should it be different with
in lists?

Like it's, it doesn't make sense.

Yeah.

Although they could have different
plans, like at a certain point

it could flip the plan, but that's
true of any query.

That's true of any querying.

Nikolay: How work, how they work
with not in and nulls, right.

Remember this danger, right.

So yeah, let's discuss.

So yeah, open source, Apache 2.0,
open source components, PG

version 3 as like engine for this.

So we just,

Michael: pgwatch

Nikolay: pgwatch version
3, which was written by Pavlo

Golub, CYBERTEC, great work, a
lot of optimizations, a lot of

things like how we control, they
call it sinks, right?

Like how data streams are controlled
and so on.

And what else?

Very fresh Grafana, obviously,
right?

And there we have several interesting
tricks there.

So where does this dashboards shine?

If we talk about only dashboards,
as I said, it's a part of bigger

vision, but just talking about
dashboards and manual use of them.

First dashboard is the troubleshooting
dashboard.

You have just 1 minute and you
need to understand where to dig

further, right?

We use 4 golden signals there.

So we have throughput, latencies,
errors, what else?

Saturation, right?

Michael: I didn't see that 1.

Dashboard, okay, dashboards.

Nikolay: Dashboard number 1.

Michael: Active session history?

Nikolay: Yeah.

No, active session history, This
is where it starts.

But it's a single node performance
overview, number 1.

Active session history, in our
opinion, we're convinced after

many years observing how RDS does
it, others are also convinced.

Wait event analysis is the king
in monitoring.

If you want 1 single graph or how
to say chart, this is to understand

like quickly performance and what's
wrong.

This is what you need to look at.

That's why we put it on the very
first place on the very first

dashboard.

But it's only very high level,
like 30,000 feet view.

So only wait event types, not wait
events.

No query IDs, not particular wait
events, only wait event types.

So it's very, like, is it like
CPU or I or something?

And CPU, we mark with asterisk
because it's not only CPU we know

sometimes some parts of the code
still are not covered by by

wait events.

So technically, it can be CPU or
unknown wait event.

Right?

Yeah.

But and CPU is green is as everywhere
is in performance insights,

PASH Viewer, anywhere.

So then we have very high level
pieces to understand workload

and various Postgres components.

And we follow 4 golden signals
there.

So latency, traffic, situation,
traffic throughput, right?

Situation errors.

Latencies, we extract from pg_stat_statements.

I think it's a great idea actually
to have it in the

pg_stat_database.

Probably there are reasons why
it's still not there.

You can extract it if you have
PgBouncer, sometimes I just check

PgBouncer logs manually.

To double check that I understand
latency properly.

But from pg_stat_statements, good
enough.

Of course, again, if you have a
situation like we just discussed,

and you see only the tip of the
iceberg in those pg_stat_statements.max

entries of pg_stat_statements,
5000 by default, then it will

be a wrong value.

So I think, yeah, we, we, we, we
understand this, right?

So it's like best efforts.

It's kind

Michael: of an edge case as well,
right?

Yeah.

It's not many people that have
hundreds of thousands of objects.

Nikolay: Yeah.

Yeah.

And throughput, we like in various
ways, like calls per second,

transactions per second.

So these are key ones.

And we didn't put tuples per second,
as you can find in Datadog

and some other places.

Maybe we should, but I'm not convinced
this is super low level

throughput metric should be on
the very front.

Maybe actually, because I'm contradicting
with myself.

I'm always discussing logical replication.

I say on modern Intel and AMD processors,
single slot, Single

publication subscription, it can
saturate subscriber side single

threaded process at levels like
1000, 2000, 3000 tuples

per second, depending on various
factors.

So maybe tuples per second for
writes, maybe we should put it,

actually.

Good idea, maybe.

But this is very low level throughput.

So in database you can send throughput
in various levels.

Transactions, queries, well, tuples
next.

Is tuples next?

Or maybe rows?

Rows, tuples are different, right?

Well, buffers, yeah, yeah.

Actually, this is what we do have.

We have hits and reads and writes,
dirtied buffers.

All those buffers we put it right
on the front page, but only

high level, we don't distinguish
by query ID, this is the job

for different dashboard.

So you quickly can understand what's
happening here.

And of course, we translate it
to bytes, right?

So how many bytes per second of
buffer reads your buffer pool

is having right now?

And did we have spikes, right?

So yeah.

And actually, yeah, and errors.

Yeah.

So if we are errors, we will see
it in transaction analysis,

number of rollbacks.

We currently don't work with logs,
but it's in the roadmap, of

course.

So we extract errors from logs,
aggregate, and so on.

I still have hope that a tiny but
very useful extension called

logerrors.

1 day we'll make it into engine
and we will have some pg_stat_errors

or something.

It would be great to have some
error code and how many times

it occurred, right?

Basic accumulative metric and we
could start observing them in

normal way, exporting and then
alerting.

Speaking of alerts, it's also in
roadmap, definitely.

We have very careful approach because
talking to different customers,

we see very different expectations,
very like opposite sometimes.

So yeah, I will discuss it.

Yeah.

In the past.

Michael: Well, I think it's interesting.

I think alerting is fascinating
and like balancing risks of false

positives versus risks of false
negatives.

Like I think it's a really hard
problem.

Nikolay: It's a hard problem.

Michael: And configurationally,
People always just end up making

everything configurable and it's...

And picking defaults is hard or
like...

Anyway, it's not envious of people
that make monitoring tools.

They quite like making

Nikolay: it simple.

Yeah, I think we will avoid a single
default set, set of defaults,

and we will just give some, basically,
a questionnaire.

What matters for you more?

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

Monitoring from Postgres AI

Creators and Guests

Some kind things our listeners have said