Postgres FM | Transcript: pgBadger

January 12, 2024 • 32 Minutes

pgBadger

Michael: Hello and welcome to PostgresFM,
a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard,
and today I am joined by

Alicja Kucharczyk, Program Manager
for Azure Database for PostgreSQL

at Microsoft and organizer of the
Warsaw PostgreSQL user group.

Thank you so much for joining me
today Alicja, it's an honor

to have you here.

Alicja: Thank you for having me
Michael.

Michael: Wonderful, well I proposed
a topic this week but mainly

because you're one of the biggest
fans or at least advocates of

this tool, I think in the world.

But it's probably an underused
tool, in my opinion anyway.

I definitely don't use it at times
when I could or should, and

I know most people that could be
using it have never even heard

of it, at least newer users.

So what are we going to be talking
about today?

Alicja: Yeah about my favorite
tool which is pgBadger which is

totally open source which is really
great it's like allows you

like to do really comprehensive
troubleshooting and I believe

like it's yeah top-notch of the
class.

Michael: Yeah absolutely do you
want to give us a little bit

of background on the tool or how
people use it?

Alicja: Yes absolutely so maybe
I will start with my own story

because I've learned that the tool
exists over, I believe, 10

years ago when I was working for
a Polish Bank.

And the process in there was like
each time something was happening,

they started to generate a pgBadger
report.

And that was just always the
first step that they've been

doing.

So that was the place where I learned
that this tool exists.

And over the years, actually, because
I was moving from company

to company, and I started to be
a Postgres consultant.

And back then, that started to
be my main tool for doing the

troubleshooting.

Because as a consultant, you don't
really know the system.

Like you're going to the customer,
you know something has happened

and the customer is saying, for
instance, CPU is very high or

latency of the query suddenly has
grown, but you still don't

really have any knowledge about
the system.

And pgBadger is really one of the
best tools to quickly

get knowledge about the workload
that is happening within your

database.

And I really like pgBadger also
that it's showing you like

really comprehensive picture because
you might know other tools,

like similar tools, like pg_stat_statements.

And it's not that I want to say
that this is a bad tool or anything.

I do use pg_stat_statements.

I know that many people do, but
pg_stat_statements is actually just

one tab in the pgBadger report.

And next to that, you've got also
like report about the temporary

files.

You've got really nicely aggregated
errors.

So, everything that you've got
in the logs that are errors, fatals,

warnings, it's really nicely aggregated
on the events tab.

You would also have the logs and
logs with CK, right?

Not logs, but logs.

I know it's easy to mix these two.

So you also have these logs, you've
got the number of connections,

number of sessions.

So you really have a comprehensive
view.

For instance, if you've got
pg_stat_statements, you see, okay,

this query was executed 1,000 times,
and the total time it was

this and like the average time
was this, right?

You would have like the same thing
in PgBadger, but the difference

is in PgBadger, you can quickly
go to another tab, for instance,

temporary files tab and see that,
okay, this execution time was

caused actually because this query
was generating temporary files

or maybe it was waiting for a lock.

So that's why I do prefer PgBadger
over pg_stat_statements and

other tools, because it's giving
me this comprehensive view.

Michael: Yeah, I like it.

Shots fired right up top.

I'm a big fan of pg_stat_statements
as well, so I'm hoping to

be somewhat converted here.

I definitely think there's pros
of both.

So with PG Badger, it's a log analysis.

So it's a standalone tool as well,
right?

So open source, as you said, we
can install it wherever and analyze

our log files for anything that
we can spit out to the log.

So as you mentioned, locks, yeah,
so wait events, errors as well,

that's a big missing part of pg_stat_statements.

So sometimes we have, it's quite
common advice to have a timeout,

right?

If you have queries that are taking
more than, I don't know,

some, on some systems might be
30 seconds, time them out instead

of measuring, instead of letting
them complete.

In pg_stat_statements, we don't
see those at all, but in PgBadger,

we can see those because

Alicja: of the errors, right?

Absolutely.

Yes, yes, yes, yes, yes.

And the errors actually, you know,
it's helping with other problems.

Like recently I was troubleshooting
logical replication errors

and if you are just, have the plaintext
book and you've got just

1 error, it's really not easy like
to go up like thousands of

lines up and see what was the root
cause error, right?

Because sometimes you've got 1
error and this is causing the

whole other errors, right?

Like in this case.

And in PgBadger, you're seeing
like that.

That's great because of the aggregation
that pgBadger is doing.

Michael: Nice.

So, because we're reliant on the
logs here, I would like to ask,

a lot of people won't have, a lot
of the default settings are

to not log a lot of important settings.

So a lot of people's logs, if they're
new to Postgres or if they're

using a managed provider and haven't
set any settings, they won't

have many of the things.

So if they ran pgBadger now, I'm
guessing a lot of them wouldn't

get some of the reports you're
talking about because they wouldn't

have the right settings on.

Is that fair?

Alicja: Yes, yes, absolutely.

Like with default settings, you
would probably have like a very

poor pgBadger report because as
you said, like pgBadger is just

Perl script that is just taking
the text files, the text logs

as the input and then generating
HTML report as the output, right?

So whatever is in logs, it will
be regenerated, like it will

be parsed and it will be, you know,
aggregated in this really

nice pgBadger format.

But if something is not in the
logs, then obviously you will

not see that, right?

So for instance, if you want to
see the locks this time, like

you need to have log_lock_waits
on.

On.

There is 1 setting in Postgres
that you need to have locks

logged to be able to generate that.

Michael: Yeah, log_lock_waits,
I think it is.

Alicja: Log_lock_waits, yes.

I believe this is the one.

And this is responsible for the
locks tab in the pgBadger.

So if you've got this off, you
will not see anything.

Also like the log_line_prefix is
super important.

So, you know, the default prefix,
it's pretty much null.

And I'm using the one that is actually
recommended by pgBadger,

and it contains a lot of stuff
like database name, user name,

application name, of course, like
the timestamp, and it also

has like a host.

So the IP of the server that the connection
is coming from.

And this allows you to have really
like, you know rich pgBadger

because pgBadger is able also to
split the queries into different

tabs per database per user.

So it's able like to show you for
instance how many select queries

were like on a particular database
or were issued by a particular

user.

But you need log_line_prefix for
that.

Otherwise, if you've got really
a short default, you will not see

a lot of useful information.

But I believe in here, the pgBadger
documentation is pretty

good.

You've got the entire list of the
settings that you need to set.

And most of that, like most of
that is safe.

Like I would say most, because
you've got, of course, 1 setting,

log_min_duration_statement, which
is super, might be super

dangerous, right?

Of course, maybe just to explain
log_min_duration_statement, it's

the setting that is regulating
how many queries we want to log

into the text logs.

And by default, this is minus 1.

It means no.

And 0 means log everything.

So everything means even select
1, like something that is under

milliseconds, right?

With 0, you will see like a huge,
huge impact on your database.

So, you know, you always need to
kind of adjust this to your

system, to the workload you've
got, how intensive, you know,

your database is processing the
queries.

If you've got an analytical database,
right, like super huge data

amounts that you've got, yeah,
5 queries per hour, probably not

many of you, but yeah, let's say
log_min_duration_statement

0 is okay then.

But with millions, thousands of
queries per second, this is,

yes, this is a killer.

Like I saw even like 40% overhead
of the overall performance

with log_min_duration_statement
0.

So this is dangerous.

I usually start with something
around between 1 minute to 10

minutes, depending on the system.

Oh, wow.

Yeah.

So, you know, depending on the
particular intensivity, like,

let's say, of the system, depends
what customer is telling me,

okay, it's super intense.

If it's not, right?

So let's say we are starting from
1 minute and then we see the

overhead.

If it's huge, we are going up.

If it's not huge, then we try and
go a bit down with the setting.

Michael: When you say 1 minute,
you mean only log the duration

of statements that take longer
than 60 seconds?

Alicja: Yes, yes, yes, yes.

Michael: I thought you're going
to go much lower, but that makes

a lot of sense, just in case there's
a huge overhead.

Alicja: Exactly, exactly.

This is like to be on the safe
side, right?

Because otherwise if you've got
a production system, then you

go too low and then you've got
an outage.

Michael: And I don't think this
requires a restart, right?

So you can reload the config and
it's not a huge overhead to

change that 1 but yeah that's
a really good tip.

Without that we get almost nothing
like we don't get any reporting

on the slow statements but I guess
the trade-off is if our system's

biggest bottleneck is CPU,
let's say the customer mentioned

they have high CPU, and that's
mostly being caused by millions

of executions of very fast queries,
then we don't spot them.

So it's that trade-off between
slow queries being the issue versus

lots of fast queries being the
issue.

Alicja: It is, but you know in
this kind of, this is a very good

point because you are right like
I saw outages when we had like

yes a lot of super small queries,
right, **that were executed often

like thousand times a second and
you know we've got 2 ways actually

to do that.

One way would be like to have some
kind of cron job that would

change log_min_duration_statement
to 0.

And after some short period of
time, like switch back, right?

That would be one.

And yeah, and the second one would
be, there is like the new method

that I haven't tried yet on production,
but there is the sampling.

So, you know, in Postgres, like,
I haven't tried it yet, but

it might be something to consider.

So, for now, I was just doing cron jobs,
but to be very honest

with you, it's a problem when you
are analyzing the report, right?

Because, yeah, you see really weird,
like, charts.

So, you always need to keep that
in mind that you've got this

switch in the cron job.

Michael: Yeah, or for the fast
queries, rely on something like

pg_stat_statements if you have
that on already, for example.

But back to pgBadger, I looked
at the docs and some of the other

ones that people might want to
switch on that have low overhead.

Stop me if I mention one that is
dangerous.

But we have log_checkpoints, log_connections, log_disconnections,

log_lock_waits you mentioned already,
log_temp_files.

Do you do any...

So the recommendation in the pgBadger
Docs is to set that to

Alicja: No I just set it to 0 you
know.

Michael: Okay great so that's good
that that's you don't see

that as being anywhere near as
dangerous **I have seen some recommendations

to start with a higher setting
of that and go downwards as well

which I guess people can do if
they want to be extra cautious

but if you've tuned work_mem at
all already then I guess you should

be pretty safe to put that to 0.

Alicja: I've never seen any problem
when I tweaked with temp

files or checkpoints to be very
honest.

Like maybe there are some narrow
use cases, but for, you know,

over the years, I've never seen
any problems with other settings.

Like the same with autovacuum,
like people also were a bit afraid

about the autovacuum, but **it's
not, it's really just a couple

of lines like per minute and yeah,
it will not hit your system

that much.

Michael: Nice.

And yeah, you've got one that I forgot
there, which is log_autovacuum_min_duration.

vacuum min duration.

Set that to 0 as well so that we
get the important vacuum stuff.

Great.

So that lets people get this set
up.

And I've checked some hosting.

So seeing as you work on a managed
service provider, Do you have

some of these set by default or
is this something that people

need to go in and change?

Alicja: Yes, for instance, yeah,
because I work at Microsoft

it's the easiest for me to talk
about how we've got the setup

done.

I don't know that much about other cloud
providers but we've got log

connections, log disconnections
to on, and users cannot change

it.

It's read only.

And it's because we need it for
the internal telemetry of ours.

So this is just on.

Also, you know, pgBadger needs
to have logs in English.

So also this is by default on Azure.

So a lot of them are already enabled,
already on, like on Azure.

So this is good news, but still
you need to enable logs, you

need to enable autovacuum, you
need to enable logging like

for other stuff and log_line_prefix.

I know that not all cloud providers
you are able to change

log_line_prefix, as far as I know
at least.

Yes, we do allow that.

So this is good news, but I've
heard that not on all, like probably

AWS doesn't allow that, as I remember.

At least it wasn't a couple of
years ago, you know.

I remember writing a script,
like to parse the logs, to

be able to generate like pgBadger,
because we were not able to

change log_line_prefix.

And yes, they also have this, they
had really weird format.

So we needed a script that would
parse that.

But good news is that pgBadger
introduced support for this log

format.

So you've got a separate format,
which is RDS.

So you don't need to have custom
scripts anymore.

So this is good news.

Michael: Yeah, I was so impressed
reading through the pgBadger

docs, all the different formats
they support is incredibly

flexible tool, almost as if it's
designed by people that are

using it and built by people that
use it all the time.

But definitely tons of options
and lots of auto recognizing formats

and file types and things like
that.

So for a lot of people I would
guess it would just work and for

people on cloud providers you've
written a great post for using

it with Azure but there's also
I found a couple of posts from

people from the RDS team and from
Google CloudSQL teams that

explain how to use it with their
systems as well.

If any other cloud providers are
listening and have guides, please

send them to me and I'll include
them up in the show notes.

But yeah, it feels like this is
a tool people could be using

even with unmanaged Postgres
providers.

Alicja: Exactly, it makes me really
happy to see the blog posts

from AWS and Google Cloud because
exactly as you said it means

that they're using pgBadger,
like the users of the managed

services are using pgBadger,
so it was probably needed.

Michael: Nice, yep and also it
means this thing can live on,

right?

We don't have to completely replace
it with cloud native tools,

it still should work, right?

Still Postgres, the only place
it currently puts some of this

information is in the logs so at
some point we're gonna have

to look at the logs and having
a tool to do so often helps.

Cool, all right so we've mentioned
the min duration being a potential

gotcha are there any others that
we need to be cautious of when

using this?

Alicja: About the parameter settings,
actually this is like the

most dangerous.

So not that much I would say, like
the other thing you might

have problem with, like it's the
size of the log itself.

So I had sometimes like a problem,
you know, when I was like

receiving from the customer just
terabytes of logs and it was

okay, here are my logs, just generate
whatever you want from

that, right?

And Yeah, that also might be a
problem.

And although, you know, pgBadger
has this minus J option, like

most of Postgres tools, right?

It's still like on my own laptop,
it was like days.

So, Yes, so still like if you've
got like really, you know huge

amount of logs you probably still
need powerful VM just to parse

the logs.

Otherwise like yeah, it will not
work

Michael: Yeah, I saw I saw it has
both -j lowercase and -J uppercase So it has like two different

ways of handling.

I think the first one is for parallelizing,
like having multiple

CPUs handling a single large log
file.

I didn't hear of terabytes.

All the examples I saw were in
gigabytes, so that's quite scary.

But I think "-J" was even faster
if you had, let's say, like if

you were given 20 Large log files
and you wanted to give 20 CPUs

and handle them one at a time like
one CPU per file So that's quite

cool that it has multiple ways
of handling like huge workloads,

but terabytes of logs is terrifying.

Did they have to?

So I saw in that, like it has so
many options by the way, I have

a, in fact, a quick trivia question
for you.

It has so many dash single letter
commands that I had noticed

there are only 2 letters in the
English alphabet that it doesn't

have a Dash that letter Can you
guess which letters they are?

Only K and Y. It's not fair only
K, and and there is

even a they even have an option
that's keep comments.

So my recommendation to the pgBadger
team is to make that the

K, and then they only have to find
1 more, and they have the whole

alphabet. Anyway, sorry, that's
a that's not But

Alicja: no, no, this is actually
great point because sometimes

you know, the customers has got
like some non-common questions,

right?

For instance, I had a customer
that had a problem, like performance

problem, and they wanted to share
pgBadger with me, but they

were not able to because of the
data, right?

Like, because they were not able
to show me the queries.

And I didn't even know, but yes,
pgBadger has so many options

that I quickly check and there
is like this anonymized option

like yeah so you can you can find
a lot of stuff like really

a lot of options in pgBadger.

So this is actually a great point.

Michael: I was gonna ask about
2 of them actually, B and E, so

beginning and end.

If we have a huge log file, but
we only, let's say it's from

a month of logs or something, and
we only care about, let's say

a day or 2 that we had problems,
if we set beginning and end,

would that drastically speed things
up as well or are we still

Alicja: going to?

Depends, right, of the size of
the file.

So it's more like an operating system
question, I would say, because,

you know, sometimes if you've got
really huge file, like it,

you need to read your operating
system, you need to read it anyways,

right?

So in this case, it might not help
that much.

So, you know, it's a long story,
but, you know, it's really huge

logs, like huge files.

I've been trying like a lot of
stuff and I've been trying like

to split the files, you know, using
just the split command.

Like, I was trying like a lot of
different ways, but finally

what works the best is just a huge
VM.

Like this is the easiest way and
you just put that on a huge VM

and you've got it solved.

Michael: Nice.

Cool.

Okay.

I saw you hosted 1 of the Postgres
Friday.

We've done a couple of submissions
of podcast submissions to that

event and yours was on pgBadger
but I saw that Gilles Darold submitted

a post and 1 of the things they
mentioned was flexibility so

they they really like to typical
Postgres consultant I go not

typical but extra on the Postgres
side they parse the logs into

a Postgres database in order to
be able to query them, what like

giving them a little bit more flexibility.
So I guess we are somewhat

limited by the reports that pgBadger
gives us, but in the real

world, it's giving us so much useful
information that I guess

those cases are a bit more like
few and far between.

And we get most of the way there
just with the pgBadger report.

Is that fair?

Alicja: You know, I, yeah, I saw
the answer from Leticia and

you know this is actually interesting,
like parsing this to the

database, especially like I had
a lot of ask about that from

ex-Oracle DBAs, right, they always
ask, you know, we're asking

if there is a way to do that, right?

So it might be useful, but when
I was a consultant, because I'm

not that much anymore, like I do
prefer to use something that

I can just ping a link to, like
the tool that is already there

and it works and just pinging the
link, like generate me data,

you know, focusing on writing my
own tools.

So but OK, I don't say it's wrong
or something, but it's just

it's just easier for me.

And also what I like, it's the
visual part, because like the

visual part, it's really something
that allows you super quickly

to go to the root cause.

It's not that easy with the text,
like if you're doing the query

and you've got the result okay
but you need to focus a bit more

and you know it's pgBadger
you've got this chart and you

just you know see everything straight
away so it's the way easier

way faster for me and you've got
like all the answers straight

away so you know it's just a matter
of preferences I would say

my preference is to have graphical
presentation.

Michael: Yeah, that's a huge pro
for a lot of people me included.

You've also mentioned another one
there which I don't think we've

mentioned is how tried and tested
this is this is a I think I

looked it up and it was 2012 pgBadger
v1 came out April or May

time and it's version 12.4 now
with a ton of bug fixes and all

sorts of feature improvements since
then.

So to start writing your own tool,
we've already mentioned some

of the features we get, but there's
honestly dozens of settings

that it has, more than the ones
I mentioned.

Not just every letter covered,
but multiple letters have lowercase

and uppercase and there are quite
a few that don't have shortcuts

like there's another few dozen
that just you know you have to

set separately so you're yes incredibly
flexible and tested like

it's been tested for many many
years which is worth it worth

a lot.

Alicja: Exactly, exactly, you know
like to be honest to write

your own tool that would be that
comprehensive it would have

like so many dashboards so many
information in it like it's years

you know so yeah I do prefer to
have this tool instead of, you

know, writing my own, but I totally
get people that prefer to

write their own tools.

Michael: Nice.

And is there anything that they
could be doing?

I think, is it Gilles Darouine?

I don't know quite how to pronounce
their name, but the maintainer,

is there anything they could be
doing or is there anything you'd

like to see in pgBadger in the
future?

Alicja: I would like to see the
support for Azure.

That would be my main thing.

And you know, probably that also
like the thing about the logs,

and I know this is not easy because
it's not that easy like to

get that from the logs.

But you know, in the logs tab,
you would see only the queries

that were locked, not like the
queries that caused the lock.

And of course, there is a simple
reason because of that, because

it's how Postgres is doing this,
right?

But sometimes if you've got log
min duration statements to 0,

right, and you've got PID in the
prefix, you can just correlate,

right?

Because Postgres would normally
tell you like, okay, this PID

is blocking your query, right?

So then you can search for this
PID and maybe you've got like

the query and try to correlate
that because it's, you know, it's

not that useful.

Like if you just see like the queries
that are locked and not

what is actually locking the queries,
right?

So you need to have an additional
cron job with the lock query to

compare with.

So yeah, that for sure would
be an ask, but I'm not sure if

this is actually the ask to Gilles.

I also am not sure how to pronounce
his name, so I'm saying Gilles.

I'm not sure if this is an ask
to Gilles or more to Postgres contributors,

to be honest, just to have that
in the logs.

Michael: Yeah, to log more information.

So yeah, that makes sense that
it might be there.

Cool.

And yeah, it's released under the
PostgreSQL license, which is

one of the most permissive licenses
I've ever seen, which is awesome.

So thank you for doing that.

Thanks for all your hard work on
it.

Is there anything else you'd like
people to know?

Alicja: Nothing's coming to my mind,
to be honest.

Michael: Well actually, because
you started with throwing shade

at pg_stat_statements, I did wonder
if there were any...

So, like, for example, with pg_stat_statements,
we can see things

like buffers.

We can see the amount of work done
by each query.

So reads, writes, temp, shared,
that kind of thing.

I don't think, is there a way to
see that kind of information?

I guess we could look at the auto_explain
logs but I don't think

pgBadger is doing anything at the
moment to aggregate those per

query group.

Alicja: No, I've never seen that.
Just thinking if we would have

like more, if we would change error
verbosity, if that would

be logged and then presented, but
I haven't seen that.

So I don't want to guess, but you
know, like for me, actually,

the pgBadger, it's like I'm going
to the top queries.

And when I see top query, like,
yeah, maybe let's go first through

the process.

That will be easier.

So when I see the top queries,
sometimes it's just, you know,

you just see what's wrong.

Like, it's really pretty often,
like, much more often than, you

know, you would guess, like, just
you see, okay, this is not

in, right, or this is something
similar.

So, you just see that, right?

So, you don't need to even run
explain analyze but if you don't,

then I just take the query
and run explain analyze on

that and yeah, that's the next step
I would do. So but yes, you

are right.

Michael: And then you can ask for
buffers.

Alicja: Exactly, and then like buffers,
yes, absolutely, I would

use that option.

Michael: So just to clarify on
that, I think this is great, so

top queries is a really good point
and it orders those by total

execution time.

So, across all calls, right?

So, you see, it could be a slow
query that's run a few times that's

taking up the most resources or
it could be a fairly fast query

that's being run thousands of times
per second, as you mentioned,

and they're ordered by, I believe
the main report is by total

execution time.

Is that right?

Alicja: You've got like couple
sections on this page.

So you've got the most important
one that I'm using.

It's exactly the one that you mentioned.

So the total execution time, that
this is always like the first

thing I'm looking at, like, you
know, what took the most time

of this, you know, from the system,
which queries that it doesn't

matter for me, just one query, or
it was like multiple executions

of other queries, right?

So this is like probably the third
section from the top, but

you also have other aggregations,
like the slowest individual

queries, other stuff, but I usually
don't look at that, to be

very honest.

I just, you know, I just taking
the, this is absolutely my favorite.

If I need to then, yeah, I go,
but it's absolutely my main point,

like to go there and see.

Michael: Nice.

That's what I like to recommend
people do on pg_stat_statements

as well.

But I guess with this one, with pgBadger,
we get more of an opinionated,

shows people what to look at rather
than pg_stat_statements.

People can query and order by whatever
they want.

So I see a lot more people making
that mistake or just looking

at their slowest queries rather
than the totals.

Amazing.

Alicja, thank you so much for your
time today.

Is there anything you want to plug
or anything you want to share

with people?

Alicja: Actually, if I may plug
my songs, because I just realized

not many people do know my songs.

I produced two songs, the first was
about the vacuum and the second

one actually you were the author
of the lyrics, right?

So you know there are like two Postgres
songs out there and yes

We are co-producers of the second
one.

So yeah, that would be absolutely
my, my plug.

Michael: That's kind of you.

The lyrics are extremely basic
for that second one.

So I apologize to people in advance,
but you did an awesome job

with the vocals and the, that was
not me, the vocals and the

video as well.

It was a nice one of all the Postgres
events so yeah you did an

awesome job I will happily link
those up in the show notes.

Alicja: Thank you.

Michael: Thank you so much Alicja,
it's been a pleasure take care.

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Guest

Alicja Kucharczyk

Program Manager for Azure Database for PostgreSQL at Microsoft, and organiser of the Warsaw PostgreSQL User Group

pgBadger

Creators and Guests

Some kind things our listeners have said