pgBadger
Michael: Hello and welcome to PostgresFM,
a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard,
and today I am joined by
Alicja Kucharczyk, Program Manager
for Azure Database for PostgreSQL
at Microsoft and organizer of the
Warsaw PostgreSQL user group.
Thank you so much for joining me
today Alicja, it's an honor
to have you here.
Alicja: Thank you for having me
Michael.
Michael: Wonderful, well I proposed
a topic this week but mainly
because you're one of the biggest
fans or at least advocates of
this tool, I think in the world.
But it's probably an underused
tool, in my opinion anyway.
I definitely don't use it at times
when I could or should, and
I know most people that could be
using it have never even heard
of it, at least newer users.
So what are we going to be talking
about today?
Alicja: Yeah about my favorite
tool which is pgBadger which is
totally open source which is really
great it's like allows you
like to do really comprehensive
troubleshooting and I believe
like it's yeah top-notch of the
class.
Michael: Yeah absolutely do you
want to give us a little bit
of background on the tool or how
people use it?
Alicja: Yes absolutely so maybe
I will start with my own story
because I've learned that the tool
exists over, I believe, 10
years ago when I was working for
a Polish Bank.
And the process in there was like
each time something was happening,
they started to generate a pgBadger
report.
And that was just always the
first step that they've been
doing.
So that was the place where I learned
that this tool exists.
And over the years, actually, because
I was moving from company
to company, and I started to be
a Postgres consultant.
And back then, that started to
be my main tool for doing the
troubleshooting.
Because as a consultant, you don't
really know the system.
Like you're going to the customer,
you know something has happened
and the customer is saying, for
instance, CPU is very high or
latency of the query suddenly has
grown, but you still don't
really have any knowledge about
the system.
And pgBadger is really one of the
best tools to quickly
get knowledge about the workload
that is happening within your
database.
And I really like pgBadger also
that it's showing you like
really comprehensive picture because
you might know other tools,
like similar tools, like pg_stat_statements.
And it's not that I want to say
that this is a bad tool or anything.
I do use pg_stat_statements.
I know that many people do, but
pg_stat_statements is actually just
one tab in the pgBadger report.
And next to that, you've got also
like report about the temporary
files.
You've got really nicely aggregated
errors.
So, everything that you've got
in the logs that are errors, fatals,
warnings, it's really nicely aggregated
on the events tab.
You would also have the logs and
logs with CK, right?
Not logs, but logs.
I know it's easy to mix these two.
So you also have these logs, you've
got the number of connections,
number of sessions.
So you really have a comprehensive
view.
For instance, if you've got
pg_stat_statements, you see, okay,
this query was executed 1,000 times,
and the total time it was
this and like the average time
was this, right?
You would have like the same thing
in PgBadger, but the difference
is in PgBadger, you can quickly
go to another tab, for instance,
temporary files tab and see that,
okay, this execution time was
caused actually because this query
was generating temporary files
or maybe it was waiting for a lock.
So that's why I do prefer PgBadger
over pg_stat_statements and
other tools, because it's giving
me this comprehensive view.
Michael: Yeah, I like it.
Shots fired right up top.
I'm a big fan of pg_stat_statements
as well, so I'm hoping to
be somewhat converted here.
I definitely think there's pros
of both.
So with PG Badger, it's a log analysis.
So it's a standalone tool as well,
right?
So open source, as you said, we
can install it wherever and analyze
our log files for anything that
we can spit out to the log.
So as you mentioned, locks, yeah,
so wait events, errors as well,
that's a big missing part of pg_stat_statements.
So sometimes we have, it's quite
common advice to have a timeout,
right?
If you have queries that are taking
more than, I don't know,
some, on some systems might be
30 seconds, time them out instead
of measuring, instead of letting
them complete.
In pg_stat_statements, we don't
see those at all, but in PgBadger,
we can see those because
Alicja: of the errors, right?
Absolutely.
Yes, yes, yes, yes, yes.
And the errors actually, you know,
it's helping with other problems.
Like recently I was troubleshooting
logical replication errors
and if you are just, have the plaintext
book and you've got just
1 error, it's really not easy like
to go up like thousands of
lines up and see what was the root
cause error, right?
Because sometimes you've got 1
error and this is causing the
whole other errors, right?
Like in this case.
And in PgBadger, you're seeing
like that.
That's great because of the aggregation
that pgBadger is doing.
Michael: Nice.
So, because we're reliant on the
logs here, I would like to ask,
a lot of people won't have, a lot
of the default settings are
to not log a lot of important settings.
So a lot of people's logs, if they're
new to Postgres or if they're
using a managed provider and haven't
set any settings, they won't
have many of the things.
So if they ran pgBadger now, I'm
guessing a lot of them wouldn't
get some of the reports you're
talking about because they wouldn't
have the right settings on.
Is that fair?
Alicja: Yes, yes, absolutely.
Like with default settings, you
would probably have like a very
poor pgBadger report because as
you said, like pgBadger is just
Perl script that is just taking
the text files, the text logs
as the input and then generating
HTML report as the output, right?
So whatever is in logs, it will
be regenerated, like it will
be parsed and it will be, you know,
aggregated in this really
nice pgBadger format.
But if something is not in the
logs, then obviously you will
not see that, right?
So for instance, if you want to
see the locks this time, like
you need to have log_lock_waits
on.
On.
There is 1 setting in Postgres
that you need to have locks
logged to be able to generate that.
Michael: Yeah, log_lock_waits,
I think it is.
Alicja: Log_lock_waits, yes.
I believe this is the one.
And this is responsible for the
locks tab in the pgBadger.
So if you've got this off, you
will not see anything.
Also like the log_line_prefix is
super important.
So, you know, the default prefix,
it's pretty much null.
And I'm using the one that is actually
recommended by pgBadger,
and it contains a lot of stuff
like database name, user name,
application name, of course, like
the timestamp, and it also
has like a host.
So the IP of the server that the connection
is coming from.
And this allows you to have really
like, you know rich pgBadger
because pgBadger is able also to
split the queries into different
tabs per database per user.
So it's able like to show you for
instance how many select queries
were like on a particular database
or were issued by a particular
user.
But you need log_line_prefix for
that.
Otherwise, if you've got really
a short default, you will not see
a lot of useful information.
But I believe in here, the pgBadger
documentation is pretty
good.
You've got the entire list of the
settings that you need to set.
And most of that, like most of
that is safe.
Like I would say most, because
you've got, of course, 1 setting,
log_min_duration_statement, which
is super, might be super
dangerous, right?
Of course, maybe just to explain
log_min_duration_statement, it's
the setting that is regulating
how many queries we want to log
into the text logs.
And by default, this is minus 1.
It means no.
And 0 means log everything.
So everything means even select
1, like something that is under
milliseconds, right?
With 0, you will see like a huge,
huge impact on your database.
So, you know, you always need to
kind of adjust this to your
system, to the workload you've
got, how intensive, you know,
your database is processing the
queries.
If you've got an analytical database,
right, like super huge data
amounts that you've got, yeah,
5 queries per hour, probably not
many of you, but yeah, let's say
log_min_duration_statement
0 is okay then.
But with millions, thousands of
queries per second, this is,
yes, this is a killer.
Like I saw even like 40% overhead
of the overall performance
with log_min_duration_statement
0.
So this is dangerous.
I usually start with something
around between 1 minute to 10
minutes, depending on the system.
Oh, wow.
Yeah.
So, you know, depending on the
particular intensivity, like,
let's say, of the system, depends
what customer is telling me,
okay, it's super intense.
If it's not, right?
So let's say we are starting from
1 minute and then we see the
overhead.
If it's huge, we are going up.
If it's not huge, then we try and
go a bit down with the setting.
Michael: When you say 1 minute,
you mean only log the duration
of statements that take longer
than 60 seconds?
Alicja: Yes, yes, yes, yes.
Michael: I thought you're going
to go much lower, but that makes
a lot of sense, just in case there's
a huge overhead.
Alicja: Exactly, exactly.
This is like to be on the safe
side, right?
Because otherwise if you've got
a production system, then you
go too low and then you've got
an outage.
Michael: And I don't think this
requires a restart, right?
So you can reload the config and
it's not a huge overhead to
change that 1 but yeah that's
a really good tip.
Without that we get almost nothing
like we don't get any reporting
on the slow statements but I guess
the trade-off is if our system's
biggest bottleneck is CPU,
let's say the customer mentioned
they have high CPU, and that's
mostly being caused by millions
of executions of very fast queries,
then we don't spot them.
So it's that trade-off between
slow queries being the issue versus
lots of fast queries being the
issue.
Alicja: It is, but you know in
this kind of, this is a very good
point because you are right like
I saw outages when we had like
yes a lot of super small queries,
right, **that were executed often
like thousand times a second and
you know we've got 2 ways actually
to do that.
One way would be like to have some
kind of cron job that would
change log_min_duration_statement
to 0.
And after some short period of
time, like switch back, right?
That would be one.
And yeah, and the second one would
be, there is like the new method
that I haven't tried yet on production,
but there is the sampling.
So, you know, in Postgres, like,
I haven't tried it yet, but
it might be something to consider.
So, for now, I was just doing cron jobs,
but to be very honest
with you, it's a problem when you
are analyzing the report, right?
Because, yeah, you see really weird,
like, charts.
So, you always need to keep that
in mind that you've got this
switch in the cron job.
Michael: Yeah, or for the fast
queries, rely on something like
pg_stat_statements if you have
that on already, for example.
But back to pgBadger, I looked
at the docs and some of the other
ones that people might want to
switch on that have low overhead.
Stop me if I mention one that is
dangerous.
But we have log_checkpoints, log_connections, log_disconnections,
log_lock_waits you mentioned already,
log_temp_files.
Do you do any...
So the recommendation in the pgBadger
Docs is to set that to
0?
Alicja: No I just set it to 0 you
know.
Michael: Okay great so that's good
that that's you don't see
that as being anywhere near as
dangerous **I have seen some recommendations
to start with a higher setting
of that and go downwards as well
which I guess people can do if
they want to be extra cautious
but if you've tuned work_mem at
all already then I guess you should
be pretty safe to put that to 0.
Alicja: I've never seen any problem
when I tweaked with temp
files or checkpoints to be very
honest.
Like maybe there are some narrow
use cases, but for, you know,
over the years, I've never seen
any problems with other settings.
Like the same with autovacuum,
like people also were a bit afraid
about the autovacuum, but **it's
not, it's really just a couple
of lines like per minute and yeah,
it will not hit your system
that much.
Michael: Nice.
And yeah, you've got one that I forgot
there, which is log_autovacuum_min_duration.
vacuum min duration.
Set that to 0 as well so that we
get the important vacuum stuff.
Great.
So that lets people get this set
up.
And I've checked some hosting.
So seeing as you work on a managed
service provider, Do you have
some of these set by default or
is this something that people
need to go in and change?
Alicja: Yes, for instance, yeah,
because I work at Microsoft
it's the easiest for me to talk
about how we've got the setup
done.
I don't know that much about other cloud
providers but we've got log
connections, log disconnections
to on, and users cannot change
it.
It's read only.
And it's because we need it for
the internal telemetry of ours.
So this is just on.
Also, you know, pgBadger needs
to have logs in English.
So also this is by default on Azure.
So a lot of them are already enabled,
already on, like on Azure.
So this is good news, but still
you need to enable logs, you
need to enable autovacuum, you
need to enable logging like
for other stuff and log_line_prefix.
I know that not all cloud providers
you are able to change
log_line_prefix, as far as I know
at least.
Yes, we do allow that.
So this is good news, but I've
heard that not on all, like probably
AWS doesn't allow that, as I remember.
At least it wasn't a couple of
years ago, you know.
I remember writing a script,
like to parse the logs, to
be able to generate like pgBadger,
because we were not able to
change log_line_prefix.
And yes, they also have this, they
had really weird format.
So we needed a script that would
parse that.
But good news is that pgBadger
introduced support for this log
format.
So you've got a separate format,
which is RDS.
So you don't need to have custom
scripts anymore.
So this is good news.
Michael: Yeah, I was so impressed
reading through the pgBadger
docs, all the different formats
they support is incredibly
flexible tool, almost as if it's
designed by people that are
using it and built by people that
use it all the time.
But definitely tons of options
and lots of auto recognizing formats
and file types and things like
that.
So for a lot of people I would
guess it would just work and for
people on cloud providers you've
written a great post for using
it with Azure but there's also
I found a couple of posts from
people from the RDS team and from
Google CloudSQL teams that
explain how to use it with their
systems as well.
If any other cloud providers are
listening and have guides, please
send them to me and I'll include
them up in the show notes.
But yeah, it feels like this is
a tool people could be using
even with unmanaged Postgres
providers.
Alicja: Exactly, it makes me really
happy to see the blog posts
from AWS and Google Cloud because
exactly as you said it means
that they're using pgBadger,
like the users of the managed
services are using pgBadger,
so it was probably needed.
Michael: Nice, yep and also it
means this thing can live on,
right?
We don't have to completely replace
it with cloud native tools,
it still should work, right?
Still Postgres, the only place
it currently puts some of this
information is in the logs so at
some point we're gonna have
to look at the logs and having
a tool to do so often helps.
Cool, all right so we've mentioned
the min duration being a potential
gotcha are there any others that
we need to be cautious of when
using this?
Alicja: About the parameter settings,
actually this is like the
most dangerous.
So not that much I would say, like
the other thing you might
have problem with, like it's the
size of the log itself.
So I had sometimes like a problem,
you know, when I was like
receiving from the customer just
terabytes of logs and it was
okay, here are my logs, just generate
whatever you want from
that, right?
And Yeah, that also might be a
problem.
And although, you know, pgBadger
has this minus J option, like
most of Postgres tools, right?
It's still like on my own laptop,
it was like days.
So, Yes, so still like if you've
got like really, you know huge
amount of logs you probably still
need powerful VM just to parse
the logs.
Otherwise like yeah, it will not
work
Michael: Yeah, I saw I saw it has
both -j lowercase and -J uppercase So it has like two different
ways of handling.
I think the first one is for parallelizing,
like having multiple
CPUs handling a single large log
file.
I didn't hear of terabytes.
All the examples I saw were in
gigabytes, so that's quite scary.
But I think "-J" was even faster
if you had, let's say, like if
you were given 20 Large log files
and you wanted to give 20 CPUs
and handle them one at a time like
one CPU per file So that's quite
cool that it has multiple ways
of handling like huge workloads,
but terabytes of logs is terrifying.
Did they have to?
So I saw in that, like it has so
many options by the way, I have
a, in fact, a quick trivia question
for you.
It has so many dash single letter
commands that I had noticed
there are only 2 letters in the
English alphabet that it doesn't
have a Dash that letter Can you
guess which letters they are?
Only K and Y. It's not fair only
K, and and there is
even a they even have an option
that's keep comments.
So my recommendation to the pgBadger
team is to make that the
K, and then they only have to find
1 more, and they have the whole
alphabet. Anyway, sorry, that's
a that's not But
Alicja: no, no, this is actually
great point because sometimes
you know, the customers has got
like some non-common questions,
right?
For instance, I had a customer
that had a problem, like performance
problem, and they wanted to share
pgBadger with me, but they
were not able to because of the
data, right?
Like, because they were not able
to show me the queries.
And I didn't even know, but yes,
pgBadger has so many options
that I quickly check and there
is like this anonymized option
like yeah so you can you can find
a lot of stuff like really
a lot of options in pgBadger.
So this is actually a great point.
Michael: I was gonna ask about
2 of them actually, B and E, so
beginning and end.
If we have a huge log file, but
we only, let's say it's from
a month of logs or something, and
we only care about, let's say
a day or 2 that we had problems,
if we set beginning and end,
would that drastically speed things
up as well or are we still
Alicja: going to?
Depends, right, of the size of
the file.
So it's more like an operating system
question, I would say, because,
you know, sometimes if you've got
really huge file, like it,
you need to read your operating
system, you need to read it anyways,
right?
So in this case, it might not help
that much.
So, you know, it's a long story,
but, you know, it's really huge
logs, like huge files.
I've been trying like a lot of
stuff and I've been trying like
to split the files, you know, using
just the split command.
Like, I was trying like a lot of
different ways, but finally
what works the best is just a huge
VM.
Like this is the easiest way and
you just put that on a huge VM
and you've got it solved.
Michael: Nice.
Cool.
Okay.
I saw you hosted 1 of the Postgres
Friday.
We've done a couple of submissions
of podcast submissions to that
event and yours was on pgBadger
but I saw that Gilles Darold submitted
a post and 1 of the things they
mentioned was flexibility so
they they really like to typical
Postgres consultant I go not
typical but extra on the Postgres
side they parse the logs into
a Postgres database in order to
be able to query them, what like
giving them a little bit more flexibility.
So I guess we are somewhat
limited by the reports that pgBadger
gives us, but in the real
world, it's giving us so much useful
information that I guess
those cases are a bit more like
few and far between.
And we get most of the way there
just with the pgBadger report.
Is that fair?
Alicja: You know, I, yeah, I saw
the answer from Leticia and
you know this is actually interesting,
like parsing this to the
database, especially like I had
a lot of ask about that from
ex-Oracle DBAs, right, they always
ask, you know, we're asking
if there is a way to do that, right?
So it might be useful, but when
I was a consultant, because I'm
not that much anymore, like I do
prefer to use something that
I can just ping a link to, like
the tool that is already there
and it works and just pinging the
link, like generate me data,
you know, focusing on writing my
own tools.
So but OK, I don't say it's wrong
or something, but it's just
it's just easier for me.
And also what I like, it's the
visual part, because like the
visual part, it's really something
that allows you super quickly
to go to the root cause.
It's not that easy with the text,
like if you're doing the query
and you've got the result okay
but you need to focus a bit more
and you know it's pgBadger
you've got this chart and you
just you know see everything straight
away so it's the way easier
way faster for me and you've got
like all the answers straight
away so you know it's just a matter
of preferences I would say
my preference is to have graphical
presentation.
Michael: Yeah, that's a huge pro
for a lot of people me included.
You've also mentioned another one
there which I don't think we've
mentioned is how tried and tested
this is this is a I think I
looked it up and it was 2012 pgBadger
v1 came out April or May
time and it's version 12.4 now
with a ton of bug fixes and all
sorts of feature improvements since
then.
So to start writing your own tool,
we've already mentioned some
of the features we get, but there's
honestly dozens of settings
that it has, more than the ones
I mentioned.
Not just every letter covered,
but multiple letters have lowercase
and uppercase and there are quite
a few that don't have shortcuts
like there's another few dozen
that just you know you have to
set separately so you're yes incredibly
flexible and tested like
it's been tested for many many
years which is worth it worth
a lot.
Alicja: Exactly, exactly, you know
like to be honest to write
your own tool that would be that
comprehensive it would have
like so many dashboards so many
information in it like it's years
you know so yeah I do prefer to
have this tool instead of, you
know, writing my own, but I totally
get people that prefer to
write their own tools.
Michael: Nice.
And is there anything that they
could be doing?
I think, is it Gilles Darouine?
I don't know quite how to pronounce
their name, but the maintainer,
is there anything they could be
doing or is there anything you'd
like to see in pgBadger in the
future?
Alicja: I would like to see the
support for Azure.
That would be my main thing.
And you know, probably that also
like the thing about the logs,
and I know this is not easy because
it's not that easy like to
get that from the logs.
But you know, in the logs tab,
you would see only the queries
that were locked, not like the
queries that caused the lock.
And of course, there is a simple
reason because of that, because
it's how Postgres is doing this,
right?
But sometimes if you've got log
min duration statements to 0,
right, and you've got PID in the
prefix, you can just correlate,
right?
Because Postgres would normally
tell you like, okay, this PID
is blocking your query, right?
So then you can search for this
PID and maybe you've got like
the query and try to correlate
that because it's, you know, it's
not that useful.
Like if you just see like the queries
that are locked and not
what is actually locking the queries,
right?
So you need to have an additional
cron job with the lock query to
compare with.
So yeah, that for sure would
be an ask, but I'm not sure if
this is actually the ask to Gilles.
I also am not sure how to pronounce
his name, so I'm saying Gilles.
I'm not sure if this is an ask
to Gilles or more to Postgres contributors,
to be honest, just to have that
in the logs.
Michael: Yeah, to log more information.
So yeah, that makes sense that
it might be there.
Cool.
And yeah, it's released under the
PostgreSQL license, which is
one of the most permissive licenses
I've ever seen, which is awesome.
So thank you for doing that.
Thanks for all your hard work on
it.
Is there anything else you'd like
people to know?
Alicja: Nothing's coming to my mind,
to be honest.
Michael: Well actually, because
you started with throwing shade
at pg_stat_statements, I did wonder
if there were any...
So, like, for example, with pg_stat_statements,
we can see things
like buffers.
We can see the amount of work done
by each query.
So reads, writes, temp, shared,
that kind of thing.
I don't think, is there a way to
see that kind of information?
I guess we could look at the auto_explain
logs but I don't think
pgBadger is doing anything at the
moment to aggregate those per
query group.
Alicja: No, I've never seen that.
Just thinking if we would have
like more, if we would change error
verbosity, if that would
be logged and then presented, but
I haven't seen that.
So I don't want to guess, but you
know, like for me, actually,
the pgBadger, it's like I'm going
to the top queries.
And when I see top query, like,
yeah, maybe let's go first through
the process.
That will be easier.
So when I see the top queries,
sometimes it's just, you know,
you just see what's wrong.
Like, it's really pretty often,
like, much more often than, you
know, you would guess, like, just
you see, okay, this is not
in, right, or this is something
similar.
So, you just see that, right?
So, you don't need to even run
explain analyze but if you don't,
then I just take the query
and run explain analyze on
that and yeah, that's the next step
I would do. So but yes, you
are right.
Michael: And then you can ask for
buffers.
Alicja: Exactly, and then like buffers,
yes, absolutely, I would
use that option.
Michael: So just to clarify on
that, I think this is great, so
top queries is a really good point
and it orders those by total
execution time.
So, across all calls, right?
So, you see, it could be a slow
query that's run a few times that's
taking up the most resources or
it could be a fairly fast query
that's being run thousands of times
per second, as you mentioned,
and they're ordered by, I believe
the main report is by total
execution time.
Is that right?
Alicja: You've got like couple
sections on this page.
So you've got the most important
one that I'm using.
It's exactly the one that you mentioned.
So the total execution time, that
this is always like the first
thing I'm looking at, like, you
know, what took the most time
of this, you know, from the system,
which queries that it doesn't
matter for me, just one query, or
it was like multiple executions
of other queries, right?
So this is like probably the third
section from the top, but
you also have other aggregations,
like the slowest individual
queries, other stuff, but I usually
don't look at that, to be
very honest.
I just, you know, I just taking
the, this is absolutely my favorite.
If I need to then, yeah, I go,
but it's absolutely my main point,
like to go there and see.
Michael: Nice.
That's what I like to recommend
people do on pg_stat_statements
as well.
But I guess with this one, with pgBadger,
we get more of an opinionated,
shows people what to look at rather
than pg_stat_statements.
People can query and order by whatever
they want.
So I see a lot more people making
that mistake or just looking
at their slowest queries rather
than the totals.
Amazing.
Alicja, thank you so much for your
time today.
Is there anything you want to plug
or anything you want to share
with people?
Alicja: Actually, if I may plug
my songs, because I just realized
not many people do know my songs.
I produced two songs, the first was
about the vacuum and the second
one actually you were the author
of the lyrics, right?
So you know there are like two Postgres
songs out there and yes
We are co-producers of the second
one.
So yeah, that would be absolutely
my, my plug.
Michael: That's kind of you.
The lyrics are extremely basic
for that second one.
So I apologize to people in advance,
but you did an awesome job
with the vocals and the, that was
not me, the vocals and the
video as well.
It was a nice one of all the Postgres
events so yeah you did an
awesome job I will happily link
those up in the show notes.
Alicja: Thank you.
Michael: Thank you so much Alicja,
it's been a pleasure take care.