RegreSQL
Michael: Hello and welcome to PostgresFM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard, and I'm joined as usual by
Nik, founder of PostgresAI.
Hey, Nik.
Nikolay: Hi, Michael.
Michael: And we also have Radim Marek with us, who is a consultant
and Postgres enthusiast, whose BoringSQL site we have mentioned
many times on the show.
Welcome, Radim.
Radim: Hello guys.
Thank you for having me.
Nikolay: Thank you for coming.
Michael: Yeah, absolutely.
So we wanted to talk about a newish project you have kicked off
called RegreSQL about regression testing SQL queries or SQL
queries.
Very cool.
It got quite a lot of uptick on Hacker News, a lively discussion
on there.
Congrats.
But also, I'd be interested in discussing with you about how
you got into the topic, like regression testing in general.
Where would you like to start?
Radim: Okay so that's the tough topic to start with.
So effectively it wasn't my project or it didn't start it as
my project it actually has started as a complementary project
to a book called The Art of Postgres, which I still believe is
one of the best things you can do to learn how to work with Postgres.
Nikolay: Dimitri, right?
Radim: Yes, Dimitri.
And I'm trying to get in touch with him forever, like last 4
years, But I'm not able, so maybe if he hears this podcast that
I will have a chance to talk to him.
So I did start with the project because I was kind of a bit worried
about how people deal with the SQL queries in their code.
Because usually it's either generated by ORM and you don't see
it or it was a string hard-coded in my case Go file so you can
see it in TypeScript files.
And all the testing around this one originally kind of bothered
me because you have an infrastructure which was difficult to
start, set up, and everything.
And I said there must be a simple way.
Something that actually running a simple query doesn't take minutes
or seconds to do all the infrastructure tiered up, tiered down.
And I found RegreSQL that is doing this job.
And I tried to fix the project.
And fast forward, I think 4 years, because it's actually quite
a long time ago.
I found another use cases because I'm working on some educational
products and I needed to test whatever
users are hitting the
metrics that I'm interested.
And then I kind of figure out,
yes, this is actually where it
can come in because it was comparing
whatever the output the
correctness of the query and that
was only step away because
then I realized okay every single
incident is based on increased
time on SQL queries and that was
just a you know small step to
go from something that is predictable
because time is unpredictable.
I got into something predictable
and this is where I started.
So I started committing, I started
adjusting it.
And right now I'm actually working
on a version 2.0, which is
a big update in terms of functionality
and in terms of developer
experience.
Nikolay: But how do you approach
the testing?
So I understand the tool.
For regression testing, it's super
important these days, especially
because there are no more excuses
not to have 100% coverage of
tests and proper coverage, not
just like good-looking coverage.
And performance testing in CI,
and even before your AI assistant
commits and pushes code, that should
have happened before probably.
How do you approach the fact that
we need a lot of data to have
proper plans?
Radim: Okay, so this is effectively
a never-ending topic, how
even developers can get to something
which is representative
of their database.
I actually call this 1, I think
that happened somewhere on Hacker
News with somebody from Supabase.
We came up to a topic which was
called, it works on my database.
That's the same issue.
But to answer your question, as
I said, initially I naively thought
I can do a time, But time is all
over the place and that wasn't
working.
So the only way how to approach
this 1 is buffers and cost.
Nikolay: Let's pause here because
I think it's super important.
I think industry doesn't understand
it yet.
We chatted about buffers with Michael
2 years.
Finally, Postgres 18 has it by
default and explain and analyze,
this is great, like absolutely
great.
So, industry doesn't realize how
important it is to shift focus
from timing to buffers still, right?
Why timing is not enough?
Why is it not so good when you
have to control performance.
What's your point of view on this?
Radim: Okay, so timing.
Timing is a perfect metric because
this is what you see if you
come in the middle of incident.
Timing is actually the easiest
to spot metric.
Queries are timing out, Hopefully
all of us have, you know, statement
timeouts, transaction timeouts,
everything in place.
Nikolay: And humans need the good
timing.
This is our final goal, right?
Radim: Exactly.
But the problem is, A, if you want something fast, explain alone
doesn't give you timing because that's just, you know, what the
planner thinks.
And EXPLAIN ANALYZE can't give you something that is predictable.
So if you want to do a regression testing, You can't rely if
you are running on, I don't know, underutilized AMD machine or
ARM machine.
Or, you know, everybody these days, we have Apple Silicon Macs
and they are super fast.
Their I/O is fast.
So timing is generally, that's the-
Nikolay: Doesn't work in CI.
Radim: It doesn't, yes.
Well, it doesn't even work in production because you will get
a big spread of queries.
You still have outliers.
1 customer can hit 10,000 rows.
1 customer hits 5 rows.
How do you...
Nikolay: What's the locking weight on locking acquisition as
well?
Yeah, yeah.
That's great.
Yeah.
I think it's even
Michael: worse in CI because flaky tests are a nightmare.
So even if, for example, we accepted double the time, if we said
like fail this test if it's double the time and that worked 99%
of runs, when it fails but for a bad reason, when it fails, it's
picked the same plan, no indexing changes, nothing has actually
changed, but there was just some something else happening on
the machine at that moment, and it took, it kind of blitzed past
its p99, we got that 1 time it was slow, and the test fails,
that's not a flaky test, that's a test that sometimes fails when
it shouldn't, like false positives.
And those are always the nightmare tests.
Nikolay: Before we proceed to the solution, we all agree, buffers,
right?
Still like let's pause maybe, I have a crazy idea, I'm thinking
for several years about this.
Maybe it's time to implement it.
Probably it's a better topic for our hacking sessions with Kirk
and Andrey we have usually on Wednesdays live.
But your opinion, I'm very interested to hear your opinion.
What if we still keep analyzing timing, but imagine, you know,
wait events, right?
Which usually are used in aggregated form, like how many sessions,
aggregate, and so on.
What if that analysis would be
brought to explain and analyze
to understand how this timing was
spent exactly.
Was it heavyweight lock acquisition
waiting?
Or I/O was too slow this time?
If in this case, wouldn't it help
to extract the, like to structure
the whole latency, to split it
to pieces, and in CI context,
extract pieces which are like unpredictable,
okay, this doesn't
matter, okay, waiting time, it
doesn't matter, waiting on lock
acquisition.
But here we see a lot of actual
CPU non-waiting, which usually
is null, right?
But there is a trick there.
But anyway, wouldn't it help to
keep in the, like, still keep
timing as main metric for regression,
testing in CI for performance?
Radim: I believe in theory that
would work, but for me the complexity
behind this is already implemented
on the cost and buffer level.
So I'm not sure if the benefit
would actually be having better
metrics.
Because what I found out is, and
we can get to the discussion
is, you know, buffers are fixed.
If you have a predictable data
set that you start with, and you
have a ideal state scenario.
I think that's the important ideal
state because production is
never in ideal state.
You will have a bloat, you will
have a, so, you know, my regression
testing is never- Always changing.
Exactly, everything is changing,
so it depends.
But if you want to do a regression
testing, you just need a matrix,
1 or 2 of them, which are predictable.
Buffers are always predictable.
Nikolay: Yeah, I'm just, I'm so
so interesting because you came
and talked what I was like shouting,
shouting on every event
basically.
Buffers, I even have t-shirt with
buff, explain what is buffers,
where buffers is bold.
By the
Michael: way, it can depend a little
bit, unfortunately.
It depends what you mean by buffers,
but like if you include
planning buffers, which are sometimes
reported, like there are
states that, like for example,
if for example, autovacuum just
ran, you might end up with a different,
if analyze's just run,
you might end up with a different
number of planning buffers
than if it hasn't just run or if
it hasn't run recently.
So there are like slight variations.
So yeah, it might not be perfect.
But cost is an interesting 1, for
sure.
Nikolay: So, okay, what I'm going
to do, I will postpone my comments
about why not only buffers, and
maybe let's return to timing,
and we will discuss maybe later,
but let's agree with you, like
buffers is the thing that doesn't
depend on the state so much.
So we just check hits and reads of shared buffers, also temporary,
like everything.
It's I.O.
Database is all about I.O.
Right?
But still the question remains how like your approach, like you
need a lot of data to reproduce the executor behavior so you
will see proper number of buffers, right?
You cannot do it with database sizes small.
Radim: Yes, that's the part.
It works on my database because developers they will start with
hundreds.
Cardinality of the data is going to be very small.
And I haven't seen a single company that wouldn't have a problem
with how to get a kind of representative set of data to developers
machines.
Well, I know there are attempts, but like if you take an average
software team, they usually have huge issues on this one.
So my issue, and that's actually part of the reaction to Michael's
comment, is I do always have to start with a clean slate.
So for me, this is not a production, so I have to count that
the vacuum ran and the statistics are up to date.
The original solution, which I described, fixtures, they are
the obvious solution, so you know I will always have to improve
the ways how to improve the fixtures.
But the way how to do it, and this is actually my roadmap towards
version 2.0, is to do a snapshot building.
You've probably seen, well you actually seen because you have
discussed it, that instant database clones.
That article wasn't a coincidence, it was actually for me need
how to get a fast start state for the regression testing, and
I needed something that works.
This is why I had to dig, and then I actually, same as you guys,
I was surprised it's already there, you know, because it wasn't
something I was aware actually exists in version 18, but suddenly
everything was done.
So it's not for production use, which you mentioned correctly,
because there are still issues.
But if you fix a easy to get starting point, if you imagine you
have a predictable build, which actually is coming, it's now
already committed, but it's coming for version 2.0 as part of
the developer experience.
And you have a fresh statistics, you can run the query.
And if your configuration, Postgres configuration is actually
a very interesting part because you can change the cost, how
it's calculated but you will always get the same buffers.
Exactly.
Nikolay: If you talk about hits plus reads probably the distribution
will be different.
Also hits might be hitting the same buffer multiple times.
So there are nuances there.
Radim: But the total, it's like
always the same.
So, and you Nik, you said it,
we all of us came from a spindle
disk.
So you remember times where you
try to hit the control C on the
query and it was just flushing
WAL files on a disk.
And we can say that I/O is cheap
these days, but it's not.
It's still the only metric that
you have to do.
And I/O, whatever there's a contention,
will affect time.
So for me, certainly buffers is
number 1.
So there is nothing more predictable.
Cost is very close.
And I actually did a, I haven't
done like a scientific research
on the topic, but if you take the
1 version of Postgres and you
run through different architectures
and same settings, the variance
is surprisingly low.
Not all
Nikolay: settings.
There is a specific set of settings
because you can have much
less memory, shared buffers can
be very small, you just need
to set proper effective cache size,
proper all the planner settings
and work memory.
We have an article about it.
And then I agree with you, it will
be the same planner behavior
for the same version.
That's it.
It's great.
It's magic.
You can have Raspberry Pi and repeat
the haver of huge like 800
core machine.
It's insane.
Radim: Exactly.
This is what I did.
I actually took my old Raspberry
Pi, I took Hetzner, I took a
GCP machine And the difference
was something around 2%.
So right now I have a threshold
of 10% by default, but I'm looking
forward to actually make it configurable.
That's great.
And let's let people decide.
But and then the answer to all
your initial questions is I don't
have to care about timing because
in production, if you would
have 1000 transactions per second,
I/O will get saturated, you
know, And that will naturally raise
the timing.
So if you care about buffers, you
don't have to care about timing,
because timing is just an extrapolation
of problems that happened
before actually the query returns.
Nikolay: Some people say, like,
let's do proper full-fledged
benchmarking just to study behavior
of planner for a few queries.
It's wrong idea because you have
so many moving parts around.
You just narrow down to single
plan, single session, right?
And focus on what matters.
But here, like my...
I have...
Before I had that question.
So you mentioned create database in Postgres 18, you mean strategy,
right?
When you can copy-on-write clone and when you have create database
with clones, template 1 database, but not in regular form, but
basically very fast because of copy-on-write.
Does it mean you use ZFS or something, Btrfs or what?
Radim: Well, it depends what you want.
I'm not forcing people to use anything.
I'm just offering an option if you want fast because you know,
another part of frustration was that everything in your integration
test that touches the database is slow.
Yeah.
And the way how people normally use it, it's slow, but you know,
running most of the queries it's like literally milliseconds.
Nikolay: Yeah, yeah.
And we have new 1 branching, we have other copy-on-write branching,
so that's great.
And here, if you compare, do you know about DBLab or not?
Radim: Yes, I know.
Nikolay: Yeah, so I'm comparing the various approaches and this
approach, like when we have multiple logical databases and physical,
is great because this is exactly what we try to achieve with
the DBLab, but not many people realize what you just said.
I hope with this episode more people will realize.
And the thing is that if you allocate this machine, you can have
a huge database.
This wonderful Postgres 18 feature, if You use ZFS, we don't
care about time anymore, almost.
So we focus on buffers.
Postgres doesn't know which file system we have.
And then you can achieve constant price for many, many tests
in parallel.
Unlike Neon or Aurora, thin cloning, you need to pay.
If you run 10 tests in parallel, for example, you can have 10
ideas from LLM, you need to pay 10 times more for compute.
Yes, storage is solved, but compute is not.
Here, create database, 10 databases in parallel, it's good.
The only thing compared to DBLab is that it's not possible, you
cannot have a major upgrade inside a clone and have independent
versions in parallel.
But Maybe some more differences, but in general, it's great that
it comes to Postgres naturally.
This feature is a great feature.
I hope it will get traction in CI environments.
Radim: I think it's actually great to have this natively.
I'm not saying it's perfect.
It's just an option.
Nobody is forced into a third-party
solutions, everybody will
pick whatever they want and, you
know, I'm not against Neon or
anything, I think you evaluate
your requirements and you say
what's the best approach you want
to take.
Nikolay: Well with Neon it's also
the same, like you can, you
just pay for compute but also you
can study planning and behavior,
put it into CI.
The thing is that you need to focus
on buffers because cold startup
will be different, right?
But we don't care anymore if you
focus on buffers.
That's the magic, right?
That's great.
And cost, also cost.
Because this is what says how planner
things.
The only problem is there are some
queries which are out of this
methodology.
For example, when people who deal
with RLS, they know this problem,
right?
If you have a query with counting
million rows and there is a
hidden filter involving current
setting, it's going to be executed
for each row And it's not buffers.
It's pure CPU time.
But these are exclusions, right?
We have only limited number of
them, right?
Radim: Exactly.
I would say before hitting that
problem, that is 99% of problem
we can solve much easier.
Nikolay: I agree.
And still you can troubleshoot
it on this machine because getting
those million rows is much faster
if you have this copy-on-write
and your methodology.
I mean, what you said, it doesn't
mean you should stop thinking
about timing, right?
You still can think about timing,
right?
Radim: You can, it will hit you.
That's what's going to hit you
in production at the end of the
day, because this is what the user
perceives when they render
the page waiting for the output
or something, but it's not predictable.
Yeah, The
Nikolay: problem is we gave this
tool, DBLab, to many hundreds
of engineers and noticed, I think
thousands already, and noticed
that some people expect that it's
full reproduction of production
behavior.
They say, why my Index is being
created 3 times longer than in
production.
Well because quite different environment,
right?
So there is some adjustment needed
in in a mindset to test like
this, right?
Well, okay enough about thing calling
and branching.
Let's think about tooling.
Michael: Yeah, I had a question
going back to cost tolerance.
You mentioned allowing for 10%
higher costs, and that confused
me slightly because I was thinking
on CI we'd always be running
on the same, effectively a clone
of the same Database static
state you mentioned why would the
cost why do we need to allow
for 10% higher costs potentially
Radim: that was initially my naive
protection effectively so
before I did the benchmarking I
said you know 10% literally I
just there was no measuring I just
said 10% sounds reasonable.
When I did measuring, I found,
as we discussed, you know, there
is slight variance, but it's not
actually, I think on average
was below 2%.
Nikolay: I would like to see details
here because I think there
should be 0 difference.
Let's talk, I'm going to follow
up with you on this because there
are some cases which I don't know,
I'm super interested to understand.
And
Radim: yeah.
I will just, you know, just to
finish the topic because, you
know, I did extensive testing in
terms of establishing it I haven't
dig into a reason because you know
some were our machine some
were very you know Raspberry Pis
so that wasn't comparable and
that was the where the difference
came from Some queries were
0.
Nikolay: But Planner doesn't know
about hardware.
It doesn't know anything about
hardware.
Radim: We actually had this discussion
on the Prague Postgres
meetup 2 months ago, that it will
be actually very interesting
to go into a differences between
ARM and Intel processors.
What actually is, if there is some
default somewhere, I have,
for example, seen that I/O startup
time, again, it doesn't involve
buffers, but the I/O
Startup times on the cold I/O
Are much slower on ARM in general.
I believe there might be something
else.
Nikolay: Different code paths.
Exactly.
Right, right.
This I can imagine, yeah.
But it's not about hardware, just
code paths.
The architecture might matter.
Yeah, that's interesting.
Yeah.
What's also interesting here, I
think we touched an interesting
topic.
Could zoom in there.
It's because it's super interesting.
When they say 10%, I think it's
because in CI, you need to define
some thresholds to define where,
when it's failed, when it's
passed, right?
And it's when you work manually,
you have some like thresholds
in your mind, you don't realize,
but then you need to code it
basically.
And this is a challenge, right?
Radim: Exactly.
This is where it started from.
And this is actually where I hit
a wall with the original release
of the tool because people didn't
understand, you know, you said
it, people didn't understand buffers,
why they matter so much.
People didn't understand why something
is changing.
And I had to start thinking again.
And this is, you know, I think
if you would go to original blog
post, now there is a notice that
I had to take the fixtures out
of the test because people were
kind of linking it together and
they were getting different data
sets and they didn't use it
correctly.
My fault was that I didn't specify
properly that that was the
issue with the blog post.
But this is why the version 2.0
is actually heading in the direction
because we have touched on the
topic.
You can have a multiple snapshots
and those snapshots will give
you a different experience but
the same baseline on a given snapshot
should be stable or more or less
stable over time.
And then it depends because your
business will change, business
requirements will change.
Then for example, yesterday I committed
a change which is a selective
update.
So you can literally link a change
in baseline to a commit so
you can say and blame what actually
was that kind of historical
trend change on a query.
So this kind of adds another dimension
because suddenly you don't
have a good blame only who broke
something.
You actually can trace it back
to how did it affect the performance.
And this was actually on one of the
posts, Michael, you had on
your blog about this optimization
from 7 milliseconds to something
you hit the index scan.
And that was actually a big inspiration
because I said, you know,
this is actually very important
because the reason that query
does 7 milliseconds, that doesn't
matter.
But if you save so many buffers,
you should be able to trace
it back even on the positive outcome.
So at 1 point, there can be a layer
which will take all those
optimizations.
And this is where we will get to
very interesting topic and that's
the topic in this AI.
Nikolay: Right, experiments, right?
Radim: Experiment, well, I wouldn't
call it experiments.
I call it set reality because I
believe most people-
Nikolay: I mean, code is written
by AI only right now, right?
Radim: Yes.
Yeah,
Nikolay: yeah.
And how do we control it, right?
So as I said, we need not only
100% coverage, but it should be
real tests should be inside, right?
Including performance, I agree.
Yeah.
Radim: So, yeah, we had a discussion
with Michael in London during
a PGDay.
If you have a context which AI
don't have at 1 point, you can
then see regular SQL actually in
a very different light, because
it knows the query, it knows the
baseline, it has a snapshot,
and it has a context, which is
the schema.
Nikolay: Exactly.
And statistics.
Radim: Exactly.
You have statistics.
So you have suddenly much more
data, which you need.
And I had to, you know, all of
us, all of us have to think about
how AI is going to change our work.
I'm thinking
Nikolay: every day since 2017.
Radim: Exactly.
I have a weekly session which I
do and you know, I'm trying to
digest the news and do this 1.
But there are things that just
confirmed that this is the right
direction because the joke is 84%
of developers use some sort
of assisted coding or the vibe
code or whatever, we all do it.
Yes.
But the same 84% of people, they
don't write SQL queries anymore.
Nikolay: Yes.
Radim: If they were bad in writing
SQL queries, you know how
a LLM SQL queries look without
the context.
I actually found it, you know,
this is the deformation you have
when you look into a, how Planner
works.
I found it very funny because it
doesn't differentiate whatever
you're running Postgres 12, 14,
18, it will mix all that advice,
it kind of reads the blog post
which is just you something wrong
and people will just go with it.
And if you put this 1 in context
of a 10 times more pull request,
Who's going to review it?
And now you see where all this
coming through because if you
don't have time to review it, if
you don't have the human aspect
of reviewing it, and if you don't
have the DBA knowledge, I think
the lost DBA knowledge is actually
kind of key here.
You need something that will govern
that a your tests are fine,
you can write unit tests, integration
tests on a couple rows,
but how do you do it if you have
to return 5,000 rows?
This is very difficult And this
is what RegreSQL actually
was doing ever since start.
You can take a 5,000 rows and
compare them.
And it can suggest you, you know,
order is wrong or maybe, you
know, there is a wrong field and
it will notice it.
So actually that comes somewhere
between unit test and integration
test, it can measure or control
that your data and your contract
is still valid.
And this is where I believe the
direction of the tool is actually
having a big future.
Nikolay: That's absolutely great.
I just like everything you say
makes total sense, absolute sense.
Yeah.
That's great that we like think
in the same direction.
That's all I can say.
Radim: Well, thank you.
But I would say, you know, this
is, if you have experience dealing
with production system, this is
what you will effectively arrive
to.
Yeah.
Because you have to generalize,
you have to find underlying issues.
Nikolay: We need guardrails and
protection, automated protection
at a larger scale.
We cannot say, oh, let's not, let's
prepare to run those pipelines,
right?
Because they're expensive and long.
We need to make them cheap and
fast, radically, right?
So we can run like 50 of them,
right?
And check all the things and throw
away all hallucinated stuff,
right?
This is this is the way yeah This
is the way And then comes to
in many details including for example
There are some hard problems
which come from production side
not from development side.
For example plan flips You know
like and then it's different
so but it's a different problem
if we split these problems anyway,
I agree the most danger to performance
right now is the AI.
But it's also great.
But, Like, I mean, it's wild west
right now, right?
So we need some practices and tools
to be established.
Radim: But just to kind of put
it in contrast, I actually believe
most issues are actually very simple
because most of the incident
they start, you know, I'm not sure
if I'm the only 1, but most
incidents you see are literally
simple, missing index.
Or somebody rewrites index without
understanding how it should
be rewritten.
And you know, they obviously verify
it locally.
It works fine because they have
a small database but they don't
notice that the query exploded
100 times in terms of buffers
retrieved and that will have an
impact.
Well, 100 times is easy, that usually
you can survive it.
Nikolay: Well, important point
here also I noticed if you say
1000 buffers people don't understand
you.
If you say 8 megabytes they start
understanding how much it is,
right?
So you just multiply by block size
And this is magic.
So simple magic, you know.
Radim: It is.
That's, and you know, this is,
I don't want to sound as a promotion.
I'm building like it's called labs.
So, you know, we both have labs
of some sort.
Nikolay: For experimentation, it's
right.
Yeah.
Radim: Because for me, and you
mentioned it before, developers
want same environment as production
and they don't understand
why it's different.
And how many of us, I think 3 of
us, we are lucky because we
had access to a really busy production
environments and we know
how it looks like.
But if you take an average developer,
during their career, they
will never actually touch environment
that has more than 10 TPS
and maybe gigabyte of data, they
don't simply see it.
This is why, you know, if you trace
the BoringSQL block, everything
comes down to, this is fine on
your database, but it will eat
your life on production because
it won't work.
Nikolay: Yes, I compare it to bicycle.
It's hard to be DBA because you
need to learn, but where to learn?
If you don't have access to terabyte
scale and heavy workloads,
if the only practice you have is
production, it's basically learning
bicycle, how to ride bicycle on
highway.
Some people survived and they are
called DBA now.
Others they say the database is
slow, it's dangerous, don't touch
it.
If you have an environment where
you can safely make mistakes,
I saw magic when people have access
to terabyte scale clones
and they start to do really crazy
stuff and realize it's wrong,
but nobody affected.
And this is how learning is actually
works, you know, it works.
So yeah, you're basically 100%
resonating in my mind what you
do.
Radim: And then take whatever you
do, because I'm actually following
your post on LinkedIn about self-driving
Postgres.
Now you take, people won't write
SQL queries.
They will not, let's forget that
it's a regression testing framework,
they will not have this experience
and they will never hit issue
that Postgres would be badly tuned.
Yeah.
That suddenly increases the gap
and sure, there will be AI which
will help but you still need a
manual or human operator at 1
point or somebody who will be able
to diagnose right when wrong.
Nikolay: Because real testing shifted
to production that's why.
And we just need to shift it left
again.
Yeah.
Radim: Exactly.
So I think it's all about clarity
and giving people, sure, there
will be, I think we will never
have 10 times more DBAs.
We will have probably 10 times
less DBAs.
But we need to make sure how the
new people will actually learn
those skills and how they will
learn them predictably without
that fear, without that, you know,
being scared and who knows
what will happen in 5 to 10 years.
Nikolay: Yeah, I agree.
I agree.
And also it's very resonant to
what happened to code and containers.
Containers is the way to have reproducible
tests.
It's standard, right?
We just need the same for data.
And copy-on-write is the key.
But also methodologies should be
adjusted, buffers.
So everything is so.
And I agree, DBAs is like sysadmins.
It's in the past.
People need to learn Postgres who
build stuff and know how to
operate bigger databases coming
from various angles.
I have front-end engineers who
already start understanding things.
Or Supabase, Supabase is a great
example.
He attracted a lot of front-end
engineers to learn SQL and so
on, right?
That's great, I think.
Radim: And that's the boring part,
you know, I have to make the
joke because, you know, for me
literally the direction where
I'm going is those boring things
are with us for 40 years, they
will be with us for quite some
time.
And they just work, those skills
are reusable, no matter if you've
read them in LLM or not.
Somebody needs to understand them.
Nikolay: Yeah, yeah.
So can you explain what your tool
does?
So assuming we understood this,
that we need to focus on buffers
during optimization and establish
baselines focusing on buffers
and cost.
And we managed to achieve the same
data using this feature in
Postgres 18.
What does your tool do and who
should use it?
Radim: Okay, So the main goal of
the tool is to find all your
SQL queries and help you in some
way to write a plan because
you know queries need parameters,
it needs to have a way, So
you need to write a fixtures or
provide your dump database, anonymized
database.
That's your choice.
I'm not actually forcing you in
anything.
You have to get it to predictable
state.
That's your snapshot.
Then you run it on your queries
and based on your plan, each
query can have 1 to n plans.
That means what's the variation.
So you can have LLM to generate
you plans, you know, using random
values.
You can have a fuzzy generated
plans and that means every single
plan will execute that query once.
What the tool does, it captures
the expected, the original tool
did it also, it captures the data.
So what you have is right now it's
a JSON with all the data
that it produced.
This is your baseline And that's
the contract for correctness
of the queries.
Then you have the functionality
for baseline.
So right now the version 1.0 does
EXPLAIN only, but version 2.0
will have a full EXPLAIN ANALYZE
because that's what I need for
buffers.
And it records that data.
Nikolay: Yeah.
What about other settings, modern
settings like memory and serialize?
Radim: That is interesting because
again, this is the part of
the 2.0 roadmap because snapshot
can't be consistent.
You have to have a data, you have
to have a now even configuration.
So I didn't, I'm not saving all
the configuration, all the metadata
about where it's running and how
it's running, but I'm actually
saving this as part of the snapshot.
So if something changes, like if
you change a configuration,
what's the cost of sequential scan,
it will warn you.
You change something and I will
probably give you wrong results.
So you still have to interpret
the data.
At 1 point you arrive to a scenario
where you have a good Commit
and you can distribute the snapshots
for testing.
So ideally, you can have a couple
of versions of the snapshots.
You can have a small snapshot,
which will have 100 megabytes,
200 megabytes.
You can have a mid-size 5 gigabyte
snapshot, which you can use.
And then you can have a large snapshot,
50 gigabytes or something.
And then every time you run test,
you will get your report, what
actually has changed.
And that change is based on the
cost of the buffers.
Nikolay: Yeah, that's great.
And you capture also version, I
think, right?
And settings.
Yes.
And some like statistics and structure
of table, which indexes
we have and so on, right?
We're basically in the same, like,
met at some point, it resonates
a lot, but you come from development
more, I come from production
more.
And in production, what also matters often, including buffers,
for example, if you think about index-only scans, heap fetches
matters a lot, right?
And this is the state of production which can be bad or good.
And capturing from production the state, some statistics might
make sense as well.
Do you see somehow it could play or not?
For example, when last time this table was vacuumed, how much
dead tuples it has or how much bloat accumulated, something like
these things, you know, or it's outside of your scope?
Radim: I spend so much time thinking about this topic.
And if I would ask, if you would ask LLM, that would say, yes,
that's your natural direction.
But I actually proactively said no to this because this is a
different domain.
You know, you have to guarantee what's broken on the development
side.
You have to make sure that contract, which is also a performance
or buffers, is the same or better and you track it, but you can't
replace this in production.
Because in production you can have a latency, you can have a
replica, which run in different availability zones, you can have
a network issues, You can have, you know, so many issues that
can happen in production.
And there are tools that I effectively, I can't compete with.
And I think that wouldn't be, that wouldn't be my goal.
And because my angle is education effectively.
For me, it stops with CI.
Nikolay: Yeah, it again resonates 100%, because what you do,
you say, let's split the system into pieces and study each piece
separately.
And when we understand each piece, like, behavior fully, then
we can go and increase complexity.
And this, that's why you shouldn't use pgbench to study plans,
right?
Because pgbench is
Radim: very cool
Nikolay: to test the whole system.
And also that's why you shouldn't bring too much from production.
Study your system first, how it works, and then let's go to production,
understanding our system and knowing what it's capable of already.
Then we bring complexity.
This is like one of the oldest scientific methods, René Descartes,
right?
Let's split the pieces, study each piece, and then composition
to have whole understanding.
Without this split, it's super hard and you always have too many
moving pieces, right?
So I like this, but still I'm from production, so I care about
the state of production, you know, it's different.
Yeah.
Cool.
Michael: I think we're talking about different things.
They were, I really liked that.
This is about regression testing.
This is about not breaking things that like ourselves through
development.
This is not breaking correctness, whether that's a human changing
a query or an LLM chatbot changing a query.
So correctness is like primary importance, that makes sense.
And ideally not affecting performance, but we can't measure production
performance in test, in development.
So all we can do is get a proxy for it.
I don't think we can anticipate every single potential performance
issue in development that could hit production.
Nikolay: We've produced really difficult problems.
Yeah, with point-in-time recovery, sometimes you can reproduce
really hard problems at macro level.
Michael: Reproduction is different from anticipation though.
Anticipating, so to get it into your regression test you would
have to anticipate every possible issue and I think personally
I really like the direction of let's do 1 thing at a time, let's
tackle the common things, let's
Nikolay: look at the costing.
I'm not
Michael: arguing with the
Nikolay: whole concept, I'm just saying that also is possible,
it's just much, much heavier, much more expensive, time consuming.
In development we need to move faster to write a lot and Check
dozens of ideas at the same time like in parallel.
This is what we need right and this is perfect approach So the
Michael: last the last thing I wanted to say is I've personally
I personally think that This hasn't been a topic that's been
particularly of interest to people for many years.
I've expected it to be like I thought it should be.
I've worked at a company previously that made tools in this area
that weren't very successful and I was actually pleasantly surprised
by how much positivity and interest there was in on the Hacker
News discussion and in your in your revamp of this tool.
So for me, it's actually feels like people are ready for this
topic for some reason.
I don't know what changed or what.
Nikolay: What is writing there?
Who knows?
Michael: Do you think that's what it is, Radim, from your perspective?
Radim: I think this goes to, there's not 1 answer.
First is going back 1 topic, how many percent of incidents you
deal are really production-based?
Like no matter if it's human or LLM, you know, I'll be just guessing.
90% of all incidents are change management that didn't go well
and something happened.
So this is why the regression testing
actually when you Microsoft
you can't measure but if you take
a snapshot something which
is now working you should actually
do something that is measurable.
You measure if you're improving
on performance.
So it's you know sure contract
is important but actually if you
start with a 0.0, you can technically
say we on average go up
and down in terms of performance.
This is very easy to measure.
And you know this 1 for example,
you know how many times you
double guess when you upgrade from
Postgres 17 to 18, what's
the impact?
You can suddenly put a number and
if you somehow put all your
queries and do that number, you
can put that number into a bucket.
And the second 1 is no matter what
you do, I was asking myself,
you know, why nobody thought about
this 1 before?
I actually don't understand why
Nobody did this because it's
predictable.
The metrics were there, but nobody
did it.
And from my experience and the
frustration I had was all the
incidents were very easily avoidable
because the metrics were
not in the 10% margin, they were
in 1000%.
Every single incident started with
1000% difference.
Nikolay: I have also 1 question
about this tooling.
So how do you capture queries to
test?
Because for example, if we have
ORM, it produces various queries.
We can see pg_stat_statements, normalized
queries in production,
but for each query we might have
multiple plans depending on
parameters.
So how to find those queries?
I have my own methodology.
It's actually described in some
articles.
We could collaborate maybe a little
bit on it, but what's your
approach?
Radim: This is the difficult part.
I'm not going to lie.
I come from Go background, and
if you take, ORMs are not that
popular in Go, I would say.
For a framework like SQLX and things
like this, 1 day they encourage
you to write SQL queries by hand.
And this is actually where it started.
So, effectively, my first commits
were about supporting multiple
queries in 1 file.
So I have a different base.
There are even frameworks like
SQL C, I think that generates
code based on the queries or vice
versa.
But that would keep the target
market very slow.
So the obviously next step, and
that's the support of the different
frameworks.
So I think in my article, I did
a SQL alchemy.
That was 1 that actually allows
you to save the queries.
And this, if you remember, we talked
about the plans.
And plans are that variation of
arguments.
Nikolay: Right.
Radim: So you see, you have an
artificially generated snapshot,
which is fixed, and then you need
to generate the plans.
And plans variation, if you write
1 plan, your query testing
is not going to be this 1.
So you need to generate, and this
is actually where LLM can help
you.
Because suddenly you have a tool
which can create the variation,
create even hallucinate data.
And you will be testing queries
with a, I would actually be quite
happy to have hallucinated parameters
because it will show you
whatever something is changing
or not.
Nikolay: It's a proactive approach
actually, because my approach
is to take everything from production
because we deal with really
large systems and we need to understand
when we, for example,
perform major upgrade, there are
no plan flips, so planner or
plan flips are positive.
But what you say, like, let's use
LLM to generate parameters
which probably will happen next
month, right?
Like some edge cases we haven't
met before.
I like this a lot as well.
Yeah.
Radim: Yeah.
So you have to use what you are
afraid of.
So if we are afraid of randomized
queries, then why not use it?
And effectively said, yeah, I need
a variation of thousand plans.
Nikolay: Yes, again, like the problem
is like, again, I have
slightly different opinion here
because I'm coming from production
experience.
I would say we can generate like
thousands and thousands of various
cases, maybe 1,000,000 queries to
test, but it will increase the
longevity and the price we pay
for CI and testing, right?
So we need probably to prioritize
some queries.
How to prioritize?
We need to study production.
No?
Radim: Exactly.
Okay.
You need to understand your business
requirements, you need to
understand your outliers, you need
to understand your smallest
customers, you need to understand
your biggest customers, you
need to understand the patterns
which are happening.
Nikolay: That's also an interesting
comment, because I was thinking
about like this tooling, Oh, by
the way, parameters right now
is just an idea, it's not implemented
yet, a collection of...
Radim: No, that's already...
The queries are already parameterized.
Yeah, I understand the
Nikolay: picture, but if I have already a large 10TB database
with a complex workload and I want to start using your tool,
will it help me to create a set of queries, example queries,
which will be tested?
Radim: Queries we have to extract from somewhere.
So no, I would say answer is no, because right now I assume you
have a queries or way how to capture the queries, either by having
them in the SQL files or by being able to have a CI task that
will generate those SQL files from your ORM.
Nikolay: Yeah, I would like to have an additional call with you
to discuss because I have an approach maybe we could connect the
dots here and see if it can land because this methodology probably
works with your tool naturally, you know, how to inject this
approach to an existing big project.
We deal mostly with projects already starting, hitting walls,
they come to us for help.
So I understand completely how to start from scratch, but when
you have something already, there it can be difficult.
Another thing I wanted to point out, what I hear, you say smaller,
medium-sized, bigger.
Sometimes we have only 1 database and that's it.
Some systems like SaaS, which is installed only once.
But sometimes we develop software which has many installations,
small, big ones.
So it's interesting.
I think the approach when you define some reference databases,
this works better when we have multiple installations of various
sizes?
Is it what you think?
Radim: Yes, as I said, there's no universal answer how to generate.
And many people actually wrote me about fixtures being completely
wrong.
But what I'm trying to get past is I don't care about the features.
I care about the way how to generate a snapshot.
And if you have the same cardinality and same data, you can then
measure correctness.
So obviously the investment in a tool like RegreSQL is actually,
it's a big 1.
You have to start using it and you will get dividends of months
on.
You might get some dividends as you go, but actually the true
value will only go.
So the developer experience and then whole life cycle, that's
maybe a version 3.0, because you know, there needs to be much
more support.
So the tool itself is not important as much, it's the whole experience
in the process because methodology, exactly.
Nikolay: So what I'm actually interested in, like let me share
like to be honest with you and what I have in my mind listening
to you.
For example, GitLab, they use our approach.
Many engineers say it's great and we have case study publishers.
Everything is great.
But listening to you, I realize actually we always cared about
the biggest GitLab.com installation which is huge, multiple,
it's huge.
We focused on solving that problem.
We solved it, not fully, but solved how to test queries there.
It can be proved always.
But now I'm thinking, actually, GitLab has hundreds of thousands
of other installations and they have reference.
We actually should care about that as well.
So it's interesting because we could generate different sizes
of database and cover it by testing them as well.
So it's interesting because I was always thinking about the biggest
1, you know, because usually problems usually there.
Right.
Radim: Exactly.
You hear about the biggest ones.
This is where, you know, the production breaks, the incident
starts, the query, which was fine.
It sounds similar, but it's a very different problem.
But you need to find a some sort of boundary sometimes you can't
solve all the issues and maybe you need to find a different solution
Nikolay: right right right yeah so it's
Michael: just out of interest Nik what would you expect to break
on a smaller instance?
Is it that it's just like a different shape, for example, on
GitLab.com's main instance, maybe there's tons of 1 type of object
and in a different customer's installation there's a different
distribution and therefore plans are different?
Or is it like...
Nikolay: I can tell you.
So what we think about, like, we have various kinds of testing,
like, red data testing, It's like ultimate final testing with
basically production clones, which only a few people can have
access.
But also, you cannot download this database due to size and also
restrictions, PII and so on.
But what if you need to care about multiple installations in
this case, the direction DBLab is moving in is having full-fledged
database branching Git-like, but also cloning between multiple
setups.
And We already run on MacBook.
So I'm thinking if you care about smaller databases as well,
developers could basically download reference database and iterate
much faster because I hate how CI is organized right now.
If AI is hallucinating, in production if I hit it, it's an absolute
nightmare.
But hitting even in GitLab CI or
GitHub Actions, hitting problems
there, it's already too late for
me.
I want to iterate much faster.
I want all the tests to be run
before push, git push, right?
In this case, if we have a DBLab
of a smaller size, you download
it and like don't hunt only for
the biggest case, medium size,
small size.
I have like a terabyte or so here.
I can have a good example database
on my laptop and iterate much
faster and before push I already
have much better code, you know.
Michael: Yeah, so catch more issues
earlier and only catch the...
Nikolay: Shift left what you can
shift left.
Michael: Yeah, that makes sense.
Nikolay: Okay.
So it's saying let's test earlier.
Ideally, developers should test.
Usually they don't have time, but
AI, I have no excuses to AI.
AI should test everything.
I pay for it, right?
Tokens, not time, right?
That's it.
I think it's a perfect time to
think about all levels, and if
you can shift left to your laptop
to your machine, you should
do it.
So future DBLab versions will be
able to clone data between,
to pass snapshots between 2 DBLabs.
You can have it on your machine
and just extract and then do
something, extract back and share
with your colleagues some tests
and so on.
Yeah, anyway, I'm still in agenda
all the time because it's actually
my topic.
I'm so glad you came actually.
Radim: There's actually so many
topics and I think we can go
for another hour because you know,
it's like the support.
I think last night I pushed was
the same 1.
You take the snapshot, you run
it against migration, either a
SQL file or command.
So you can actually trigger a third-party
system and then test
the same just to test the migration
how it looks like and You
can start there so many angles
on this topic alone.
Nikolay: Oh, yes
Radim: You know, I don't say I
have answers for everything, but
I just feel like I
Nikolay: hit a topic.
Radim: Let's have
Nikolay: part 2 about testing of
schema changes.
Let's have part 2 just about it,
because it's a big topic.
This is where, as you said, change
management, this is where
danger comes from usually in instance,
right?
So let's just discuss it separately
because I think it's worth
discussing.
Michael: Radim, is there anything on the regression testing side
that you wish we had asked about or that you wanted to mention
before we wrap up?
Radim: I think we have covered most of the things.
Nikolay: Where to start for people?
Radim: Where to start for people?
Nikolay: You know, first steps.
Radim: Follow, first step, follow the project.
I think that's the first good step.
I will definitely make it public and do some, you know, sessions
around the version 2.0 because this is where the core developer
experience will come and I will actually need feedback on that.
So I'm actually open to any feedback.
You can email me at any point or open a GitHub issue.
I would be actually grateful to see that.
But I would say pick your 10 queries.
Don't start with anything.
Find a way to pick 10 queries which you believe are, maybe not
small, I would say representative of something that's breaking.
Try with them, try to create a plans, try to try the tool, how
it works, what might be the flow, and try to live with it.
And then have a rule that every single handwritten or LLM written
query needs to be covered.
If you start with that, you will start seeing value in a couple
of weeks, a couple of months, because you will have that check
mark, everything is fine.
And then in future, hopefully all the features and the developer
flow will support that you will get much more values and you
will actually get your time back in terms of safer production.
Nikolay: Confidence level builds up.
Yeah, I mean, a small set of queries, but it's a good idea to
understand which queries matter most based on some production
or something.
And you have a website boringsql.com so this is where you have
blog and so on and the project you're talking about is RegreSQL,
right?
Radim: Exactly.
Yeah, great, great.
Cool.
Right now it's not featured, I'm preparing a section.
Nikolay: You have a huge roadmap I guess.
Radim: You can't even imagine and this is actually where I loved
it.
Nikolay: I can.
Yeah, that's great, that's great.
And the project is in Go, so that's great.
Yeah, good, good, I think, finishing words, right?
Unless Michael has something.
Michael: No, I'll link to everything in the in the show notes
if anybody needs anything that
we've mentioned and yeah thanks
so much for joining us Radim it's
been great.
Radim: Yeah thank you it was very
interesting and I loved talking
because you know it's an important
topic.
Nikolay: Thank you for coming.
Radim: Okay, thank you guys.