PostGIS
Michael: Hello and welcome to Postgres.FM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard and I'm joined as usual by
Nik, founder of PostgresAI.
Hey Nik.
Nikolay: Hi Michael.
Michael: And we have not 1, but 2 heavy hitters in from the PostGIS
world.
Both also recognize Postgres core contributors.
First we have Regina Obe, President of Paragon Corporation,
a consulting firm, member of the PostGIS core development team
and steering committee, and co-author of a whole host of books
on GIS and SQL, including the book PostGIS in Action.
Welcome, Regina.
Regina: Thanks, thanks for having me.
Michael: It's our pleasure.
And also we have Paul Ramsey, who is Staff Engineer at Snowflake
via the Crunchy Data acquisition, the co-founder of PostGIS, member
of the core development team and the chair of its steering committee.
A pleasure to have you too, Paul.
Paul: Hi, Michael and Nik.
Nice to be here.
Michael: It's wonderful.
Right.
So I was hoping actually that maybe one of you could give us a
kind of brief history of the project and maybe the other one could
give us an update on where it's at today and what it looks like.
Paul: Well I'll take the history side of this because yeah, because
PostGIS comes out of my history as a geospatial consultant and
it's a long time ago now.
PostGIS first released 0.1 was at the end of May in 2001.
So we're creeping up on 25 years of PostGIS.
Pretty crazy.
PostGIS was born because, out of a consulting company here in
Victoria, British Columbia, and it was born because we ran out
of work.
The fiscal year for the government, which is our main source
of money, ends on the March 31st.
And the first release of PostGIS is 2 months after that.
We had been doing an analytical work for the government using
standard GIS tools and a database to store interim results in
the database we're using was Postgres.
And having had that experience of using a database and using
SQL, it's becoming familiar with this sort of powerful tool to
slice through big sets of chunks of data in different ways.
And also being spatial people looked at this database tool and
said, there's gotta be a way, like, there's gotta be a way to
bring together the spatial questions we're asking and the non-spatial
questions we're asking into one place so we can ask like more complicated
questions and have it all fulfilled inside the database.
And there was at that 0.1 relatively
fresh international standard,
the simple features for SQL standard
from the Open Geospatial
Consortium.
There was one extant proprietary
product that did that.
Oracle 8i at that point had been
released and it had a pretty,
80, 20 complete implementation
of Spatial SQL.
And, and we did both.
We both looked at Oracle Spatial
and thought, can we use this?
And at the time, if you wanted
to use it, you had to have Oracle
Enterprise.
It really, the implication was
about a quarter million dollar
licensing bill to really use it
in anger.
And that obviously it scared us
off as a small consulting company.
It also successfully scared off
anyone in the government that
we like, we may all get some of
the government who wants to do
this stuff and we'll piggyback
on their licensing and learn it
and do an implementation for them.
And we'll be like spatial database
experts.
Now it's too expensive for anyone
in the government.
But we had this specification And
we had this experience with
Postgres and Postgres itself was
built for type extension.
So we also had the really like
clear on-ramp to add a new type,
to bind it to access methods, to
add functions that worked on
this type in a way that really
didn't exist in what was at the
time the sort of premier open source
database MySQL.
It took quite a few years beyond
2001 before MySQL added Spatial.
And when they did, We went and
looked at their implementation,
looked at the patch, and it was
big.
It was a big patch that crossed
the whole code base.
It really took someone who understood
MySQL, soup to nuts to
do it.
They had to do the index implementation,
like everything.
They had to do a hell of a lot
of work to do it.
Whereas we got this nice, tight
little code base sitting as an
extension completely divorced from
the core code.
The core code provided access method
prebuilt.
We were good to go.
So it was doable to have a working
extension that we could demonstrate
to ourselves and to customers within
2 months and then release.
So yeah, we were clever people,
but Postgres is a hell of a clever
product and the PostGIS wouldn't
exist if Postgres hadn't been
so clever to start with.
Nikolay: And it was just right
at that time.
Paul: At that point it was just,
it still is just actually the
access method we use for spatial
indexing.
Yeah.
Nikolay: I'm smiling the whole
time you're talking because it
connects 20 years ago, 2005, 2006,
2007.
I remember first of all, this is
when I left Oracle 8i.
Exactly.
Like last time I touched Oracle
was exactly 8i.
And then when I was building my
second startup social network,
we put a lot of people and objects
like restaurants and so on,
on map 2007, it was too early.
And I remember I hired 2 guys,
Oleg Bartunov and Teodor Sigaev,
to help us.
It was so bad because it was amazing
to have GiST and reading
all the articles and basics and
like 7 functions, abstraction,
all great.
But when we went to practical questions
and we had 10 million
people to be placed on the map
and a lot of objects and then
you try to solve simple problem,
show me nearest 10 objects on
the map which have good reviews.
And it's terrible because there
was no support for order by and
KNN.
I didn't know KNN.
And since we moved so fast, I eventually
decided to pivot my
startup away from maps and so on.
But Oleg and Teodor, a couple
of years later, brought KNN to
GiST.
It was 9.1.
And then Alexander Korotkov, also
Nikita Glukhov helped to brought
the same like to SP-GiST.
And this connects us a lot.
But I just, I showed them this
problem and that's it for me.
It was great.
Paul: Yeah, back in the early days,
starting off incubating an
open source project, within a couple
of years, we had quite a
few users and some institutional
users.
And at the time the GiST infrastructure
was, did not include
recovery.
It wasn't a completely ACID implementation.
And, at that point I had like a
stable of people who had businesses
depending on the PostGIS, and
I put out a call for development
money and got scraped together
from all the different companies
who were using or depended on Postgres
about 10 or $15,000 and
then ship that money to Oleg and
Teodor who did the work necessary
to make GiST recoverable.
And it was a weird thing to sell
to the community.
It's we're going to get this work
done.
You're using it now as a result
of the release cycle of Postgres
and time takes to do the work and
all that stuff.
You'll be able to use it and show
it to your clients in 12 to
18 months.
But you're going to give me the
money right up front.
So it was interesting introduction,
both to the, to the Postgres
development cycle, and also to
like trying to drum up crowdsourced
funds for open source development.
And it really taught me that crowdsourcing
funds is really hard
because it's people are willing
to pay to get the bug, which
is bothering them right now, fixed
right away.
Much more difficult to get them
to pay for something, which will
maybe make them a little bit happier
in the far future.
It's a harder, it's a harder sale.
Nikolay: Especially for startups.
So they move too fast to
Paul: understand that
Nikolay: the thing will come only
in a couple of years.
Regina: Yeah, they might not be
around that long.
Nikolay: But I'm glad K&N was brought
into Postgres and PostGIS
as well.
Yeah, great.
Mason.
Michael: Is that a good excuse
for us to get Regina to give us
a little bit of like a catching
up from, it's been 25 years,
right?
You've done a lot of work since
those early days.
What's changed?
What does it look like now and
how is that different to the first
versions?
Regina: I mean, we definitely have
a bigger audience and we now
have people seem to be running
things in the cloud more than
they're running things on premises
or maybe not.
I think it's yeah slowly getting
to the point where most people
we see are running PostGIS in
Google Cloud or Amazon Cloud
or Microsoft Azure.
I think that's the biggest thing
that's changed since we started.
Michael: I was thinking as well,
it looked like it started as
a single extension and it's now
kind of a collection, it's like
a family of extensions that serve
quite a few more use cases
and lots
Regina: of functions.
So when we started off, we just
had the vector support and then
we added raster support.
And originally we thought, Oh,
we'll just throw raster in the
PostGIS extension.
And then it got very fat and people
started complaining.
Why do I have to carry these thousands
of functions that I don't
even use raster?
So then we broke it out into 2
extensions, PostGIS and PostGIS
raster.
And we also have PostGIS topology
And then PostGIS spun off other
extensions.
So you have the H3 extension, which
is now it's, we brought it
into our organization, but it's
still a separate extension.
It's not even part of the PostGIS
extension because we forked
it from the original owner into
our system.
So that was very recent.
And then there's another extension called MobilityDB, which is
a separate extension, separate group of people, but they build
on top of PostGIS.
And they deal with things like, you know, vehicle movement, the
speed at which things move and all that and flying birds, trains.
I think those are the key ones.
Yeah.
And then there's also a pgRouting, which I'm a member of too,
which deals with network routing, like driving, biking, walking,
getting directions on that.
So yes, it's spun up a lot of other extensions.
Paul: Yeah.
So that's the technical side.
Deployment is a big deal for sure.
When we started, it was, you want to use PostGIS?
Here's the source code.
I'm tired in the contrib directory of an existing Postgres installation.
And then you build and then you install and now you have PostGIS.
And around the time Regina started joining us, she helped with
Windows builds.
So Windows people at least could have a pre-build binary.
Regina: And I still do Windows builds.
Paul: At some point, PDDG included us in their builds, and that
was a huge change.
So now it meant that you didn't have to build it yourself.
You could just find the right packages and install those.
That opened up a whole pile of new users who previously wouldn't
have tried it.
But yeah, as Regina said, the real tipping point lately has been
being part of the default installs on all the clouds.
So there's no longer a software install step from the point of
view of the end user.
It's just type create extension PostGIS.
And that ubiquity really goes beyond the ubiquity for users.
I feel like we've achieved a certain level of ubiquity in the
industry.
Like, first of all, like just geospatial as database functionality
is table stakes now.
It's pretty rare to find a SQL implemented database that doesn't
have Geospatial.
Those who don't are all tacking it on at a great rate.
Regina: Ours is still the most popular, I think.
I think it's surprising that Postgres, even though it started
out as a meager database, PostGIS itself is probably more popular
than Oracle Spatial and SQL Server Spatial and MySQL.
I don't even hear other people doing serious work on those.
Paul: Yeah, And that's probably not fair because I bet you there's
a whole bunch of institutional users of those databases who they
never leave the ranch.
So that's all I understand is Oracle spatial or SQL server spatial
because that's just the way their
institutions are shaped.
But there's no doubt that in the
wider world, PostGIS is the
thing.
The thing that made me really understand
that PostGIS was the
lingua franca was seeing the
Google BigQuery announcement
blog post where they said one of
the features of Google BigQuery
was that it was PostGIS compatible.
It was like, Oh, wow.
Okay.
So that's nice to see.
Michael: That's them saying you're
the standard, right?
That's pretty cool.
Paul: Yeah.
And it's a standard that I could
have said, we support the simple
features for SQL standard, but
they decided that PostGIS compatible
was more understandable to the
larger marketplace.
It's like, okay, so we have the
mindshare.
Nikolay: Yeah, that's great.
Actually, my, my daughter, she
is a fourth year bachelor degree
student at UCSD and they have projects
related to wildfires in
California and GIS as well.
And, but I was super annoyed to
hear it's not PostGIS, but it's
actually Google BigQuery, so that's
good to hear.
Yeah.
It's at least compatible.
Great.
Paul: Oh, reusable.
Reusable knowledge.
Nikolay: Cool.
What's the latest things you can
share?
What's the latest development?
This is one thing I'm also curious
about.
I know, for example, PostGIS works
quite well with
TimescaleDB.
It should work with pgvector.
I never tried.
I'm curious.
And with sharding systems as well.
There are super extensions.
In the case of Postgres, it's not
super extension, it's a super
set of extensions.
Not super set, a collection of
extensions.
But sometimes it's hard to make
it work together, these big building
blocks what's there especially
PostgreSQL vector because it's also
important building block is there's
right.
What's the recent development.
Regina: No if there's any recent
development I think they just
work together and it was because
of the way Postgres is set up
right.
You can all You can have 3 or 4
different extensions in the same
query, and it's just because they
standardize on all data sets,
get the same feature set, get the
same core feature set.
And so you don't have that annoying
problem that MySQL has with
their different storage engines
and their different query syntax
for things.
So things just work seamlessly
well.
So I don't think we did anything
special to try to make ourselves
work with pgvector.
Paul: Yeah, pgvector is more similar
to us than, say, one of, like
you said, the super extensions
like Timescale or Citus, like
these sort of overarching big swings
of the architecture of the
system.
pgvector, like, they provide a
type, they provide access methods,
they provide some functions that
make sense for that type.
They're very PostGIS-esque in that
respect.
But from the point of view of what
they're for, they're kind
of orthogonal.
I've only seen one example recently
of PostGIS and pgvector being
used together.
And they're not being used together
because they're getting any
great leverage out of PostGIS.
And they're getting leverage out
of the fact that PostGIS itself
has a huge amount of third-party
interconnectivity.
So you use something like really
high dimensional search and
pgvector to get a bunch of results.
And if those results happen to
include spatial information, then
you throw that spatial information
on a map because everything
connects to PostGIS, all the mapping
software connects to PostGIS,
and that's how it works.
We really saw a cool example of
this in our last PostGIS Day.
We have a PostGIS Day mid-November
every year, an online conference,
and Shawn Burke from GeoBase
showed us geospatial AI using
pgvector as the, as a search engine,
but then getting the results
back through PostGIS to his mapping
tools.
So it could show us the results.
Neat example of what the geospatial
are calling people are calling
foundational models, not large
language models, but a completely
different look at how to use the
AI kind of technology for doing
geospatial work.
Nikolay: And if we talk back to
TimescaleDB, I saw great examples
of Postgres and TimescaleDB working
together, because if you
collect a lot of geographic signals
from drivers or from anything
like on the map, you want to partition
all this, right?
Because it's a lot of data, right?
One way or another partitioning should
work, because it's also
time-series, not only geographical
data, right?
So this is natural pair of extensions
working together.
Okay.
What about lakes and since like
your company got acquired last
year?
Yeah.
Yeah.
What, what, recent development
there?
Paul: And this ties actually back
to your other question, which
is at the core is an interoperability
question.
And the nature of the type model
in Postgres is such that when
you add a new type using the Postgres
type model, a lot of the
time things just work because the
overarching machinery of a
big, a big piece of machinery,
say Citus doesn't think necessarily
about particular types.
It thinks just about type mappings
and the type mappings are
entered in when you add the, add
the new type to the, like the
pg_type table and, and your
pg_attributes just refer back
to the type table.
So everything is runtime configured
and that's the underlying
expectation of these extensions.
So stuff just works.
It's only when you have to push
out to some external format that
has its own understanding of the
world.
You have to start thinking about
these mappings again.
pg_lake, which was open sourced
a few months ago, is a set of
tooling to allow you to push data
from Postgres database out
to parquet or iceberg data on object
stores.
Michael: I had a question around
that.
Only in researching for this, I
came across GeoParquet for the
first time.
I didn't know that was a thing.
Paul: Yeah.
There's a long history of the geospatial
community taking extant
open source standards, as they
say, which are like good and clearly
going somewhere insane.
Regina: GeoJSON is another 1.
Paul: Exactly.
Back when JSON was turning into
the lingua franca wire transfer
protocol, The geospatial community
said, but we need a way to
encode geometry too.
So they invented one and eventually
got it ISO, not ISO, IETF standardized.
Same thing with Parquet.
So you have Parquet, which in its
initial versionings understands
like the basic types of relational
databases and doesn't understand
anything about geo.
So like people, geo people want
to push data through this, through
this format out to their own clients
and say, we've got this
byte type and we've got this metadata
slot.
So we'll slot some information
about columns that are holding
geometry and we'll stick the raw
geometry in there into the byte
type.
And that will be just enough affordance
for geo-aware tools just
to look in the metadata and say,
Oh wait, there's geo here.
And we can pull it out.
Fortunately that for the Parquet
and Iceberg formats has been
a very short interregnum.
Sometimes these things last for
like years, 5 years.
And we're still playing with these
hacks.
I think GeoParquet and GeoIceberg
lasted for about a year and
a half before the mainline projects
came along and said, you
know what we need, we need a geometry
type and a geography type.
We should just formalize that.
So yeah, so that was formalized,
I think 9 to 12 months ago,
both Parquet and Iceberg got around
to saying, yeah, we have
an actual type.
And I meant that the metadata could
like live in proper slots.
So where you'd expect it to be,
which in turn made engine developers
be able to do things like much
faster scans and pruning of these
files.
So it's been good all around and
you can forget about GeoParquet
and forget about GeoIceberg because
those there's a pieces of
history now there, the geo stuff
is now actually embedded in
the standard proper.
That's
Michael: great news.
Paul: Yeah.
Yes.
So, yeah.
So pg_lake is a way of pushing
data out of Postgres into the
public internet in places where
these Parquet and Iceberg engines
can scan them.
And the promise of that is that
if you pump your data out in
hopefully an ordering, which is
good for the bulk queries you
expect to apply to the data, you
can then hit those object store
data lakes, if you will, God help
us, with an engine which understands
those formats and can slam through
them in massive parallel.
And, you know, one of those engines
is DuckDB, a single node engine,
which is very good at that.
1 of those engines is Snowflake,
a multi-node, highly parallel
engine, which is also really good
at that.
Another one is Databricks.
All the same gig.
It's like this separation of compute
and storage.
The compute comes to the object
store, rips this file off.
It is ridiculously inefficient.
If you're a Postgres developer,
you think, this is really inefficient,
you should have an index, so you
can find exactly the thing you're
looking for.
And these compute engines are,
no, I will look at every
record.
I will lightly prune the file as
I zip through it.
But they're willing to scan a terabyte
of data to get a few hundred
thousand rows of results, which
is, which is crazy, unless you
happen to have the compute and
storage and cloud infrastructure
to afford to do that, which these
days we do.
It's one of the problems of being
old.
You, your understanding of what
is efficient and correct is receding
in the distance.
Nikolay: But I still have some
doubts and questions how stable
performance for such queries is.
Like it's maybe not fully resolved
problem.
Paul: I agree.
I find it really weird that you
have to reform or resort your
data to match your use case.
And it's nice.
It's great when things work out
great.
So like you're sloughing your data
off the, off the transactional
store, off to the object store
in, usually it's like time series
order, right?
Time of most recent change.
And it so happens that your queries
are either grouping on time
or filtering on time, and then
the underlying format metadata
allows the engines to really throw
away 90% of the data before
it looks at it.
That's great.
I'm used to the power of SQL inside
a real OLTP engine where you
can address any combination of
tables and joins, and you're not
going to be hit too hard with efficiency
as long as your indexes
are in place.
Nikolay: And it sounds reasonable
for columnstore and for purely
analytical queries, but when you,
for example, go back to this
KNN, K-Nearest Neighbors problem,
we just need like 10
Paul: rows, but we
Nikolay: have millions, right,
or billions.
Regina: Yeah.
How does that work?
Does it even work?
Can you even do a KNN in those
systems?
Paul: Ah, you end up making some
kind of compromise with reality.
If you sort your data, like on
a key, like a Hilbert code or
something that has like spatial
co-location to it, then you can
do that kind of scan a little better
because you know where to
hop in the Hilbert code.
Duck for its internal implementation,
I think it's just sucked
it up and put a lightweight index
next to the table.
So they're not actually doing the
full column store scan for
those particular kinds of queries.
Spatial is as database people who
have put a geospatial type
into a standard relational database,
we like to say spatial isn't
special because you just type SQL
and you get your answer back,
but at an implementation level,
there's no getting around it.
Spatial is special in a whole bunch
of ways, because just being
higher dimensional, it's the same
thing people run into with
pgvector.
Once you get beyond that one dimensional,
1 dimension of sort order,
a lot of your preconceptions about
the most efficient way to
deal with the data fall away and
you have to rethink problems.
Michael: one difference seems to
be people aren't as forgiving
of accuracy issues when it comes
to spatial data.
You know, you don't have approximate
nearest neighbor.
Paul: Neil That is 100% true.
Yeah.
Things on maps, People are very
unforgiving with maps.
It's like one mistake, you're
wrong.
Whereas you roll up a BI dashboard
and give some people a pie
chart, whatever.
It's actually one of my great shaking
fist at cloud things about
the big column stores is that the
people insist on summing up
every record of 5 billion records
and make a pie chart.
You could have sampled and got
the exact same pie chart.
Michael: I feel like we've dived
into the most complex type of
performance issues, but I think
a lot of people, like I've come
across some performance issues
when it comes to PostGIS work,
a few customers that have the odd
specific case.
And I think even it's like the
basics that they have questions
around like geometry versus geography
or GiST versus SP-GiST.
And any advice either of you have
for kind of people that are
new to this stuff on the basics
of on the simpler queries, how
should they be thinking about these
things?
Nikolay: IMB The absolute basic
is how R-tree is implemented and
what R-tree is compared to B-tree.
This is like the gem of Postgres
actually, right?
Maybe let's talk a little bit about
this, like some basics in
the middle of it.
Regina: I think that bores most
users, the discussion.
They just want to say that I can
write this and I get this.
And they don't care about the underlying
implementation.
Paul: You think
Nikolay: so?
There are different users, I don't
know.
Maybe.
But okay, let's mention that Postgres
has this GiST, a generalized
search tree, and R-tree is implemented
using it, not separately.
Correct.
And yeah, there is a great work
by Hellerstein from Berkeley
and from the late 90s.
And it's like basically generalization
of B-tree idea to multidimensional
space, two-dimensional, three-dimensional,
n-dimensional.
Paul: Well, n-dimensional is defined
more or less by the API,
right?
Not so much by the space and whatever
space you're in, as long
as you can meet the conditions
of the API.
Nikolay: 7 functions, right?
Paul: Yeah.
Given a key, can you say whether
things are consistent or not
with that key, that's one piece of
the API.
Another is given a collection of
keys, can you split that collection?
And that depending on your dimensional
space, those splits will
be calculated differently, but
it's just like, given a set, can
you do a split for me?
Yes.
Okay.
Can you do key consistency?
I'm trying to think what the other,
what the other API pieces,
I feel like there's a third, but
regardless you can, with the
rules of the GiST API, you can
implement an R-tree, like The
kinds of questions that an R-tree
answers, the kinds of guarantees
an R-tree gives are ones that fit
inside the generalized API
of GiST.
So it means that you don't need
a separate access method for
R-tree.
You don't actually need a separate
access method for B-tree either.
In fact, there's a B-tree GiST
implementations in contrib that
you can look at.
Because it's more generic, There's
less opportunities for performance
optimizations, but the beauty of
being able to, for almost any
type, cause the just conditions
are really, really loose and
generic, almost any type you can
get an access method without
having to think about all the problems
of pages and consistency
and so on.
That's one of the things that has
made, say, a tool like pgvector
a complex thing to approach and
that some of the only someone
with like serious computer science
chops would want to attack
is that they had to write their
own access method right down
to disk.
They did not get to just layer
their access method on top of
an existing implementation like
GiST.
They had to write it all the way
down to all the way down to
the disk and page level, which
is a lot harder to do if you're
going to retain guarantees of consistency.
Nikolay: Right.
So you mentioned, I guess you mentioned
B-tree GiST, it's called,
or GiST B-tree, I always
Paul: confuse it.
Yeah, B2 GiST.
Nikolay: Yeah, and it lost comparison
to native implementation
of B-tree, while R-tree didn't lose, like
it won.
And just...
Paul: Yeah.
And I'm not sure if that's just,
probably that's more a matter
of people not wanting to spend
the time to work on the R-tree implementation.
Like it was considered, it was
considered not a good implementation
at the time that we started.
It still existed when we started
with PostGIS, but it was more
tightly bound to the spatial types,
the native spatial types
in Postgres.
And so we didn't use it.
No one else used it.
Eventually it was ripped out.
It had, I think all the way up
to the time it got taken out,
an object size limitation.
So it couldn't handle objects larger
than a page size.
That was, I think, the death knell,
really.
Because practical work with spatial
data inevitably involves
dealing with objects that exceed
the page size.
Nikolay: So I want, what I wanted
to just like some counter argument
why it matters to understand things
like slightly more than just
let's use it.
And that's it.
Because for example, in this particular
case, B-tree is like
less, less than greater than equals.
R-tree is like is contained, contains,
right?
Overlaps.
You can, if you use this Btree
GiST extension, you can combine
and have multi-column index of
very different data types, right?
And
Regina: I use that a lot too.
Nikolay: Yeah.
And this is great performance trick.
And if you know like why it's useful,
it can save you because
it's just a single index scan.
That's it.
So you use it a lot.
Regina: Well, if you have things
where there's a lot of Attribute
things that you have to filter
against, but you also need to
do a spatial at the same time,
especially with the way the sensitivity
that Postgres has, it sometimes
picks the wrong plan, whereas
when you have it together, it can't
screw up.
So there's some cases where I've
had where if I had just a GiST
index and a B-tree separate, the
planner would do something really
stupid and try to use the spatial
and not use it the B-tree
at all and so the performance you'd
go from minutes to milliseconds
in those cases.
Nikolay: Yeah yeah so I had some
experience not once yeah.
Michael: Same And I also thought
like Postgres would be able
to do a bitmap scan.
I thought it would be able to use
the GiST index and the B-tree
separately.
I know it wouldn't be as efficient
still, but it seemed to be
refusing to do so for me.
So I also wonder if there's something
missing on the Postgres
side to consider them for bitmap
scans as well, multiple indexes
with an and condition.
I thought it would consider it
but I really struggled to get
it to.
Nikolay: I also can complain You take pgvector and you cannot
combine it with GiST basically.
Paul: Yeah, that's super hard.
Given just the use case for pgvector, like the find me the nearest
thing is the core use case, and that is like really does not
fit nicely into the executor.
Nikolay: one case works really fast and the other doesn't.
Michael: What about for beginners?
Regina, you've written some books on this.
How do you describe performance stuff for beginners, like in
terms of practical tips, or what do you tend to see mistakes
people make?
Regina: I think the classic one is people try to do distance by
doing using ST_Distance.
So they say ST_Distance this less than 50 instead of using ST_DWithin.
So like ST_Distance can't use an index.
So you know their queries might take minutes and then when they
switch it to ST_DWithin and finishes in milliseconds, that kind
of thing because it would have to scan the whole all the records
if you do ST_Distance.
That's a classical mistake that people make.
Nikolay: I'm curious, does AI make this mistake or not?
Regina: And what's annoying is like in SQL Server, it's different.
You can actually use ST_Distance and it uses an index.
Paul: So what are they doing in SQL Server?
They must be rewriting the query.
Regina: Yeah, they must be rewriting the query.
So I was surprised that, oh yeah, you can do distance less than
50 and it will do the right thing.
It will do it pretty much what RS_ST_DWithin does.
Paul: That's interesting.
That, yeah, well query rewriting is kind of deep.
But the fact that they do that is interesting just in and of
itself.
It speaks to them, well either like presupposing a problem their
users are going to have or something their users really did have
constantly so it's worth it for them to go to the extra work
of doing a query rewrite for that case.
Regina: I noticed that Postgres does do query rewrites too, right?
Paul: It does.
There's an affordance there.
We could muck with the plan if we wanted to.
Regina: Yeah, if you do in, they change it to any, right?
Paul: Yeah.
Yeah.
Regina: So in theory, we could do a rewrite somewhere.
It's just, where would that go?
Paul: Yeah.
I mean, that's like adding a certain amount of brittleness to
the system where you don't know
that the users actually need
it.
Anyways, your example presupposes
that they have a spatial index
in the first place, which is what
I would have raised as like
the number 1, the number one new
user mistake.
Regina: Oh, that's true.
Yeah.
That's the other thing.
Paul: It's quite an easy mistake
to make because, by and large,
yeah, by and large, people don't
think about index as indexes
as having a type, like if they
come from a SQL server world or
whatever, like the idea that this,
they need this extra keyword,
the using just keyword to get a
spatial index, not going to be
super obvious as they're just coming
from some other database.
And if you type create index blah
on geometry column, it'll do
something like it won't error out
because we've had to do lightweight
B-tree bindings in order to get
aspects of sortability from our
type.
So it's not like it'll just say,
oh, I don't have a B-2B binding.
It'll go and build a really terrible
index that's no good for
you and quietly return.
So I think that's one of the biggest
gotchas.
It's just like having a proper
spatial index.
Regina: Yes, knowing that you need
an index and knowing that
you need it to be a GiST index.
Paul: Yeah.
Regina: As far as between GiST
and SP-GiST, I'd still tend to
go toward GiST because I know we
put in more effort in GiST than
we did in SP-GiST.
Same.
I think our GiST support is more
robust than SP-GiST.
Paul: Yeah.
I have tried different benchmarks
and come up with different
answers from different data, different
queries.
There's not, I've not been able
to like fully characterize the
kinds of places where SP-GiST outperforms.
I don't know, Nik, you, you seemed
like really excited about
SP-GiST or like there are places
you've found where like SP-GiST
is the clear winner.
Nikolay: No, somehow I just recently
don't see this kind of workloads
at all, like maybe a couple of
years, so I don't have fresh data
at all.
I don't know like why, obviously,
like it's just a matter of
luck.
Michael: I came across one recently
where they were using a GiST
index and still struggling with
performance and I suggested trying
SP-GiST because I thought it might
make sense and it didn't help.
So it wasn't much worse, it was
very similar.
But the one thing that they changed
that did help a lot was they
were using geography before, and
it was only to calculate quite
small distances, like within 25 meters type thing and switching
to geometry made a big difference.
Yeah.
That was surprised by, Yeah, you 2 are not surprised.
You 2 are nodding along.
Paul: I'm not surprised at all.
Regina: Because geography is a more complex, it's more, it's
three-dimensional, whereas the geometry is two-dimensional, really.
Paul: I have a slide in my sort of standard Postgres talk where
I say, you know, geometry or geography, what should I do?
And the thing that I use to demonstrate why geometry is going
to be better for performance purposes is just the distance calculation.
Everyone has learned Pythagoras.
So I know what the distance calculations looks like there, right?
2 squares and a square root.
Any distance calculation in geography space is going to have
to do a bunch of Haversine calculations, which is the distance
on a sphere.
And Haversine has 5 transcendentals in it, plus the square root.
So it's just way more, like orders of magnitude more computationally
intensive.
So if you're doing a bunch of those calculations, you're just
going to pay a big price for sure.
Nikolay: The earth is flat.
The earth is flat.
I knew it.
Paul: Peter Ideal, in an ideal world, the office is flat.
Yes.
So that means you need to commit to a planar understanding of
your work area if you can.
So if you are doing an app which works in London and say, you
know, there's a planar projection which is good for London.
If you're doing something in Idaho, there's a planar projection
which is good for Idaho.
If You're doing something which is like the continent of the
U.S.
There is a planar projection, which is good for most people's
purposes in the continent of the U.S.
It's very rare that they have something which requires latitude
and longitude.
You only get the breakdowns when you're starting to work with
truly global data.
Places where you go over the poles, places where you have data
in the Northwest territories and also in South America, where
the scale distortions of something like Mercator cannot be ignored,
that's when you say, okay, fine, I'll use geography.
And then you'll get some nice advantages out of geography.
If you have truly global data, the fact that the shortest path
might go over the North Pole, that's a useful thing to have.
Or it would go across the dateline.
That's a useful thing for the system to just transparently handle,
which it does for both the calculations of things like distance
and area, And for the performance-based stuff like indexing,
like it does not care.
It goes over the poles.
It goes around the dateline.
It doesn't notice because the model just, it takes away those
singularity points when you're working on a sphere.
Michael: Nice.
The, I think the reason I was surprised was more because I'm
so used to queries being IO bound and it's, it's quite fast to
get one that's actually CPU.
You know, actually the CPU do matter here.
Yeah.
Paul: That is one of the places where spatial is special and people
don't notice it.
And it's a weird 1.
Like we for, trying to integrate with a Postgres planner, we've
been trying to help the planner get smarter by costing our functions
appropriately.
But the planner only looks at function costs in filters and joins.
A lot of the times People are doing the spatial calculations
in the SELECT line, right?
The return values.
So if you have a complex function in your result set, Postgres
is not going to add that in and say, oh, I should do a different
thing, or that there's a lot of cost here.
So that's an issue.
And then actually it's a real problem on IO as well.
I actually did a blog post for Snowflake about 2 weeks ago that
went public, talking about that with respect to JSONB, but I
learned about it in the geography, geometry world, which is,
you know, you get a big object and Canada, Canada has like
10,000 vertices, like as well, as well above the, the maximum object
size that you, that you'll find in a, in a page.
So it's going to be TOASTed.
And that means any retrieval of that object is going to be a
2 stepper where it has to go and find the TOAST pointer and then
go into the TOAST tables and gather all the parts and put them
all together in order and then decompress all that, like just
to get it out, just in the IO.
And that's like a combination of the IO, pull that across, but
also the computational overhead of just reassembling the thing,
it's about 10 times, 10X.
I found in joins, You can get things to go 10X faster when I
don't have that TOAST overhead.
If I pre-cut everything into shapes which are smaller than the
object to the max.
Regina: Yeah, Speaking of TOAST, there's also the one gigabyte.
Is that the limit on TOAST?
If people have already started
complaining about that, about
that.
Paul: It's or
Nikolay: what are the types of
Arduino types?
So it's one gigabyte, I think.
Yeah.
Yeah.
But it's better not to go beyond
300 megabytes.
I think there are benchmarks showing
that for JSON is terrible.
Paul: Oh, I can't imagine.
Nikolay: Yeah.
There is a drop in performance
after 300 megabytes or so.
Michael: That's a lot of JSON though.
Regina: Yeah, I think I forget
what the reason was.
It had more to do with outputting
stuff.
Nikolay: I also forgot.
Regina: Like if you're trying to
output a big NVT file.
Paul: The whole world in one file?
The level 0 file?
Michael: Does it compress well?
I was just thinking, as JSON generally,
there's a lot of repetition,
a lot of like 300 megabytes post-compression
JSON file would
be like
Paul: pretty massive.
This is one of the places where there
are potential implementation
paths which would have given us
a much better I.O.
Profile, but at the cost of something
which the users considered
pretty inviolate, and that's perfect
fidelity of input and output.
So users are feeding the system
with geometries which are described
in terms of coordinates which are
built on double precision.
And if you want to return what
they got, you have to store the
double precision.
And double precision does not compress
so great.
So in answer to your compression
question, yeah, not so great.
If you just take the smallest swing
at saying, this stuff is
quantized or like, this has an
effective minimum precision, like
your road segment is not more precise
than a centimeter.
And if I take advantage of that
fact, if I know that there is
a quantization available, I can
squash stuff down into a delta
encoding, which is incredibly cheap
for spatial data, because
spatial data tends to be highly
auto-correlated in space.
So you can store the differences
between each coordinate with
a very small number, which is great.
It's a really good compression
and it's a very fast decompression.
But you have to give up precision
to take advantage of it.
Nikolay: Or if it's a lot of moving
objects on the map, but this
is should be column compression,
like TimescaleDB does.
So we just change a little bit
position.
We cannot jump to different continents
immediately.
Paul: Yeah, but
there will have to be precision
associated with that.
Like they will be losing a little
bit of precision when they
compress.
Like you can't just slam floats
or slam doubles together cleanly
all the time and get a nice clean
delta encoding.
Sometimes you can, but not always.
Nikolay: Similar with timestamps,
actually.
If you want to give up milliseconds,
you can have fewer bytes,
definitely.
Speaking of numbers, I'm curious
what are the biggest PostGIS
setups you saw?
Both single node or maybe sharded
systems like with Citus or
something.
Paul: Regina, you're
the implementer.
What have you seen?
Regina: I have no idea.
I don't work with that big of data
sets.
Paul: Yeah.
So Postgres is, or PostGIS is weird,
a weird number of the Postgres
community because we have, I think
almost certainly the largest
number of deployed clusters.
Like the number of PostGIS users
is huge.
I remember Stephen Frost came to
his first Phosphor G and his
jaw sort of hit the floor because
he was used to going to a Postgres
conference where there'd be like
200 attendees and he came to
1 of the free and open source for
geospatial conferences.
And there are 800 attendees.
And every one of them he talked to
said, yeah, yeah, we use PostGIS.
It was like he'd never seen so
many distinct Postgres users in
1 place.
That's one difference, the population
of PostGIS users is huge
relative to the population of Postgres
users, but they all have
very small deployments.
So yeah, if you're, if you're a
county, you've got several dozen
tables, none of which has more
than 100,000 records, like from
the point of view of scale, it
barely registers, But there are
a few big ones.
It's just that you don't necessarily
know they exist because
no one talks about them.
Like pretty sure Apple Maps production
system is all on Postgres.
That's only because someone whispered
to me in a dark alley,
like no one's ever made it public.
I know that here, which is a really
large geospatial data aggregator.
If you're going to get raw geospatial
data and want to have a
map, which like competes with what
Apple or Google provides,
you probably go to here and license
their data.
I know that their production pipeline
and a lot of their APIs
are built on PostGIS.
So it's a really big user, global
data, high uptime requirements.
And then one of repeats over and
over again.
So there's a lot of these installations and they're big and they're
high performance is people who are doing OSM map builds.
Because the open means OpenStreetMap.
OpenStreetMap is a map of the world.
It's a map of the world based on a very strict topological model,
edge node collections, which means that anything which covers
an area has to be built from those edge node relationships, which
implies loading the whole edge node system and then building
up the polygons from the parts and then testing the polygons
against each other.
And so people who end up using PostGIS as the intermediate
stage between I've got a lot of raw open stream app data and
I have a rendered tile or a vector tile, there's a big PostGIS
Postgres process in the middle of that where it's shuffling through
all the data and building up the higher level structures just
don't exist in the raw OSM dump.
And there's, there's gotta be hundreds and hundreds of those,
but they all look about the same size.
There's a size of OSM.
Regina: There's more than hundreds.
Paul: Okay.
Thousands of them.
Nikolay: Yeah.
Also, I also saw in your blog posts, Redfin, State Farm, like
we should have a lot, this setup should have a lot of data, right?
Paul: They do, and real production loads.
They're actually, they're asking real questions to the spatial
database and getting spatial answers back.
Nikolay: And also state level, government level, like in France,
Norway?
Paul: Yeah, that was our biggest installation for a long time.
It was the biggest one I could like quote chapter and verse on.
It was the, and they came on very early 2005.
1 of our first big institutional users was IGN, the French National
Mapping Agency.
And they ran Postgres and PostGIS through a competitive process
with Db2 Spatial and Oracle Spatial at that time, 2005, and determined
that Postgres and PostGIS was, in other words, just as good as
Db2 Spatial and Oracle Spatial, and most importantly, like way
cheaper.
So, and way easier to deploy.
They didn't have to go through like a full commercial, like commercial
requisition.
They could just say, okay, we're going to use it.
And they did.
So IGN has been on Postgres PostGIS for 20 years.
And that initial database was a 150 million object database.
Michael: At the
Paul: time, very substantial.
Nowadays, it's routing error.
But at the time, it was one of the biggest databases.
Nikolay: You needed to be really brave those days because of
lack of replication and proper backup.
Yeah.
Today, it's much, much better.
Much better.
Paul: They wanted to have, one of the requirements was multi-site
replication.
Because they wanted to have...
I think it was Slony to start with.
Nikolay: Yeah.
Good.
I'm glad now it's already solved.
Relatively.
We have problems with large setups, but it's already a different
story.
Great.
Michael: Is there anything we haven't talked about that either
of you wanted to make sure we mentioned?
Regina: I guess I should mention conferences.
So there's going to be a PgDay Boston conference, June 9th.
And then for the conference that Paul was mentioning, the free
and open source geospatial, that's going to be in Japan this
year.
And let's see.
Paul: That's Hiroshima, August 30th to September 5th.
Regina: Yeah.
Oh, and then there's also FOSS4G NA.
So that's coming up.
But I don't think we've said we haven't set this to the day yet,
but it's gonna be probably in November.
Paul: November.
Michael: I really appreciate that you wrote in PostGIS day,
the day after GIS day.
And I've actually watched a few of them.
We run them all online, which is great.
And I, yes, I've seen quite a few, tend to at least watch the
talks on performance.
My special interest and anything by Brian Timoney as well.
I wanted to give him a shout out.
I've enjoyed.
Paul: Yeah.
PostGIS day.
If you just type it in, you'll get all the video archives.
So if you're jonesing for PostGIS conference content, that's
the place to get it for sure.
Michael: Yeah.
It's awesome.
I highly recommend anything.
Any last things for me, Paul?
Paul: We haven't talked about AI,
which feels like a huge mess.
Yeah.
And, and I don't know what your
experience is with is with this
is
Regina: The next company has AI
in its name.
Paul: Oh yeah.
So he's really into it.
That's good.
And I wanted to point out like
a couple of things that I've seen
like in the AI space.
1 is the geospatial AI space.
There's this whole other field
called foundation models, which
is basically like RAG, retrieval
log, retrieval log meta generation,
like RAG for geospatial, where
they build up a characterization
of an area or the whole globe where
every pixel in that characterization
has a multi-dimensional and very
high dimensionality vector associated
with it.
So it really is a reg process.
But people have been using pgvector
as like the storage engine
for their foundation models and
to do like quick gatherings of
nearest neighbor.
And when you do nearest neighbor
with a foundation model, what
you get is like land, which is
similar to this other land.
So you can draw a circle around,
say, a bunch of solar panels
and say, show me the other solar
panels, and it will just go
off and find every solar panel
in the world for you.
Very cool stuff.
There's a, and there's a cool talk
on it in the 2025 Postgres
day.
Really worth looking at.
The other thing that came up in
that Postgres day around AI
was the idea of natural language
to SQL and swing a cat, throw
a stone, hit someone who's doing
natural language to SQL out
there.
But a really cool observation from
Brendan Ashworth from Bunting
Labs during his talk that a lot
of the code models, like how
you get a coding model is first
train a generic model and then
do reinforcement learning against
it.
And the way you get a code model
which can speak SQL, is you
do a lot of reinforcement learning
against standard SQL patterns.
And his observation was that the
standard SQL patterns tend to
be like Postgres SQL patterns.
So you tend to get really good
Postgres SQL and maybe not so
good Oracle SQL.
And the same thing holds true with
the spatial side because they're
using Postgres because PostGIS
is the lingua franca for geospatial
SQL.
You tend to get good spatial SQL
generated by these models.
And the real limiting step tends
to be providing the correct
amount of like semantic metadata
around your data model so that
the LLM can make sense of the natural
language question you're
making and spit out the right SQL.
But you're already ahead of the
game if you're using Postgres
and Postgres as your engine, because
these models will tend to
do SQL that matches them.
Nikolay: It's writing in Python
is easier with like using LLMs
because it's they know much better
than new languages.
Paul: Yeah, give me Perl 6.
Nikolay: Yeah, and that's why Postgres
is very natural because
so many, so much material it was
trained on it and Postgres the
same thing.
This is a great point.
So it should help to grow popularity
even more because a lot
of applications are created by
AI these days.
But I'm curious, if you deal with
a lot of data, you need to
verify.
If you don't have a way to verify
all the ideas AI is generating,
the quality will be not good.
So this is one thing I always do,
how to give AI tools to verify
its own ideas very fast without
humans.
Before humans are involved, it's
already much better.
I'm curious what you think about
this.
How to write proper queries and
don't use this distance function
you mentioned.
Function you mentioned.
Regina: The good thing is, as Paul
said, it does write the right
distance function.
It does do ST_DWithin, if you give
it a text, say, give me something
within 50 miles, it writes the
right query.
I have tested that.
Nikolay: Good, But it still would
be better if it would have
some tooling to test and see actual
results, not just guess.
Paul: Yeah, and we've started to
see that discipline arrive in
the PostGIS codebase.
We have our first agents file.
Regina: Oh yeah, we have that SKILL.md.
Paul: Or SKILL, sorry.
So yeah, we have our first skills
file, which does touch on that
kind of stuff you're talking about,
Nik.
Like these are best practices for
generating spatial SQL.
Nikolay: Yeah, in my opinion, copy
on write and database branching
should grow because we need to
be able to test on large data
sets for very low money and very
fast.
It can be as in S3, like some company
implemented, right?
Neon or Timescale, they implemented
it.
It can be with ZFS or anything,
but it should be fast and cheap.
So you give like millions of points
on the map and tell, just
check, EXPLAIN, ANALYZE
BUFFERS.
Michael: Or, Nik, did you see friend of the show, Radim posted
a new blog post recently about just import the stats.
So you don't even need to load all the data.
Nikolay: Yeah, I saw it.
Just test planning and behavior without data actually, just with
stats.
I think it's a valid approach, but at the same time I think it's
limited because you don't see actual execution.
Michael: Of course, but you can at least check like index use
and things, which is cool.
Nikolay: Yeah, definitely.
It's a super lightweight approach.
I like that.
And I think more and more should be developed because of AI.
Otherwise, so many mistakes.
But if you have this testing pipeline and you're even not involved.
AI has some ideas how to improve query and tooling to verify
this is the way and put it to test to CI and so on.
And it's great.
Yeah.
So that's why we created branching for many years ago with DBLab.
We like We were sitting and waiting for these very days when
it's needed so much.
Great.
Michael: Well, it was a pleasure to meet you both.
Thank you so much for coming.
Thank you for joining us.
Nikolay: Thank you for coming.
Regina: It was a nice talk.
Paul: Thanks Nik.
Thanks Michael.
Nikolay: Yeah.
Have a great week.
Creators and Guests