Michael: Hello and welcome to Postgres.FM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard and I'm joined as usual by

Nik, founder of PostgresAI.

Hey Nik.

Nikolay: Hi Michael.

Michael: And we have not 1, but 2 heavy hitters in from the PostGIS

world.

Both also recognize Postgres core contributors.

First we have Regina Obe, President of Paragon Corporation,

a consulting firm, member of the PostGIS core development team

and steering committee, and co-author of a whole host of books

on GIS and SQL, including the book PostGIS in Action.

Welcome, Regina.

Regina: Thanks, thanks for having me.

Michael: It's our pleasure.

And also we have Paul Ramsey, who is Staff Engineer at Snowflake

via the Crunchy Data acquisition, the co-founder of PostGIS, member

of the core development team and the chair of its steering committee.

A pleasure to have you too, Paul.

Paul: Hi, Michael and Nik.

Nice to be here.

Michael: It's wonderful.

Right.

So I was hoping actually that maybe one of you could give us a

kind of brief history of the project and maybe the other one could

give us an update on where it's at today and what it looks like.

Paul: Well I'll take the history side of this because yeah, because

PostGIS comes out of my history as a geospatial consultant and

it's a long time ago now.

PostGIS first released 0.1 was at the end of May in 2001.

So we're creeping up on 25 years of PostGIS.

Pretty crazy.

PostGIS was born because, out of a consulting company here in

Victoria, British Columbia, and it was born because we ran out

of work.

The fiscal year for the government, which is our main source

of money, ends on the March 31st.

And the first release of PostGIS is 2 months after that.

We had been doing an analytical work for the government using

standard GIS tools and a database to store interim results in

the database we're using was Postgres.

And having had that experience of using a database and using

SQL, it's becoming familiar with this sort of powerful tool to

slice through big sets of chunks of data in different ways.

And also being spatial people looked at this database tool and

said, there's gotta be a way, like, there's gotta be a way to

bring together the spatial questions we're asking and the non-spatial

questions we're asking into one place so we can ask like more complicated

questions and have it all fulfilled inside the database.

And there was at that 0.1 relatively
fresh international standard,

the simple features for SQL standard
from the Open Geospatial

Consortium.

There was one extant proprietary
product that did that.

Oracle 8i at that point had been
released and it had a pretty,

80, 20 complete implementation
of Spatial SQL.

And, and we did both.

We both looked at Oracle Spatial
and thought, can we use this?

And at the time, if you wanted
to use it, you had to have Oracle

Enterprise.

It really, the implication was
about a quarter million dollar

licensing bill to really use it
in anger.

And that obviously it scared us
off as a small consulting company.

It also successfully scared off
anyone in the government that

we like, we may all get some of
the government who wants to do

this stuff and we'll piggyback
on their licensing and learn it

and do an implementation for them.

And we'll be like spatial database
experts.

Now it's too expensive for anyone
in the government.

But we had this specification And
we had this experience with

Postgres and Postgres itself was
built for type extension.

So we also had the really like
clear on-ramp to add a new type,

to bind it to access methods, to
add functions that worked on

this type in a way that really
didn't exist in what was at the

time the sort of premier open source
database MySQL.

It took quite a few years beyond
2001 before MySQL added Spatial.

And when they did, We went and
looked at their implementation,

looked at the patch, and it was
big.

It was a big patch that crossed
the whole code base.

It really took someone who understood
MySQL, soup to nuts to

do it.

They had to do the index implementation,
like everything.

They had to do a hell of a lot
of work to do it.

Whereas we got this nice, tight
little code base sitting as an

extension completely divorced from
the core code.

The core code provided access method
prebuilt.

We were good to go.

So it was doable to have a working
extension that we could demonstrate

to ourselves and to customers within
2 months and then release.

So yeah, we were clever people,
but Postgres is a hell of a clever

product and the PostGIS wouldn't
exist if Postgres hadn't been

so clever to start with.

Nikolay: And it was just right
at that time.

Paul: At that point it was just,
it still is just actually the

access method we use for spatial
indexing.

Yeah.

Nikolay: I'm smiling the whole
time you're talking because it

connects 20 years ago, 2005, 2006,
2007.

I remember first of all, this is
when I left Oracle 8i.

Exactly.

Like last time I touched Oracle
was exactly 8i.

And then when I was building my
second startup social network,

we put a lot of people and objects
like restaurants and so on,

on map 2007, it was too early.

And I remember I hired 2 guys,
Oleg Bartunov and Teodor Sigaev,

to help us.

It was so bad because it was amazing
to have GiST and reading

all the articles and basics and
like 7 functions, abstraction,

all great.

But when we went to practical questions
and we had 10 million

people to be placed on the map
and a lot of objects and then

you try to solve simple problem,
show me nearest 10 objects on

the map which have good reviews.

And it's terrible because there
was no support for order by and

KNN.

I didn't know KNN.

And since we moved so fast, I eventually
decided to pivot my

startup away from maps and so on.

But Oleg and Teodor, a couple
of years later, brought KNN to

GiST.

It was 9.1.

And then Alexander Korotkov, also
Nikita Glukhov helped to brought

the same like to SP-GiST.

And this connects us a lot.

But I just, I showed them this
problem and that's it for me.

It was great.

Paul: Yeah, back in the early days,
starting off incubating an

open source project, within a couple
of years, we had quite a

few users and some institutional
users.

And at the time the GiST infrastructure
was, did not include

recovery.

It wasn't a completely ACID implementation.

And, at that point I had like a
stable of people who had businesses

depending on the PostGIS, and
I put out a call for development

money and got scraped together
from all the different companies

who were using or depended on Postgres
about 10 or $15,000 and

then ship that money to Oleg and
Teodor who did the work necessary

to make GiST recoverable.

And it was a weird thing to sell
to the community.

It's we're going to get this work
done.

You're using it now as a result
of the release cycle of Postgres

and time takes to do the work and
all that stuff.

You'll be able to use it and show
it to your clients in 12 to

18 months.

But you're going to give me the
money right up front.

So it was interesting introduction,
both to the, to the Postgres

development cycle, and also to
like trying to drum up crowdsourced

funds for open source development.

And it really taught me that crowdsourcing
funds is really hard

because it's people are willing
to pay to get the bug, which

is bothering them right now, fixed
right away.

Much more difficult to get them
to pay for something, which will

maybe make them a little bit happier
in the far future.

It's a harder, it's a harder sale.

Nikolay: Especially for startups.

So they move too fast to

Paul: understand that

Nikolay: the thing will come only
in a couple of years.

Regina: Yeah, they might not be
around that long.

Nikolay: But I'm glad K&N was brought
into Postgres and PostGIS

as well.

Yeah, great.

Mason.

Michael: Is that a good excuse
for us to get Regina to give us

a little bit of like a catching
up from, it's been 25 years,

right?

You've done a lot of work since
those early days.

What's changed?

What does it look like now and
how is that different to the first

versions?

Regina: I mean, we definitely have
a bigger audience and we now

have people seem to be running
things in the cloud more than

they're running things on premises
or maybe not.

I think it's yeah slowly getting
to the point where most people

we see are running PostGIS in
Google Cloud or Amazon Cloud

or Microsoft Azure.

I think that's the biggest thing
that's changed since we started.

Michael: I was thinking as well,
it looked like it started as

a single extension and it's now
kind of a collection, it's like

a family of extensions that serve
quite a few more use cases

and lots

Regina: of functions.

So when we started off, we just
had the vector support and then

we added raster support.

And originally we thought, Oh,
we'll just throw raster in the

PostGIS extension.

And then it got very fat and people
started complaining.

Why do I have to carry these thousands
of functions that I don't

even use raster?

So then we broke it out into 2
extensions, PostGIS and PostGIS

raster.

And we also have PostGIS topology
And then PostGIS spun off other

extensions.

So you have the H3 extension, which
is now it's, we brought it

into our organization, but it's
still a separate extension.

It's not even part of the PostGIS
extension because we forked

it from the original owner into
our system.

So that was very recent.

And then there's another extension called MobilityDB, which is

a separate extension, separate group of people, but they build

on top of PostGIS.

And they deal with things like, you know, vehicle movement, the

speed at which things move and all that and flying birds, trains.

I think those are the key ones.

Yeah.

And then there's also a pgRouting, which I'm a member of too,

which deals with network routing, like driving, biking, walking,

getting directions on that.

So yes, it's spun up a lot of other extensions.

Paul: Yeah.

So that's the technical side.

Deployment is a big deal for sure.

When we started, it was, you want to use PostGIS?

Here's the source code.

I'm tired in the contrib directory of an existing Postgres installation.

And then you build and then you install and now you have PostGIS.

And around the time Regina started joining us, she helped with

Windows builds.

So Windows people at least could have a pre-build binary.

Regina: And I still do Windows builds.

Paul: At some point, PDDG included us in their builds, and that

was a huge change.

So now it meant that you didn't have to build it yourself.

You could just find the right packages and install those.

That opened up a whole pile of new users who previously wouldn't

have tried it.

But yeah, as Regina said, the real tipping point lately has been

being part of the default installs on all the clouds.

So there's no longer a software install step from the point of

view of the end user.

It's just type create extension PostGIS.

And that ubiquity really goes beyond the ubiquity for users.

I feel like we've achieved a certain level of ubiquity in the

industry.

Like, first of all, like just geospatial as database functionality

is table stakes now.

It's pretty rare to find a SQL implemented database that doesn't

have Geospatial.

Those who don't are all tacking it on at a great rate.

Regina: Ours is still the most popular, I think.

I think it's surprising that Postgres, even though it started

out as a meager database, PostGIS itself is probably more popular

than Oracle Spatial and SQL Server Spatial and MySQL.

I don't even hear other people doing serious work on those.

Paul: Yeah, And that's probably not fair because I bet you there's

a whole bunch of institutional users of those databases who they

never leave the ranch.

So that's all I understand is Oracle spatial or SQL server spatial

because that's just the way their
institutions are shaped.

But there's no doubt that in the
wider world, PostGIS is the

thing.

The thing that made me really understand
that PostGIS was the

lingua franca was seeing the
Google BigQuery announcement

blog post where they said one of
the features of Google BigQuery

was that it was PostGIS compatible.

It was like, Oh, wow.

Okay.

So that's nice to see.

Michael: That's them saying you're
the standard, right?

That's pretty cool.

Paul: Yeah.

And it's a standard that I could
have said, we support the simple

features for SQL standard, but
they decided that PostGIS compatible

was more understandable to the
larger marketplace.

It's like, okay, so we have the
mindshare.

Nikolay: Yeah, that's great.

Actually, my, my daughter, she
is a fourth year bachelor degree

student at UCSD and they have projects
related to wildfires in

California and GIS as well.

And, but I was super annoyed to
hear it's not PostGIS, but it's

actually Google BigQuery, so that's
good to hear.

Yeah.

It's at least compatible.

Great.

Paul: Oh, reusable.

Reusable knowledge.

Nikolay: Cool.

What's the latest things you can
share?

What's the latest development?

This is one thing I'm also curious
about.

I know, for example, PostGIS works
quite well with

TimescaleDB.

It should work with pgvector.

I never tried.

I'm curious.

And with sharding systems as well.

There are super extensions.

In the case of Postgres, it's not
super extension, it's a super

set of extensions.

Not super set, a collection of
extensions.

But sometimes it's hard to make
it work together, these big building

blocks what's there especially
PostgreSQL vector because it's also

important building block is there's
right.

What's the recent development.

Regina: No if there's any recent
development I think they just

work together and it was because
of the way Postgres is set up

right.

You can all You can have 3 or 4
different extensions in the same

query, and it's just because they
standardize on all data sets,

get the same feature set, get the
same core feature set.

And so you don't have that annoying
problem that MySQL has with

their different storage engines
and their different query syntax

for things.

So things just work seamlessly
well.

So I don't think we did anything
special to try to make ourselves

work with pgvector.

Paul: Yeah, pgvector is more similar
to us than, say, one of, like

you said, the super extensions
like Timescale or Citus, like

these sort of overarching big swings
of the architecture of the

system.

pgvector, like, they provide a
type, they provide access methods,

they provide some functions that
make sense for that type.

They're very PostGIS-esque in that
respect.

But from the point of view of what
they're for, they're kind

of orthogonal.

I've only seen one example recently
of PostGIS and pgvector being

used together.

And they're not being used together
because they're getting any

great leverage out of PostGIS.

And they're getting leverage out
of the fact that PostGIS itself

has a huge amount of third-party
interconnectivity.

So you use something like really
high dimensional search and

pgvector to get a bunch of results.

And if those results happen to
include spatial information, then

you throw that spatial information
on a map because everything

connects to PostGIS, all the mapping
software connects to PostGIS,

and that's how it works.

We really saw a cool example of
this in our last PostGIS Day.

We have a PostGIS Day mid-November
every year, an online conference,

and Shawn Burke from GeoBase
showed us geospatial AI using

pgvector as the, as a search engine,
but then getting the results

back through PostGIS to his mapping
tools.

So it could show us the results.

Neat example of what the geospatial
are calling people are calling

foundational models, not large
language models, but a completely

different look at how to use the
AI kind of technology for doing

geospatial work.

Nikolay: And if we talk back to
TimescaleDB, I saw great examples

of Postgres and TimescaleDB working
together, because if you

collect a lot of geographic signals
from drivers or from anything

like on the map, you want to partition
all this, right?

Because it's a lot of data, right?

One way or another partitioning should
work, because it's also

time-series, not only geographical
data, right?

So this is natural pair of extensions
working together.

Okay.

What about lakes and since like
your company got acquired last

year?

Yeah.

Yeah.

What, what, recent development
there?

Paul: And this ties actually back
to your other question, which

is at the core is an interoperability
question.

And the nature of the type model
in Postgres is such that when

you add a new type using the Postgres
type model, a lot of the

time things just work because the
overarching machinery of a

big, a big piece of machinery,
say Citus doesn't think necessarily

about particular types.

It thinks just about type mappings
and the type mappings are

entered in when you add the, add
the new type to the, like the

pg_type table and, and your
pg_attributes just refer back

to the type table.

So everything is runtime configured
and that's the underlying

expectation of these extensions.

So stuff just works.

It's only when you have to push
out to some external format that

has its own understanding of the
world.

You have to start thinking about
these mappings again.

pg_lake, which was open sourced
a few months ago, is a set of

tooling to allow you to push data
from Postgres database out

to parquet or iceberg data on object
stores.

Michael: I had a question around
that.

Only in researching for this, I
came across GeoParquet for the

first time.

I didn't know that was a thing.

Paul: Yeah.

There's a long history of the geospatial
community taking extant

open source standards, as they
say, which are like good and clearly

going somewhere insane.

Regina: GeoJSON is another 1.

Paul: Exactly.

Back when JSON was turning into
the lingua franca wire transfer

protocol, The geospatial community
said, but we need a way to

encode geometry too.

So they invented one and eventually
got it ISO, not ISO, IETF standardized.

Same thing with Parquet.

So you have Parquet, which in its
initial versionings understands

like the basic types of relational
databases and doesn't understand

anything about geo.

So like people, geo people want
to push data through this, through

this format out to their own clients
and say, we've got this

byte type and we've got this metadata
slot.

So we'll slot some information
about columns that are holding

geometry and we'll stick the raw
geometry in there into the byte

type.

And that will be just enough affordance
for geo-aware tools just

to look in the metadata and say,
Oh wait, there's geo here.

And we can pull it out.

Fortunately that for the Parquet
and Iceberg formats has been

a very short interregnum.

Sometimes these things last for
like years, 5 years.

And we're still playing with these
hacks.

I think GeoParquet and GeoIceberg
lasted for about a year and

a half before the mainline projects
came along and said, you

know what we need, we need a geometry
type and a geography type.

We should just formalize that.

So yeah, so that was formalized,
I think 9 to 12 months ago,

both Parquet and Iceberg got around
to saying, yeah, we have

an actual type.

And I meant that the metadata could
like live in proper slots.

So where you'd expect it to be,
which in turn made engine developers

be able to do things like much
faster scans and pruning of these

files.

So it's been good all around and
you can forget about GeoParquet

and forget about GeoIceberg because
those there's a pieces of

history now there, the geo stuff
is now actually embedded in

the standard proper.

That's

Michael: great news.

Paul: Yeah.

Yes.

So, yeah.

So pg_lake is a way of pushing
data out of Postgres into the

public internet in places where
these Parquet and Iceberg engines

can scan them.

And the promise of that is that
if you pump your data out in

hopefully an ordering, which is
good for the bulk queries you

expect to apply to the data, you
can then hit those object store

data lakes, if you will, God help
us, with an engine which understands

those formats and can slam through
them in massive parallel.

And, you know, one of those engines
is DuckDB, a single node engine,

which is very good at that.

1 of those engines is Snowflake,
a multi-node, highly parallel

engine, which is also really good
at that.

Another one is Databricks.

All the same gig.

It's like this separation of compute
and storage.

The compute comes to the object
store, rips this file off.

It is ridiculously inefficient.

If you're a Postgres developer,
you think, this is really inefficient,

you should have an index, so you
can find exactly the thing you're

looking for.

And these compute engines are,
no, I will look at every

record.

I will lightly prune the file as
I zip through it.

But they're willing to scan a terabyte
of data to get a few hundred

thousand rows of results, which
is, which is crazy, unless you

happen to have the compute and
storage and cloud infrastructure

to afford to do that, which these
days we do.

It's one of the problems of being
old.

You, your understanding of what
is efficient and correct is receding

in the distance.

Nikolay: But I still have some
doubts and questions how stable

performance for such queries is.

Like it's maybe not fully resolved
problem.

Paul: I agree.

I find it really weird that you
have to reform or resort your

data to match your use case.

And it's nice.

It's great when things work out
great.

So like you're sloughing your data
off the, off the transactional

store, off to the object store
in, usually it's like time series

order, right?

Time of most recent change.

And it so happens that your queries
are either grouping on time

or filtering on time, and then
the underlying format metadata

allows the engines to really throw
away 90% of the data before

it looks at it.

That's great.

I'm used to the power of SQL inside
a real OLTP engine where you

can address any combination of
tables and joins, and you're not

going to be hit too hard with efficiency
as long as your indexes

are in place.

Nikolay: And it sounds reasonable
for columnstore and for purely

analytical queries, but when you,
for example, go back to this

KNN, K-Nearest Neighbors problem,
we just need like 10

Paul: rows, but we

Nikolay: have millions, right,
or billions.

Regina: Yeah.

How does that work?

Does it even work?

Can you even do a KNN in those
systems?

Paul: Ah, you end up making some
kind of compromise with reality.

If you sort your data, like on
a key, like a Hilbert code or

something that has like spatial
co-location to it, then you can

do that kind of scan a little better
because you know where to

hop in the Hilbert code.

Duck for its internal implementation,
I think it's just sucked

it up and put a lightweight index
next to the table.

So they're not actually doing the
full column store scan for

those particular kinds of queries.

Spatial is as database people who
have put a geospatial type

into a standard relational database,
we like to say spatial isn't

special because you just type SQL
and you get your answer back,

but at an implementation level,
there's no getting around it.

Spatial is special in a whole bunch
of ways, because just being

higher dimensional, it's the same
thing people run into with

pgvector.

Once you get beyond that one dimensional,
1 dimension of sort order,

a lot of your preconceptions about
the most efficient way to

deal with the data fall away and
you have to rethink problems.

Michael: one difference seems to
be people aren't as forgiving

of accuracy issues when it comes
to spatial data.

You know, you don't have approximate
nearest neighbor.

Paul: Neil That is 100% true.

Yeah.

Things on maps, People are very
unforgiving with maps.

It's like one mistake, you're
wrong.

Whereas you roll up a BI dashboard
and give some people a pie

chart, whatever.

It's actually one of my great shaking
fist at cloud things about

the big column stores is that the
people insist on summing up

every record of 5 billion records
and make a pie chart.

You could have sampled and got
the exact same pie chart.

Michael: I feel like we've dived
into the most complex type of

performance issues, but I think
a lot of people, like I've come

across some performance issues
when it comes to PostGIS work,

a few customers that have the odd
specific case.

And I think even it's like the
basics that they have questions

around like geometry versus geography
or GiST versus SP-GiST.

And any advice either of you have
for kind of people that are

new to this stuff on the basics
of on the simpler queries, how

should they be thinking about these
things?

Nikolay: IMB The absolute basic
is how R-tree is implemented and

what R-tree is compared to B-tree.

This is like the gem of Postgres
actually, right?

Maybe let's talk a little bit about
this, like some basics in

the middle of it.

Regina: I think that bores most
users, the discussion.

They just want to say that I can
write this and I get this.

And they don't care about the underlying
implementation.

Paul: You think

Nikolay: so?

There are different users, I don't
know.

Maybe.

But okay, let's mention that Postgres
has this GiST, a generalized

search tree, and R-tree is implemented
using it, not separately.

Correct.

And yeah, there is a great work
by Hellerstein from Berkeley

and from the late 90s.

And it's like basically generalization
of B-tree idea to multidimensional

space, two-dimensional, three-dimensional,
n-dimensional.

Paul: Well, n-dimensional is defined
more or less by the API,

right?

Not so much by the space and whatever
space you're in, as long

as you can meet the conditions
of the API.

Nikolay: 7 functions, right?

Paul: Yeah.

Given a key, can you say whether
things are consistent or not

with that key, that's one piece of
the API.

Another is given a collection of
keys, can you split that collection?

And that depending on your dimensional
space, those splits will

be calculated differently, but
it's just like, given a set, can

you do a split for me?

Yes.

Okay.

Can you do key consistency?

I'm trying to think what the other,
what the other API pieces,

I feel like there's a third, but
regardless you can, with the

rules of the GiST API, you can
implement an R-tree, like The

kinds of questions that an R-tree
answers, the kinds of guarantees

an R-tree gives are ones that fit
inside the generalized API

of GiST.

So it means that you don't need
a separate access method for

R-tree.

You don't actually need a separate
access method for B-tree either.

In fact, there's a B-tree GiST
implementations in contrib that

you can look at.

Because it's more generic, There's
less opportunities for performance

optimizations, but the beauty of
being able to, for almost any

type, cause the just conditions
are really, really loose and

generic, almost any type you can
get an access method without

having to think about all the problems
of pages and consistency

and so on.

That's one of the things that has
made, say, a tool like pgvector

a complex thing to approach and
that some of the only someone

with like serious computer science
chops would want to attack

is that they had to write their
own access method right down

to disk.

They did not get to just layer
their access method on top of

an existing implementation like
GiST.

They had to write it all the way
down to all the way down to

the disk and page level, which
is a lot harder to do if you're

going to retain guarantees of consistency.

Nikolay: Right.

So you mentioned, I guess you mentioned
B-tree GiST, it's called,

or GiST B-tree, I always

Paul: confuse it.

Yeah, B2 GiST.

Nikolay: Yeah, and it lost comparison
to native implementation

of B-tree, while R-tree didn't lose, like
it won.

And just...

Paul: Yeah.

And I'm not sure if that's just,
probably that's more a matter

of people not wanting to spend
the time to work on the R-tree implementation.

Like it was considered, it was
considered not a good implementation

at the time that we started.

It still existed when we started
with PostGIS, but it was more

tightly bound to the spatial types,
the native spatial types

in Postgres.

And so we didn't use it.

No one else used it.

Eventually it was ripped out.

It had, I think all the way up
to the time it got taken out,

an object size limitation.

So it couldn't handle objects larger
than a page size.

That was, I think, the death knell,
really.

Because practical work with spatial
data inevitably involves

dealing with objects that exceed
the page size.

Nikolay: So I want, what I wanted
to just like some counter argument

why it matters to understand things
like slightly more than just

let's use it.

And that's it.

Because for example, in this particular
case, B-tree is like

less, less than greater than equals.

R-tree is like is contained, contains,
right?

Overlaps.

You can, if you use this Btree
GiST extension, you can combine

and have multi-column index of
very different data types, right?

And

Regina: I use that a lot too.

Nikolay: Yeah.

And this is great performance trick.

And if you know like why it's useful,
it can save you because

it's just a single index scan.

That's it.

So you use it a lot.

Regina: Well, if you have things
where there's a lot of Attribute

things that you have to filter
against, but you also need to

do a spatial at the same time,
especially with the way the sensitivity

that Postgres has, it sometimes
picks the wrong plan, whereas

when you have it together, it can't
screw up.

So there's some cases where I've
had where if I had just a GiST

index and a B-tree separate, the
planner would do something really

stupid and try to use the spatial
and not use it the B-tree

at all and so the performance you'd
go from minutes to milliseconds

in those cases.

Nikolay: Yeah yeah so I had some
experience not once yeah.

Michael: Same And I also thought
like Postgres would be able

to do a bitmap scan.

I thought it would be able to use
the GiST index and the B-tree

separately.

I know it wouldn't be as efficient
still, but it seemed to be

refusing to do so for me.

So I also wonder if there's something
missing on the Postgres

side to consider them for bitmap
scans as well, multiple indexes

with an and condition.

I thought it would consider it
but I really struggled to get

it to.

Nikolay: I also can complain You take pgvector and you cannot

combine it with GiST basically.

Paul: Yeah, that's super hard.

Given just the use case for pgvector, like the find me the nearest

thing is the core use case, and that is like really does not

fit nicely into the executor.

Nikolay: one case works really fast and the other doesn't.

Michael: What about for beginners?

Regina, you've written some books on this.

How do you describe performance stuff for beginners, like in

terms of practical tips, or what do you tend to see mistakes

people make?

Regina: I think the classic one is people try to do distance by

doing using ST_Distance.

So they say ST_Distance this less than 50 instead of using ST_DWithin.

So like ST_Distance can't use an index.

So you know their queries might take minutes and then when they

switch it to ST_DWithin and finishes in milliseconds, that kind

of thing because it would have to scan the whole all the records

if you do ST_Distance.

That's a classical mistake that people make.

Nikolay: I'm curious, does AI make this mistake or not?

Regina: And what's annoying is like in SQL Server, it's different.

You can actually use ST_Distance and it uses an index.

Paul: So what are they doing in SQL Server?

They must be rewriting the query.

Regina: Yeah, they must be rewriting the query.

So I was surprised that, oh yeah, you can do distance less than

50 and it will do the right thing.

It will do it pretty much what RS_ST_DWithin does.

Paul: That's interesting.

That, yeah, well query rewriting is kind of deep.

But the fact that they do that is interesting just in and of

itself.

It speaks to them, well either like presupposing a problem their

users are going to have or something their users really did have

constantly so it's worth it for them to go to the extra work

of doing a query rewrite for that case.

Regina: I noticed that Postgres does do query rewrites too, right?

Paul: It does.

There's an affordance there.

We could muck with the plan if we wanted to.

Regina: Yeah, if you do in, they change it to any, right?

Paul: Yeah.

Yeah.

Regina: So in theory, we could do a rewrite somewhere.

It's just, where would that go?

Paul: Yeah.

I mean, that's like adding a certain amount of brittleness to

the system where you don't know
that the users actually need

it.

Anyways, your example presupposes
that they have a spatial index

in the first place, which is what
I would have raised as like

the number 1, the number one new
user mistake.

Regina: Oh, that's true.

Yeah.

That's the other thing.

Paul: It's quite an easy mistake
to make because, by and large,

yeah, by and large, people don't
think about index as indexes

as having a type, like if they
come from a SQL server world or

whatever, like the idea that this,
they need this extra keyword,

the using just keyword to get a
spatial index, not going to be

super obvious as they're just coming
from some other database.

And if you type create index blah
on geometry column, it'll do

something like it won't error out
because we've had to do lightweight

B-tree bindings in order to get
aspects of sortability from our

type.

So it's not like it'll just say,
oh, I don't have a B-2B binding.

It'll go and build a really terrible
index that's no good for

you and quietly return.

So I think that's one of the biggest
gotchas.

It's just like having a proper
spatial index.

Regina: Yes, knowing that you need
an index and knowing that

you need it to be a GiST index.

Paul: Yeah.

Regina: As far as between GiST
and SP-GiST, I'd still tend to

go toward GiST because I know we
put in more effort in GiST than

we did in SP-GiST.

Same.

I think our GiST support is more
robust than SP-GiST.

Paul: Yeah.

I have tried different benchmarks
and come up with different

answers from different data, different
queries.

There's not, I've not been able
to like fully characterize the

kinds of places where SP-GiST outperforms.

I don't know, Nik, you, you seemed
like really excited about

SP-GiST or like there are places
you've found where like SP-GiST

is the clear winner.

Nikolay: No, somehow I just recently
don't see this kind of workloads

at all, like maybe a couple of
years, so I don't have fresh data

at all.

I don't know like why, obviously,
like it's just a matter of

luck.

Michael: I came across one recently
where they were using a GiST

index and still struggling with
performance and I suggested trying

SP-GiST because I thought it might
make sense and it didn't help.

So it wasn't much worse, it was
very similar.

But the one thing that they changed
that did help a lot was they

were using geography before, and
it was only to calculate quite

small distances, like within 25 meters type thing and switching

to geometry made a big difference.

Yeah.

That was surprised by, Yeah, you 2 are not surprised.

You 2 are nodding along.

Paul: I'm not surprised at all.

Regina: Because geography is a more complex, it's more, it's

three-dimensional, whereas the geometry is two-dimensional, really.

Paul: I have a slide in my sort of standard Postgres talk where

I say, you know, geometry or geography, what should I do?

And the thing that I use to demonstrate why geometry is going

to be better for performance purposes is just the distance calculation.

Everyone has learned Pythagoras.

So I know what the distance calculations looks like there, right?

2 squares and a square root.

Any distance calculation in geography space is going to have

to do a bunch of Haversine calculations, which is the distance

on a sphere.

And Haversine has 5 transcendentals in it, plus the square root.

So it's just way more, like orders of magnitude more computationally

intensive.

So if you're doing a bunch of those calculations, you're just

going to pay a big price for sure.

Nikolay: The earth is flat.

The earth is flat.

I knew it.

Paul: Peter Ideal, in an ideal world, the office is flat.

Yes.

So that means you need to commit to a planar understanding of

your work area if you can.

So if you are doing an app which works in London and say, you

know, there's a planar projection which is good for London.

If you're doing something in Idaho, there's a planar projection

which is good for Idaho.

If You're doing something which is like the continent of the

U.S.

There is a planar projection, which is good for most people's

purposes in the continent of the U.S.

It's very rare that they have something which requires latitude

and longitude.

You only get the breakdowns when you're starting to work with

truly global data.

Places where you go over the poles, places where you have data

in the Northwest territories and also in South America, where

the scale distortions of something like Mercator cannot be ignored,

that's when you say, okay, fine, I'll use geography.

And then you'll get some nice advantages out of geography.

If you have truly global data, the fact that the shortest path

might go over the North Pole, that's a useful thing to have.

Or it would go across the dateline.

That's a useful thing for the system to just transparently handle,

which it does for both the calculations of things like distance

and area, And for the performance-based stuff like indexing,

like it does not care.

It goes over the poles.

It goes around the dateline.

It doesn't notice because the model just, it takes away those

singularity points when you're working on a sphere.

Michael: Nice.

The, I think the reason I was surprised was more because I'm

so used to queries being IO bound and it's, it's quite fast to

get one that's actually CPU.

You know, actually the CPU do matter here.

Yeah.

Paul: That is one of the places where spatial is special and people

don't notice it.

And it's a weird 1.

Like we for, trying to integrate with a Postgres planner, we've

been trying to help the planner get smarter by costing our functions

appropriately.

But the planner only looks at function costs in filters and joins.

A lot of the times People are doing the spatial calculations

in the SELECT line, right?

The return values.

So if you have a complex function in your result set, Postgres

is not going to add that in and say, oh, I should do a different

thing, or that there's a lot of cost here.

So that's an issue.

And then actually it's a real problem on IO as well.

I actually did a blog post for Snowflake about 2 weeks ago that

went public, talking about that with respect to JSONB, but I

learned about it in the geography, geometry world, which is,

you know, you get a big object and Canada, Canada has like

10,000 vertices, like as well, as well above the, the maximum object

size that you, that you'll find in a, in a page.

So it's going to be TOASTed.

And that means any retrieval of that object is going to be a

2 stepper where it has to go and find the TOAST pointer and then

go into the TOAST tables and gather all the parts and put them

all together in order and then decompress all that, like just

to get it out, just in the IO.

And that's like a combination of the IO, pull that across, but

also the computational overhead of just reassembling the thing,

it's about 10 times, 10X.

I found in joins, You can get things to go 10X faster when I

don't have that TOAST overhead.

If I pre-cut everything into shapes which are smaller than the

object to the max.

Regina: Yeah, Speaking of TOAST, there's also the one gigabyte.

Is that the limit on TOAST?

If people have already started
complaining about that, about

that.

Paul: It's or

Nikolay: what are the types of
Arduino types?

So it's one gigabyte, I think.

Yeah.

Yeah.

But it's better not to go beyond
300 megabytes.

I think there are benchmarks showing
that for JSON is terrible.

Paul: Oh, I can't imagine.

Nikolay: Yeah.

There is a drop in performance
after 300 megabytes or so.

Michael: That's a lot of JSON though.

Regina: Yeah, I think I forget
what the reason was.

It had more to do with outputting
stuff.

Nikolay: I also forgot.

Regina: Like if you're trying to
output a big NVT file.

Paul: The whole world in one file?

The level 0 file?

Michael: Does it compress well?

I was just thinking, as JSON generally,
there's a lot of repetition,

a lot of like 300 megabytes post-compression
JSON file would

be like

Paul: pretty massive.

This is one of the places where there
are potential implementation

paths which would have given us
a much better I.O.

Profile, but at the cost of something
which the users considered

pretty inviolate, and that's perfect
fidelity of input and output.

So users are feeding the system
with geometries which are described

in terms of coordinates which are
built on double precision.

And if you want to return what
they got, you have to store the

double precision.

And double precision does not compress
so great.

So in answer to your compression
question, yeah, not so great.

If you just take the smallest swing
at saying, this stuff is

quantized or like, this has an
effective minimum precision, like

your road segment is not more precise
than a centimeter.

And if I take advantage of that
fact, if I know that there is

a quantization available, I can
squash stuff down into a delta

encoding, which is incredibly cheap
for spatial data, because

spatial data tends to be highly
auto-correlated in space.

So you can store the differences
between each coordinate with

a very small number, which is great.

It's a really good compression
and it's a very fast decompression.

But you have to give up precision
to take advantage of it.

Nikolay: Or if it's a lot of moving
objects on the map, but this

is should be column compression,
like TimescaleDB does.

So we just change a little bit
position.

We cannot jump to different continents
immediately.

Paul: Yeah, but

there will have to be precision
associated with that.

Like they will be losing a little
bit of precision when they

compress.

Like you can't just slam floats
or slam doubles together cleanly

all the time and get a nice clean
delta encoding.

Sometimes you can, but not always.

Nikolay: Similar with timestamps,
actually.

If you want to give up milliseconds,
you can have fewer bytes,

definitely.

Speaking of numbers, I'm curious
what are the biggest PostGIS

setups you saw?

Both single node or maybe sharded
systems like with Citus or

something.

Paul: Regina, you're
the implementer.

What have you seen?

Regina: I have no idea.

I don't work with that big of data
sets.

Paul: Yeah.

So Postgres is, or PostGIS is weird,
a weird number of the Postgres

community because we have, I think
almost certainly the largest

number of deployed clusters.

Like the number of PostGIS users
is huge.

I remember Stephen Frost came to
his first Phosphor G and his

jaw sort of hit the floor because
he was used to going to a Postgres

conference where there'd be like
200 attendees and he came to

1 of the free and open source for
geospatial conferences.

And there are 800 attendees.

And every one of them he talked to
said, yeah, yeah, we use PostGIS.

It was like he'd never seen so
many distinct Postgres users in

1 place.

That's one difference, the population
of PostGIS users is huge

relative to the population of Postgres
users, but they all have

very small deployments.

So yeah, if you're, if you're a
county, you've got several dozen

tables, none of which has more
than 100,000 records, like from

the point of view of scale, it
barely registers, But there are

a few big ones.

It's just that you don't necessarily
know they exist because

no one talks about them.

Like pretty sure Apple Maps production
system is all on Postgres.

That's only because someone whispered
to me in a dark alley,

like no one's ever made it public.

I know that here, which is a really
large geospatial data aggregator.

If you're going to get raw geospatial
data and want to have a

map, which like competes with what
Apple or Google provides,

you probably go to here and license
their data.

I know that their production pipeline
and a lot of their APIs

are built on PostGIS.

So it's a really big user, global
data, high uptime requirements.

And then one of repeats over and
over again.

So there's a lot of these installations and they're big and they're

high performance is people who are doing OSM map builds.

Because the open means OpenStreetMap.

OpenStreetMap is a map of the world.

It's a map of the world based on a very strict topological model,

edge node collections, which means that anything which covers

an area has to be built from those edge node relationships, which

implies loading the whole edge node system and then building

up the polygons from the parts and then testing the polygons

against each other.

And so people who end up using PostGIS as the intermediate

stage between I've got a lot of raw open stream app data and

I have a rendered tile or a vector tile, there's a big PostGIS

Postgres process in the middle of that where it's shuffling through

all the data and building up the higher level structures just

don't exist in the raw OSM dump.

And there's, there's gotta be hundreds and hundreds of those,

but they all look about the same size.

There's a size of OSM.

Regina: There's more than hundreds.

Paul: Okay.

Thousands of them.

Nikolay: Yeah.

Also, I also saw in your blog posts, Redfin, State Farm, like

we should have a lot, this setup should have a lot of data, right?

Paul: They do, and real production loads.

They're actually, they're asking real questions to the spatial

database and getting spatial answers back.

Nikolay: And also state level, government level, like in France,

Norway?

Paul: Yeah, that was our biggest installation for a long time.

It was the biggest one I could like quote chapter and verse on.

It was the, and they came on very early 2005.

1 of our first big institutional users was IGN, the French National

Mapping Agency.

And they ran Postgres and PostGIS through a competitive process

with Db2 Spatial and Oracle Spatial at that time, 2005, and determined

that Postgres and PostGIS was, in other words, just as good as

Db2 Spatial and Oracle Spatial, and most importantly, like way

cheaper.

So, and way easier to deploy.

They didn't have to go through like a full commercial, like commercial

requisition.

They could just say, okay, we're going to use it.

And they did.

So IGN has been on Postgres PostGIS for 20 years.

And that initial database was a 150 million object database.

Michael: At the

Paul: time, very substantial.

Nowadays, it's routing error.

But at the time, it was one of the biggest databases.

Nikolay: You needed to be really brave those days because of

lack of replication and proper backup.

Yeah.

Today, it's much, much better.

Much better.

Paul: They wanted to have, one of the requirements was multi-site

replication.

Because they wanted to have...

I think it was Slony to start with.

Nikolay: Yeah.

Good.

I'm glad now it's already solved.

Relatively.

We have problems with large setups, but it's already a different

story.

Great.

Michael: Is there anything we haven't talked about that either

of you wanted to make sure we mentioned?

Regina: I guess I should mention conferences.

So there's going to be a PgDay Boston conference, June 9th.

And then for the conference that Paul was mentioning, the free

and open source geospatial, that's going to be in Japan this

year.

And let's see.

Paul: That's Hiroshima, August 30th to September 5th.

Regina: Yeah.

Oh, and then there's also FOSS4G NA.

So that's coming up.

But I don't think we've said we haven't set this to the day yet,

but it's gonna be probably in November.

Paul: November.

Michael: I really appreciate that you wrote in PostGIS day,

the day after GIS day.

And I've actually watched a few of them.

We run them all online, which is great.

And I, yes, I've seen quite a few, tend to at least watch the

talks on performance.

My special interest and anything by Brian Timoney as well.

I wanted to give him a shout out.

I've enjoyed.

Paul: Yeah.

PostGIS day.

If you just type it in, you'll get all the video archives.

So if you're jonesing for PostGIS conference content, that's

the place to get it for sure.

Michael: Yeah.

It's awesome.

I highly recommend anything.

Any last things for me, Paul?

Paul: We haven't talked about AI,
which feels like a huge mess.

Yeah.

And, and I don't know what your
experience is with is with this

is

Regina: The next company has AI
in its name.

Paul: Oh yeah.

So he's really into it.

That's good.

And I wanted to point out like
a couple of things that I've seen

like in the AI space.

1 is the geospatial AI space.

There's this whole other field
called foundation models, which

is basically like RAG, retrieval
log, retrieval log meta generation,

like RAG for geospatial, where
they build up a characterization

of an area or the whole globe where
every pixel in that characterization

has a multi-dimensional and very
high dimensionality vector associated

with it.

So it really is a reg process.

But people have been using pgvector
as like the storage engine

for their foundation models and
to do like quick gatherings of

nearest neighbor.

And when you do nearest neighbor
with a foundation model, what

you get is like land, which is
similar to this other land.

So you can draw a circle around,
say, a bunch of solar panels

and say, show me the other solar
panels, and it will just go

off and find every solar panel
in the world for you.

Very cool stuff.

There's a, and there's a cool talk
on it in the 2025 Postgres

day.

Really worth looking at.

The other thing that came up in
that Postgres day around AI

was the idea of natural language
to SQL and swing a cat, throw

a stone, hit someone who's doing
natural language to SQL out

there.

But a really cool observation from
Brendan Ashworth from Bunting

Labs during his talk that a lot
of the code models, like how

you get a coding model is first
train a generic model and then

do reinforcement learning against
it.

And the way you get a code model
which can speak SQL, is you

do a lot of reinforcement learning
against standard SQL patterns.

And his observation was that the
standard SQL patterns tend to

be like Postgres SQL patterns.

So you tend to get really good
Postgres SQL and maybe not so

good Oracle SQL.

And the same thing holds true with
the spatial side because they're

using Postgres because PostGIS
is the lingua franca for geospatial

SQL.

You tend to get good spatial SQL
generated by these models.

And the real limiting step tends
to be providing the correct

amount of like semantic metadata
around your data model so that

the LLM can make sense of the natural
language question you're

making and spit out the right SQL.

But you're already ahead of the
game if you're using Postgres

and Postgres as your engine, because
these models will tend to

do SQL that matches them.

Nikolay: It's writing in Python
is easier with like using LLMs

because it's they know much better
than new languages.

Paul: Yeah, give me Perl 6.

Nikolay: Yeah, and that's why Postgres
is very natural because

so many, so much material it was
trained on it and Postgres the

same thing.

This is a great point.

So it should help to grow popularity
even more because a lot

of applications are created by
AI these days.

But I'm curious, if you deal with
a lot of data, you need to

verify.

If you don't have a way to verify
all the ideas AI is generating,

the quality will be not good.

So this is one thing I always do,
how to give AI tools to verify

its own ideas very fast without
humans.

Before humans are involved, it's
already much better.

I'm curious what you think about
this.

How to write proper queries and
don't use this distance function

you mentioned.

Function you mentioned.

Regina: The good thing is, as Paul
said, it does write the right

distance function.

It does do ST_DWithin, if you give
it a text, say, give me something

within 50 miles, it writes the
right query.

I have tested that.

Nikolay: Good, But it still would
be better if it would have

some tooling to test and see actual
results, not just guess.

Paul: Yeah, and we've started to
see that discipline arrive in

the PostGIS codebase.

We have our first agents file.

Regina: Oh yeah, we have that SKILL.md.

Paul: Or SKILL, sorry.

So yeah, we have our first skills
file, which does touch on that

kind of stuff you're talking about,
Nik.

Like these are best practices for
generating spatial SQL.

Nikolay: Yeah, in my opinion, copy
on write and database branching

should grow because we need to
be able to test on large data

sets for very low money and very
fast.

It can be as in S3, like some company
implemented, right?

Neon or Timescale, they implemented
it.

It can be with ZFS or anything,
but it should be fast and cheap.

So you give like millions of points
on the map and tell, just

check, EXPLAIN, ANALYZE
BUFFERS.

Michael: Or, Nik, did you see friend of the show, Radim posted

a new blog post recently about just import the stats.

So you don't even need to load all the data.

Nikolay: Yeah, I saw it.

Just test planning and behavior without data actually, just with

stats.

I think it's a valid approach, but at the same time I think it's

limited because you don't see actual execution.

Michael: Of course, but you can at least check like index use

and things, which is cool.

Nikolay: Yeah, definitely.

It's a super lightweight approach.

I like that.

And I think more and more should be developed because of AI.

Otherwise, so many mistakes.

But if you have this testing pipeline and you're even not involved.

AI has some ideas how to improve query and tooling to verify

this is the way and put it to test to CI and so on.

And it's great.

Yeah.

So that's why we created branching for many years ago with DBLab.

We like We were sitting and waiting for these very days when

it's needed so much.

Great.

Michael: Well, it was a pleasure to meet you both.

Thank you so much for coming.

Thank you for joining us.

Nikolay: Thank you for coming.

Regina: It was a nice talk.

Paul: Thanks Nik.

Thanks Michael.

Nikolay: Yeah.

Have a great week.

Creators and Guests

Paul Ramsey
Guest
Paul Ramsey
Loving open source software, practicing the piano, moving things around, working @crunchydata
Regina Obe
Guest
Regina Obe
Spatial database enthusiast focused on PostGIS development

Some kind things our listeners have said