Nikolay: Hello, hello.

This is Postgres.FM.

As usual, my name is Nik, Postgres.AI, and my co-host is Michael,

pgMustard.

Hi, Michael.

Michael: Hi, Nik.

Nikolay: And we have an unexpected guest today, Simon Eskildsen, CEO and

co-founder of turbopuffer.

Hi, Simon.

Simon: Thank you for having me.

Nikolay: Yeah.

Thank you for coming and it was very unexpected because we mentioned

turbopuffer last time and you messaged me on Twitter and I think

it's a great idea sometimes to look outside of traditional Postgres

ecosystem.

I think it's beneficial for everyone, should be.

So yeah, thank you for coming for sure.

It's a great idea.

I think.

Simon: Yeah, the origin story is kind of funny and it was only

a couple of days ago.

I have a, there's a script where if turbopuffer is mentioned

anywhere, then I'll get an email or a summary.

And you guys were discussing last time, different ANN, like both

in Postgres and outside and turbopuffer was mentioned.

And so I just DM'd you and asked, hey, we get, I'd come on, chat

about Postgres, chat about MySQL, chat about like databases and

when you choose one over the other and when Postgres breaks for

some of these workloads that we've seen and when it's great.

And now it's what?

Yeah, 2 or 3 days later and we're on.

Nikolay: Including the weekend actually.

So podcast was out on Friday and we record this on Monday.

That's, that's, that's the loss that you, everyone should try

to achieve.

I like it.

Simon: Yeah.

And you've had, you've had chicken hatch in the meantime.

I got,

Nikolay: that's why my camera is overexposed because a whole,

whole night this camera, I used it to broadcast.

I didn't have time to tune it properly.

But again, like, thank you for coming.

This is about databases.

This time probably not so much about Postgres, but definitely

we should talk about vector, vectors, right, and ANN.

And maybe we should start from distance and talk about your background.

And I've heard some MySQL is involved, right?

Can you discuss it a little bit?

Simon: For sure.

Yeah, my background is I spent
almost a decade at Shopify scaling

mainly the Database layer there,
but pretty much anything that

would break as the scale increased
and increased through the

2010s.

Shopify, like most companies started
in the early 2000s was on

MySQL.

So a lot of the work that I did
was, was with MySQL, but also

every other Database that Shopify
employed, like Redis, Memcached,

Elasticsearch, and a ton of others,
Kafka, so on.

So it's been a long time there.

When I joined, it was a couple
of hundred requests per second.

And when I left, it was in the
millions and very very intimate

with the data layer there.

I was on the last resort pager
for many many years and that has

informed a lot about how I write
software today and a couple

years ago I started a company called
turbopuffer because I thought...

Nikolay: Before that, sorry for
interrupting.

I remember Shopify, actually, I
always, we always recommend the

article.

Shopify had a couple of blog posts
about UUID and how UUID version

4 is not good and version 7 is
much better.

And it doesn't matter MySQL or
Postgres, the mechanics behind

the scene is the same.

It's how B-tree behaves.

So I remember, and I think, Michael,
we mentioned that article

on podcast a few times as well.

Michael: We have mentioned it, Nik,
for sure.

And I think Shopify has come up
before.

I thought, is it a Vitess shop?

I thought it might have been 1
of the early Vitess adopters.

Simon: Shopify is using a little
bit of Vitess, but not very

much.

Vitess was not around when we
did all the sharding back in the

early 2010s.

And so we did it all at the, at
the, at the application layer.

I wasn't part of the actual sharding
decision, but I was part

of a lot of the sharding work over
the years.

And it's funny because at the time
I know that they looked at

a bunch of proxies and all of the
businesses they later looked

at had gone out of business.

It's not a great business, unfortunately.

But everything was done in Ruby
land through a, just a module

called sharding, and it did a lot
of things and a lot of monkey

patches into Rails.

But let's talk about this, like
UUID v4 thing, because I think

if we wanted to do a pros and cons,
MySQL versus Postgres, I

spent quite a bit of time with
both and this 1 actually, to my

knowledge, only really matters
for MySQL.

Well, it actually matters for Postgres
as well, but in a different

way.

So on MySQL, the primary key dictates
how the B-tree is laid

out for the primary key, right?

So for the entire row.

So if you have UUID v4, it's completely
randomly scattered along

the B-tree.

So whenever you're doing an insert,
right?

It will just kind of fall somewhere
random in the B-tree, which

makes the updates very expensive
because if you're updating 10,

if you're adding 10 rows at a time,
so you're not updating to

ads, doing 10 insertions that are
in 10 different leaves, well,

you're just doing a lot more disk
I/O and your write information

is high.

In Postgres, of course, you're
just appending to the heap with

all of its drawbacks and benefits.

And so it doesn't matter as much
other than on the indexes, right?

It's my knowledge, but on the indexes,
it will matter a lot because

on the indexes, of course, it is
sorted by that.

And if you have some temporal locality
then it's not gonna matter

as much.

So that's my understanding.

So this matters a lot in MySQL.

Now, that article I think was after
I left and Shopify doesn't

use UUIDs as primary keys for anything.

So I don't really know where this
mattered.

It must be something tangential
because MySQL really just does

auto increment.

And every shard does an auto increment
with basically like with

a 32K auto increment number.

And then every shard has a plus
offset into that to allow it

to grow to 32K shards.

Given how much of a pain in the
ass it would be to change that,

that's probably still the case.

But I always really liked that
scheme.

So some tables over time at Shopify
ended up having a primary

key on the shop ID comma the ID
because that would give locality

for a shop.

Because otherwise you have a lot
of re-damification if you're

trying to dump out a bunch of products
for a shop, because the

chance that there's gonna be multiple
products for a shop, like

in a leaf, unless you do that,
it's just a lot lower.

So that ended up working really
well.

And this is a pain to do in Postgres,
because if you want to

rewrite the primary key or the
heap by an ID, you have to rewrite

the entire thing.

That was 1 of my surprises, having
worked a bit more with Postgres

in later years.

Nikolay: Yeah, yeah, yeah.

And I agree with you and I agree
that in Postgres it also matters,

but only for B-tree itself, primary
key B-tree, if it's after increment

or in Postgres it's called big
serial, serial for example, or

auto-generated.

Right now there's another method,
but behind the scenes it's

also like sequence and inserts.

Oh no, in this case it's not sequence.

There should be a function which
generates UUID version 4 and

if it's random, like version 4
is random, Version 7 is closer

to regular numbers, basically,
because it's monotonically growing,

right?

Lexicographically ordered, right?

So in this case, you insert only
on the right side of B-tree, and

dirty pages, if you think about
how checkpointer is working.

Also, in Postgres there is also
overhead after each checkpoint

is full page write which involves
indexes as well.

So if you touch random pages all
the time, disk I/O overhead and

replication and backups, actually
everything receives additional

overhead.

While in version 7 we write on
the right side all the time.

It's much better.

But heap, yes, heap is different.

So I agree with this and anyway,
like I just wanted to say we

use MySQL article because it's
written very well.

And in Postgres we didn't have
version 7 for quite some time.

Last Thursday version 18 release
candidate was out, which will

include full implementation of
UUID v7, which was live-coded

on Postgres TV with a couple of
friends, just in Cursor, I think.

Oh no, it was not in Cursor, it
was before that, but anyway,

it was just created right online.

We did it, and right now it took
a couple of years to reach maturity

because Postgres always waits until
RFCs are finalized and so

on.

Anyway, soon version 18 will be
out and UUID version 7 is inside.

But I think everyone is already
using it on client.

Okay, great.

So you had this great career and
then decided to create another,

is it, can we call it a database
system, database management

system?

Simon: It's certainly a full-blown
database.

You know, underneath turbopuffer
is an LSM.

An LSM works really well for object
storage.

And, you know, every successful
database ends up implementing

every, every query eventually right
in the limit and turbopuffer

will end up doing the same thing,
but every good database also

starts with some specialization,
right?

And our specialization has been
on, on search workloads.

I would say that it's by no means
a replacement for Postgres.

There always becomes a time where
it starts to make sense to

move parts of data into more specialized
algorithms, more specialized

data structures, and more specialized
storage.

In general, my, my hypothesis on when it's time to create a database

is that you need 2 things in the air in order to create a generational

database company.

The first thing that you need is that you need a new storage

architecture, because if you don't have a new storage architecture,

some new way to store the data, ideally both data structure wise,

and also the actual medium that you're persisting on.

There's no reason why in the limit the other databases won't

do the workload better.

They already have the existing momentum.

They already have the workloads.

In the Postgres case, of course, you know, it's the classic relational

architecture where you replicate every byte onto 3 disks.

And this works phenomenally well, right?

We've had this in production for probably more than 40 years.

And it works great.

It has high performance, It has a very predictable performance

profile and it works really, really well with the page cache

or the buffer pool, whatever database you're using.

The problem with this model is that if the data is not very valuable,

this model is expensive.

Every gigabyte of disks, a network bound disk is about 10 cents

per gigabyte.

And unless you are really risky DBA, you're gonna run those disks

at 50% utilization on all the replicas and on the primary.

So you're paying for this disk kind of 3 times, which ends up

with a whole all-in cost of 60 cents per gigabyte.

And that's not even accounting for all the CPUs that you also

need on the replicas because you need the replicas to process

the rights as fast as the primary.

So you're kind of paying for the same thing 3 times.

So the all in sort of per terabyte cost, when you also take into

consideration that the system can be a little bit more memory

hungry is probably around 60, 60 cents to $2 per gigabyte

Nikolay: per month.

Simon: Per month.

Yeah.

Per month USD on object storage.

The base cost is 2 cents per gigabyte.

Right.

And When we need that data in RAM or on disk, we only have to

pay for 1.

We only have to pay for some percentage of the data to be in

cache at all times.

We mentioned Cursor earlier.

Cursor is a turbopuffer customer.

They don't need every single code base on SSDs or in memory at

all times.

They need some percentage in memory, the ones that are queried

a lot right now, and some percentage on disk, the ones that are

gonna be queried again in a few minutes or maybe in a few hours.

And so you end up paying a lot less because we only have to keep

1 copy of a subset of the data rather than 3 copies of all of

the data.

Now that comes with a fundamental set of trade-offs, right?

We want to be as public as that.

You can't use turbopuffer for everything.

If you want to do a write to turbopuffer, we commit that to S3,

right?

So by the time it's committed to turbopuffer, the guarantee

is actually stronger than most relational systems because we've

committed it into S3, which takes a couple hundred milliseconds,

but the durability guarantee is very strong.

But if you're building a system like Shopify, well, you can't

live with a commit time in hundreds of milliseconds.

It's just not acceptable.

So that's a trade-off that means that this is not a system that

will ever replace a relational database store.

The other downside is that because not all the data is on a disk

or in memory at all times, it means that you can have tail latency.

And that can be really catastrophic in very large systems that

are doing millions of queries per second.

If you can't rely on a very predictable query profile, you can

have massive outages to hydrate the caches.

I've seen these outages on disks, I've seen them.

And just even the workload changing slightly can mess with the

buffer pool in a way where you have a massive outage.

So these 2 things may sound like small trade-offs, but they're

massive trade-offs for very large production systems.

But it might mean that if you have let's say a billion vectors

and you're trying to store them into Postgres.

The economics just don't make sense.

You're paying thousands of thousands, not tens of thousands of

dollars in hardware costs for workload that might cost tens or

hundreds of dollars on turbopuffer.

Nikolay: How much is it really?

1 billion vectors.

If 1 vector is what's like 700

Simon: and a 768 dimensions.

Yeah.

It's, It's 3 terabytes.

Nikolay: 3 terabytes to store 1 billion vectors.

Yeah.

And also we don't have a good index for 1 billion scale.

Yeah, I mean, in Postgres, pgvector, HNSW won't work with 1 billion.

Simon: I mean, we could just run the math a couple of different

ways, right?

I'm not saying this is how it works in pgvector, I'm less familiar

with it now.

But even if you have 3 terabytes of data raw, you're probably

going to need to store more than that.

You might be able to do some tricks to make the vector smaller,

so you only have to store maybe, I don't know, a terabyte or

something along those lines, right?

But remember that a gigabyte of DRAM is $5 per month.

And you need that 3 times on all your replicas.

So you're paying $15 per gigabyte per month.

So if you're doing that, if you have to store that 3 times, you

put everything in memory and you're somehow able to get it down

to a terabyte, then you're talking about $15,000 per month,

right, across the 3 replicas, just for the RAM alone.

Nikolay: Yeah, I agree with you.

HNSW, it's like memory.

Memory is the key, and creation of index takes a lot of time.

And for billion, I already have issues with a few million vectors

scale.

I know Timescale, which I now renamed to Tiger Data, they developed

another index based on DiskANN from Microsoft, I think, which

is more like for disk, right?

But I agree with you, like for this scale, it's not convenient.

But Also, I think it's not only about vectors.

This argument that we need to save on storage costs, and it's

insane to pay for storage and memory as well when we have replicas

and we scale.

In Postgres, if it's a physical replica, it's everything.

So you replicate everything, all indexes, everything.

You cannot replicate even...

There is no ability to replicate only 1 logical database.

You need to get the whole cluster.

And it means that you multiply costs for storage and for memory.

And it would be so great to have some tiered storage maybe with

partitioning as much as automated as possible and offload all

data to S3 as basically you consider.

S3 is great for this and I remember I was actually I explored

turbopuffer through Cursor.

It's great like to see the documentation.

I knew Cursor is using Postgres.

It was a few months ago, but then I heard they considered moving

to PlanetScale.

That was before PlanetScale announced support of Postgres, so

I was thinking, are they switching to MySQL?

And then I saw vectors are stored in turbopuffer.

Great.

Then I learned that several of our clients who get consulting

from us and use Postgres, they also store vectors in turbopuffer.

It was, I think, Photoroom, a couple of more companies.

It was a surprise for me, and I started to think, oh, that's

interesting.

And then I checked your talks.

I think you also mentioned there that if we...

So there is this approach with
SPFresh algorithm, right?

Different it's not HNSW, different
types of index, but also some

additional interesting ideas about
economics you mentioned, right?

Not only these sense.

Can you elaborate a little bit?

Simon: I think it might be helpful
to just talk at a very high

level about the different algorithms
to do vector indexing into.

I'll try to simplify it as much
as possible, and we can dig into

it further if you want.

Fundamentally, the simplest way
to do vector search is that you

just store all of the vectors in
a flat array on disk, right?

And then on every search, you just
compare the query vector to

all of those.

The problem with that is that you
very quickly run up against

bandwidth limits, right?

If you have a gigabyte of vectors
and you're searching that at

maybe 10 gigabytes per second,
if you can exhaust the memory

bandwidth, which is unlikely in
a big production System, you're

only doing, you know, maybe 5 queries
per second and the Query

latency in the hundreds of milliseconds.

So if you have very, very few queries
and you don't care about

latency, this can be feasible on
small scale.

And lots of people are doing that
in production.

But if you want to search a million
vectors in less than a couple

hundred milliseconds and maybe
10 milliseconds and as part of

a bigger pipeline, you need some
kind of index.

The problem with indexing vectors
is that there is no known way

to do it in a perfect way, right?

If I search for a Query vector
about a fruit or whatever, I know

that if I'm searching for banana,
I get the closest fruit.

Maybe, I don't know, maybe there's
a plantain.

I don't know.

Right, in the cluster.

But you have to build an approximate
index in order to make this

faster.

Nikolay: Because of too many dimensions.

For small number of dimensions,
there are approaches and approaches

have them for years, but for hundreds
of thousands of dimensions

yeah.

Simon: That's right yeah there's
KD trees and so on for the

for the smaller dimensional space
which we can use for geo coordinates

and things like that and simpler
geometry.

So for very high dimensional spaces
these things fall apart.

Curse of dimensionality it's called.

So they're very large is also important
about the vectors.

If you have a kilobyte of text
it can easily turn into tens and

tens of kilobytes of vectors which
is why separating into cheaper

storage makes a lot of sense.

So there's 2 fundamental ways that
you can index the data.

There's the graph-based approaches,
HNSW and DiskANN, were the

2 you mentioned before.

And there's the clustering-based
approach.

The graph-based approach is phenomenal
if you can store all of

the data in memory and you have
very high QPS and very low latency

requirements.

So if you have, let's say a hundred
million vectors and you're

searching that at, you know, a
hundred thousand queries per second,

and you need very low latency,
you're not going to beat HNSW.

It's going to create a very, very
good graph to, to navigate

it with.

The problem is that it's very expensive
and the other problem

is that almost no workloads in
the real world actually look like

this.

So HNSW got very popular because
it's fairly simple to implement

correctly and it's very simple
to maintain.

So when you create the graph which
is essentially just you know

points that are close in vector
space or close in the graph,

it's very simple to incrementally
maintain.

You put 1 thing in, you search
the graph, and then you add it.

There's very simple rules.

You can implement something like
HNSW in tens of lines of code

if you did a very, very simple
implementation of it.

The problem with HNSW is that every
time you do a write, you

have to update a lot of data, right?

In database land, we call this
write amplification, where every

byte or every page you update,
you have to update a lot of others,

and the reason for that is that
You add something, you add a

node in the graph, and then you
have to update all these other

things to do connections to that
node in the graph.

So this works great in memory,
because memory is very fast at

updating and very fast at random
writes.

But the problem is also on the
read path.

So memory is very fast at doing
random reads.

You can do a random read at 100
nanoseconds, but a random read

on S3 or on a disk is much slower,
into hundreds of microseconds

to the hundreds of milliseconds
on S3.

And on a graph, right, you don't
really, there's no speculation

that helps, right?

If you start at the middle of the
graph and then greedily navigate

the graph from the query vector
to the closest matching vectors,

well, every single time you do
that, there's a round trip.

And on S3, that's a round trip
that takes hundreds of milliseconds.

So you're sort of navigating from
the root, it's like 200 milliseconds.

You go out 1, 200 milliseconds.

You go out another 1, 200 milliseconds.

And in general, for HNSW on a million,
this might be in the tens

of round trips.

So this gets very slow, right?

This is in the seconds to do this
on S3.

And even on a disk, this very quickly
adds up to tens of milliseconds.

That's the fundamental problem with graphs.

Now, DiskANN is essentially using a lot of the same ideas

as other graph-based indexes, but instead of trying to have the

tens of round trips that HNSW has, that's very good for memory,

DiskANN basically tries to shrink the graph so that there

are fewer jumps, right?

So instead of 200 milliseconds, 30 times, it tries to get it

to maybe 6 or 7 times, or 10 times, by shrinking the graph as

much as possible.

That's essentially the insight in DiskANN.

The problem with DiskANN is that after you have added more than

10 or so, 10 to 20% of the size of the data, you have to rebuild

the entire graph, which is an incredibly expensive operation.

That is absolutely terrifying to me, to someone that's been on

call for large databases to just have, like you could max out

like 128 cores, rebuilding this graph in prod, and it could take

you down to 3 a.m.

Because you don't know when you hit that threshold.

And if you don't do it, then the approximations start getting

bad and you start getting bad search results.

The nice thing about the graphs is just they have very low latency

but they're just very expensive to maintain.

Now the first way, then there's the other type of index which

are the clustered indexes.

Clustered indexes are very simple to picture in your head, right?

If you have a cluster of let's say you took out every song in

Spotify and you clustered them in a coordinate system and you

can visualize this in 2 dimensions.

If you, then you plotted all the songs and the songs that are

adjacent, of course, also adjacent in vector space and genres

will emerge, right?

There'll be a rap cluster, there'll be a rock cluster.

You zoom in and you get, you know, like little sub clusters,

I don't know, death metal, black metal, I don't know what all

the different rock genres are, right?

Somewhere in these clusters.

And you can generate great recommendations based on that because

you can look at, okay, what did Michael listen to and what are

some songs that are close by that he hasn't listened to and same

for Nik.

So this is, it's very simple, you create a clustering algorithm

that basically just tries to divide the data set into clusters

and when you do a query, instead of looking at all the vectors,

you look at, well, clearly the user is asking about a rock song,

so we're only going to look in the rock cluster.

And that way you divide down the number of vectors that you have

to seek.

Now the problem with this is that if you have everything in memory,

it's not necessarily as optimal because you might have to look

at more data than you do in a graph-based approach.

And because RAM has such good random
read latency, the penalty

is not necessarily worth it if
everything is in memory at all

times.

But this is great for disks, and
it's great for S3, because I

can go to S3 and in 200 milliseconds
get gigabytes of data back,

right?

It doesn't matter if I'm getting
like, you know, a megabyte or

a gigabyte, I can often get that
in the same round trip time

if I exhaust the network.

So if you then go into S3, basically,
you have to download all

the clusters.

So like, let's say clusters of
JSON to really simplify this.

And then you just look at the closest
n clusters to your query

vector.

And then you download, you know,
cluster1.json, cluster2.json,

whichever ones are closest in 2
round trips.

And now instead of on the graph-based
ones where you're doing

200 milliseconds, 200 milliseconds,
200 milliseconds to navigate

the graph, you just have to do
200 milliseconds to get the clusters

and then 200 milliseconds to get
all the clusters that were adjacent.

The nice thing about these clustered
indexes is that with algorithms

like SPFresh and lots of modifications
to them, we can incrementally

maintain these clusters because
you can imagine that when you

add a vector you just have to add
it to the cluster and it's

just 1 write.

The write amplification is very
low.

Once in a while that cluster will
grow beyond a size.

Let's say it's a thousand elements
long and you have to split

the cluster and then you have to
do some modifications.

That's essentially what SPFresh
is.

And then there's a little bit higher
write amplification.

But it's stable in the way that
you never reach this threshold

where, okay, I've added 20% of
the dataset, I have to rebuild

the entire thing as you do in DiskANN.

HNSW don't have to do that, which
is why it's very, very, like,

it's very nice, but it's just slow
still.

So SPFresh, we think, writes a
really, really nice set of trade-offs

where it's gonna be a little bit
slower, but slower in terms

of, okay, instead of a search returning
in a millisecond, it

might take 5 milliseconds, and
just no 1 cares in production

for search.

This matters for a point lookup
into a relational database, but

for search it's a perfect set of
tradeoffs.

Michael: Question on the cluster
splitting.

Does that mean we don't ever need
to rebuild the whole index?

Because I think that was a limitation
of the first cluster-based...

IVF flat, I think, we started with
in pgvector, and that didn't

have the splitting as far as I'm
aware, and therefore we had

to rebuild the whole index from
time to time if the data changed

significantly.

Simon: That's right.

And there's also MERGE,

right?

Nikolay: There's also not only split, there
is also MERGE in SPFresh,

as I remember.

Simon: Yes, there's MERGE as
well.

Michael: Makes sense.

Simon: And because, because you
might do deletes, which are

also a pain.

To my knowledge, pgvector does
not implement SPFresh.

It's a, it is very difficult to
implement correctly.

Nikolay: But also funny, I did
a little bit research in May,

I think, when I discovered turbopuffer
and started reading about

this.

I saw original implementation was
actually on forked Postgres

SPFresh.

I saw some, you need to dig into
it, like in some Microsoft repositories

also I think some Chinese engineers
were involved, something

like there, Like I saw some repository
which was basically forked

Postgres and original SPFresh
implementation was on top of that.

Maybe I'm hugely mistaken, but
I saw something like this.

But it's hard.

I agree.

And I also recalled what I wanted
to ask.

I was lost in my previous question
because it was too many things.

I recalled in your talks, you discuss
that S3 is ready to store

data.

S3 is ready because over the last
few years, they added important

features.

Can you recall what you talk about
at the talks?

Simon: Yeah, I mentioned that
there were 2 things that you

needed to build a new database.

And there's a reason why a database
like turbopuffer hasn't already

been built.

It's a new storage architecture.

It's only really possible now.

The 3 things that have enabled
a database like turbopuffer to

exist with this sort of pufferfish
architecture, right, where

it's, you know, when the pufferfish
is deflated, it's in S3,

and when it's, you know, somewhere
in between, it's in SSD, and

then it's in memory when it's all
the way inflated.

The reason that's possible is because
of 3 things.

The first 1 is our NVMe SSDs.

NVMe SSDs have a new set of trade-offs.

They act completely differently
than other disks, right?

SSDs are just sort of like...

The old SSDs were very fast, but
NVMe SSDs have just phenomenal

performance potential where basically
on an NVMe SSD, the cost

per gigabyte is a hundred times
lower than memory, but the performance,

if you use it correctly, is only
about 5 times worse.

And you have to, by using NVMe
SSD correctly is really that you

have to put a lot of concurrency
on the disk, but again, similar

to S3, every single round trip
takes into hundreds of microseconds,

but you can drive a lot of bandwidth.

Old storage engines have not been
designed for that.

You have to design from that, from
the day that you write the

first line of code.

Otherwise, it takes a very long
time to retrofit.

It happens to be that that exact
usage pattern is also what's

very good for S3.

NVMe SSDs were not available in
the cloud until 2017, 2018.

So this is, in database languages,
this is relatively new.

Second thing that needs to happen
is that S3 was not consistent

until December of 2020.

I think this is the most counterintuitive
to most because most

of us just think that it always
has been, but it hasn't.

What that means is that when, if
you put an object on S3 and

then you read it immediately after
you were, as of December of

2020, guaranteed that it was going
to be read after write consistency.

The third thing that needs to happen,
and this is very informed

by the fact that I was on call
for Shopify for so long.

When you're on call for a long
time, you gravitate towards very

simple systems and ideally systems
that are constantly tested

on their resiliency.

So you don't get paged when something
abnormal happens.

Nikolay: Yeah.

Simon: And for us, what was very
important for us to be on

call for a database for Justin
and I was that it only had 1 dependency.

And that dependency could only
be 1 of the most reliable systems

on earth, which is S3, Google Cloud
Storage, right?

And the other derivatives like
Azure blob storage and so on.

They're very, very, very reliable
systems.

But you could not build the metadata
layer on top, right?

So Snowflake and Databricks and
others that have built on top

of this in the last generation
of new databases needed another

metadata layer, some consensus
layer like FoundationDB or their

own Paxos or RAF protocol to essentially
enforce the read off

the right consistency, but also
to do various metadata operations

atomically.

But in late 2024, S3 finally announced
that re:Invent

compare-and-swap.

And what compare-and-swap allows
you to do is to put a file,

let's say metadata.json, you download
the file, you do some modifications

to it, and then you upload it again
with a version number, and

you only upload it if the version
number is the same.

Basically guaranteeing that you
did an atomic and nothing has

changed in the interim.

Very important when you're building
distributed systems, right,

you can really implement anything
on top of that, as long as

you're willing to take the performance
hit of going back and

forth to S3.

And of course, they have a whole metadata, Paxos, whatever thing

to implement that.

In GCS, it's Spanner, but I don't have to worry about that.

That's for them to formally verify and whatever they need to

do to uphold those constraints.

Those were the 3 things that needed to happen.

And that is requirement number 1, to build a new database.

And so That's what was in the air for turbopuffer to grab.

The second thing that you need for a new database is that you

need a new workload that's begging for that particular storage

architecture.

So for Snowflake and Databricks, we saw that in, well, we want

big scale analytics and it doesn't make sense also for the dollar

per gigabyte that we have to pay in operational databases and

adding indexes on everything.

So there was a new OLAP workload, and there was at least an acceleration

of the OLAP workload with all the successful web applications,

and then the new storage architecture on top of S3, but they

had to do some more work because these APIs that I just mentioned

weren't available.

And a new workload now is connecting lots of data to AI.

And this storage architecture is a very good fit for it.

Nikolay: Yeah, So that's great.

Thank you for elaborating about this 3 changes at S3.

But they also made another change recently and they announced

S3 vectors, right?

What do you think about this comparing to your system?

Simon: Yeah, I think that S3 Vectors is, if you, if you are

writing the vectors once and you don't query them that much,

and you don't care that much about the query latency, it can

be a useful, useful product in the same way that you might be

using S3 today.

But S3 Vectors is not, doesn't do full text search, right?

It doesn't do lots of these features that you need for a serious

system.

And even S3 Vectors recommends that you load into OpenSearch.

And so for archival vectors, this can make sense.

So they're still operational, but there's lots of limitations

that would make it very difficult for it to go into production

systems, right?

If you do a query to S3 vectors, it takes hundreds and hundreds

of milliseconds to get the result back.

Whereas the turbopuffer, you can get the result back in less

than 10 milliseconds.

Nikolay: Thanks to cache, right?

Simon: Thanks to cache.

Nikolay: Yeah.

Yeah, that's totally makes sense.

I still I have skeptical questions more of them, if you don't

mind.

Simon: Let's go.

Nikolay: Yeah, 1 of them is your
your cache layer is on local

NVMEs, right?

Yeah.

But why?

Why?

Like, we could store it there.

PlanetScale recently came to Postgres
ecosystem and they said

let's stop having fears about using
local NVMEs and yes, like

ephemeral storage and so on and
there is a reliable failover

and so on.

Let's do it super fast.

Yes, I agree, super fast.

And 4TB for a billion or 3TB for
a billion vectors probably won't

cost too much because this price
is usually in AWS, for example,

in the instances which have local
NVMe, it's basically included.

And the limit is many, many dozens
of terabytes of local NVMe

storage these days on larger instances.

So we are fine to store not only
Vectors, but everything else.

So my question is like back from
S3 to MySQL, for example, My

SQL supports storage engines for
many years.

Have you considered building storage
engine for, for MySQL,

for example, and, and using local
NVMe?

Simon: You can't outrun the economics,
right?

The economics are still the same.

You, you have to replicate whether
it's local NVMe or not, which

is not necessarily cheaper, maybe
only marginally than, than

a, than a network volume.

You still have to replicate it
3 times, right?

You still have to put it on 3 machines.

You still need all the cores and
memory and so on that you would

need on the primary to keep up,
unless you start to have a heterogeneous

replica stack, which for a variety
of reasons would be a really

bad idea.

So you're still paying for paying
for all of that.

Now up to a certain point that
makes a lot of sense right.

If I have a customer who gets on
a call with our sales team and

they have a couple million vectors
in pgvector there is no reason

to move off of it.

That is perfect.

You should not be taking on the
complexity of ETLs and so on.

But if you have like tens of terabytes
of vector data, it is

not economical for a lot of businesses.

Now for some businesses it are,
right?

But the art of a business is to
earn a return on the underlying

costs.

And for some businesses, it's very,
very challenging to earn

a return on storing this on 3 replicas,
this vector data, it's

generally not valuable enough to
the business.

So turbopuffer doesn't make sense
to as a storage engine into

into a MySQL or into a Postgres.

It's just fundamentally not compatible
with the way that we do

things outside of replica chains,
right?

You could maybe come up with a
storage engine where you page

everything into S3 and all of that,
but you're now trying to

build a new database inside of
an extremely old database, right?

The storage layer is, is too, especially,
more so in Postgres,

I think that in MySQL is very,
very weaved into, into how the,

how the query planner works and
all of that.

So at that point, you're rebuilding
the query planner to be very

round trip sensitive, you're rebuilding
a storage layer.

It doesn't make sense anymore.

It's a completely different database.

Nikolay: Okay.

Another skeptical question.

I go to turbopuffer website and,
by the way, great design, very

popular these days, with mono,
mono space font and, like this

types of graphic and your the ties
for 1 billion scale by 1 billion

vectors or 1 billion documents.

But there, if I check 1 billion,
the console of namespaces pops

up.

And namespaces, it's, if I choose
1 billion, if I choose a hundred

million, it's okay.

I can have 1 namespace, but if
it's 1 billion, there is a warning

that probably you should split
it to 10 namespaces.

And this means 10 different indexes,
right?

Simon: That's right.

Nikolay: Yeah.

So it's not actual 1 billion scale.

Or like something is off here in
my mind.

Like 1 billion is a single index,
should be index, single index.

If we talk about 1 billion but
divided by 10 collections and

indexes also, like it's already
a different story.

And so can you explain this to
me?

I saw this in the beginning and
I'm happy in this position I

can ask the founder himself about
this.

Simon: Yeah, we try to be extremely
transparent in our limits,

right?

And I think your mental model is
correct.

Before this, we just had a limit
that said that we could do in

a single namespace around 250 million
vectors, right?

But I mean, even that is a simplification
because how big are

the vectors?

If they're 128 dimensions, they
are a lot less space, which is

ultimately what matters here, which
is why we also put the gigabytes.

So when we in the past, we had
just 250 million vectors on there

as a limit, people came to us and
said, or we knew that people

weren't testing with us, because
they wanted to sort of search

a billion at once.

And they didn't realize that you
could just do id % N and

then you could do a billion and
it would be very economical for

them.

So we sort of had to put in the
docs like, yes, you can do a

billion at once, but you have to
shard it.

Now I would love to handle that
sharding for you, right?

I mean that's what Elasticsearch
does and it's what a lot of

databases do because the only way
to scale any database is sharding,

right?

You don't get around it.

The question is where the complexity
lives, right?

Does the complexity live inside
of the database to handle it

for you.

Some of the most sophisticated
sharding you will find lives inside

of Cockroach and Spanner and these
kinds of systems.

And the simplest type of sharding
is what we're exposing, where

every single shard is just a directory
on S3, and you can put

as many of them as you want and
you can query as many of them

as you want.

Now over time, of course, we need
to create an orchestration

layer on top of that so that a
logical namespace to the user

is actually multiple namespaces
underneath.

But we're challenging ourselves
to make every individual namespace

as large as it possibly can.

When I ran an Elasticsearch cluster,
or was involved in scaling

Elasticsearch clusters, every shard
was around 50 gigabytes.

That was roughly what was recommended
for Elasticsearch.

Nikolay: Quite small, right?

Simon: It's quite small, right?

Like on Postgres it's a lot larger
than that.

It's basically the size of the
machine.

But the problem with a small shard
size is that for something

like a B-tree, right, it's obviously
like it's log n, right?

The number of searches you have
to do.

And if you have log n and n is
very high, well, that's a great

number of operations that you have
to do.

But for every shard, there's sort
of like an m log n.

And now m is very high if the shard
size is small and n is small.

So you're doing a lot more operations.

So we want the shards to be as
large as possible, because of

course you can get to a billion
by just doing, you know, a thousand,

1 million shards, but like that's
incredibly competitionally

ineffective.

So we have shards now for some
users that we're testing that

do almost a billion documents at
once, right?

But it requires some careful tuning
on our end.

So we wanna push that number as
high as possible.

The other thing with namespaces
is that turbopuffer is a multi-tenant

system and we have to bin pack
these namespaces.

So if a namespace is 5 terabytes
large, it's much harder for

us to bin pack on node sizes that make sense.

So we have to walk some threshold there where we try to make

the shards as small as possible from the logical data.

We have to bin pack and then we have to index by without slowing

the indexing down because the larger an ANN index gets, the harder

it is to maintain the index.

So those are the constraints we walk.

So over time, we update these limits continuously.

You will see that shard size increase.

But you're not going to find anyone who's doing 100 billion vectors

in a single index on a single machine, AKA M is 1 and N is a

hundred billion, but we want to get as high as possible because

that is the most competitionally ineffective.

That's how we get our users the best prices.

And we see the most ambitious use cases.

Nikolay: And, and, shards are these actual shards or partitions?

So it's like shards, meaning that the different compute nodes

behind the scenes are used if it's 10 namespaces is it 10 shards

10 different compute nodes?

Simon: So a namespace for us and this is also why we've chosen

a different name for it right A namespace to us is a directory

on S3, and which compute node that goes to is essentially just

a consistent hash of the name of the prefix and the User ID,

right?

So compute node 1 could hash to maybe, you know, a hundred thousand

different namespaces.

And they share the compute and then we scale the compute right

with, with that, and that's, that's why the bin packing is

Nikolay: flexible.

I see.

Simon: Yes.

And that's also part of why we can do get really good economics.

Nikolay: Yeah.

I wanted to shout out again, like front page is beautiful.

It explains like basics, architecture and pricing right here,

latencies right here and case studies.

This is like how it should be, right?

For engineers, you know, so you quickly see all the numbers and

understand.

That's great.

Thank you for transparency.

That's great.

Simon: I think it's, it just comes from the fact that I was

on that, you know, your side of the table, my whole career, right?

I've been buying databases, evaluating databases, and sometimes

I swear I end up on a Database website and I don't know if they're

marketing a sneaker or a Database.

Nikolay: Yeah, go figure out storage or compute price in AWS

or GCP.

It's like always a task.

Simon: So we really wanted to
just put the foot forward of

like the diagram is right there.

Okay.

This is my mental model.

This is what it does.

This is what it costs.

This is how fast it is.

These are the limits.

These are the customers and how
they use it.

You go to the documentation.

We talk about guarantees.

We talk about trade-offs.

The kinds of things that I always
look for immediately to slot

it into my model.

Because I don't want you to use
our database at all costs.

I want you to use it if it makes
sense for you.

I want it to be a competitive advantage
to you.

I want to save you like millions
of dollars a year and in order

for that to make sense You have
to just put all the bullshit

aside and make it very clear what
you're good at and what you're

not good at

Nikolay: What about for example
if I I know there is a vector

search and full text search support
to try it But what about

additional filtering when we have
some dimensions when we want

to filter on them.

Is it possible?

Yes.

For example, some categories or
like time ranges or something,

it's all possible, right?

In both types of search in turbopuffer.

Simon: I'll go back to my line
before of every successful database

eventually supports every query,
right?

And, we're, I don't know, I like
maybe 5% of the way there.

So we support filters and filtering
on vector search is actually

very difficult, right?

Because if you do something like,
let's say I'm searching for

banana and it's an e-commerce site
and I have to filter for ships

to Canada.

Well that might cut off a fruit
cluster which is actually where

I want it to be and then like you
know I get to the banana on

a t-shirt cluster but it's really
far away.

So you have to do very sophisticated
filtering to get good results.

I don't know what pgvector does,
but I know that our query planner

is recall aware.

It does a lot of statistics and
distribution of where the vectors

are to ensure that we have high
recall.

And for a percentage of queries
in production, we check against

an exhausted search what the recall
is, the accuracy of the index.

I don't know if pgvector does this,
but I know it's been a monumental

effort for us to

Nikolay: do it.

It's a big headache there.

It's a big headache.

Actually, it's a good point that
additional filtering can change

semantics of the search.

If it's searching for bananas,
size equals nano, it's very different.

Michael: 1 thing they've implemented
relatively recently is iterative

index scans so I think they had
a problem where for example you'd

put a limit you do a similarity
search with a limit and a filter

and you'd ask for a hundred results
and you'd get back fewer

even though they were so it's a
they had a problem so they go

back to the index and request more
so yes it's like a work around

I'd say more than a solution but
it's pretty effective for a

lot of these cases.

Simon: I think the problem there
is that I would be very skeptical

of the recall in a case like that,
right?

Because, okay, so you just like
you got 100, but you probably

should have looked at a lot more
data to know what the true top

K was.

So, you know, that solution was
not acceptable to us.

Like we had to go much, much further
to ensure high recall.

And I think the problem with a
lot of these solutions is that

people are not evaluating their
recall because it's actually

very difficult to do.

And it's very competitionally expensive.

You're not gonna have your production
Postgres run an exhaustive

search on millions of vectors.

It's too dangerous to do that while
you're doing your transactional

workloads.

But again, if you're hitting these
kinds of problems, then it

may be time to consider maybe for
a subset of your workloads,

whether something more specialized
makes sense, whether the trade-offs

make sense to you.

But I think with the pre-filtering
and post-filtering, it can

be, it's very challenging to create
a query planner that can

do that well.

So we support all the queries,
right?

You can do, and you can do range
queries.

You can do exact queries.

You can operate with arrays.

You can do all types of queries,
set intersections that people

use for permissions.

We can do full-text search queries,
we can do GROUP BYs, we

can do simple aggregates.

So we can do more and more queries
and we're constantly expanding

that with what our users demand
from the system that we've built.

Nikolay: 1 more unknowing question.

I know it's not open source.

There is no free version at all.

Even I'm pretty sure it was decision
not to go with open core

or open source model at all, and
can you explain why?

Simon: Yeah, I think there's
never been any, any, any particular

resistance to open source.

I mean, I love open source.

The reason we're not open source
is because open source is, if

you want to do it well, it's a
lot of effort.

And it's also a lot of effort to
build a company.

It's a lot of effort to find product
market fit.

And we decided to pour all of our
energy into that.

And it's similar argument for the
minimum.

It's really the, the only thing
that a startup has over, you

know, the big incumbents is focus.

And so our focus has been on the
customers that are willing to

pay.

And 1 day I would love to give
a free tier.

I would love for people's blogs
to run on turbopuffer.

But for now, I'm afraid that we
have to prioritize the customers

that are willing to put some money
behind it.

It's not because we have provisioned
hardware or anything like

that.

It's really just that we need to
know that you're willing to

put some money behind it so we
can give you amazing support,

right?

A lot of the times there'll be
engineers working on the database

who are supporting your tickets
and we can't do that with a free

tier.

Nikolay: Yeah.

Makes sense.

Yeah.

And we, Mike and I have different
points of view on this area.

Michael: I think that was a good
answer.

I'm on your side, Simon.

I had a question on the full text
search and semantic search.

We had a discussion a while back
about hybrid, I think it's generally

called hybrid search.

Are you seeing much of that where
people are wanting to kind

of mix the results and order between
them?

Simon: Yeah, we do.

And what we see is that the embedding
models are pretty phenomenal

at finding good results for subjects
that they know about, right?

And terms that they know about.

You know, you, you, you run a project
called pgMustard, right?

And maybe the embedding model doesn't
know.

I think it's popular enough that
it will know what it is, but

let's say it didn't know.

And then it puts it in the cluster
close to ketchup.

And the results are just horrible.

And the text, the full text search
tends to be really good at

this.

Like recall for things that the
embedding model can't know about.

If you're searching for an algae,
you know, these like TV SKUs,

it's like, you know, it's indecipherable.

Actually, the embedding models
are actually quite good at these

because they happen enough on the
web.

But imagine that it wasn't these
kinds of searches it's good

for.

The other thing is search as your
type.

So if I'm searching for SI, the
embedding model might find that,

well, he's searching for something
agreeable in Spanish, right?

But really what I wanted was the
document that started with Simon.

So sometimes you just need to compliment
these twos.

And I think that that's why we've
doubled down on the full-text

search information in turbopuffer
to do this hybrid.

But people can get a lot of mileage
of embedding models alone.

Michael: Wonderful.

Nikolay: Yeah, thank you so much.

Yeah, it was super interesting.

Good luck with your system.

And yeah, I think it's an interesting
idea, I guess, for Postgres

users who need, as you said, to
store more and have all characteristics

turbopuffer have.

Maybe it's worth considering, right,
to keep all OLTP workloads

and data in regular Postgres while
moving vectors to turbopuffer,

right?

Simon: I mean, it's just similar
to, people have taken workloads

out of Postgres.

It's like updating a posting list.

It's a very expensive operational
and a transactional store.

And it's similar in an index updating
that with the same kinds

of ACID semantics that Postgres
upholds is very expensive.

And I've ripped full-text search
out of Postgres many times because

it's very, very expensive to do.

So we do that because we don't
want to shard Postgres, because

it's like a lot of work.

And so we start by moving some
of these workloads out first and

search is 1 of the early ones to
go.

Nikolay: So your point is that
it probably postpones the moment

when you need to shard.

Simon: Yeah, same reason from
memcache and redis, right?

You separate out these workloads
as soon as possible to avoid

sharding.

Kick the can.

Nikolay: Interesting idea.

Okay, again, thank you so much
for coming.

It was a super interesting discussion.

I enjoyed a lot.

Simon: Thank you so much.

Michael: Yeah, really nice to meet
you.

Thanks for joining us.

Some kind things our listeners have said