
turbopuffer
Nikolay: Hello, hello.
This is Postgres.FM.
As usual, my name is Nik, Postgres.AI, and my co-host is Michael,
pgMustard.
Hi, Michael.
Michael: Hi, Nik.
Nikolay: And we have an unexpected guest today, Simon Eskildsen, CEO and
co-founder of turbopuffer.
Hi, Simon.
Simon: Thank you for having me.
Nikolay: Yeah.
Thank you for coming and it was very unexpected because we mentioned
turbopuffer last time and you messaged me on Twitter and I think
it's a great idea sometimes to look outside of traditional Postgres
ecosystem.
I think it's beneficial for everyone, should be.
So yeah, thank you for coming for sure.
It's a great idea.
I think.
Simon: Yeah, the origin story is kind of funny and it was only
a couple of days ago.
I have a, there's a script where if turbopuffer is mentioned
anywhere, then I'll get an email or a summary.
And you guys were discussing last time, different ANN, like both
in Postgres and outside and turbopuffer was mentioned.
And so I just DM'd you and asked, hey, we get, I'd come on, chat
about Postgres, chat about MySQL, chat about like databases and
when you choose one over the other and when Postgres breaks for
some of these workloads that we've seen and when it's great.
And now it's what?
Yeah, 2 or 3 days later and we're on.
Nikolay: Including the weekend actually.
So podcast was out on Friday and we record this on Monday.
That's, that's, that's the loss that you, everyone should try
to achieve.
I like it.
Simon: Yeah.
And you've had, you've had chicken hatch in the meantime.
I got,
Nikolay: that's why my camera is overexposed because a whole,
whole night this camera, I used it to broadcast.
I didn't have time to tune it properly.
But again, like, thank you for coming.
This is about databases.
This time probably not so much about Postgres, but definitely
we should talk about vector, vectors, right, and ANN.
And maybe we should start from distance and talk about your background.
And I've heard some MySQL is involved, right?
Can you discuss it a little bit?
Simon: For sure.
Yeah, my background is I spent
almost a decade at Shopify scaling
mainly the Database layer there,
but pretty much anything that
would break as the scale increased
and increased through the
2010s.
Shopify, like most companies started
in the early 2000s was on
MySQL.
So a lot of the work that I did
was, was with MySQL, but also
every other Database that Shopify
employed, like Redis, Memcached,
Elasticsearch, and a ton of others,
Kafka, so on.
So it's been a long time there.
When I joined, it was a couple
of hundred requests per second.
And when I left, it was in the
millions and very very intimate
with the data layer there.
I was on the last resort pager
for many many years and that has
informed a lot about how I write
software today and a couple
years ago I started a company called
turbopuffer because I thought...
Nikolay: Before that, sorry for
interrupting.
I remember Shopify, actually, I
always, we always recommend the
article.
Shopify had a couple of blog posts
about UUID and how UUID version
4 is not good and version 7 is
much better.
And it doesn't matter MySQL or
Postgres, the mechanics behind
the scene is the same.
It's how B-tree behaves.
So I remember, and I think, Michael,
we mentioned that article
on podcast a few times as well.
Michael: We have mentioned it, Nik,
for sure.
And I think Shopify has come up
before.
I thought, is it a Vitess shop?
I thought it might have been 1
of the early Vitess adopters.
Simon: Shopify is using a little
bit of Vitess, but not very
much.
Vitess was not around when we
did all the sharding back in the
early 2010s.
And so we did it all at the, at
the, at the application layer.
I wasn't part of the actual sharding
decision, but I was part
of a lot of the sharding work over
the years.
And it's funny because at the time
I know that they looked at
a bunch of proxies and all of the
businesses they later looked
at had gone out of business.
It's not a great business, unfortunately.
But everything was done in Ruby
land through a, just a module
called sharding, and it did a lot
of things and a lot of monkey
patches into Rails.
But let's talk about this, like
UUID v4 thing, because I think
if we wanted to do a pros and cons,
MySQL versus Postgres, I
spent quite a bit of time with
both and this 1 actually, to my
knowledge, only really matters
for MySQL.
Well, it actually matters for Postgres
as well, but in a different
way.
So on MySQL, the primary key dictates
how the B-tree is laid
out for the primary key, right?
So for the entire row.
So if you have UUID v4, it's completely
randomly scattered along
the B-tree.
So whenever you're doing an insert,
right?
It will just kind of fall somewhere
random in the B-tree, which
makes the updates very expensive
because if you're updating 10,
if you're adding 10 rows at a time,
so you're not updating to
ads, doing 10 insertions that are
in 10 different leaves, well,
you're just doing a lot more disk
I/O and your write information
is high.
In Postgres, of course, you're
just appending to the heap with
all of its drawbacks and benefits.
And so it doesn't matter as much
other than on the indexes, right?
It's my knowledge, but on the indexes,
it will matter a lot because
on the indexes, of course, it is
sorted by that.
And if you have some temporal locality
then it's not gonna matter
as much.
So that's my understanding.
So this matters a lot in MySQL.
Now, that article I think was after
I left and Shopify doesn't
use UUIDs as primary keys for anything.
So I don't really know where this
mattered.
It must be something tangential
because MySQL really just does
auto increment.
And every shard does an auto increment
with basically like with
a 32K auto increment number.
And then every shard has a plus
offset into that to allow it
to grow to 32K shards.
Given how much of a pain in the
ass it would be to change that,
that's probably still the case.
But I always really liked that
scheme.
So some tables over time at Shopify
ended up having a primary
key on the shop ID comma the ID
because that would give locality
for a shop.
Because otherwise you have a lot
of re-damification if you're
trying to dump out a bunch of products
for a shop, because the
chance that there's gonna be multiple
products for a shop, like
in a leaf, unless you do that,
it's just a lot lower.
So that ended up working really
well.
And this is a pain to do in Postgres,
because if you want to
rewrite the primary key or the
heap by an ID, you have to rewrite
the entire thing.
That was 1 of my surprises, having
worked a bit more with Postgres
in later years.
Nikolay: Yeah, yeah, yeah.
And I agree with you and I agree
that in Postgres it also matters,
but only for B-tree itself, primary
key B-tree, if it's after increment
or in Postgres it's called big
serial, serial for example, or
auto-generated.
Right now there's another method,
but behind the scenes it's
also like sequence and inserts.
Oh no, in this case it's not sequence.
There should be a function which
generates UUID version 4 and
if it's random, like version 4
is random, Version 7 is closer
to regular numbers, basically,
because it's monotonically growing,
right?
Lexicographically ordered, right?
So in this case, you insert only
on the right side of B-tree, and
dirty pages, if you think about
how checkpointer is working.
Also, in Postgres there is also
overhead after each checkpoint
is full page write which involves
indexes as well.
So if you touch random pages all
the time, disk I/O overhead and
replication and backups, actually
everything receives additional
overhead.
While in version 7 we write on
the right side all the time.
It's much better.
But heap, yes, heap is different.
So I agree with this and anyway,
like I just wanted to say we
use MySQL article because it's
written very well.
And in Postgres we didn't have
version 7 for quite some time.
Last Thursday version 18 release
candidate was out, which will
include full implementation of
UUID v7, which was live-coded
on Postgres TV with a couple of
friends, just in Cursor, I think.
Oh no, it was not in Cursor, it
was before that, but anyway,
it was just created right online.
We did it, and right now it took
a couple of years to reach maturity
because Postgres always waits until
RFCs are finalized and so
on.
Anyway, soon version 18 will be
out and UUID version 7 is inside.
But I think everyone is already
using it on client.
Okay, great.
So you had this great career and
then decided to create another,
is it, can we call it a database
system, database management
system?
Simon: It's certainly a full-blown
database.
You know, underneath turbopuffer
is an LSM.
An LSM works really well for object
storage.
And, you know, every successful
database ends up implementing
every, every query eventually right
in the limit and turbopuffer
will end up doing the same thing,
but every good database also
starts with some specialization,
right?
And our specialization has been
on, on search workloads.
I would say that it's by no means
a replacement for Postgres.
There always becomes a time where
it starts to make sense to
move parts of data into more specialized
algorithms, more specialized
data structures, and more specialized
storage.
In general, my, my hypothesis on when it's time to create a database
is that you need 2 things in the air in order to create a generational
database company.
The first thing that you need is that you need a new storage
architecture, because if you don't have a new storage architecture,
some new way to store the data, ideally both data structure wise,
and also the actual medium that you're persisting on.
There's no reason why in the limit the other databases won't
do the workload better.
They already have the existing momentum.
They already have the workloads.
In the Postgres case, of course, you know, it's the classic relational
architecture where you replicate every byte onto 3 disks.
And this works phenomenally well, right?
We've had this in production for probably more than 40 years.
And it works great.
It has high performance, It has a very predictable performance
profile and it works really, really well with the page cache
or the buffer pool, whatever database you're using.
The problem with this model is that if the data is not very valuable,
this model is expensive.
Every gigabyte of disks, a network bound disk is about 10 cents
per gigabyte.
And unless you are really risky DBA, you're gonna run those disks
at 50% utilization on all the replicas and on the primary.
So you're paying for this disk kind of 3 times, which ends up
with a whole all-in cost of 60 cents per gigabyte.
And that's not even accounting for all the CPUs that you also
need on the replicas because you need the replicas to process
the rights as fast as the primary.
So you're kind of paying for the same thing 3 times.
So the all in sort of per terabyte cost, when you also take into
consideration that the system can be a little bit more memory
hungry is probably around 60, 60 cents to $2 per gigabyte
Nikolay: per month.
Simon: Per month.
Yeah.
Per month USD on object storage.
The base cost is 2 cents per gigabyte.
Right.
And When we need that data in RAM or on disk, we only have to
pay for 1.
We only have to pay for some percentage of the data to be in
cache at all times.
We mentioned Cursor earlier.
Cursor is a turbopuffer customer.
They don't need every single code base on SSDs or in memory at
all times.
They need some percentage in memory, the ones that are queried
a lot right now, and some percentage on disk, the ones that are
gonna be queried again in a few minutes or maybe in a few hours.
And so you end up paying a lot less because we only have to keep
1 copy of a subset of the data rather than 3 copies of all of
the data.
Now that comes with a fundamental set of trade-offs, right?
We want to be as public as that.
You can't use turbopuffer for everything.
If you want to do a write to turbopuffer, we commit that to S3,
right?
So by the time it's committed to turbopuffer, the guarantee
is actually stronger than most relational systems because we've
committed it into S3, which takes a couple hundred milliseconds,
but the durability guarantee is very strong.
But if you're building a system like Shopify, well, you can't
live with a commit time in hundreds of milliseconds.
It's just not acceptable.
So that's a trade-off that means that this is not a system that
will ever replace a relational database store.
The other downside is that because not all the data is on a disk
or in memory at all times, it means that you can have tail latency.
And that can be really catastrophic in very large systems that
are doing millions of queries per second.
If you can't rely on a very predictable query profile, you can
have massive outages to hydrate the caches.
I've seen these outages on disks, I've seen them.
And just even the workload changing slightly can mess with the
buffer pool in a way where you have a massive outage.
So these 2 things may sound like small trade-offs, but they're
massive trade-offs for very large production systems.
But it might mean that if you have let's say a billion vectors
and you're trying to store them into Postgres.
The economics just don't make sense.
You're paying thousands of thousands, not tens of thousands of
dollars in hardware costs for workload that might cost tens or
hundreds of dollars on turbopuffer.
Nikolay: How much is it really?
1 billion vectors.
If 1 vector is what's like 700
Simon: and a 768 dimensions.
Yeah.
It's, It's 3 terabytes.
Nikolay: 3 terabytes to store 1 billion vectors.
Yeah.
And also we don't have a good index for 1 billion scale.
Yeah, I mean, in Postgres, pgvector, HNSW won't work with 1 billion.
Simon: I mean, we could just run the math a couple of different
ways, right?
I'm not saying this is how it works in pgvector, I'm less familiar
with it now.
But even if you have 3 terabytes of data raw, you're probably
going to need to store more than that.
You might be able to do some tricks to make the vector smaller,
so you only have to store maybe, I don't know, a terabyte or
something along those lines, right?
But remember that a gigabyte of DRAM is $5 per month.
And you need that 3 times on all your replicas.
So you're paying $15 per gigabyte per month.
So if you're doing that, if you have to store that 3 times, you
put everything in memory and you're somehow able to get it down
to a terabyte, then you're talking about $15,000 per month,
right, across the 3 replicas, just for the RAM alone.
Nikolay: Yeah, I agree with you.
HNSW, it's like memory.
Memory is the key, and creation of index takes a lot of time.
And for billion, I already have issues with a few million vectors
scale.
I know Timescale, which I now renamed to Tiger Data, they developed
another index based on DiskANN from Microsoft, I think, which
is more like for disk, right?
But I agree with you, like for this scale, it's not convenient.
But Also, I think it's not only about vectors.
This argument that we need to save on storage costs, and it's
insane to pay for storage and memory as well when we have replicas
and we scale.
In Postgres, if it's a physical replica, it's everything.
So you replicate everything, all indexes, everything.
You cannot replicate even...
There is no ability to replicate only 1 logical database.
You need to get the whole cluster.
And it means that you multiply costs for storage and for memory.
And it would be so great to have some tiered storage maybe with
partitioning as much as automated as possible and offload all
data to S3 as basically you consider.
S3 is great for this and I remember I was actually I explored
turbopuffer through Cursor.
It's great like to see the documentation.
I knew Cursor is using Postgres.
It was a few months ago, but then I heard they considered moving
to PlanetScale.
That was before PlanetScale announced support of Postgres, so
I was thinking, are they switching to MySQL?
And then I saw vectors are stored in turbopuffer.
Great.
Then I learned that several of our clients who get consulting
from us and use Postgres, they also store vectors in turbopuffer.
It was, I think, Photoroom, a couple of more companies.
It was a surprise for me, and I started to think, oh, that's
interesting.
And then I checked your talks.
I think you also mentioned there that if we...
So there is this approach with
SPFresh algorithm, right?
Different it's not HNSW, different
types of index, but also some
additional interesting ideas about
economics you mentioned, right?
Not only these sense.
Can you elaborate a little bit?
Simon: I think it might be helpful
to just talk at a very high
level about the different algorithms
to do vector indexing into.
I'll try to simplify it as much
as possible, and we can dig into
it further if you want.
Fundamentally, the simplest way
to do vector search is that you
just store all of the vectors in
a flat array on disk, right?
And then on every search, you just
compare the query vector to
all of those.
The problem with that is that you
very quickly run up against
bandwidth limits, right?
If you have a gigabyte of vectors
and you're searching that at
maybe 10 gigabytes per second,
if you can exhaust the memory
bandwidth, which is unlikely in
a big production System, you're
only doing, you know, maybe 5 queries
per second and the Query
latency in the hundreds of milliseconds.
So if you have very, very few queries
and you don't care about
latency, this can be feasible on
small scale.
And lots of people are doing that
in production.
But if you want to search a million
vectors in less than a couple
hundred milliseconds and maybe
10 milliseconds and as part of
a bigger pipeline, you need some
kind of index.
The problem with indexing vectors
is that there is no known way
to do it in a perfect way, right?
If I search for a Query vector
about a fruit or whatever, I know
that if I'm searching for banana,
I get the closest fruit.
Maybe, I don't know, maybe there's
a plantain.
I don't know.
Right, in the cluster.
But you have to build an approximate
index in order to make this
faster.
Nikolay: Because of too many dimensions.
For small number of dimensions,
there are approaches and approaches
have them for years, but for hundreds
of thousands of dimensions
yeah.
Simon: That's right yeah there's
KD trees and so on for the
for the smaller dimensional space
which we can use for geo coordinates
and things like that and simpler
geometry.
So for very high dimensional spaces
these things fall apart.
Curse of dimensionality it's called.
So they're very large is also important
about the vectors.
If you have a kilobyte of text
it can easily turn into tens and
tens of kilobytes of vectors which
is why separating into cheaper
storage makes a lot of sense.
So there's 2 fundamental ways that
you can index the data.
There's the graph-based approaches,
HNSW and DiskANN, were the
2 you mentioned before.
And there's the clustering-based
approach.
The graph-based approach is phenomenal
if you can store all of
the data in memory and you have
very high QPS and very low latency
requirements.
So if you have, let's say a hundred
million vectors and you're
searching that at, you know, a
hundred thousand queries per second,
and you need very low latency,
you're not going to beat HNSW.
It's going to create a very, very
good graph to, to navigate
it with.
The problem is that it's very expensive
and the other problem
is that almost no workloads in
the real world actually look like
this.
So HNSW got very popular because
it's fairly simple to implement
correctly and it's very simple
to maintain.
So when you create the graph which
is essentially just you know
points that are close in vector
space or close in the graph,
it's very simple to incrementally
maintain.
You put 1 thing in, you search
the graph, and then you add it.
There's very simple rules.
You can implement something like
HNSW in tens of lines of code
if you did a very, very simple
implementation of it.
The problem with HNSW is that every
time you do a write, you
have to update a lot of data, right?
In database land, we call this
write amplification, where every
byte or every page you update,
you have to update a lot of others,
and the reason for that is that
You add something, you add a
node in the graph, and then you
have to update all these other
things to do connections to that
node in the graph.
So this works great in memory,
because memory is very fast at
updating and very fast at random
writes.
But the problem is also on the
read path.
So memory is very fast at doing
random reads.
You can do a random read at 100
nanoseconds, but a random read
on S3 or on a disk is much slower,
into hundreds of microseconds
to the hundreds of milliseconds
on S3.
And on a graph, right, you don't
really, there's no speculation
that helps, right?
If you start at the middle of the
graph and then greedily navigate
the graph from the query vector
to the closest matching vectors,
well, every single time you do
that, there's a round trip.
And on S3, that's a round trip
that takes hundreds of milliseconds.
So you're sort of navigating from
the root, it's like 200 milliseconds.
You go out 1, 200 milliseconds.
You go out another 1, 200 milliseconds.
And in general, for HNSW on a million,
this might be in the tens
of round trips.
So this gets very slow, right?
This is in the seconds to do this
on S3.
And even on a disk, this very quickly
adds up to tens of milliseconds.
That's the fundamental problem with graphs.
Now, DiskANN is essentially using a lot of the same ideas
as other graph-based indexes, but instead of trying to have the
tens of round trips that HNSW has, that's very good for memory,
DiskANN basically tries to shrink the graph so that there
are fewer jumps, right?
So instead of 200 milliseconds, 30 times, it tries to get it
to maybe 6 or 7 times, or 10 times, by shrinking the graph as
much as possible.
That's essentially the insight in DiskANN.
The problem with DiskANN is that after you have added more than
10 or so, 10 to 20% of the size of the data, you have to rebuild
the entire graph, which is an incredibly expensive operation.
That is absolutely terrifying to me, to someone that's been on
call for large databases to just have, like you could max out
like 128 cores, rebuilding this graph in prod, and it could take
you down to 3 a.m.
Because you don't know when you hit that threshold.
And if you don't do it, then the approximations start getting
bad and you start getting bad search results.
The nice thing about the graphs is just they have very low latency
but they're just very expensive to maintain.
Now the first way, then there's the other type of index which
are the clustered indexes.
Clustered indexes are very simple to picture in your head, right?
If you have a cluster of let's say you took out every song in
Spotify and you clustered them in a coordinate system and you
can visualize this in 2 dimensions.
If you, then you plotted all the songs and the songs that are
adjacent, of course, also adjacent in vector space and genres
will emerge, right?
There'll be a rap cluster, there'll be a rock cluster.
You zoom in and you get, you know, like little sub clusters,
I don't know, death metal, black metal, I don't know what all
the different rock genres are, right?
Somewhere in these clusters.
And you can generate great recommendations based on that because
you can look at, okay, what did Michael listen to and what are
some songs that are close by that he hasn't listened to and same
for Nik.
So this is, it's very simple, you create a clustering algorithm
that basically just tries to divide the data set into clusters
and when you do a query, instead of looking at all the vectors,
you look at, well, clearly the user is asking about a rock song,
so we're only going to look in the rock cluster.
And that way you divide down the number of vectors that you have
to seek.
Now the problem with this is that if you have everything in memory,
it's not necessarily as optimal because you might have to look
at more data than you do in a graph-based approach.
And because RAM has such good random
read latency, the penalty
is not necessarily worth it if
everything is in memory at all
times.
But this is great for disks, and
it's great for S3, because I
can go to S3 and in 200 milliseconds
get gigabytes of data back,
right?
It doesn't matter if I'm getting
like, you know, a megabyte or
a gigabyte, I can often get that
in the same round trip time
if I exhaust the network.
So if you then go into S3, basically,
you have to download all
the clusters.
So like, let's say clusters of
JSON to really simplify this.
And then you just look at the closest
n clusters to your query
vector.
And then you download, you know,
cluster1.json, cluster2.json,
whichever ones are closest in 2
round trips.
And now instead of on the graph-based
ones where you're doing
200 milliseconds, 200 milliseconds,
200 milliseconds to navigate
the graph, you just have to do
200 milliseconds to get the clusters
and then 200 milliseconds to get
all the clusters that were adjacent.
The nice thing about these clustered
indexes is that with algorithms
like SPFresh and lots of modifications
to them, we can incrementally
maintain these clusters because
you can imagine that when you
add a vector you just have to add
it to the cluster and it's
just 1 write.
The write amplification is very
low.
Once in a while that cluster will
grow beyond a size.
Let's say it's a thousand elements
long and you have to split
the cluster and then you have to
do some modifications.
That's essentially what SPFresh
is.
And then there's a little bit higher
write amplification.
But it's stable in the way that
you never reach this threshold
where, okay, I've added 20% of
the dataset, I have to rebuild
the entire thing as you do in DiskANN.
HNSW don't have to do that, which
is why it's very, very, like,
it's very nice, but it's just slow
still.
So SPFresh, we think, writes a
really, really nice set of trade-offs
where it's gonna be a little bit
slower, but slower in terms
of, okay, instead of a search returning
in a millisecond, it
might take 5 milliseconds, and
just no 1 cares in production
for search.
This matters for a point lookup
into a relational database, but
for search it's a perfect set of
tradeoffs.
Michael: Question on the cluster
splitting.
Does that mean we don't ever need
to rebuild the whole index?
Because I think that was a limitation
of the first cluster-based...
IVF flat, I think, we started with
in pgvector, and that didn't
have the splitting as far as I'm
aware, and therefore we had
to rebuild the whole index from
time to time if the data changed
significantly.
Simon: That's right.
And there's also MERGE,
right?
Nikolay: There's also not only split, there
is also MERGE in SPFresh,
as I remember.
Simon: Yes, there's MERGE as
well.
Michael: Makes sense.
Simon: And because, because you
might do deletes, which are
also a pain.
To my knowledge, pgvector does
not implement SPFresh.
It's a, it is very difficult to
implement correctly.
Nikolay: But also funny, I did
a little bit research in May,
I think, when I discovered turbopuffer
and started reading about
this.
I saw original implementation was
actually on forked Postgres
SPFresh.
I saw some, you need to dig into
it, like in some Microsoft repositories
also I think some Chinese engineers
were involved, something
like there, Like I saw some repository
which was basically forked
Postgres and original SPFresh
implementation was on top of that.
Maybe I'm hugely mistaken, but
I saw something like this.
But it's hard.
I agree.
And I also recalled what I wanted
to ask.
I was lost in my previous question
because it was too many things.
I recalled in your talks, you discuss
that S3 is ready to store
data.
S3 is ready because over the last
few years, they added important
features.
Can you recall what you talk about
at the talks?
Simon: Yeah, I mentioned that
there were 2 things that you
needed to build a new database.
And there's a reason why a database
like turbopuffer hasn't already
been built.
It's a new storage architecture.
It's only really possible now.
The 3 things that have enabled
a database like turbopuffer to
exist with this sort of pufferfish
architecture, right, where
it's, you know, when the pufferfish
is deflated, it's in S3,
and when it's, you know, somewhere
in between, it's in SSD, and
then it's in memory when it's all
the way inflated.
The reason that's possible is because
of 3 things.
The first 1 is our NVMe SSDs.
NVMe SSDs have a new set of trade-offs.
They act completely differently
than other disks, right?
SSDs are just sort of like...
The old SSDs were very fast, but
NVMe SSDs have just phenomenal
performance potential where basically
on an NVMe SSD, the cost
per gigabyte is a hundred times
lower than memory, but the performance,
if you use it correctly, is only
about 5 times worse.
And you have to, by using NVMe
SSD correctly is really that you
have to put a lot of concurrency
on the disk, but again, similar
to S3, every single round trip
takes into hundreds of microseconds,
but you can drive a lot of bandwidth.
Old storage engines have not been
designed for that.
You have to design from that, from
the day that you write the
first line of code.
Otherwise, it takes a very long
time to retrofit.
It happens to be that that exact
usage pattern is also what's
very good for S3.
NVMe SSDs were not available in
the cloud until 2017, 2018.
So this is, in database languages,
this is relatively new.
Second thing that needs to happen
is that S3 was not consistent
until December of 2020.
I think this is the most counterintuitive
to most because most
of us just think that it always
has been, but it hasn't.
What that means is that when, if
you put an object on S3 and
then you read it immediately after
you were, as of December of
2020, guaranteed that it was going
to be read after write consistency.
The third thing that needs to happen,
and this is very informed
by the fact that I was on call
for Shopify for so long.
When you're on call for a long
time, you gravitate towards very
simple systems and ideally systems
that are constantly tested
on their resiliency.
So you don't get paged when something
abnormal happens.
Nikolay: Yeah.
Simon: And for us, what was very
important for us to be on
call for a database for Justin
and I was that it only had 1 dependency.
And that dependency could only
be 1 of the most reliable systems
on earth, which is S3, Google Cloud
Storage, right?
And the other derivatives like
Azure blob storage and so on.
They're very, very, very reliable
systems.
But you could not build the metadata
layer on top, right?
So Snowflake and Databricks and
others that have built on top
of this in the last generation
of new databases needed another
metadata layer, some consensus
layer like FoundationDB or their
own Paxos or RAF protocol to essentially
enforce the read off
the right consistency, but also
to do various metadata operations
atomically.
But in late 2024, S3 finally announced
that re:Invent
compare-and-swap.
And what compare-and-swap allows
you to do is to put a file,
let's say metadata.json, you download
the file, you do some modifications
to it, and then you upload it again
with a version number, and
you only upload it if the version
number is the same.
Basically guaranteeing that you
did an atomic and nothing has
changed in the interim.
Very important when you're building
distributed systems, right,
you can really implement anything
on top of that, as long as
you're willing to take the performance
hit of going back and
forth to S3.
And of course, they have a whole metadata, Paxos, whatever thing
to implement that.
In GCS, it's Spanner, but I don't have to worry about that.
That's for them to formally verify and whatever they need to
do to uphold those constraints.
Those were the 3 things that needed to happen.
And that is requirement number 1, to build a new database.
And so That's what was in the air for turbopuffer to grab.
The second thing that you need for a new database is that you
need a new workload that's begging for that particular storage
architecture.
So for Snowflake and Databricks, we saw that in, well, we want
big scale analytics and it doesn't make sense also for the dollar
per gigabyte that we have to pay in operational databases and
adding indexes on everything.
So there was a new OLAP workload, and there was at least an acceleration
of the OLAP workload with all the successful web applications,
and then the new storage architecture on top of S3, but they
had to do some more work because these APIs that I just mentioned
weren't available.
And a new workload now is connecting lots of data to AI.
And this storage architecture is a very good fit for it.
Nikolay: Yeah, So that's great.
Thank you for elaborating about this 3 changes at S3.
But they also made another change recently and they announced
S3 vectors, right?
What do you think about this comparing to your system?
Simon: Yeah, I think that S3 Vectors is, if you, if you are
writing the vectors once and you don't query them that much,
and you don't care that much about the query latency, it can
be a useful, useful product in the same way that you might be
using S3 today.
But S3 Vectors is not, doesn't do full text search, right?
It doesn't do lots of these features that you need for a serious
system.
And even S3 Vectors recommends that you load into OpenSearch.
And so for archival vectors, this can make sense.
So they're still operational, but there's lots of limitations
that would make it very difficult for it to go into production
systems, right?
If you do a query to S3 vectors, it takes hundreds and hundreds
of milliseconds to get the result back.
Whereas the turbopuffer, you can get the result back in less
than 10 milliseconds.
Nikolay: Thanks to cache, right?
Simon: Thanks to cache.
Nikolay: Yeah.
Yeah, that's totally makes sense.
I still I have skeptical questions more of them, if you don't
mind.
Simon: Let's go.
Nikolay: Yeah, 1 of them is your
your cache layer is on local
NVMEs, right?
Yeah.
But why?
Why?
Like, we could store it there.
PlanetScale recently came to Postgres
ecosystem and they said
let's stop having fears about using
local NVMEs and yes, like
ephemeral storage and so on and
there is a reliable failover
and so on.
Let's do it super fast.
Yes, I agree, super fast.
And 4TB for a billion or 3TB for
a billion vectors probably won't
cost too much because this price
is usually in AWS, for example,
in the instances which have local
NVMe, it's basically included.
And the limit is many, many dozens
of terabytes of local NVMe
storage these days on larger instances.
So we are fine to store not only
Vectors, but everything else.
So my question is like back from
S3 to MySQL, for example, My
SQL supports storage engines for
many years.
Have you considered building storage
engine for, for MySQL,
for example, and, and using local
NVMe?
Simon: You can't outrun the economics,
right?
The economics are still the same.
You, you have to replicate whether
it's local NVMe or not, which
is not necessarily cheaper, maybe
only marginally than, than
a, than a network volume.
You still have to replicate it
3 times, right?
You still have to put it on 3 machines.
You still need all the cores and
memory and so on that you would
need on the primary to keep up,
unless you start to have a heterogeneous
replica stack, which for a variety
of reasons would be a really
bad idea.
So you're still paying for paying
for all of that.
Now up to a certain point that
makes a lot of sense right.
If I have a customer who gets on
a call with our sales team and
they have a couple million vectors
in pgvector there is no reason
to move off of it.
That is perfect.
You should not be taking on the
complexity of ETLs and so on.
But if you have like tens of terabytes
of vector data, it is
not economical for a lot of businesses.
Now for some businesses it are,
right?
But the art of a business is to
earn a return on the underlying
costs.
And for some businesses, it's very,
very challenging to earn
a return on storing this on 3 replicas,
this vector data, it's
generally not valuable enough to
the business.
So turbopuffer doesn't make sense
to as a storage engine into
into a MySQL or into a Postgres.
It's just fundamentally not compatible
with the way that we do
things outside of replica chains,
right?
You could maybe come up with a
storage engine where you page
everything into S3 and all of that,
but you're now trying to
build a new database inside of
an extremely old database, right?
The storage layer is, is too, especially,
more so in Postgres,
I think that in MySQL is very,
very weaved into, into how the,
how the query planner works and
all of that.
So at that point, you're rebuilding
the query planner to be very
round trip sensitive, you're rebuilding
a storage layer.
It doesn't make sense anymore.
It's a completely different database.
Nikolay: Okay.
Another skeptical question.
I go to turbopuffer website and,
by the way, great design, very
popular these days, with mono,
mono space font and, like this
types of graphic and your the ties
for 1 billion scale by 1 billion
vectors or 1 billion documents.
But there, if I check 1 billion,
the console of namespaces pops
up.
And namespaces, it's, if I choose
1 billion, if I choose a hundred
million, it's okay.
I can have 1 namespace, but if
it's 1 billion, there is a warning
that probably you should split
it to 10 namespaces.
And this means 10 different indexes,
right?
Simon: That's right.
Nikolay: Yeah.
So it's not actual 1 billion scale.
Or like something is off here in
my mind.
Like 1 billion is a single index,
should be index, single index.
If we talk about 1 billion but
divided by 10 collections and
indexes also, like it's already
a different story.
And so can you explain this to
me?
I saw this in the beginning and
I'm happy in this position I
can ask the founder himself about
this.
Simon: Yeah, we try to be extremely
transparent in our limits,
right?
And I think your mental model is
correct.
Before this, we just had a limit
that said that we could do in
a single namespace around 250 million
vectors, right?
But I mean, even that is a simplification
because how big are
the vectors?
If they're 128 dimensions, they
are a lot less space, which is
ultimately what matters here, which
is why we also put the gigabytes.
So when we in the past, we had
just 250 million vectors on there
as a limit, people came to us and
said, or we knew that people
weren't testing with us, because
they wanted to sort of search
a billion at once.
And they didn't realize that you
could just do id % N and
then you could do a billion and
it would be very economical for
them.
So we sort of had to put in the
docs like, yes, you can do a
billion at once, but you have to
shard it.
Now I would love to handle that
sharding for you, right?
I mean that's what Elasticsearch
does and it's what a lot of
databases do because the only way
to scale any database is sharding,
right?
You don't get around it.
The question is where the complexity
lives, right?
Does the complexity live inside
of the database to handle it
for you.
Some of the most sophisticated
sharding you will find lives inside
of Cockroach and Spanner and these
kinds of systems.
And the simplest type of sharding
is what we're exposing, where
every single shard is just a directory
on S3, and you can put
as many of them as you want and
you can query as many of them
as you want.
Now over time, of course, we need
to create an orchestration
layer on top of that so that a
logical namespace to the user
is actually multiple namespaces
underneath.
But we're challenging ourselves
to make every individual namespace
as large as it possibly can.
When I ran an Elasticsearch cluster,
or was involved in scaling
Elasticsearch clusters, every shard
was around 50 gigabytes.
That was roughly what was recommended
for Elasticsearch.
Nikolay: Quite small, right?
Simon: It's quite small, right?
Like on Postgres it's a lot larger
than that.
It's basically the size of the
machine.
But the problem with a small shard
size is that for something
like a B-tree, right, it's obviously
like it's log n, right?
The number of searches you have
to do.
And if you have log n and n is
very high, well, that's a great
number of operations that you have
to do.
But for every shard, there's sort
of like an m log n.
And now m is very high if the shard
size is small and n is small.
So you're doing a lot more operations.
So we want the shards to be as
large as possible, because of
course you can get to a billion
by just doing, you know, a thousand,
1 million shards, but like that's
incredibly competitionally
ineffective.
So we have shards now for some
users that we're testing that
do almost a billion documents at
once, right?
But it requires some careful tuning
on our end.
So we wanna push that number as
high as possible.
The other thing with namespaces
is that turbopuffer is a multi-tenant
system and we have to bin pack
these namespaces.
So if a namespace is 5 terabytes
large, it's much harder for
us to bin pack on node sizes that make sense.
So we have to walk some threshold there where we try to make
the shards as small as possible from the logical data.
We have to bin pack and then we have to index by without slowing
the indexing down because the larger an ANN index gets, the harder
it is to maintain the index.
So those are the constraints we walk.
So over time, we update these limits continuously.
You will see that shard size increase.
But you're not going to find anyone who's doing 100 billion vectors
in a single index on a single machine, AKA M is 1 and N is a
hundred billion, but we want to get as high as possible because
that is the most competitionally ineffective.
That's how we get our users the best prices.
And we see the most ambitious use cases.
Nikolay: And, and, shards are these actual shards or partitions?
So it's like shards, meaning that the different compute nodes
behind the scenes are used if it's 10 namespaces is it 10 shards
10 different compute nodes?
Simon: So a namespace for us and this is also why we've chosen
a different name for it right A namespace to us is a directory
on S3, and which compute node that goes to is essentially just
a consistent hash of the name of the prefix and the User ID,
right?
So compute node 1 could hash to maybe, you know, a hundred thousand
different namespaces.
And they share the compute and then we scale the compute right
with, with that, and that's, that's why the bin packing is
Nikolay: flexible.
I see.
Simon: Yes.
And that's also part of why we can do get really good economics.
Nikolay: Yeah.
I wanted to shout out again, like front page is beautiful.
It explains like basics, architecture and pricing right here,
latencies right here and case studies.
This is like how it should be, right?
For engineers, you know, so you quickly see all the numbers and
understand.
That's great.
Thank you for transparency.
That's great.
Simon: I think it's, it just comes from the fact that I was
on that, you know, your side of the table, my whole career, right?
I've been buying databases, evaluating databases, and sometimes
I swear I end up on a Database website and I don't know if they're
marketing a sneaker or a Database.
Nikolay: Yeah, go figure out storage or compute price in AWS
or GCP.
It's like always a task.
Simon: So we really wanted to
just put the foot forward of
like the diagram is right there.
Okay.
This is my mental model.
This is what it does.
This is what it costs.
This is how fast it is.
These are the limits.
These are the customers and how
they use it.
You go to the documentation.
We talk about guarantees.
We talk about trade-offs.
The kinds of things that I always
look for immediately to slot
it into my model.
Because I don't want you to use
our database at all costs.
I want you to use it if it makes
sense for you.
I want it to be a competitive advantage
to you.
I want to save you like millions
of dollars a year and in order
for that to make sense You have
to just put all the bullshit
aside and make it very clear what
you're good at and what you're
not good at
Nikolay: What about for example
if I I know there is a vector
search and full text search support
to try it But what about
additional filtering when we have
some dimensions when we want
to filter on them.
Is it possible?
Yes.
For example, some categories or
like time ranges or something,
it's all possible, right?
In both types of search in turbopuffer.
Simon: I'll go back to my line
before of every successful database
eventually supports every query,
right?
And, we're, I don't know, I like
maybe 5% of the way there.
So we support filters and filtering
on vector search is actually
very difficult, right?
Because if you do something like,
let's say I'm searching for
banana and it's an e-commerce site
and I have to filter for ships
to Canada.
Well that might cut off a fruit
cluster which is actually where
I want it to be and then like you
know I get to the banana on
a t-shirt cluster but it's really
far away.
So you have to do very sophisticated
filtering to get good results.
I don't know what pgvector does,
but I know that our query planner
is recall aware.
It does a lot of statistics and
distribution of where the vectors
are to ensure that we have high
recall.
And for a percentage of queries
in production, we check against
an exhausted search what the recall
is, the accuracy of the index.
I don't know if pgvector does this,
but I know it's been a monumental
effort for us to
Nikolay: do it.
It's a big headache there.
It's a big headache.
Actually, it's a good point that
additional filtering can change
semantics of the search.
If it's searching for bananas,
size equals nano, it's very different.
Michael: 1 thing they've implemented
relatively recently is iterative
index scans so I think they had
a problem where for example you'd
put a limit you do a similarity
search with a limit and a filter
and you'd ask for a hundred results
and you'd get back fewer
even though they were so it's a
they had a problem so they go
back to the index and request more
so yes it's like a work around
I'd say more than a solution but
it's pretty effective for a
lot of these cases.
Simon: I think the problem there
is that I would be very skeptical
of the recall in a case like that,
right?
Because, okay, so you just like
you got 100, but you probably
should have looked at a lot more
data to know what the true top
K was.
So, you know, that solution was
not acceptable to us.
Like we had to go much, much further
to ensure high recall.
And I think the problem with a
lot of these solutions is that
people are not evaluating their
recall because it's actually
very difficult to do.
And it's very competitionally expensive.
You're not gonna have your production
Postgres run an exhaustive
search on millions of vectors.
It's too dangerous to do that while
you're doing your transactional
workloads.
But again, if you're hitting these
kinds of problems, then it
may be time to consider maybe for
a subset of your workloads,
whether something more specialized
makes sense, whether the trade-offs
make sense to you.
But I think with the pre-filtering
and post-filtering, it can
be, it's very challenging to create
a query planner that can
do that well.
So we support all the queries,
right?
You can do, and you can do range
queries.
You can do exact queries.
You can operate with arrays.
You can do all types of queries,
set intersections that people
use for permissions.
We can do full-text search queries,
we can do GROUP BYs, we
can do simple aggregates.
So we can do more and more queries
and we're constantly expanding
that with what our users demand
from the system that we've built.
Nikolay: 1 more unknowing question.
I know it's not open source.
There is no free version at all.
Even I'm pretty sure it was decision
not to go with open core
or open source model at all, and
can you explain why?
Simon: Yeah, I think there's
never been any, any, any particular
resistance to open source.
I mean, I love open source.
The reason we're not open source
is because open source is, if
you want to do it well, it's a
lot of effort.
And it's also a lot of effort to
build a company.
It's a lot of effort to find product
market fit.
And we decided to pour all of our
energy into that.
And it's similar argument for the
minimum.
It's really the, the only thing
that a startup has over, you
know, the big incumbents is focus.
And so our focus has been on the
customers that are willing to
pay.
And 1 day I would love to give
a free tier.
I would love for people's blogs
to run on turbopuffer.
But for now, I'm afraid that we
have to prioritize the customers
that are willing to put some money
behind it.
It's not because we have provisioned
hardware or anything like
that.
It's really just that we need to
know that you're willing to
put some money behind it so we
can give you amazing support,
right?
A lot of the times there'll be
engineers working on the database
who are supporting your tickets
and we can't do that with a free
tier.
Nikolay: Yeah.
Makes sense.
Yeah.
And we, Mike and I have different
points of view on this area.
Michael: I think that was a good
answer.
I'm on your side, Simon.
I had a question on the full text
search and semantic search.
We had a discussion a while back
about hybrid, I think it's generally
called hybrid search.
Are you seeing much of that where
people are wanting to kind
of mix the results and order between
them?
Simon: Yeah, we do.
And what we see is that the embedding
models are pretty phenomenal
at finding good results for subjects
that they know about, right?
And terms that they know about.
You know, you, you, you run a project
called pgMustard, right?
And maybe the embedding model doesn't
know.
I think it's popular enough that
it will know what it is, but
let's say it didn't know.
And then it puts it in the cluster
close to ketchup.
And the results are just horrible.
And the text, the full text search
tends to be really good at
this.
Like recall for things that the
embedding model can't know about.
If you're searching for an algae,
you know, these like TV SKUs,
it's like, you know, it's indecipherable.
Actually, the embedding models
are actually quite good at these
because they happen enough on the
web.
But imagine that it wasn't these
kinds of searches it's good
for.
The other thing is search as your
type.
So if I'm searching for SI, the
embedding model might find that,
well, he's searching for something
agreeable in Spanish, right?
But really what I wanted was the
document that started with Simon.
So sometimes you just need to compliment
these twos.
And I think that that's why we've
doubled down on the full-text
search information in turbopuffer
to do this hybrid.
But people can get a lot of mileage
of embedding models alone.
Michael: Wonderful.
Nikolay: Yeah, thank you so much.
Yeah, it was super interesting.
Good luck with your system.
And yeah, I think it's an interesting
idea, I guess, for Postgres
users who need, as you said, to
store more and have all characteristics
turbopuffer have.
Maybe it's worth considering, right,
to keep all OLTP workloads
and data in regular Postgres while
moving vectors to turbopuffer,
right?
Simon: I mean, it's just similar
to, people have taken workloads
out of Postgres.
It's like updating a posting list.
It's a very expensive operational
and a transactional store.
And it's similar in an index updating
that with the same kinds
of ACID semantics that Postgres
upholds is very expensive.
And I've ripped full-text search
out of Postgres many times because
it's very, very expensive to do.
So we do that because we don't
want to shard Postgres, because
it's like a lot of work.
And so we start by moving some
of these workloads out first and
search is 1 of the early ones to
go.
Nikolay: So your point is that
it probably postpones the moment
when you need to shard.
Simon: Yeah, same reason from
memcache and redis, right?
You separate out these workloads
as soon as possible to avoid
sharding.
Kick the can.
Nikolay: Interesting idea.
Okay, again, thank you so much
for coming.
It was a super interesting discussion.
I enjoyed a lot.
Simon: Thank you so much.
Michael: Yeah, really nice to meet
you.
Thanks for joining us.