Hacker Read

Hacker Read top | best | new | newcomments | leaders | about | bookmarklet

login

		USearch: Smaller and faster single-file vector search engine (unum-cloud.github.io) similar stories update story
		202 points by 0xedb \| karma 21271 \| avg karma 35.63 2023-07-31 09:25:14 \| hide \| past \| favorite \| 57 comments

view as:

eitan-turok | karma 3 | avg karma 1.5 2023-07-31 14:16:10 | [–] similar comments

This looks like a great package. Many vector-search engines do not allow you to implement your own custom distance metrics. But Unum does. Love it!

ashvardanian | karma 2504 | avg karma 5.55 2023-07-31 14:30:50 | [–] similar comments

Oh, thank you! The library author here :)

We've just hosted one of our first community/contributor calls a few hours ago, discussing the plans for the upcoming 1.0 release, and integration with UCall, UStore, and UForm - our other FOSS libraries. Please don't hesitate to reach out for any questions or feature requests - now is the best time :)

_0ffh | karma 2659 | avg karma 1.98 2023-07-31 14:51:45 | [–] similar comments

I know it's a triviality, but you've got a typo in "Hardware-agmostic" you may want to fix.

ashvardanian | karma 2504 | avg karma 5.55 2023-07-31 14:54:30 | [–] similar comments

Thanks for suggestion! Details count, nothing is a triviality!

If someone here has free time and C++ experience I am open to recommendations on the codebase and style as well: https://github.com/unum-cloud/usearch/blob/main-dev/include/...

pilooch | karma 591 | avg karma 1.75 2023-07-31 16:57:07 | [–] similar comments

Look at policy-based templating, my goto for generic AI/search algorithms with many options. May prove useful.... Or not :)

gregw134 | karma 1224 | avg karma 4.71 2023-07-31 15:40:18 | [–] similar comments

I Have a general vector retrieval question, if you have time to humor me. Suppose I have 10 features per document, each with an embedding. Is it possible to retrieve the document with the highest average embedding score across its features? The only approach I can think of is retrieving the top 1k results across each feature to generate a candidate set, then recomputing full scores for each document.

ashvardanian | karma 2504 | avg karma 5.55 2023-07-31 15:58:37 | [–] similar comments

e z!

The simplest way with USearch - concatenate 10 embeddings, define a custom metric with Numba, that takes the average of 10 dot-products. Done :)

gregw134 | karma 1224 | avg karma 4.71 2023-07-31 16:01:57 | [–] similar comments

Cool. What if I want a weighted average of embeddings? Going further, is it possible to adjust the weights at search time?

ashvardanian | karma 2504 | avg karma 5.55 2023-07-31 16:06:35 | [–] similar comments

Yes, and yes. The last one may be a bit trickier through Python bindings today, but I can easily include that in the next release… shouldn’t take more than 50 LOC.

gregw134 | karma 1224 | avg karma 4.71 2023-07-31 16:14:04 | [–] similar comments

(replying here because hn limits thread length)

> Yes, and yes. The last one may be a bit trickier through Python bindings today, but I can easily include that in the next release… shouldn’t take more than 50 LOC.

Appreciate it. That'd be game-changing for me. The ultimate thing I'd like to do is actually use a function of the form score = af1(embedding1) + bf2(embedding2) + ...

That way you could make adjustments like ignoring feature1 unless its score passes a threshold. I'll try looking at Numba to see if that's possible.

ashvardanian | karma 2504 | avg karma 5.55 2023-07-31 23:30:42 | [–] similar comments

Sure, don’t hesitate to reach out to us on Discord. It will be much easier to chat and exchange code snippets there: https://discord.gg/A6wxt6dS9j

KRAKRISMOTT | karma 1657 | avg karma 2.17 2023-07-31 15:10:30 | [–] similar comments

What's performance like without BLAS acceleration?

ashvardanian | karma 2504 | avg karma 5.55 2023-07-31 15:18:10 | [–] similar comments

We don't use BLAS. Why? BLAS helps with matrix-matrix multiplications, if you feel lazy and don't want to write the matrix tiling code manually.

They bring essentially nothing of value in vector-vector operations, as compilers can properly auto-vectorize simple dot products... Moreover, they generally only target single and double precision, while we often prefer half or quarter precision. All in all, meaningless dependency.

What do we use? I wrote a tiny package called SimSIMD. It's idea is to utilize less common SIMD instructions, especially in mixed-typed computations, that are hard for compilers to optimize. It was also a fun exercise to evaluate the performance of new SVE instruction on recent Arm CPUs, like the Graviton 3. You can find the code, the benchmarks, and the results in the repo: https://github.com/ashvardanian/simsimd

Still, even without SimSIMD, USearch seems to be one of the faster implementations of vector search. You can find the benchmarks in the first table here: https://github.com/unum-cloud/usearch#memory-efficiency-down...

KRAKRISMOTT | karma 1657 | avg karma 2.17 2023-07-31 15:38:00 | [–] similar comments

The docs recommends compiling for the target machine, does the pip package compile on install?

ashvardanian | karma 2504 | avg karma 5.55 2023-07-31 15:45:22 | [–] similar comments

If you install from PyPi default repository - it comes precompiled, but can still be ad-hoc accelerated with JIT-ed metrics. Either way, it should have decent performance. Still, if you wanna push the limits and work with Multi-Terabyte indexes on one node - recompiling locally should help.

twelfthnight | karma 830 | avg karma 3.82 2023-07-31 15:24:15 | [–] similar comments

Are folks typically using HNSW for vector search these days? I thought maybe ScaNN has proven to be better? Especially since it's available in FAISS [2].

[1] https://ai.googleblog.com/2020/07/announcing-scann-efficient... [2] https://github.com/facebookresearch/faiss/wiki/Fast-accumula...

ashvardanian | karma 2504 | avg karma 5.55 2023-07-31 15:29:28 | [–] similar comments

Depends... I have a beef with all methods based on "trained quantization". It introduces too much noise in your distribution, suffers from drifts, and makes the method mostly inapplicable for other forms of "Similarity Search" that don't strictly fall into the "Vector Search" category.

Many disagree. Pick whatever rocks your boat, there is a FOSS library for almost everything these days :)

twelfthnight | karma 830 | avg karma 3.82 2023-07-31 15:35:01 | [–] similar comments

Ah, those are interesting considerations. I don't have a horse in that race, I just had to implement a similarity search algorithm a few years ago and it was surprisingly difficult to find a consensus on what ANN algo to use!

smeeth | karma 803 | avg karma 6.58 2023-07-31 15:50:08 | [–] similar comments

Yeah, SPANN has better f1+queries per second on some benchmarks, but that's a little like comparing sorting algorithms, they're both fast and good.

The database software behind the ANN algo is probably a little more important in practice than the ANN algo itself, unless you're operating at such scale and speed that its an actual issue (e.g. you're google).

Differences between algorithms are a little more interesting when they let you do something totally different, like, for example, minimize the speed hit from doing searches on disk (SPTAG, DiskANN).

utopcell | karma 1697 | avg karma 2.06 2023-08-01 01:17:09 | [–] similar comments

ScANN is not available in FAISS, it is Google's version of it.

jbellis | karma 4390 | avg karma 4.86 2023-08-02 11:29:46 | [–] similar comments

But FAISS has their own implementation of those ideas ("FastScan") which is what GP linked.

freediver | karma 9889 | avg karma 4.21 2023-07-31 16:30:11 | [–] similar comments

I am interested in testing this in production, instead of faiss/mrpt.

> metric='cos', # Choose 'l2sq', 'haversine' or other metric, default = 'ip'

As a note, it is actually 'l2_sq' for the Python example.

> index.add(labels=np.arange(len(vectors)), vectors=vectors)

Adding to index appears to be very slow. Also labels are listed as an optional param but the Python SDK has them as required.

Do you have setup of params for 'brute force' approach (100% accuracy)?

ashvardanian | karma 2504 | avg karma 5.55 2023-07-31 23:28:43 | [–] similar comments

Sure! You can pass exact=True to the search interface.

> Adding to index appears to be very slow.

Interesting. Can you please elaborate? We benchmark it on daily basis, but there is always a chance we forgot some corner case :)

PS: Thanks for considering us! USearch is already used in production by a few companies (small and very large), and we would be happy to assist with integration!

PS2: Argument name inconsistency is solved on the main-dev, and will be released with a bunch of major changes in 1.0 this week.

moab | karma 688 | avg karma 3.42 2023-07-31 16:58:33 | [–] similar comments

Do you have plans to support metadata filtering?

momothereal | karma 673 | avg karma 3.94 2023-07-31 17:15:05 | [–] similar comments

I was going to ask the same. That is a really important feature to have to replace traditional indexes and usually poorly implemented in vector search libraries.

For example, filtering by arbitrary time range.

ashvardanian | karma 2504 | avg karma 5.55 2023-08-01 01:45:22 | [–] similar comments

I will reply with a meme from my recent talk on Vector Databases, if that is in order: https://youtu.be/UMrhB3icP9w?t=1682

§ Supporting advanced filtering in USearch

In the low-level C++ interface we already support arbitrary predicates (callbacks) evaluated during HNSW graph traversal. JIT-ing them from the Python level is a bit trickier, but we will consider that, if there is demand.

§ Supporting advanced filtering with USearch

We are now in the process of building a bridge between USearch and UStore, that would allow combining Vector Search with a proper Multi-Modal database. This will solve your problem, but will take some time to get it right. Feel free to contribute :)

CharlesW | karma 34884 | avg karma 5.2 2023-07-31 18:09:40 | [–] similar comments

@ashvardanian, what are reasons a developer would choose this over sqlite-vss?

ashvardanian | karma 2504 | avg karma 5.55 2023-07-31 23:33:46 | [–] similar comments

sqlite-vss is an SQLite extension. Such things are often build on top libraries like FAISS or USearch. It is just a matter of - how many layers of abstractions do you want to pay for… performance wise.

If you already use some DBMS to store your data - extension can be a good place to start. Once you scale and want to tune… switch to using the underlying engine directly.

nl | karma 29762 | avg karma 2.49 2023-07-31 22:56:30 | [–] similar comments

Slightly offtopic, but I'm currently working on a video similarity search tool, and the vectors I'm using are pretty big (the size of a vector is over 2M). This is quite different to the normal vector size of maybe 10k max.

Currently I'm using Annoy (mostly because it's what I've used before) but I am a bit worried that this is well outside what it has been designed for.

Has anyone got specific advice for things I should try? I've used FAISS previously but it seems to have the same design space.

ashvardanian | karma 2504 | avg karma 5.55 2023-07-31 23:21:29 | [–] similar comments

Yes, Annoy is probably not the best tool for the task. Are the vectors sparse?

nl | karma 29762 | avg karma 2.49 2023-07-31 23:36:26 | [–] similar comments

> Are the vectors sparse?

No they aren't.

shri_krishna | karma 24 | avg karma 0.19 2023-07-31 23:38:58 | [–] similar comments

> the size of a vector is over 2M

Do you mean the dimension of the vector or the number of vectors?

nl | karma 29762 | avg karma 2.49 2023-07-31 23:58:14 | [–] similar comments

The dimension of the vector. It's the hidden state from a video vision transformer.

ashvardanian | karma 2504 | avg karma 5.55 2023-08-01 00:08:43 | [–] similar comments

Ouch! That’s fat! Which model is that?

We have built a few video-search system by now, using USearch and UForm for embedding. They are only 256 dims and you can concatenate a few from different parts of the video. Any chance it would help?

https://github.com/unum-cloud/uform

nl | karma 29762 | avg karma 2.49 2023-08-01 00:39:19 | [–] similar comments

It's https://huggingface.co/docs/transformers/main/model_doc/vivi...

I'm doing the most naive implementation possible at the moment though so it's likely I could improve it.

> UForm

Looks interesting. I'll have a play, thanks.

I'm surprised there aren't more options in this space actually

shri_krishna | karma 24 | avg karma 0.19 2023-08-01 00:51:06 | [–] similar comments

Damn! 2M dimension dense vector is huge. Maybe you need to do some dimensionality reduction before attempting ANN. Something like PCA should help.

henrydark | karma 423 | avg karma 2.49 2023-08-01 01:15:43 | [–] similar comments

How do you do pca on dimension 2M?

woadwarrior01 | karma 1565 | avg karma 2.96 2023-08-01 06:45:20 | [–] similar comments

Have you considered training a shallow MLP autoencoder, perhaps with tied weights between the encoder and the decoder to reduce the dimensionality of the embeddings? Another (IMO, better) approach I can think of off the top of my head, would be to use a semi-supervised contrastive learning approach, with labelled similar and dissimilar video pairs, like in this notebook[1] from OpenAI.

[1]: https://github.com/openai/openai-cookbook/blob/main/examples...

nl | karma 29762 | avg karma 2.49 2023-08-01 07:44:15 | [–] similar comments

I originally started doing a triplet loss for video similarity similar to https://keras.io/examples/vision/siamese_network/ except for video instead of images.

But I'm sort of hoping to avoid training a model.

nl | karma 29762 | avg karma 2.49 2023-08-01 00:02:33 | [–] similar comments

Reading the docs of this library it seems like I should try it, especially since it has built-in downcasting to save space on the indexes (which is rapidly turning into a big problem for me!)

jhj | karma 403 | avg karma 3.36 2023-08-01 03:06:34 | [–] similar comments

This seems impractical, it's likely that the data is highly redundant and you'd probably do just as well by just picking a random projection to a much smaller subspace (or simply just perform a random subsampling of the dimensions, or sum dimensions together, stuff like that) rather than spending the compute to learn a projection via SVD or some such. Hubness might be a significant problem as well and lead to search results not matching your intent. Also, numeric problems (e.g., if you were accumulating distance in floating point) would become an issue as well with millions of dimensions unless the way that distances are summed get special treatment (like Kahan summation, or reduction trees to sum values of roughly equal expected magnitude, etc) too; x += dist[i] won't cut it.

Any kind of acceleration technique to limit the search to a subset of the database (such as cell-probe-ish methods like LSH or IVF, or graph-based methods, etc) would take a ton of time to compute. Simply storing all the data you need for search, even brute force, would rapidly explode, not to mention the compute required.

Most cases with such large vectors I've seen begin with highly sparse vectors. Certainly Faiss (I wrote the GPU side of Faiss), Annoy, or most any similarity search libraries out there are geared to dense vectors in the 20 - 2000ish dimension range (beyond the number of dimensions where exact methods such as BSP or k-D trees work well as in "high" dimensions your nearest neighbor is highly likely to lie on either side of a dividing hyperplane, but below cases where simply storing the data uncompressed / unquantized / etc is hard and the amount of compute is prohibitive as well).

How big is the data set (number of vectors) that you are searching among, and are you performing single queries or batch queries?

nl | karma 29762 | avg karma 2.49 2023-08-01 07:34:25 | [–] similar comments

Thanks for the response.

You are probably right in the general case.

I'm only searching hundreds of vectors, and it seems to be working surprisingly well (astonishingly well really!) - but I've only spent a few hours working on this and are far from having a proper measure of how good it is.

I'll try the smaller subspace ideas - seems like it'd be easy to try and could work.

> How big is the data set (number of vectors) that you are searching among

The biggest dataset I've tried so far is 400 (it is searching similar scenes in a sports broadcast)

> and are you performing single queries or batch queries?

single - choose a scene and it finds similar ones.

Thanks for Faiss BTW. I love love love it - long time member of the FB group you have. I think I switched to Annoy because I had a packaging issue some time back or something.

janalsncm | karma 6246 | avg karma 3.1 2023-08-01 03:41:36 | [–] similar comments

Train an autoencoder to reduce your vector dimensions down to something more workable. It’s unlikely you’ll be able to search against such enormous vectors in a reasonable amount of time anyways.

Another option is to shard your vectors into N pieces, where N*k is the length of your vector. Since cosine similarity doesn’t care about order, it will be fine. The only requirement is that the k-th vector can only be compared with other k-th vectors for similarity. The benefit of this approach is that it can be parallelized easily.

nl | karma 29762 | avg karma 2.49 2023-08-01 07:39:30 | [–] similar comments

The autoencoder was what I was thinking too. I was hoping to avoid training anything though!

Search speed is good (although the datasize is pretty small - hundreds of vectors).

So the sharding idea I'd:

  slice up the vector
  compare each sub vector with the corresponding sub vector ones from other vectors getting the similarities. 
  sum the similarities
  choose the maximum

Is this the general idea? Is there an implementation of this you've seen?

janalsncm | karma 6246 | avg karma 3.1 2023-08-01 15:50:55 | [–] similar comments

If you’re looking for something out of the box, I can’t help you unfortunately. I do ML at a company where we have our own in-house tooling.

I agree with your algorithm, it is equivalent to cosine similarity. But you might also consider whether you need all the shards at all. Maybe you can get away with a half or even a tenth of them. If you have some metric like ndcg you can measure the drop in performance and consider the trade off.

j2kun | karma 6207 | avg karma 3.38 2023-07-31 23:33:23 | [–] similar comments

In this page they have "space filling curves" as an example in one of the images, but I haven't been able to find production systems that actually use space filling curves for similarity search. Anyone have any tips?

ashvardanian | karma 2504 | avg karma 5.55 2023-08-01 00:15:41 | [–] similar comments

Old-school Postgres extensions for GIS would be an example. They aren’t used much anymore, but I felt like they deserve a place in history :)

PS: Love your blog! I have worked on SFCs in the past. Did you?

j2kun | karma 6207 | avg karma 3.38 2023-08-02 06:03:05 | [–] similar comments

I've been looking for good examples for my next book, but everything I've read about has turned out to be not used, including everything in Michael Bader's book [1], except for S2Geometry [2], but I feel like I need to talk to someone who has more expertise in the benefits/drawbacks of using that in an actual product.

[1]: https://www.google.com/books/edition/Space_Filling_Curves/zm... [2]: https://github.com/google/s2geometry

ykadowak | karma 1 | avg karma 0.5 2023-08-01 02:06:09 | [–] similar comments

@ashvardanian any plan to put it on ANN benchmarks?

ashvardanian | karma 2504 | avg karma 5.55 2023-08-01 02:28:48 | [–] similar comments

Here you go :)

https://github.com/erikbern/ann-benchmarks/pull/451

ykadowak | karma 1 | avg karma 0.5 2023-08-01 21:19:22 | [–] similar comments

Thanks! Looking forward to seeing the result.

nh2 | karma 2553 | avg karma 5.33 2023-08-01 02:53:54 | [–] similar comments

Is view() for disk-based indexes doing something special over plain mmap(), e.g. setting read-aheads based on the knowledge of the intental structure to make it faster if done over the network?

Talking about https://github.com/unum-cloud/usearch#disk-based-indexes

ashvardanian | karma 2504 | avg karma 5.55 2023-08-01 06:30:16 | [–] similar comments

Not in the current public version, but you are thinking in the right direction. Sty tuned ;)

svcrunch | karma 147 | avg karma 2.26 2023-08-01 20:34:21 | [–] similar comments

I'm curious, is HSNW the only option? Do you support IVF-style indexes? Also, FAISS is nice because it supports a pluggable storage layer. Is this something that's easily supported in USearch?

Great work, and thank you for your contributions.

adultSwim | karma 560 | avg karma 0.78 2023-08-03 12:10:52 | [–] similar comments

In the vein of single-file databases, I've been enjoying DuckDB and am exploring Kùzu, both coming out of the database group at University of Waterloo. DuckDB aims to be a SQLite for analytics (OLAP), while Kùzu is an analytics focused graph database.

https://duckdb.org/ https://kuzudb.com/

ukuina | karma 1113 | avg karma 1.96 2023-08-03 21:41:20 | [–] similar comments

The fact that USearch has a WASM binding for frontend use (AND supports serialization) is very cool for client-side search/LLM applications!

How would I integrate this into a dense passage retriever workflow for RAG? I could not find any examples for document chunk ingestion and similarity query.

Legal | privacy