We've just hosted one of our first community/contributor calls a few hours ago, discussing the plans for the upcoming 1.0 release, and integration with UCall, UStore, and UForm - our other FOSS libraries. Please don't hesitate to reach out for any questions or feature requests - now is the best time :)
I Have a general vector retrieval question, if you have time to humor me. Suppose I have 10 features per document, each with an embedding. Is it possible to retrieve the document with the highest average embedding score across its features? The only approach I can think of is retrieving the top 1k results across each feature to generate a candidate set, then recomputing full scores for each document.
Yes, and yes. The last one may be a bit trickier through Python bindings today, but I can easily include that in the next release… shouldn’t take more than 50 LOC.
> Yes, and yes. The last one may be a bit trickier through Python bindings today, but I can easily include that in the next release… shouldn’t take more than 50 LOC.
Appreciate it. That'd be game-changing for me. The ultimate thing I'd like to do is actually use a function of the form score = af1(embedding1) + bf2(embedding2) + ...
That way you could make adjustments like ignoring feature1 unless its score passes a threshold. I'll try looking at Numba to see if that's possible.
We don't use BLAS. Why? BLAS helps with matrix-matrix multiplications, if you feel lazy and don't want to write the matrix tiling code manually.
They bring essentially nothing of value in vector-vector operations, as compilers can properly auto-vectorize simple dot products... Moreover, they generally only target single and double precision, while we often prefer half or quarter precision. All in all, meaningless dependency.
What do we use? I wrote a tiny package called SimSIMD. It's idea is to utilize less common SIMD instructions, especially in mixed-typed computations, that are hard for compilers to optimize. It was also a fun exercise to evaluate the performance of new SVE instruction on recent Arm CPUs, like the Graviton 3. You can find the code, the benchmarks, and the results in the repo: https://github.com/ashvardanian/simsimd
If you install from PyPi default repository - it comes precompiled, but can still be ad-hoc accelerated with JIT-ed metrics. Either way, it should have decent performance. Still, if you wanna push the limits and work with Multi-Terabyte indexes on one node - recompiling locally should help.
Are folks typically using HNSW for vector search these days? I thought maybe ScaNN has proven to be better? Especially since it's available in FAISS [2].
Depends... I have a beef with all methods based on "trained quantization". It introduces too much noise in your distribution, suffers from drifts, and makes the method mostly inapplicable for other forms of "Similarity Search" that don't strictly fall into the "Vector Search" category.
Many disagree. Pick whatever rocks your boat, there is a FOSS library for almost everything these days :)
Ah, those are interesting considerations. I don't have a horse in that race, I just had to implement a similarity search algorithm a few years ago and it was surprisingly difficult to find a consensus on what ANN algo to use!
Yeah, SPANN has better f1+queries per second on some benchmarks, but that's a little like comparing sorting algorithms, they're both fast and good.
The database software behind the ANN algo is probably a little more important in practice than the ANN algo itself, unless you're operating at such scale and speed that its an actual issue (e.g. you're google).
Differences between algorithms are a little more interesting when they let you do something totally different, like, for example, minimize the speed hit from doing searches on disk (SPTAG, DiskANN).
Sure! You can pass exact=True to the search interface.
> Adding to index appears to be very slow.
Interesting. Can you please elaborate? We benchmark it on daily basis, but there is always a chance we forgot some corner case :)
PS: Thanks for considering us! USearch is already used in production by a few companies (small and very large), and we would be happy to assist with integration!
PS2: Argument name inconsistency is solved on the main-dev, and will be released with a bunch of major changes in 1.0 this week.
I was going to ask the same. That is a really important feature to have to replace traditional indexes and usually poorly implemented in vector search libraries.
In the low-level C++ interface we already support arbitrary predicates (callbacks) evaluated during HNSW graph traversal. JIT-ing them from the Python level is a bit trickier, but we will consider that, if there is demand.
§ Supporting advanced filtering with USearch
We are now in the process of building a bridge between USearch and UStore, that would allow combining Vector Search with a proper Multi-Modal database. This will solve your problem, but will take some time to get it right. Feel free to contribute :)
sqlite-vss is an SQLite extension. Such things are often build on top libraries like FAISS or USearch. It is just a matter of - how many layers of abstractions do you want to pay for… performance wise.
If you already use some DBMS to store your data - extension can be a good place to start. Once you scale and want to tune… switch to using the underlying engine directly.
Slightly offtopic, but I'm currently working on a video similarity search tool, and the vectors I'm using are pretty big (the size of a vector is over 2M). This is quite different to the normal vector size of maybe 10k max.
Currently I'm using Annoy (mostly because it's what I've used before) but I am a bit worried that this is well outside what it has been designed for.
Has anyone got specific advice for things I should try? I've used FAISS previously but it seems to have the same design space.
We have built a few video-search system by now, using USearch and UForm for embedding. They are only 256 dims and you can concatenate a few from different parts of the video. Any chance it would help?
Have you considered training a shallow MLP autoencoder, perhaps with tied weights between the encoder and the decoder to reduce the dimensionality of the embeddings? Another (IMO, better) approach I can think of off the top of my head, would be to use a semi-supervised contrastive learning approach, with labelled similar and dissimilar video pairs, like in this notebook[1] from OpenAI.
Reading the docs of this library it seems like I should try it, especially since it has built-in downcasting to save space on the indexes (which is rapidly turning into a big problem for me!)
This seems impractical, it's likely that the data is highly redundant and you'd probably do just as well by just picking a random projection to a much smaller subspace (or simply just perform a random subsampling of the dimensions, or sum dimensions together, stuff like that) rather than spending the compute to learn a projection via SVD or some such. Hubness might be a significant problem as well and lead to search results not matching your intent. Also, numeric problems (e.g., if you were accumulating distance in floating point) would become an issue as well with millions of dimensions unless the way that distances are summed get special treatment (like Kahan summation, or reduction trees to sum values of roughly equal expected magnitude, etc) too; x += dist[i] won't cut it.
Any kind of acceleration technique to limit the search to a subset of the database (such as cell-probe-ish methods like LSH or IVF, or graph-based methods, etc) would take a ton of time to compute. Simply storing all the data you need for search, even brute force, would rapidly explode, not to mention the compute required.
Most cases with such large vectors I've seen begin with highly sparse vectors. Certainly Faiss (I wrote the GPU side of Faiss), Annoy, or most any similarity search libraries out there are geared to dense vectors in the 20 - 2000ish dimension range (beyond the number of dimensions where exact methods such as BSP or k-D trees work well as in "high" dimensions your nearest neighbor is highly likely to lie on either side of a dividing hyperplane, but below cases where simply storing the data uncompressed / unquantized / etc is hard and the amount of compute is prohibitive as well).
How big is the data set (number of vectors) that you are searching among, and are you performing single queries or batch queries?
I'm only searching hundreds of vectors, and it seems to be working surprisingly well (astonishingly well really!) - but I've only spent a few hours working on this and are far from having a proper measure of how good it is.
I'll try the smaller subspace ideas - seems like it'd be easy to try and could work.
> How big is the data set (number of vectors) that you are searching among
The biggest dataset I've tried so far is 400 (it is searching similar scenes in a sports broadcast)
> and are you performing single queries or batch queries?
single - choose a scene and it finds similar ones.
Thanks for Faiss BTW. I love love love it - long time member of the FB group you have. I think I switched to Annoy because I had a packaging issue some time back or something.
Train an autoencoder to reduce your vector dimensions down to something more workable. It’s unlikely you’ll be able to search against such enormous vectors in a reasonable amount of time anyways.
Another option is to shard your vectors into N pieces, where N*k is the length of your vector. Since cosine similarity doesn’t care about order, it will be fine. The only requirement is that the k-th vector can only be compared with other k-th vectors for similarity. The benefit of this approach is that it can be parallelized easily.
The autoencoder was what I was thinking too. I was hoping to avoid training anything though!
Search speed is good (although the datasize is pretty small - hundreds of vectors).
So the sharding idea I'd:
slice up the vector
compare each sub vector with the corresponding sub vector ones from other vectors getting the similarities.
sum the similarities
choose the maximum
Is this the general idea? Is there an implementation of this you've seen?
If you’re looking for something out of the box, I can’t help you unfortunately. I do ML at a company where we have our own in-house tooling.
I agree with your algorithm, it is equivalent to cosine similarity. But you might also consider whether you need all the shards at all. Maybe you can get away with a half or even a tenth of them. If you have some metric like ndcg you can measure the drop in performance and consider the trade off.
In this page they have "space filling curves" as an example in one of the images, but I haven't been able to find production systems that actually use space filling curves for similarity search. Anyone have any tips?
I've been looking for good examples for my next book, but everything I've read about has turned out to be not used, including everything in Michael Bader's book [1], except for S2Geometry [2], but I feel like I need to talk to someone who has more expertise in the benefits/drawbacks of using that in an actual product.
Is view() for disk-based indexes doing something special over plain mmap(), e.g. setting read-aheads based on the knowledge of the intental structure to make it faster if done over the network?
I'm curious, is HSNW the only option? Do you support IVF-style indexes? Also, FAISS is nice because it supports a pluggable storage layer. Is this something that's easily supported in USearch?
In the vein of single-file databases, I've been enjoying DuckDB and am exploring Kùzu, both coming out of the database group at University of Waterloo. DuckDB aims to be a SQLite for analytics (OLAP), while Kùzu is an analytics focused graph database.
The fact that USearch has a WASM binding for frontend use (AND supports serialization) is very cool for client-side search/LLM applications!
How would I integrate this into a dense passage retriever workflow for RAG? I could not find any examples for document chunk ingestion and similarity query.
reply