Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
OpenAI-to-SQLite by Simonw (datasette.io) similar stories update story
1 points by thunderbong | karma 96016 | avg karma 7.63 2023-02-07 00:57:40 | hide | past | favorite | 67 comments



view as:

I'm always intrigued by Simon, and I enjoy his open-minded way of working. Thanks, Simon!

Simon is singlehandedly pushing up the average package-per-developer ratio of all of us. 1000x developer.

Same with HN. Every time I'm about to comment on HN, simonw has already been there with something more thoughtful. 1000x HNer.

If OpenAI can scrape what humans did then can you create public p2p database of request-responses?

I mean that sounds like a natural evolution; if generating a new response costs money, it makes sense to save them in a cheap database for future lookup / reuse.

But that's quite possibly OpenAI's model as well - cache results and reduce costs while keeping pricing.


Also known as "mainly poem database".

That would really only work on intrusive prompts, before existing state had been established.

It also would not create the non deterministic responses users have come to expect.


> It also would not create the non deterministic responses users have come to expect.

Probably slightly modifying text is less expensive than coming up with brand new text. So the text can be cached, and when retrieved, slightly altered before served. A "cached-but-not-really" cache


Good point, randomly rephrase slightly is probably much easier task resource wise.

> before existing state had been established.

As I understand it, "state" is actually just feeding in the history into the prompt as "context", so realistically this could help build history and context via lookups

eg.

Answer this as an assistant, blah blah blah.

You already said ${QUERY 1}, ${QUERY 2}. Refer to ${QUERY 3}, ${QUERY 4} for ideal formatting.

Prompt: "This is what the user asks chat gpt this round"


I was thinking about that when working on https://eli5.gg. I implemented a caching system internally but it would have been simpler to connect to an OpenAI proxy that handled this automagically.

They can put it in the T&Cs saying you can't do that though. Like what most geocoding APIs do - e.g. Google says you cannot cache/store the geocoding results.

What you do with the results is a legal argument after the fact. It's like saying we give you food, but you can't store the poop. Honestly - there is a house of cards just waiting to be challenged. Then again .. this is Google ($$$)

Do they follow license requirements of data they're scraping for training?

You just need to cache it with AI and you should be good to go.


This is interesting but less interesting than I assumed. I though they'd provide ChatGPT with a schema and ask it to formulate its responses as `INSERT` statements, etc.

I was able to feed it the following schema (actually I coached it through writing it for me).

  Table: Questions
  - id (Primary Key, Auto-Incremented)
  - title (VARCHAR)
  - description (TEXT)

  Table: Answers
  - id (Primary Key, Auto-Incremented)
  - question_id (Foreign Key referencing the Questions table)
  - answer (TEXT)
Then responded with:

  Please copy all my questions into SQLite compatible INSERT statements, similarly create a new INSERT for your answer.
  What color is the sky?
What I got was:

  Here is the SQLite compatible INSERT statement for your question:
    INSERT INTO Questions (title, description)
    VALUES ("What color is the sky?", "");
  And here is the SQLite compatible INSERT statement for my answer:

    INSERT INTO Answers (question_id, answer)
    VALUES (1, "The sky is typically blue during the day and black at night.");
  Note: The question_id in the Answers INSERT statement is referencing the id of the question in the Questions table. In this example, the assumption is that the question with id of 1 is "What color is the sky?".

Can't wait for the battle of the OpenAI JavaScript frameworks

What I really want to see is an HNSW index type for SQLite so finding similar vectors would be fast. I'm wondering if this can be done via an extension.

You could do it using an interface similar to the FTS tables in SQLite, where the “index” is actually a table that needs to be kept up to date with the underlying data (you can use triggers to keep them in sync). Querying is a little awkward compared to a normal index, but for these complex, inexact indexes you often want the extra flexibility (like computing the match score).

That's promising, is there a way to integrate this with other filters in the SQL query? Scenario: let's say I have documents and each one has a timestamp and an embedding vector. I should be able to write a query like this:

select * from documents where timestamp=today() order by similarity(embedding, query_vector)

In this case the query planner should be smart enough such that it first filters by timestamp and within those records remaining it does the HNSW search. I'm unsure whether SQLite's extension interface is flexible enough to support this.


I don't think so. You can have unindexed columns on fts tables but I don't think you can directly combine non-fts query filters in that way.

You can do it, but you need to manually filter from the main table then subquery the fts index, or use a join, etc. It's powerful but the interface isn't as friendly as it could be in that way.

The sqlite extension system is powerful enough that an extension could unify the behaviors as you describe. It's just not how the fts5 extension interface is exposed, and afaik no one has written an extension on top of it to make it easier to use.


> I'm unsure whether SQLite's extension interface is flexible enough to support this.

I think it is, if I've understood your requirements correctly. e.g. from the datasette-faiss docs:

  with related as (
    select value from json_each(
      faiss_search(
        'intranet',
        'embeddings',
        (select embedding from embeddings where id = :id),
        5
      )
    )
  )
  select id, title from articles, related
  where id = value

An SQLite virtual table can specify a “cost” to a particular query plan, you you could weight your index search higher and discourage SQLite from using it to filter many rows. The tuning process works by specifying a number of “disk reads” or estimated rows for a particular query plan.

More info: https://www.sqlite.org/vtab.html#the_xbestindex_method


You might be interested in https://datasette.io/plugins/datasette-faiss, which I'm using alongside openai-to-sqlite for similarity search of embeddings, following @simonw's excellent instructions at https://simonwillison.net/2023/Jan/13/semantic-search-answer...

Thanks, but the index being in-memory makes it unsuitable for large data sets :/

Can you say more? Usually projects that gravitate to SQLlite are not those that require massive scale and a FAISS index of a few GB covers a lot of documents.

My dataset is going to be around 10M documents. With OpenAI embeddings, that will be around 62GB. AFAIK SQLite should be able to handle that size, but I haven't tried.

This is not going to be my primary DB. I would update this maybe once a day and the update doesn't have to be super fast.


you might check out some vector databases:

https://milvus.io/

AND

pinecone.io

there are others too


There is a way of running disk-backed FAISS indexed that don't all fit in memory but I've not quite figured out how to do that yet: https://github.com/facebookresearch/faiss/issues/2675

OpenSearch K-NN plugin supports FAISS and it's disk based:

https://opensearch.org/docs/latest/search-plugins/knn/index/


OpenSearch looks like the best so far, all my requirements combined!

The "embedding" for a text is an array of floating point numbers? How many?

And how do you calculate the distance between two texts then?

Just request the embeddings array "a" for text one and "b" for text two and then calculate "sqrt((a[0]-b[0])^2+(a[1]-b[1])^2+(a[2]-b[2])^2...)" ?


> The "embedding" for a text is an array of floating point numbers?

Yes, that's what an embedding is.

> How many?

It varies depending on the model. In this case 1536, it's in the original post.

> And how do you calculate the distance between two texts then?

You take their embedding vectors and calculate the distance between the two vectors using something like cosine similarity - https://en.wikipedia.org/wiki/Cosine_similarity (there are other distances too obviously)


Thanks!

    1536
That's quite a lot. Is there a fast way to find the most similar one to a given one in SQLite?

    cosine similarity 
Why not euclidean distance?

Very high-dimensional space is unintuitive. It can help to imagine a cube as N increases - 2^N corners and vanishingly small volume to surface area.

This gives useful intuitions e.g. cosine distance ignores length and only looks at angles, but this matters less with higher dims; by contrast the euclidian length becomes less useful.


Do I understand you correctly, that you say cosine distance is a better measurement of similarity than euclidian distance when there are many dimensions?

I get that cosine distance is easier to calculate, but isn't sorting by cosine distance the same thing as sorting by L2 norm, assuming that the vectors are already L2 normalized?

They use cosine because of performance reasons.

Euclidean distance when used is squared, because even that sqrt is too expensive.


Also note that the embeddings are learned based on which distance metric is used, thus making it important to use the right distance metric with the right embeddings above and beyond wall-clock performance. Other metrics may still work, but less well.

txtai is an alternative approach to this. It builds a FAISS (also supports HNSW) index alongside a SQLite database. It works with sentence-transformers models. For example, this model https://huggingface.co/sentence-transformers/all-MiniLM-L6-v... is 384 dimensions and works great for semantic search.

https://github.com/neuml/txtai

https://neuml.github.io/txtai/embeddings/query/


This looks awesome, thanks for sharing.

Thanks. Just want people to know there are quality FOSS alternatives available for encoding text into embeddings.

Is there a standard library for extracting data in a consistent format from OpenAI queries yet? And testing the results?

Currently using LangChain


I'm trying to wrap my head around embeddings and am not sure I understand how "realtime" embeddings will be in a real world application.

For example, if you use embeddings, do you need to know "a priori" what you want to search for? In other words, if you don't know your search queries up front, you have to generate the embeddings, store them inside the database, and then use them. The first step requires an API call to a commercial company (OpenAI here), running against a private model, via a service (which could have downtime, etc). (I imagine there are other embedding technologies but then I would need to manage the hardware costs of training and all the things that come with running ML models on my own.)

Compare that to a regular LIKE search inside my database: I can do that with just a term that a user provides without preparing my database beforehand; the database has native support for finding that term in whatever column I choose. Embedding seem much more powerful in that you can search for something in a much fuzzier way using the cosine distance of the embedding vector, but requires me to generate and store those embeddings first.

Am I wrong about my assumptions here?


You need 3 things : a query, a corpus of text to be searched against and a language model that can map text to vectors ("compute the embedding").

Morally, when you get a query, you compute its embedding (through an API call) and return parts of your corpus whose embedding are cosine-close to the query.

In order to do that efficiently, you indeed have to pre-compute all the embeddings of your corpus beforehand and store them in your database.


Thank you. So, this does imply that you need to know the target of your embeddings in advance, right?

If I want to know if my text has something like "dancing" and it does have "tango" inside it, why wouldn't I just generate a list of synonyms and then for-each run those queries on my text and then aggregate them myself?

I can download a few synonyms databases here:

https://stackoverflow.com/questions/5618304/looking-for-thes...

Is the value here that OpenAI can get me a better list of synonyms than I could do on my own?

If OpenAI were better at generating this list of synonyms, especially with more current data (I need to search for a concept like "fuzzy-text" and want text with "embeddings" to be a positive match!) that would be valuable.

It feels like OpenAI will probably be faster to update their model with current data from the Internet than those synonym lists linked above. Having said that, one of the criticisms of ChatGPT is that it does not have great knowledge of more recent events, right? Don't ask it about the Ukraine war unless you want a completely fabricated result.


You don’t need to know the target queries, if you compute embeddings of your entries and your query you just find which embeddings are closest to your query embedding. The advantage over using synonyms is that the embedding is meant to encode the meaning of the content such that similar embeddings represent similar meaning and you won’t need to deal with the combinatorial explosion of all the different ways you can say the same thing with different words (it can also work for other content, like images, or multi language if ur network is trained for it).

That's the value prop of large language models (and here, of OpenAI's LLM): because it's been trained on some "somehow sufficiently large" corpus of data, it has internalized a lot of real world concepts, in a "superficial yet oddly good enough" way. Good enough for it to have embeddings for "dancing" and "tango" that will be fairly close.

And if you really need to, you can also fine-tune your LLM or do few shot learning to further customise your embeddings to your dataset.


But yes, if you ask OpenAI to predict next set of tokens (which is how chat works), it won’t be up to date with latest information. But if you’re using it for embeddings this is less of a problem since language itself doesn’t evolve as quickly, and using embeddings is all about encoding the meaning of text, which is likely not going to change so much - but not to say it can’t for example the definition of “transformer” pre 2017 is probably not referring to the “transformer architecture”.

Look for the Python package "sentence-transformers" and give it a spin.

I think what you would do is some long-running “index” process where you dynamically generate embeddings for all changes that get made to text in the database. Databases that support searching large amounts of text probably do a way simpler version of this (i.e. reverse keyword indexes) already. Granted, this does involve a very beefy database.

One strategy commonly used:

1. Create embeddings of your db entries by running through a nn in inference mode, save in a database in vector format.

2. Convert your query to an embedding by running it through a neural network in inference mode

3. Perform a nearest neighbor search of your query embedding with your db embeddings. There are also databases optimized for this, for example FAISS by meta/fb [1].

So if your network is already trained or you use something like OpenAI for embeddings, it can still be done in near real time just think of getting your embedding vector as part of the indexing process.

You can do more things too, like cluster your embeddings db to find similar entries.

[1] https://engineering.fb.com/2017/03/29/data-infrastructure/fa...


> if you use embeddings, do you need to know "a priori" what you want to search for?

No. Your embedding is something that represents the data. You initially calculate the embedding for each datapoint with the API (or model) and store them in an index.

When a user makes a query, it is first embedded by a call to the API. Then you can measure similarity score with a very simple multiply and add operation (dot product) between these embeddings.

More concretely, an embedding is an array of floats, usually between 300 and 10,000 long. To compare two embeddings a and b you do sum_i(a_i * b_i), the larger this score is, the more similar a and b are. If you sort by similarity you have ranked your results.

The fun part is when you compare embeddings of text with embeddings of images. They can exist in the same space, and that's how generative AI relates text to images.


I'm pretty sure they need to be cross trained on text and images like clip. it was trained on the little alt text of images to learn the association.

Total newb on the outside looking in, any primers on this stuff?

I wrote a detailed explanation of embeddings as part of this post here: https://simonwillison.net/2023/Jan/13/semantic-search-answer...

Since building this I figured out a way to calculate embeddings using a model that runs directly on my own hardware - notes here: https://til.simonwillison.net/python/gtr-t5-large

There’s a good leaderboard for embedding focused models here: https://huggingface.co/spaces/mteb/leaderboard. Some of the sentence-transformers models are even smaller than gr-t5 while still having very close performance for some use cases.

I’m working on making them 2-4x smaller with SmoothQuant int8 quantization and hoping to release standalone C++ a Rust implementation optimized for CPU inference.


[dead]

I look forward to other LLMs besides not-very-OpenAI's becoming available. This is a very clever tool, but what we all actually want is to be able to say 'can you read the paper/article at url:// and help me with XYZ' or highlight text/figures/tables/lists in an editing environment and say 'help parse/graph/translate/implement this', or conversely to say 'here is a buffer/array/heap for you to use, let's build some sort of user interface to make the feedback process more efficient.' This would be particularly useful for code-related tasks.

Right now it's like dealing with a very eager assistant that has staggering powers of information retrieval and really wants to help, but being forced to conduct all your conversations in a dark room which you and the AI take turns entering and leaving to examine what you've received. You can talk it into performing array and dictionary operations and maintaining text buffers in practice, but you're mutually forbidden from sharing/viewing any kind of semantic workspace directly.


Hey Simon, amazing work as usual!

Do you think you'll ever replace the search on your website with Semantic Search like this? I'm not sure how well Semantic Search gels with the ability to facet though, so that could be the hiccoup.

Otherwise it seems like you can get Semantic Search off SQLite that rivals e.g. the Typesenses and ElasticSearch's of the world


Legal | privacy