On a confluence that covers the whole of the Fortune 500 company I work for, I do NOT want to search over the corpus of all the documents hosted on it. I want a persistent search filter where I can easily restrict my results within certain parameters without having to constantly re-filter my results.
I think most search engine designers want to make the index as broad as possible, but the problem seems to be that people rarely want such broad searches. What they really want are very detailed indices and metadata implications over well trodden folders.
I would argue it makes no sense to index the vast majority of content without good search. If your search is good enough, you can index everything and then surface only the good stuff at query time.
Yeah, honestly I see using a regular search index as a downside rather than benefit with this tech. Conflicting info, or low quality blogspam seem to trip these LLMs up pretty bad.
Using curated search index seems like a much better use case, especially for private data (company info, docs, db schemas, code, chat logs, etc)
This is an interesting point. User managed or curated indices offer unique advantages, especially when 'depth of coverage' is more important than 'breadth of coverage'. I believe that we are witnessing people shift away from demanding 'search breadth' as we speak, so someone might possibly decide to do this.
Thanks. My hope is that the search is good enough that it doesn't matter how much is indexed -- you'll always be able to find what you are looking for. This is one reason, to start, I felt it was easier to index content server side..
This could be genuinely interesting for tools like e.g. sphinx-doc, which currently has a client-side search that does indeed ship the entire index to the client.
Just a quick non-thought-through idea but would it be possible to build an index in a way that allows clients to download only parts of it based on what they search? I.e. the search client normalizes the query in some way and then requests only the relevant index pages. The index would probably be a lot larger on the server but if disk space is not an issue...?
(Though at some point you have to ask yourself what the benefits of such an approach are compared to a search server.)
I think it does, but we're sort of against indexing everything, because then we'd just turn into Google search... It's much easier to locate something when you have only three or four documents of that type rather than a hundred...
That’s an interesting point about search developing in an absence of a standard for sites to self-index. I can’t quite imagine what that standard index would look like.
This still doesn't explain how large scale search indexing could take place effectively... the data still needs to float upstream un-encrypted, processed and indexed.
Somehow offloading full-text indexing to the client, uploading encrypted indexes to the server and then elegantly combining those indexes across users to display a unified search would be arduous and I don't see anyone doing it.
Also, IndexedDB has been around for years and still suffers from a variety of cross-browser inconsistencies that make it a little painful to work with.
The thing I wanted to cover with my idea is how to avoid rebuilding the index on each client as well as not having to store the entire index on the client. In my opinion, large indexes are probably a requirement for broad adoption by users since accurate search is very important for email (besides user desires, this could even be a legal requirement for discovery requests during a lawsuit).
I think most search engine designers want to make the index as broad as possible, but the problem seems to be that people rarely want such broad searches. What they really want are very detailed indices and metadata implications over well trodden folders.
reply