Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

On a confluence that covers the whole of the Fortune 500 company I work for, I do NOT want to search over the corpus of all the documents hosted on it. I want a persistent search filter where I can easily restrict my results within certain parameters without having to constantly re-filter my results.

I think most search engine designers want to make the index as broad as possible, but the problem seems to be that people rarely want such broad searches. What they really want are very detailed indices and metadata implications over well trodden folders.



sort by: page size:

I know our organization would like this also, but it basically makes indexing for search pretty impossible doesn't it?

I would argue it makes no sense to index the vast majority of content without good search. If your search is good enough, you can index everything and then surface only the good stuff at query time.

Yeah, honestly I see using a regular search index as a downside rather than benefit with this tech. Conflicting info, or low quality blogspam seem to trip these LLMs up pretty bad.

Using curated search index seems like a much better use case, especially for private data (company info, docs, db schemas, code, chat logs, etc)


This is an interesting point. User managed or curated indices offer unique advantages, especially when 'depth of coverage' is more important than 'breadth of coverage'. I believe that we are witnessing people shift away from demanding 'search breadth' as we speak, so someone might possibly decide to do this.

Thanks. My hope is that the search is good enough that it doesn't matter how much is indexed -- you'll always be able to find what you are looking for. This is one reason, to start, I felt it was easier to index content server side..

This could be genuinely interesting for tools like e.g. sphinx-doc, which currently has a client-side search that does indeed ship the entire index to the client.

It's not a search problem, as much as it's an indexing one, no?

Just a quick non-thought-through idea but would it be possible to build an index in a way that allows clients to download only parts of it based on what they search? I.e. the search client normalizes the query in some way and then requests only the relevant index pages. The index would probably be a lot larger on the server but if disk space is not an issue...?

(Though at some point you have to ask yourself what the benefits of such an approach are compared to a search server.)


The indexing kinda sucks though. I can never seem to find what I'm looking for.

I keep wondering how actually costly/difficult it would be to make a text-only index of the parts of the web I actually need to regularly search.

One of these days I'm gonna scratch this itch.


Search is a problem. Index is available to a saas provider.

I think it does, but we're sort of against indexing everything, because then we'd just turn into Google search... It's much easier to locate something when you have only three or four documents of that type rather than a hundred...

ha, well op said build a "search index".. its not much of a search index if you can't search it :)

That’s an interesting point about search developing in an absence of a standard for sites to self-index. I can’t quite imagine what that standard index would look like.

If it's not in the search index, it's not accessible. Nobody is going to look through every wikipage to find something.

You could make recency a factor in ranking though.


This still doesn't explain how large scale search indexing could take place effectively... the data still needs to float upstream un-encrypted, processed and indexed.

Somehow offloading full-text indexing to the client, uploading encrypted indexes to the server and then elegantly combining those indexes across users to display a unified search would be arduous and I don't see anyone doing it.

Also, IndexedDB has been around for years and still suffers from a variety of cross-browser inconsistencies that make it a little painful to work with.


The thing I wanted to cover with my idea is how to avoid rebuilding the index on each client as well as not having to store the entire index on the client. In my opinion, large indexes are probably a requirement for broad adoption by users since accurate search is very important for email (besides user desires, this could even be a legal requirement for discovery requests during a lawsuit).

It would be quite easy to only index Reddit and StackOverflow.

Is there a plugin which allows you to curate your own index, i.e. a whitelist of sites to search?
next

Legal | privacy