Hacker Read

BuyMyBitcoins · 2021-09-20 15:25:06

On a confluence that covers the whole of the Fortune 500 company I work for, I do NOT want to search over the corpus of all the documents hosted on it. I want a persistent search filter where I can easily restrict my results within certain parameters without having to constantly re-filter my results.

I think most search engine designers want to make the index as broad as possible, but the problem seems to be that people rarely want such broad searches. What they really want are very detailed indices and metadata implications over well trodden folders.

reply

famousactress | karma 6117 | avg karma 5.43 · | 2012-12-03 15:38:34

I know our organization would like this also, but it basically makes indexing for search pretty impossible doesn't it?

michaelmior | karma 5938 | avg karma 1.89 · | 2023-06-18 17:54:41

I would argue it makes no sense to index the vast majority of content without good search. If your search is good enough, you can index everything and then surface only the good stuff at query time.

Tostino | karma 2543 | avg karma 2.58 · | 2023-03-24 09:51:41

Yeah, honestly I see using a regular search index as a downside rather than benefit with this tech. Conflicting info, or low quality blogspam seem to trip these LLMs up pretty bad.

Using curated search index seems like a much better use case, especially for private data (company info, docs, db schemas, code, chat logs, etc)

reply

symonda | karma 5 | avg karma 0.62 · | 2023-07-03 02:52:32

This is an interesting point. User managed or curated indices offer unique advantages, especially when 'depth of coverage' is more important than 'breadth of coverage'. I believe that we are witnessing people shift away from demanding 'search breadth' as we speak, so someone might possibly decide to do this.

flippyhead | karma 2704 | avg karma 3.79 · | 2014-07-03 05:13:19+00:00

Thanks. My hope is that the search is good enough that it doesn't matter how much is indexed -- you'll always be able to find what you are looking for. This is one reason, to start, I felt it was easier to index content server side..

formerly_proven | karma 13110 | avg karma 3.44 · | 2021-05-02 21:07:26+00:00

This could be genuinely interesting for tools like e.g. sphinx-doc, which currently has a client-side search that does indeed ship the entire index to the client.

jfb | karma 7041 | avg karma 3.18 · | 2013-02-06 11:17:24+00:00

It's not a search problem, as much as it's an indexing one, no?

t0astbread | karma 1422 | avg karma 1.69 · | 2020-12-28 01:03:14+00:00

Just a quick non-thought-through idea but would it be possible to build an index in a way that allows clients to download only parts of it based on what they search? I.e. the search client normalizes the query in some way and then requests only the relevant index pages. The index would probably be a lot larger on the server but if disk space is not an issue...?

(Though at some point you have to ask yourself what the benefits of such an approach are compared to a search server.)

reply

omginternets | karma 10469 | avg karma 3.07 · | 2017-03-14 20:35:18+00:00

The indexing kinda sucks though. I can never seem to find what I'm looking for.

sneak | karma 23753 | avg karma 1.62 · | 2023-08-08 08:42:45

I keep wondering how actually costly/difficult it would be to make a text-only index of the parts of the web I actually need to regularly search.

One of these days I'm gonna scratch this itch.

reply

terpimost | karma 113 | avg karma 1.05 · | 2020-05-19 17:31:43

Search is a problem. Index is available to a saas provider.

StavrosK | karma 1 | avg karma 0.0 · | 2010-07-17 19:07:26

I think it does, but we're sort of against indexing everything, because then we'd just turn into Google search... It's much easier to locate something when you have only three or four documents of that type rather than a hundred...

ashot | karma 212 | avg karma 2.44 · | 2009-04-29 04:32:15+00:00

ha, well op said build a "search index".. its not much of a search index if you can't search it :)

onionisafruit | karma 2542 | avg karma 3.57 · | 2021-12-25 18:27:11

That’s an interesting point about search developing in an absence of a standard for sites to self-index. I can’t quite imagine what that standard index would look like.

toast0 | karma 25207 | avg karma 2.17 · | 2020-08-10 15:43:33+00:00

If it's not in the search index, it's not accessible. Nobody is going to look through every wikipage to find something.

You could make recency a factor in ranking though.

reply

ummjackson | karma 277 | avg karma 4.33 · | 2017-01-17 20:01:06+00:00

This still doesn't explain how large scale search indexing could take place effectively... the data still needs to float upstream un-encrypted, processed and indexed.

Somehow offloading full-text indexing to the client, uploading encrypted indexes to the server and then elegantly combining those indexes across users to display a unified search would be arduous and I don't see anyone doing it.

Also, IndexedDB has been around for years and still suffers from a variety of cross-browser inconsistencies that make it a little painful to work with.

reply

slaymaker1907 | karma 2483 | avg karma 2.04 · | 2023-06-01 12:05:15

The thing I wanted to cover with my idea is how to avoid rebuilding the index on each client as well as not having to store the entire index on the client. In my opinion, large indexes are probably a requirement for broad adoption by users since accurate search is very important for email (besides user desires, this could even be a legal requirement for discovery requests during a lawsuit).

alexis_fr | karma 225 | avg karma 0.63 · | 2019-11-28 06:29:00+00:00

It would be quite easy to only index Reddit and StackOverflow.

aronpye | karma 319 | avg karma 1.62 · | 2022-01-02 16:22:11

Is there a plugin which allows you to curate your own index, i.e. a whitelist of sites to search?