Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Yeah semantic search, if solved, would address this problem.

That's really what I was getting at. Stripped right down, Google is still just viewing documents as a bag of words[1]. I mean, they have pagerank and they will apply more weight to words in headings, and they have 6-gram indexes and synonyms and all that clever stuff, but at it's core it's still lexically centered not semantically centered.

[1] Further reading: http://en.wikipedia.org/wiki/Bag_of_words_model



sort by: page size:

That’s how you make a worse search engine than Google. If you are serious about competing in that space I think you need to do something fundamentally different than Google. Treating pages as a bag of words leads to a shitty search engine. Like I said, I’ve built a few search engines, and I have tried this.

Edit: https://en.wikipedia.org/wiki/Bag-of-words_model


I work in search quality at Google, and while certainly not everyone agrees that it's a problem, a lot of people do.

I could write a lot about this, but the central issue is that it is very very hard to make changes that sacrifice on-topic-ness for good-ness that don't make the results in general worse. We're working on it though, and I suspect we'll never stop.

I think a lot of the promise lies in as you said, identifying the tangentially related article, or as I like to frame it, bringing more queries into the head. We've launched a lot of changes that do exactly this. (But you are right, it is difficult, and fundamentally so. Language is hard.)


I remember when "semantic search" was the Next Big Thing (back when all we had were simple keyword searches).

I don't know enough about the internals of Google's search engine to know if it could be called a "semantic search engine", but not, it gets close enough to fool me.

But I feel like I'm still stuck on keyword searches for a lot of other things, like email (outlook and mutt), grepping IRC logs, searching for products in small online stores, and sometimes even things like searching for text in a long webpage.

I'm sure people have thought about these things: what technical challenges exist in improving search in these areas? is it just a matter of integrating engines like the one that's linked here? Or maybe keyword searches are often Good Enough, so no one is really clamoring for something better


"Today none of that is relevant. Today you need to provide semantic search ("SS") or your search will be considered broken because the results will truly suck."

I agree semantic search is useful, but what you're proposing sounds vague, like black box magic.

To provide semantic search, you still do the dirty things you just mentioned: integrate a stemming library, integrate synonyms and a huge corpus of 2+ words topics (ie. when someone searches for "big data", you should always return documents with those terms together, and never apart). You need those things. ML might help you generate them, but it isn't magically going to tell you X, Y and Z documents should also be returned for a query, even though it doesn't contain that term.


I think if anything we're at a crossroads. Natural language is great for some types of queries, but has its limits for others. Internet search engines have come to be used for a lot of different tasks, and a large part of Google's sticking power has been that it's at least decent at most of them. But that may also be a bit of a problem. Based on everything I've seen so far, it appears that the more conversational a search engine is, the worse it tends to become at the task search engines originally did (i.e. finding documents about a topic). We've also sort of started to perform all sorts of things through this medium of document-finding, that perhaps could be done differently. I think it does a disservice to both those other tasks and the usefulness of document-finding.

Excellent point. I guess the challenge is in first determining whether adopting the strategy to give pages with semantic info a higher rank, a good one. In other words, will this improve users' satisfaction with search?

I suppose it could start out in Google Labs as an experiment. Although I m quite convinced of the power of the semantic web, I would think Google would tread carefully before changing what's working for them, now.


Good question.

The dominant paradigm for search today is the "vector space model".

The rough idea is that you start with "OR" and then score the documents so:

* the more words occur, the higher the rank

* the more frequent the words are in the document, the higher the rank

* less common words contribute more to the score

* tuning has to be done so that small documents are not privileged relative to large documents or vice versa (that turns out to be difficult, the first good algorithm for this was discovered 25 years ago, but it is hardly used commercially because nobody can be bothered to tune it for their specific text corpus)

Given that you aren't going to read 5000 results for "chrome desktop" (unless you're researching patents) it is not a problem so well as the results at the top of the ranking are good.

Google has very different concerns than most. If you have a small collection the immediate problem is that one of the documents means "chrome desktop" but uses some other words to mean that, so you miss the document. So you need tricks to find those documents. If you have a huge collection then there is going to be some document where somebody used the same language as you and you'll be satisfied.

If AND were the default, you'd find that ordinary users would frequently find nothing and give up.

Recruiters, professional patent searchers, and other people who do nothing but search all day collect large collections of "boolean strings" that help them in their work. The best search engines, based on the VSM but using additional tricks such as autoencoders, perform comparably. The typical web search engine is worse.


"Search" is too broad to ever be solved. That's like "solving entropy".

Google focused on a specific subset — you enter a few keywords or a phrase, and the machine returns the top ~10 links to pre-existing (indexed) web pages. But that's not all there is to search!

Challenges:

1. Intranets: internal documents, typically in different modalities (FAQs, support cases, wikis, public pages) and across diverse storages that evolved throughout the years via acquisitions and osmosis.

2. Clustering: you don't have any keywords, but rather want to find how a particular document (legal template, its clause section) evolved over time. You want to avoid using keywords. Search for similar documents or document sections. Find similarity between two documents that is based on semantics rather than query keywords. Applications: eDiscovery, contract management…

3. SME & Intent: "relevant result" means different things in different domains, or even different aspects of a single domain. Google is doing an amazing job with their "single search box", but there are industries (for example, HR) where search precision matters much more than recall. More elaborate, focused, domain-specific facets or even dialogue systems make sense there.

Commercial plug: we built a search solution focused around semantic search (in the "machine learning and vectors" sense, not "sematic web and RDFs" sense), https://scaletext.ai. It's still early days in that our clients are all over the place, but to say Google/Lucene solved search is patently false.


With all those smart trillion weights models, can't they figure out which page is a useless keyword trap (90% of what google search finds these days) and which is a genuine useful page? Would be a huge help.

good idea in theory, but the search is "ridiculously" slow. Not to mention the fact that it doesn't work after the first search. Also, you're using the term "semantic" wrong.

I think search is not only not solved, but a failed concept overall.

It's too hard to find relevant information from the gazillions of pages based only on a few words.

I think search needs to be replaced by some kind of indexing/ontology/knowledge organization system, and then maybe only be applied in the "last mile" of a person's 'search' for relevant information


I absolutely prefer keyword search. Semantic results are nothing but infuriating, particularly when I'm searching for something specific and the engine substitutes my query for something interpreted.

BS. This can be solved by how to design the UX among other ways. This problem is not equal over search engines as well. There are other things that annoys me more like Google and the problems with N-Grams.

excellent! I'm tired of search engines that optimize for natural language queries because the inevitable trade-off is that they become useless at keyword/exact queries.

I mean Google search does it obviously using page rank which is that if someone links to the page with that word it uses it. Me I was searching for arcane words that I remembered hearing in the middle of a podcast. I doubt anyone actually linked or searched for the same with relevance to that podcast. Also the solution you're saying needs far more intuitive and intricate development than just indexing captions lol.

I definitely think there is room for search improvement. I believe the next area of search is contextual search (https://en.wikipedia.org/wiki/Contextual_searching). If you can combine what the user is looking for to actual website content then I think you might be onto something. The trick is finding that link function. Traditionally Google has relied on keywords and ranking by links. There could be other ways to find that user/content relationship.

Yeah, disambiguation is a huge problem in search as well. I remember that when I joined (2009), I kept saying we need better disambiguation algorithms, because I was working on UI stuff and getting nonsensical results when I hooked up other teams' backends to the search result page. A bunch of teams were formed and I think the problem was largely solved by 2012 or so - there're some clever tricks you can pull when you have the amount of data and computing power that Google does.

Well, that's the point I'm trying to make. It's not a hit the drawing board problem, it's a refine the algorithm problem. How many "if you google this string, you can't find the right site" problems have there been in the life of Google search? They continue to refine pageranking don't they?

Well I ment that a little facetiously. However, I agree those things should be taken into account--in addition to including documents with the terms you ask for in them. Pages with my terms (in order) should be the absolute highest ranked, and it should go down from there. At least if we're talking about building an intuitive google vs. really aggravating google.
next

Legal | privacy