I mean Google search does it obviously using page rank which is that if someone links to the page with that word it uses it. Me I was searching for arcane words that I remembered hearing in the middle of a podcast. I doubt anyone actually linked or searched for the same with relevance to that podcast. Also the solution you're saying needs far more intuitive and intricate development than just indexing captions lol.
I would have thought that a lack of search capability would matter. How many neat pages or sites have you come across due to searching via Google? There is currently no easy way to search the podcast content for stuff I am interested in.
There are tools like automatic captions + manual tagging that can make search much better. The tools are out there, just not well adopted at this point.
But are you solving the right problem? This sounds like someone has produced a very good and efficient version of AltaVista. Back in the 1990s, if you wanted to do classic keyword searches of the web, and find all pages that had terms A and B but not C, it would give them to you, in a big unsorted pile. The web was still small enough that this was sometimes useful, but until Google came along with tricks to rank pages that are obvious in retrospect, it just wasn't useful for common search terms.
It would be a lot better if you could just search an index of all the words on the web, and then we can refine our queries against the results to narrow things down even more. As it is right now, search just doesn't work anymore.
People say google search is terrible these days, but I find the opposite.
I can vaguely describe in a sentence the gist of an article I've read, or an image, and the proper result will usually be in the first page.
Of course, it doesn't always work, sometimes there are "hash collisions" so to speak, but I don't think the old algorithm would have been more successfully either, since if I knew the exact keywords to use, I wouldn't need to start with a vague description in the first place.
That’s how you make a worse search engine than Google. If you are serious about competing in that space I think you need to do something fundamentally different than Google. Treating pages as a bag of words leads to a shitty search engine. Like I said, I’ve built a few search engines, and I have tried this.
What it might need to do is index commonly searched terms and provide a reverse lookup to the location of the row... oh wait, that's called a search engine. :)
Think of a random word that comes to mind: let's try "animus". "No Matches Found. Please rephrase your query and try again.". Hm, okay. How about "theory"? "No Matches Found." What about "set"? Nope. "rank"? No. "war of 1812"? No. "barack obama": nope. The first term I got to return content was the biggest softball I could think of - "procog".
I'm all for a new search engine (especially one that really lets content providers know what they need to do to rank highly for queries) but I'd say this isn't ready yet; most of those queries are the sorts of queries I search for every few minutes.
good idea in theory, but the search is "ridiculously" slow. Not to mention the fact that it doesn't work after the first search. Also, you're using the term "semantic" wrong.
Yeah semantic search, if solved, would address this problem.
That's really what I was getting at. Stripped right down, Google is still just viewing documents as a bag of words[1]. I mean, they have pagerank and they will apply more weight to words in headings, and they have 6-gram indexes and synonyms and all that clever stuff, but at it's core it's still lexically centered not semantically centered.
You touched on what I was driving at in my original comment. There is no doubt whatsoever that if you were able to make a full pass over the corpus with a regular expression you could find docs you can't find on Google's search. But that's obviously not how their search works. They have to make it work at their scale, which dictates the format of the index, which in turn limits the possibilities for query operators. They have to make these design choices so that their product can exist at all.
"Grep the world" is a fine strategy for corpora up to a certain size, and I do wish there was a product that just stored everything I've ever seen and let me run expensive searches on that.
I mean “wiki <search>“ works on Google too. But I don’t want to type anything extra. My problem here is if we assume the search engine should be good at predicting what I actually want to see for a given term, Google is failing at that.
The problem is with when it doesn't work: when it's the least convenient and the most irritating. For popular content it doesn't matter if you forget a precise word used there, you'll get to it soon enough. But for specific, niche content - which is the most valuable to me most of the time - even if you get all the keywords right you might not find what you're looking for on Google. The reasons range from the keywords being too generic to the site being no longer online and it's really frustrating when it happens.
OneTab and bookmarks are not an answer because they don't save the content. I tried Joplin + Web Clipper which does, but it works on a single-tab basis, and when I have 200 tabs open sending them all to Joplin manually takes ages... and then Joplin slows down to a crawl when you're done.
True, but google is not searching there entire index for those. A simple linear search takes N time. So for a word that occurs billions of times. Google is not going to go through that entire list. They might use some clever hashing to jump around, and sorting. However, when trying to intersect two keywords they either have to pre-generate the intersection or make the data set they are intersecting small enough to get those 10 results quickly.
Well, that's the point I'm trying to make. It's not a hit the drawing board problem, it's a refine the algorithm problem. How many "if you google this string, you can't find the right site" problems have there been in the life of Google search? They continue to refine pageranking don't they?
reply