Thank you for this tool! I think it's a cool idea. Also, it is private, for not sending any requests.
What I would love is the actual predicted number of points (I don't mind that the error would be quite large given the title alone, given that it is not overfit).
I used the same dataset for a small project in my CS master. It was a really fun challenge, and it taught me a lot.
Most notably, it taught me that it was incredibly hard to make significant progress past the most simplest and naive approach. That approach was "Take average rating a user gives, take the average rating a movie gets, multiply". (Ratings normalized to be between 0 and 1).
Just using this method would give us 95% of the accuracy of our final method. I think I calculated, and compared to the prize winning result, our method got ~90% as accurate a result.
it's actually very easy to measure. the audience here is close to the 1%, so your usecase is already discarded in the grand scheme with a 2% error margin when measuring behavior for the large population.
I found that surprisingly useful. I'd be interested in developing that scale a little further since it's kind of weird that everything has different weights and, for example with point 7: The current code base can’t be used for anything. 1-20
what's the different between say a 15 and a 16? Seems very subjective, so maybe if everything was on a scale of 1 to 4 to make it easier to choose, then you used some multiplicative weighting for each issue...
Relevancy - I'd guess it's difficult to measure, due to the large dataset, the large number of possible search terms, and large number of possible results.
It is closer to fair sampling than you think, since the average internet user is more likely to install toolbars unlike the technogeek niches that we hang out in. Also, it's more useful for relative changes like drops and increases rather than absolute percentages.
Also, it's one of the only few samplings we have, unfortunately.
Does anyone have any details of how they converted a multidimensional problem (size of file + user perceived quality) into a binary win/loss score? It felt like that part of the article jumped from draw an oval to draw the rest of the fucking owl real quick.
Also how they evaluated user perceived quality doesn’t seem to be elaborated. That itself is an area of active research last I looked.
This is interesting, but in practice the quality of results is paramount, and how does something like searx compare against DDG… or Google? I wonder what kind of metric could be defined to even plot that.
52% correct seems awfully high - I am guessing it's inflated because of the easiness of the questions on StackOverflow. In my experience, at least in my domain, I will be lucky it it can generate an answer that is more than 5 lines of code without a single error. Still incredibly useful - just have to handhold it and watch it very carefully when going through a solution.
I found that to be one of the more interesting metrics shown - but have to agree with everyone else as to how that is calculated. That said, anecdotally, it seems to jive up with experience, but who knows.
reply