Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

>> So you are certainly aware that there are avenues to creating the data set. Given that, it is quite reasonable to say that search is unnecessary.

How is it unnecessary? They used none of those methods, so they had to use search. That is search being necessary, not the opposite.



sort by: page size:

> So I did what any software engineer would do and started digging through the data.

That behavior.


> You can quote a million papers from your cursory Google Books search

Still better data than the literally nothing that you brought to the discussion so far.


> the data you typed in the internal search

Your problem is that you viewed it as an internal search, when it was designed and advertised as a global search.

If you want to argue that the global search should have been an internal search, that's fine. If you imply that the search was an internal search, though, then that's misleading FUD.


> they very explicitly don't let you upload any data unnecessary to find matches.

Is that supposed to be reassuring in some way? I don't understand how that really makes the practice OK.


> why not run it on all the data you can

Because more data requires more cleaning and standardization (with more edge cases). It also requires a bigger scale to obtain and process.


> We are just the data set.

Suppliers of the data set. Without which, they are of no value to those you describe as their users.


> I took a look at the data. The data schema is disorganized to the point that a lot of janitorial work would be necessary to get it useable and perform any analysis or visualization.

In other words, it is real world data.


> Why not collect all the information you possibly can, whether or not you immediately see obvious value to it, and then find ways to justify it later?

Because it's non-trivial and it costs a lot of money to do that, in terms of engineers' salaries, storage, etc.


> This is interesting but what problem does it solve better than CTRL+F-ing a transcript?

Producing the transcript?

Being able to classify and search data seems like a pretty big deal these days too.


> PS: The only way a scientist is overwhelmed with information is when they need to do something by hand. It's hard to guess how a scientist would like to collect less information. Worst case you ignore it because you can't process it yet.

I am still trying to parse that statement. 12 years ago we needed a 250 node cluster to get anything done with all the data that was presented to us and that's a fraction of what's being generated today


>We tried A x B in this way, that way, some other way, none of them worked

How searchable is this data? Like, do I need to be an expert who is up-to-date on most proceedings in the subfield to know this, or is this information easy to pull up with a few searches?


> Just because you conceptualize it in your mental model does not mean you need a graph database.

Yes! When I was younger I worked on a problem once that needed to compute some very basic graph metrics. My seniors tried to do the work in an early graph database and it was a disaster. It turns out literally just reading in the lines from a file and counting things got the job done in a few seconds.

They refused to use the results until they were coming out of the graph database because "just in case we needed other metrics". We never needed the other metrics.


> Why do you think that?

Observation in research support, I'd guess. It typically no longer seems to be the case that you do whatever you need to for your data.


>What would be the point of adjusting the algorithm other than picking which results should be surfaced?

Well I like to think they do want to improve the results.


> I have to say I was fairly hesitant to search for that data. It's not the kind of thing you want showing up in your search history...

That, in and of itself, is a very interesting phenomenon, don't you think?


> So if this database is not kept up, it will be pretty much useless to answer this kind of question.

Umm, exactly my point. Maybe I wasn't too clear.


>This is 'data mining' right?

That would be datamining done wrong. Its perfectly fine to look at data to provoke new hypothesis. But you should not be using the same data to confirm the hypothesis that it provoked. Either use fresh data or make sure that you still ensure correctness if you are reusing the data.


>> If you think that the Cockroach Labs, Redis Labs, Mongos, Elastics, etc., of the world provide a net benefit to the software world

Elastic is an exception though, I think Lucene would be a better addition to that list.


> It is not usual to get a bunch of data and just publish it without some additional work.

I thought that is what data lakes and event sourcing is suppose to solve.

next

Legal | privacy