Hacker Read

YeGoblynQueenne · 2024-02-10 01:29:23

>> So you are certainly aware that there are avenues to creating the data set. Given that, it is quite reasonable to say that search is unnecessary.

How is it unnecessary? They used none of those methods, so they had to use search. That is search being necessary, not the opposite.

reply

dgzl | karma 1386 | avg karma 1.71 · | 2018-07-24 18:32:00+00:00

> So I did what any software engineer would do and started digging through the data.

That behavior.

reply

notriddle | karma 2413 | avg karma 2.6 · | 2021-03-14 15:55:57

> You can quote a million papers from your cursory Google Books search

Still better data than the literally nothing that you brought to the discussion so far.

reply

rlpb | karma 10175 | avg karma 3.91 · | 2018-04-26 14:10:41+00:00

> the data you typed in the internal search

Your problem is that you viewed it as an internal search, when it was designed and advertised as a global search.

If you want to argue that the global search should have been an internal search, that's fine. If you imply that the search was an internal search, though, then that's misleading FUD.

reply

JohnFen | karma 30257 | avg karma 1.9 · | 2024-01-18 10:15:41

> they very explicitly don't let you upload any data unnecessary to find matches.

Is that supposed to be reassuring in some way? I don't understand how that really makes the practice OK.

reply

rgavuliak | karma 157 | avg karma 1.43 · | 2023-02-06 02:56:54

> why not run it on all the data you can

Because more data requires more cleaning and standardization (with more edge cases). It also requires a bigger scale to obtain and process.

reply

dragonwriter | karma 118260 | avg karma 2.17 · | 2015-02-12 10:56:10

> We are just the data set.

Suppliers of the data set. Without which, they are of no value to those you describe as their users.

reply

_pmf_ | karma 2981 | avg karma 1.06 · | 2015-08-03 02:25:29

> I took a look at the data. The data schema is disorganized to the point that a lot of janitorial work would be necessary to get it useable and perform any analysis or visualization.

In other words, it is real world data.

reply

twblalock | karma 18047 | avg karma 4.2 · | 2016-12-01 21:59:55

> Why not collect all the information you possibly can, whether or not you immediately see obvious value to it, and then find ways to justify it later?

Because it's non-trivial and it costs a lot of money to do that, in terms of engineers' salaries, storage, etc.

reply

UncleEntity | karma 2069 | avg karma 0.83 · | 2023-04-15 14:36:44

> This is interesting but what problem does it solve better than CTRL+F-ing a transcript?

Producing the transcript?

Being able to classify and search data seems like a pretty big deal these days too.

reply

bbgm | karma 2872 | avg karma 4.33 · | 2009-10-11 23:37:49

> PS: The only way a scientist is overwhelmed with information is when they need to do something by hand. It's hard to guess how a scientist would like to collect less information. Worst case you ignore it because you can't process it yet.

I am still trying to parse that statement. 12 years ago we needed a 250 node cluster to get anything done with all the data that was presented to us and that's a fraction of what's being generated today

reply

adtac | karma 4292 | avg karma 4.8 · | 2018-10-25 03:03:16+00:00

>We tried A x B in this way, that way, some other way, none of them worked

How searchable is this data? Like, do I need to be an expert who is up-to-date on most proceedings in the subfield to know this, or is this information easy to pull up with a few searches?

reply

bane | karma 53753 | avg karma 4.99 · | 2019-01-01 01:17:54+00:00

> Just because you conceptualize it in your mental model does not mean you need a graph database.

Yes! When I was younger I worked on a problem once that needed to compute some very basic graph metrics. My seniors tried to do the work in an early graph database and it was a disaster. It turns out literally just reading in the lines from a file and counting things got the job done in a few seconds.

They refused to use the results until they were coming out of the graph database because "just in case we needed other metrics". We never needed the other metrics.

reply

gnufx | karma 1558 | avg karma 1.12 · | 2020-08-26 21:24:25+00:00

> Why do you think that?

Observation in research support, I'd guess. It typically no longer seems to be the case that you do whatever you need to for your data.

reply

acomjean | karma 7658 | avg karma 2.71 · | 2018-08-28 15:44:54+00:00

>What would be the point of adjusting the algorithm other than picking which results should be surfaced?

Well I like to think they do want to improve the results.

reply

LyndsySimon | karma 4208 | avg karma 1.98 · | 2016-03-29 19:22:27+00:00

> I have to say I was fairly hesitant to search for that data. It's not the kind of thing you want showing up in your search history...

That, in and of itself, is a very interesting phenomenon, don't you think?

reply

dogweather | karma 955 | avg karma 2.85 · | 2014-01-10 02:35:54+00:00

> So if this database is not kept up, it will be pretty much useless to answer this kind of question.

Umm, exactly my point. Maybe I wasn't too clear.

reply

srean | karma 5955 | avg karma 2.35 · | 2018-10-25 13:45:14

>This is 'data mining' right?

That would be datamining done wrong. Its perfectly fine to look at data to provoke new hypothesis. But you should not be using the same data to confirm the hypothesis that it provoked. Either use fresh data or make sure that you still ensure correctness if you are reusing the data.

reply

StreamBright | karma 4217 | avg karma 1.31 · | 2019-06-04 20:10:22+00:00

>> If you think that the Cockroach Labs, Redis Labs, Mongos, Elastics, etc., of the world provide a net benefit to the software world

Elastic is an exception though, I think Lucene would be a better addition to that list.

reply

mycall | karma 1372 | avg karma 1.04 · | 2020-08-20 23:32:25

> It is not usual to get a bunch of data and just publish it without some additional work.

I thought that is what data lakes and event sourcing is suppose to solve.

reply