So this may be true. But if it is it is still a skill that is not widely held. So there is either a need for a simple-to-use curation utility, or an inexpensive curation service. I don't have a specific idea how to create or implement this, but I think there is a market for it
His suggestion sounds more like human curated data sets rather than automated curation like you're assuming, although this is a good distinction to make.
Yes, whenever there is an abundance of data (= subjective, feeling of overwhelmed), we need curation which aligns with whatever your interests are. Grep, for example, is a data curator. Even flags in ls could be.
Do you remember the very first time you went online, perhaps with dialup? The sheer adrenaline rush of paying by the minute. Every second counts. Imagine having a good data curator back then. It would have saved tons of money, and time.
Yeah that's actually a good point, I was thinking that what would probably be useful is a system of collecting the data first and then from there curating it.
There's an interesting opportunity to _meaningfully_ "open-source the algorithm" by just letting each person define their own curation, at least on top of the provider's initial curation.
The internet of that world looks different from our world, probably in better ways for consumers.
Interesting point, though I more or less disagree about the level of curation necessary to make a minimally useful product. I think data.gov is FAR below that level, and is just noise so far. But these are practical questions, and I think we agree in principal about a lot.
The short story is that hand-curation does not seem to work at a very large scale, so we need to create tools for automated discovery (e.g., search, recommendations, controlled skill vocabulary etc.) and automated but credible signals of ability and trustworthiness (e.g., feedback system, skills tests, verified identities etc.). In a nutshell, the challenge is trying to take the things Josh was doing by hand and turn them into data-driven features.
Note: I'm the staff economist at oDesk & I'm on the research/data science team.
I think UP42 is trying this model. They have a catalogue where you can download archived data or task it. But yeah, right now their sources are a bit limited, and its' an open question whether or not they'll be able to sign up enough providers to make it work.
And they might be falling into the trap that the original article points out by trying to offer you processed data instead of being laser focused on having the best data sources and API.
There are people on reddit working on that. If you are interested, it would be good to coordinate efforts. /r/archiveteam and /r/datahoarder are the usual hangouts for that sort of project.
I'd like to open source at least a portion of the data to see if it is possible to build a community driven database. Could connect works back to museums and auction records as well as articles and research and potentially build a crowd sourced provenance.
Yes, I think you're right that something like this could and should be done. I've been collecting ideas and experimenting with stuff like that for a long time. It's not easy. Finding datasets is not the big problem, so I think tagging datasets doesn't help as much as it does with, say, photo collections.
The crucial thing is data quality. You basically have three kinds of public datasets:
1) Academic ones, which are mostly high quality, but tend to be dusted and not kept up to date.
2) High quality commercial datasets, which are expensive and tightly guarded.
3) Free datasets of mostly low quality. Yes you can use dapper to scrape it and freebase to store it, but what's missing is a process to assure data quality. That's what a community effort could provide or coordinate. Something like apache.org for data. And there would have to be a way for non-programmers to help, because with most datasets programmers are not the ones who know the data best and the coding can be extremely dull. It's unbelievable how many different ways there are to screw up data and how difficult it is to clean. There's always some manual work left and you can't beat a pair of eyes (yet) to spot errors.
There would also have to be a way for users of datasets to pay a reasonable amount of money to have a particular dataset brought up to high quality standards.
I'm very interested in answers to this question as well. I'm working near this space and have a lot of thoughts on what such a service would need to be successful, at least for image tasks. I think it could offer real value to the work forces that complete data labeling tasks, cutting out middleman companies with worker-owned labeling and marketplace software tools.
I do find it fascinating this information can be offered as a service, given: 1. the data is already in the public domain, 2. is published as a single, easily-ingestable file, and 3. free to download and use already-licensed.
Maybe this will lead to less intrusive or frequent calls for user donations?
They spend a lot of time talking about how dataset owners don't have to do anything to expose it to GOODS. It sounds like a big part of the work that goes into GOODS is automatic metadata inference.
reply