Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

> So I did what any software engineer would do and started digging through the data.

That behavior.



sort by: page size:

>> So you are certainly aware that there are avenues to creating the data set. Given that, it is quite reasonable to say that search is unnecessary.

How is it unnecessary? They used none of those methods, so they had to use search. That is search being necessary, not the opposite.


> I didn’t know what I’d do with the data at this point, but I started collecting it right away so that I’d have as much as possible to play with later.

And people complain that everything everywhere collects data on everyone.


> I took a look at the data. The data schema is disorganized to the point that a lot of janitorial work would be necessary to get it useable and perform any analysis or visualization.

In other words, it is real world data.


> Why not collect all the information you possibly can, whether or not you immediately see obvious value to it, and then find ways to justify it later?

Because it's non-trivial and it costs a lot of money to do that, in terms of engineers' salaries, storage, etc.


>we’ll look at the data

Just to check, did you look at the data in this case?


> I wonder if he has hard data about how often people look at his documentation.

Context: He works for RAD Game Tools, now a subsidary of Epic Games, and worked on the Oodle data compression suite for a long time. As such I believe there were tons of direct customers who would contact him and other team members whenever things went hairy. (There were no free version of Oodle so every user is a paying customer or sometimes an evaluator.) Therefore I guess he does have some data but no hard numbers.


> Of course there will be correlation

Yes, and given enough data this:

> so long as you can't tell sensitive categories

will not be true.

> I have to trust the engineers

The engineers that work for a giant corporation whose entire reason for existing is mining and collating as much data as possible on you? Why would you trust them at all?


> I can hand you all my orgs data and 200 people who work with it everyday and it will still take you years to figure out what anything means.

Ye what are you supposed to do with the information. I worked at a place that is paranoid for data leaks of non personal data, like source code.

Even if their direct competitor got the sources they would have almost no use for it since it is an undocumented mess. The source without the dev. departments is useless.

The same applies for business strategy if it leaks. Which competitor is nimble enought to change anything based on that data.


> That's not fun, that's boring. The fun part is looking at the data and gleaming all these potential patterns from it, seeing what potential is there and what could be

Exactly! This is the reason why I love my job. It gets even better when you uncover a non-intuitive insight.


> Because my data comes from a variety of unstructured, possibly dirty sources which need cleaning and transforming before they can be made sense of.

Seattle data guy had a great end of year top 10 memes post recently and one of them went like this

> oh cool you’ve hired a data scientist. so you have a collection of reliable and easy to query data sources, right?

> …

> you do have a collection of reliable and easy to query data sources, right?

—-

Like, most of the time in businesses… if the data can’t be queried with SQL then it’s not ready to be used by the rest of the business. Whether that’s for dashboards, monitoring, downstream analytics or reporting. Data engineers do the dirty data cleaning. Data scientists do the actual science.

That’s what I took from the parent at least.

YMMV obviously depending on your domain. ML being a good example where things like end to end speech-to-text operates on wav files directly.


> For example, suppose there is bug where the rate calculations aren't working for certain types of routes. The engineers will want to look up those routes to understand what is causing it.

Then show routes without names.

> If you are designing an algorithm to detect to fraud, you are going to want to look at cases of fraud to understand how to design the algorithm. Then show names without routes.

> Further, if you want to do usability testing you are going to need to test with. You might want to check the different types of names used in the system to make sure they display properly. You may also want to sample the list of customers to user-test with live data or survey customers.

Use a library that can generate realistic but fake data. I feel there is no excuse to not compartmentalize. If data security is not important to the business...well it only takes one bad article like this to cast doubt on the whole company.


> I could be a lot happier if they gave me access to the analytics data (which they most definitely do log)

Back in the days us geeks would just fire up a web-server and call it a day.

What happened?


> perhaps the data scientist needs to be involved in the data wrangling in order to understand the source better?

Founder of an ETL startup here. This is exactly what we believe: the end-user of the data should be involved as early in the data pipeline as possible, including the wrangling. If you eliminate the engineer from the ETL process you remove a lot of painful back-and-forth and get more flexible pipelines.


> They collected the data

No they didn't. Users input most of their data.


> I'd love to see the dataset you have access to backing up any of that statement.

That would make you a sub-processor.


> I doubt this is the case or it has any side effects on their performance.

"Show me the data!"


> It's a constant fight between analyst and application developer. The analyst wants a history of all states. The application developer often only cares about having a correct current state.

Again, this is totally and completely wrong. Actually, it doesn't even make sense, to say it's wrong. If I'm building an application to be used by an analyst, I want to hit all their requirements. They tell me what sort of data they want to be collected. If requirements change, or the data isn't as valuable as originally thought, or it's badly formatted, or whatever, then we work together to change it.

I don't fight with anyone.... They don't want shit data, and I don't want to store shit data. So who is there to fight with?


>You probably shouldn't be keeping data as pickles, for both compatibility and security reasons

It's science research. My own simulation code and data. Some of it is hdf5, but pickle files are pretty convenient. Workflow is mainly turning data into plots. Heh, I thought I was doing alright since I'm not using textfiles.


> but selectivly use data to do what the executive wants to be done.

the thirst of some people when I give them something that sounds like what they want. When they re-parrot last years findings back to me incorrectly because they only listened to the parts that backed up their instinctual beliefs. Makes me angry I waste my time on the data when all they wanted was an excuse. I coulda given them an excuse without having to do all that data digging...

next

Legal | privacy