I know it is not good HN policy to doubt your intentions but by emphasizing the NOT's, DEIDENTIFIED and YOU, you make me highly suspicious. Anonymization of data is difficult at best and sometimes nearly impossible so I would advise to publish the entire protocol if you want to give people assurances. The encryption key, as already stated, is pointless without an explanation how you use it and why it is employed and is otherwise just smoke and mirrors and no real security.
Using crypto hashes to anonymize data is one of those mistakes I've seen several times, and wanted to draw some attention to the issue so that hopefully we can all learn from it.
I basically don't believe non-degenerate (psued)anonymization is possible, although that complicated af homomorphic encryption stuff makes me a little uncertain.
I was trying to avoid the general ideas on what is a good way to anonymize data, because I don't think there are general rules that apply, and I'm not in a position to give authoritative advice on this. The more I dug in, the more I realized this is probably one of the hardest technical problems that exists right now, and there isn't yet a right answer that works (like use scrypt for passwords).
As for GDPR, I think digging into this in more detail would be a great follow up.
People are so naive about how hard it is to really anonymize data and how surprisingly easy it could be to deanonymize it that I'd never trust it without:
1) Some serious explanations and scrutiny about exactly how their deanonimization process work
2) How data leaks are removed from logs and elsewhere
3) how data is aggregated and presented and what measures are taken to prevent deanonimization
Edit: their FAQ is not nearly close to a "serious" discussion. And looking at their code is not an efficient way of learning whether it's a trustworthy approach.
Forgive my language but I'd expect people here to understand that's horeshit, they absolutely have enough data and patterning to de-anonymize the data. They spent time making it look anonymous.
"Anonymization" in the sense of transforming a dataset so that it's still useful but doesn't significantly reduce the privacy of the people it describes, is usually impossible, or at least beyond the state of the art. People start out with just a few tens of bits of anonymity and bits are everywhere.
You probably have a better chance of creating your own secure block cipher than of achieving this goal. In a similar way, your inability to see what's wrong with your scheme is not evidence that it works.
I don't like to be negative, and I'm all for continued research, but at this point the conservative thing to do with data that you need to "anonymize" is delete it.
Tell me how someone anonymous travels from day to day and I will tell you who he or she is.
To decouple identity data from other data does already not work in theory. Moreover, it is also in conflict with the ability to retract personal data. If data is anonymized, it is not possible anymore to get rid of your personal records.
Homeomorphic encryption is the only method that might make a dent here. Laws will be broken.
If their description of the software is correct then that part is irrelevant (privacy-wise). I guess with this source-code you could at least figure out if their implementation is as anonymous as they claim it is.
I sent you (Dylan) an email, but openly for discussion:
I'm building a product around an interactive learning system. Users train it themselves, with our assistance as needed. But I don't want to ship a puppy that will piss on your rug... and it would be great if it already had some useful skills and intuitions out of the box...
So I want to carefully anonymize that data by hand, and retain it for training models that all customers can benefit from. (I also want to train models to recognize PII, but only to assist humans in doing the task; no amount of error would be considered acceptable for this.)
I honestly don't see any problem, but I had a security consultant nearly spew beer on me when I got to that part, and insist that I drop that line of thinking.
I need more advice on this.
(The only other path I can think of is homomorphic encryption... but I would not want to retain it in the the case where the original is deleted...)
The intent behind this tool seems good, but I don't think it's a good idea. To actually anonymize data requires semantic understanding of that data and an understanding of what sort of data, harmless by itself, is transmuted into identifying data when provided in the context of other otherwise harmless data.
This tool doesn't help you with any of that. It seems to be a glorified awk script. My concern is that helping the user with the easiest part of anonymizing data stands to encourage the user to go full steam ahead without slowing down to stop and think very carefully about what they're doing.
That "anonymized technical data" link points to a github repo with no documentation. It's not clear what data you're actually sending. What do you consider personally identifying?
Anonymization of data is not a simple task. Even with the noblest of intentions things can go wrong. See the problems with the sharing of medical data in the UK and examples from the US [1].
De-anonymization is something that we already have a lot of experience with, specifically tying a device to an individual. There’s nothing special about a public key that makes this harder.
Instead, the Cryptosphere favors system robustness over guarantees on anonymity.
You trace back the "provenance of that file" through crytographic signatures. You could make your own throwaway identity, use it to publish something, and through the continued propagation of that data through the network its publication would no longer require your activity.
It should be considered psuedoanonymous publishing.
reply