Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Facebook is doing some great anonymization:

> The objection raises that not all computationally possible numbers are indeed assigned. Therefore, the lossy hash refers not to at least 16 numbers but to a maximum of 16 numbers. Furthermore, if additional data is stored along with the lossy hash, the number of individuals represented by the associated phone numbers can be reduced as data subjects not matching this additional data can be excluded. If e.g., so the DE SA, the gender is also stored, it is possible to at least divide these 16 in half.

So their hashcodes can be mapped to 16 different users, which can be trivially reduced to a single person if you have any additional information about them.



sort by: page size:

The input space is too small for SHA1 to effectively anonymize. The NANP, for example, has less than 10^9 possible numbers; it would be a very simple task to create a rainbow table mapping every possible phone number to its corresponding SHA1 hash.

For the same reason, you can't just use a simple cryptographic hash to "anonymize" data such as birthdates, zip codes, SSNs, or PINs.

Using a key derivation function with a very high cost factor can mitigate this to some extent (e.g. making it take 5 seconds on an average CPU to generate the hash from a phone number), but it by no means makes for secure anonymization; eventually computing power will catch up.

Encrypting the number with a secret key (or using an HMAC), and destroying the key after the anonymization takes place might be a reasonably secure way of doing this, however.


If our goal is true anonymisation, that is, even the host cannot know who the data belongs to, why are we hashing data at all, and not completely removing it? Replace the pii (name, email address, phone etc) with a fixed number of *'s. There's no reversing or guessing that.

If we are wanting information to be readable by some people in some circumstances, that's not anonymisation: that's data protection and an entirely different problem.


There are mathematical ways you can guaruntee different levels of anonymity though without removing all identifiers. Ex. you can increase age and location buckets until for every unique set of identifiers, you have a good distribution of attributes.

Dunno what's proposed but hashing personal data doesn't work in terms of anonymity, due to the small space of the said data.

In the days of crypto currencies, brute forcing the hashes (imagine names, day of birth, city) is a trivial task.


Google Analytics drops the last octet IIRC. Hashing isn't anonymizing because the hash can later be used to re-identify a user. (See https://ec.europa.eu/info/law/law-topic/data-protection/refo...)

The article discusses hashing being an inadequate method of anonymization.

De-anonymization is incredibly easy (bordering on trivial) with any rich data set. For example, 87% of the US population is uniquely identifiable by their combination of DoB, zip code, and sex [1]. The richer a dataset is, the easier it is to de-anonymize. This is a pretty rich dataset.

[1] http://latanyasweeney.org/work/identifiability.html


> The way we’ve chosen to anonymize the data is by generating HMAC

You can also truncate the hash after the HMAC to mix the data of different users. It still would be useful for aggregate analytics, abuse protection, rate limiting, etc, but if each user shares an identifier with many others it would be harder to unmask them and make correlations.


That's not the common case, though, and is completely awful to use as an anonymity measure.

I think another problem is that we even call any of that "anonymization". If you replace "foobar" with "1", you haven't anonymized anything. At best, you have pseudomized your data. Whether you use hashing or a secret mapping function, as long as identity within your dataset is preserved, what you are generating are pseudonyms.

Hashing identity solves this. All of the data can be anonymized while retaining the linkage of measurements and procedures performed on the same subject.

e.g. We don't know who this guy is, but we do know that every measurement of him was linked to this same enormous random string.


But the problem is, it IS easy. Even anonymized at assets are de-anonymized trivially,using secondary datasets. You can uniquely identify more than 87% of the US population with gender, birthday, and zip code. For more than 50%, you can use the municipality name instead of zip.

If your friend uses a separate, privately maintained email address for everyone and every service, that will help a lot. Then we only know his identity because email is sent in the clear, even between his own server and home computer. Of course, everyone uses separate email addresses for everyone, right?


> "anonymization" generally just involves replacing a name or credit card number with some other identifier.

DJB's description of "anonymization" while talking[1] about his job as the man in the middle at Verizon:

>> Hashing is magic crypto pixie-dust, which takes personally identifiable information and makes it incomprehensible to the marketing department. When a marketing person looks at random letters and numbers they have no idea what it means. They can't imagine that anybody could possibly understand the information, reverse the hash, correlate the hashes, track them, save them, record them.

> lots of industry standard analyses unachievable with those identifiers out of the picture.

Calling something "standard" doesn't mean it's ethical. If someone wants use that type of identifier, they need to get explicit informed consent from everyone involved, and they need to be liable for any damages that derive from their database of identified records.

[1] https://projectbullrun.org/surveillance/2015/video-2015.html...


The data is not anonymized. If you install an app that requests access to your data and the data of your friends, then it's not anonymized. How could it be? That company would have the exact links between you, under your real name, and all of the friends linked to you, under their real names.

Multiply this by a million.


Yes anonymization is the problem I was thinking of.

Anonymity is important, but the hashing is an issue because somehow the hash is created, and that is just as much a black box issue as the rest of the chain.

The hash is less sensitive than literally any other proof of personhood currently in existence. Go ahead, try to name a proof of personhood that's more private than a hash.

You seem to be framing this as a reduction in privacy but it is quite the opposite.


The answers can be hashed in many ways to ensure anonymity.

(off topic, but this report is a good example of how to handle user data)

> anonymized

Could we, perhaps, stop using this word? Instead of using the vague, often misleading term "anonymized", state directly what actually happened, e.g. "names and addresses were removed", "user data was aggregated by ${group}", or "the UID was replaced with a new, equivalent key". Most of the time claims about data being "anonymized" are simply not true; replacing names or UIDs with a hashed value that is merely replacing an existing candidate key with a new synthetic key. As DJB said[1]:

>> Hashing is magic crypto pixie-dust, which takes personally identifiable information and makes it incomprehensible to the marketing department. When a marketing person looks at random letters and numbers they have no idea what it means. They can't imagine that anybody could possibly understand the information, reverse the hash, correlate the hashes, track them, save them, record them.

The rare examples where "anonymized" actually involves meaningfully making user data anonymous are when the actual user-correlated relations[2] have been destroyed. This report specifically discusses how this was done:

> If a sentence fragment appeared in less than 10 unique adventures, it was discarded from the result set to preserve anonymity.

Sometimes this required accepting a small amount of error:

> this data needed to be processed in batches of around 10000 adventures per batch. In each batch, fragments appearing only once were purged. Therefore, counts under around 25 are actually underestimates.

[1] https://projectbullrun.org/surveillance/2015/video-2015.html...

[2] https://en.wikipedia.org/wiki/Relation_%28database%29

next

Legal | privacy