Hacker Read

josefx · 2021-09-02 08:24:18

Facebook is doing some great anonymization:

> The objection raises that not all computationally possible numbers are indeed assigned. Therefore, the lossy hash refers not to at least 16 numbers but to a maximum of 16 numbers. Furthermore, if additional data is stored along with the lossy hash, the number of individuals represented by the associated phone numbers can be reduced as data subjects not matching this additional data can be excluded. If e.g., so the DE SA, the gender is also stored, it is possible to at least divide these 16 in half.

So their hashcodes can be mapped to 16 different users, which can be trivially reduced to a single person if you have any additional information about them.

reply

trapexit | karma 394 | avg karma 4.99 · | 2013-07-01 22:36:37+00:00

The input space is too small for SHA1 to effectively anonymize. The NANP, for example, has less than 10^9 possible numbers; it would be a very simple task to create a rainbow table mapping every possible phone number to its corresponding SHA1 hash.

For the same reason, you can't just use a simple cryptographic hash to "anonymize" data such as birthdates, zip codes, SSNs, or PINs.

Using a key derivation function with a very high cost factor can mitigate this to some extent (e.g. making it take 5 seconds on an average CPU to generate the hash from a phone number), but it by no means makes for secure anonymization; eventually computing power will catch up.

Encrypting the number with a secret key (or using an HMAC), and destroying the key after the anonymization takes place might be a reasonably secure way of doing this, however.

reply

AndrewSChapman | karma 251 | avg karma 3.64 · | 2018-05-01 07:05:43

If our goal is true anonymisation, that is, even the host cannot know who the data belongs to, why are we hashing data at all, and not completely removing it? Replace the pii (name, email address, phone etc) with a fixed number of *'s. There's no reversing or guessing that.

If we are wanting information to be readable by some people in some circumstances, that's not anonymisation: that's data protection and an entirely different problem.

reply

GhostVII | karma 4019 | avg karma 4.08 · | 2019-11-12 07:04:05+00:00

There are mathematical ways you can guaruntee different levels of anonymity though without removing all identifiers. Ex. you can increase age and location buckets until for every unique set of identifiers, you have a good distribution of attributes.

xxs | karma 3320 | avg karma 1.47 · | 2018-04-06 15:09:08+00:00

Dunno what's proposed but hashing personal data doesn't work in terms of anonymity, due to the small space of the said data.

In the days of crypto currencies, brute forcing the hashes (imagine names, day of birth, city) is a trivial task.

reply

teamhappy | karma 2160 | avg karma 3.63 · | 2018-05-01 14:31:16+00:00

Google Analytics drops the last octet IIRC. Hashing isn't anonymizing because the hash can later be used to re-identify a user. (See https://ec.europa.eu/info/law/law-topic/data-protection/refo...)

Zak | karma | avg karma · | 2023-07-13 08:26:09

The article discusses hashing being an inadequate method of anonymization.

nbadg | karma 839 | avg karma 3.34 · | 2017-04-05 17:05:15+00:00

De-anonymization is incredibly easy (bordering on trivial) with any rich data set. For example, 87% of the US population is uniquely identifiable by their combination of DoB, zip code, and sex [1]. The richer a dataset is, the easier it is to de-anonymize. This is a pretty rich dataset.

[1] http://latanyasweeney.org/work/identifiability.html

reply

zzzcpan | karma 4245 | avg karma 1.44 · | 2018-04-30 20:15:15

> The way we’ve chosen to anonymize the data is by generating HMAC

You can also truncate the hash after the HMAC to mix the data of different users. It still would be useful for aggregate analytics, abuse protection, rate limiting, etc, but if each user shares an identifier with many others it would be harder to unmask them and make correlations.

reply

Dylan16807 | karma 31639 | avg karma 1.39 · | 2015-04-14 21:54:51

That's not the common case, though, and is completely awful to use as an anonymity measure.

zAy0LfpBZLC8mAC | karma 3123 | avg karma 1.28 · | 2018-04-30 20:13:08+00:00

I think another problem is that we even call any of that "anonymization". If you replace "foobar" with "1", you haven't anonymized anything. At best, you have pseudomized your data. Whether you use hashing or a secret mapping function, as long as identity within your dataset is preserved, what you are generating are pseudonyms.

stevemadere | karma 37 | avg karma 1.19 · | 2019-02-11 17:01:22+00:00

Hashing identity solves this. All of the data can be anonymized while retaining the linkage of measurements and procedures performed on the same subject.

e.g. We don't know who this guy is, but we do know that every measurement of him was linked to this same enormous random string.

reply

ohthehugemanate | karma 3167 | avg karma 6.77 · | 2016-01-06 08:17:07+00:00

But the problem is, it IS easy. Even anonymized at assets are de-anonymized trivially,using secondary datasets. You can uniquely identify more than 87% of the US population with gender, birthday, and zip code. For more than 50%, you can use the municipality name instead of zip.

If your friend uses a separate, privately maintained email address for everyone and every service, that will help a lot. Then we only know his identity because email is sent in the clear, even between his own server and home computer. Of course, everyone uses separate email addresses for everyone, right?

reply

pdkl95 | karma 17069 | avg karma 4.63 · | 2020-03-03 21:25:59

> "anonymization" generally just involves replacing a name or credit card number with some other identifier.

DJB's description of "anonymization" while talking[1] about his job as the man in the middle at Verizon:

>> Hashing is magic crypto pixie-dust, which takes personally identifiable information and makes it incomprehensible to the marketing department. When a marketing person looks at random letters and numbers they have no idea what it means. They can't imagine that anybody could possibly understand the information, reverse the hash, correlate the hashes, track them, save them, record them.

> lots of industry standard analyses unachievable with those identifiers out of the picture.

Calling something "standard" doesn't mean it's ethical. If someone wants use that type of identifier, they need to get explicit informed consent from everyone involved, and they need to be liable for any damages that derive from their database of identified records.

[1] https://projectbullrun.org/surveillance/2015/video-2015.html...

reply

mtgx | karma 33272 | avg karma 3.31 · | 2018-03-20 20:55:33+00:00

The data is not anonymized. If you install an app that requests access to your data and the data of your friends, then it's not anonymized. How could it be? That company would have the exact links between you, under your real name, and all of the friends linked to you, under their real names.

Multiply this by a million.

reply

teleclimber | karma 1100 | avg karma 2.89 · | 2016-11-02 15:22:07+00:00

Yes anonymization is the problem I was thinking of.

ChefDenominator | karma 69 | avg karma 0.57 · | 2016-11-06 19:40:06+00:00

Anonymity is important, but the hashing is an issue because somehow the hash is created, and that is just as much a black box issue as the rest of the chain.

xvector | karma 7237 | avg karma 2.94 · | 2023-07-24 18:03:15

The hash is less sensitive than literally any other proof of personhood currently in existence. Go ahead, try to name a proof of personhood that's more private than a hash.

You seem to be framing this as a reduction in privacy but it is quite the opposite.

reply

iwwr | karma 6079 | avg karma 3.83 · | 2011-06-21 08:38:32

The answers can be hashed in many ways to ensure anonymity.

pdkl95 | karma 17069 | avg karma 4.63 · | 2021-04-29 07:30:27+00:00

(off topic, but this report is a good example of how to handle user data)

> anonymized

Could we, perhaps, stop using this word? Instead of using the vague, often misleading term "anonymized", state directly what actually happened, e.g. "names and addresses were removed", "user data was aggregated by ${group}", or "the UID was replaced with a new, equivalent key". Most of the time claims about data being "anonymized" are simply not true; replacing names or UIDs with a hashed value that is merely replacing an existing candidate key with a new synthetic key. As DJB said[1]:

>> Hashing is magic crypto pixie-dust, which takes personally identifiable information and makes it incomprehensible to the marketing department. When a marketing person looks at random letters and numbers they have no idea what it means. They can't imagine that anybody could possibly understand the information, reverse the hash, correlate the hashes, track them, save them, record them.

The rare examples where "anonymized" actually involves meaningfully making user data anonymous are when the actual user-correlated relations[2] have been destroyed. This report specifically discusses how this was done:

> If a sentence fragment appeared in less than 10 unique adventures, it was discarded from the result set to preserve anonymity.

Sometimes this required accepting a small amount of error:

> this data needed to be processed in batches of around 10000 adventures per batch. In each batch, fragments appearing only once were purged. Therefore, counts under around 25 are actually underestimates.

[1] https://projectbullrun.org/surveillance/2015/video-2015.html...

[2] https://en.wikipedia.org/wiki/Relation_%28database%29

reply