Hacker Read

finnw · 2012-01-02 16:06:33+00:00

I am not convinced by one of the quiz answers:

> Our calculations confirm that a relatively short series of truly randomly chosen English dictionary words is secure; many people find these somewhat more memorable. Above we used "In the jungle! The mighty Jungle, the lion sleeps tonight!" The important thing is to choose enough words and to choose them in a random un-guessable way, such as by changing the spacing, punctuation, spelling, or capitalization.

The problem with this example is that the 10 words are not chosen independently. Type "in the j" into a google search box and the whole phrase will appear in the drop-down box. So the entropy for the choice of that phrase is about lg2(37^8) or about 42 bits.

So an approximation of the total entropy is:

Choice of source phrase = lg2(37^8) ~= 41.7 bits

Choose one of the 10 suggestions from the drop-down box = lg2(10) ~= 3.3 bits

Permutation of words = lg2(10! / 2! / 3!) ~= 18.2 bits

Spacing (assume each word may independently be precedeed by a space with probability 0.5) =10 bits

Punctuation (each word may be independently followed by '!') = 10 bits

Capitalization: independently choose one of {lowercase, camelcase, uppercase) for each word = lg2(3^10) ~= 15.8 bits

Total so far: 98 bits.

Now consider the third option: a mixture of 16 independently-chosen letters, numbers and symbols. Assume most ASCII characters are available (lets eliminate single quote, backslash and $ which cause problems for some web apps) and we have

lg2(92^16) ~= 104.4 bits, which wins.

reply

caf | karma 12991 | avg karma 2.9 · | 2013-10-16 06:50:45+00:00

You have completely missed the point of the comic, which is that if you choose 4 common English words at random, the entropy is surprisingly high. It isn't based on "human readable strings" at all.

For example, my /usr/share/dict/american-english contains just shy of 100,000 words. A random word chosen from that set has 16.6 bits of entropy, and four randomly chosen words has over 66 bits of entropy. If anything, XKCD's comic is understating the entropy involved.

reply

gabemart | karma 5015 | avg karma 7.16 · | 2015-11-17 16:46:26+00:00

The entropy count for each word represents picking one word at random from 2^11 choices of words. It doesn't have anything to do with the characters.

quinnchr | karma 375 | avg karma 2.04 · | 2013-10-16 07:57:41+00:00

Except when people create phrases like that they aren't choosing random words from a dictionary, they're most likely choosing words from their own vocabulary which will be significantly less than 100k words. Additionally the distribution is not uniform, reducing entropy even further.

epistasis | karma 21809 | avg karma 4.14 · | 2011-08-15 08:02:30+00:00

Your method is vastly different from Randall's method, each word there gets 11 bits of entropy because it's randomly chosen.

In properly formed English sentences, each character only has about 1 to 1.5 bits of entropy, and I'm not certain that taking the first letters of words in a sentence would have much higher per-character entropy than that, as the first letters of words are not very randomly distributed.

reply

dan-robertson | karma 9079 | avg karma 3.06 · | 2021-11-16 17:20:13

English has about 1 bit per character of entropy so if you type a normal expression it is going to give you ~60 bits of entropy. Random words will give you something like 2-3 bits per character so ~120-180 bits for a 60 character string. Random alphanumeric strings have about 5.95 bits of entropy so you get 178 bits from 30 of them.

So your comparison is somewhat right but only with an annoying definition of haiku and dictionary of words. However, a less good scheme should still be fine

reply

nzp | karma 607 | avg karma 1.64 · | 2016-09-10 13:15:43

Those 3000 words are not random in natural language. If they were your calculation would be correct, but they aren't so the actual entropy of the system is likely nowhere near 138 bits. In other words, song title or not, if the sentence was an actual sentence the entropy is much lower. To get maximum entropy out of sets of words you have to use something equivalent to Diceware.

CWuestefeld | karma 11896 | avg karma 3.47 · | 2012-07-12 11:42:55

Yes. If you choose your four words from a dictionary of, say, 2000, then there are 1.6e13 combinations -- about 44 bits of entropy.

sp332 | karma 55607 | avg karma 2.75 · | 2013-10-16 15:19:01+00:00

As a rule of thumb, English text has about one bit per character of entropy. [0, 1] Since we're going with averages, let's say 5 letters + a space for each word. So you need a 7- or 8-word sentence, with normal capitalization and punctuation, to get 42 bits of entropy. And of course it shouldn't be a well-known phrase like "I've got a bad feeling about this!"

[0] The original http://www.princeton.edu/~wbialek/rome/refs/shannon_51.pdf

[1] and some evidence that it's still correct http://en.wikipedia.org/wiki/Hutter_Prize

reply

leephillips | karma 21974 | avg karma 3.76 · | 2014-02-01 20:31:44

Can anyone explain to me why he assings only 2^11 bits of entropy to a word? Doesn't that correspond to choosing from only about 2000 words? If we choose from the more typical adult vocabulary of 100,000 words, isn't that log2(100,000) = 17 bits? Or am I doing it wrong?

demallien | karma 2742 | avg karma 2.67 · | 2011-08-11 14:24:23+00:00

If the dictionary really has 100 000 words, you're looking down the barrel of 52 bits of entropy for a three word phrase

In a more likely dictionary of the 5000 most commonly used words in the English language, you still get a three word pass phrase of about 40bits of entropy. Make that a four word passphrase, and you're back up around 52 bits.

reply

loup-vaillant | karma 9865 | avg karma 2.28 · | 2016-12-14 18:46:18+00:00

I'm not sure your average English word has as much as five characters on average. Okay, let's count this paragraph.

91 characters, in 21 words. 4.3 characters on average, which would mean 13 bits of entropy. I don't believe it. It sounds like your source didn't account for word frequencies in real sentences, let alone grammatical constraints.

reply

germanier | karma 2578 | avg karma 2.46 · | 2016-09-08 12:55:04+00:00

100k words might be a bit to many for a dictionary of words that are easy to remember. The comic proposes using not enough entropy (i.e. you should use more than four random words).

The Schneier method is basically equivalent but with important caveats: 1) Such a phrase is not randomly chosen but follows at minimum basic English grammar and at worst is well-known and thus part of the attacker's dictionary and 2) even if that is taken account for the first letters are not distributed uniformly across the alphabet, you just need to take a look at any (printed) dictionary. That greatly reduces entropy and makes it hard to reliably estimate entropy.

reply

tomjakubowski | karma 7365 | avg karma 3.0 · | 2023-08-08 16:32:40

You only need a phrase of twelve words from a 2048 word dictionary to have 128 bits of entropy. Twelve words is up to "Thy kingdom" in the Lord's Prayer, so certainly people are able to memorize twelve word phrases or even 24 word phrases without too much trouble.

And English is a lot more than 2048 words - so you could probably use a shorter phrase and still be fine.

reply

fart_ratty | karma 4 | avg karma 0.44 · | 2020-04-03 15:15:28

Using american-english dictionary from aspell [1] and filtering for lines that only contain lower case letters [2] gives you 77649 words. For four words that gives approx 64 bits of entropy [3].

[1] Probably available on a Linux system at /usr/share/dict/american-english

[2] '^[a-z]+$'

[3] log_2(77649^4)

reply

rcoveson | karma 2934 | avg karma 4.92 · | 2023-08-08 16:41:36

The thing about the Lord's Prayer doesn't really follow. If you use a grammatically correct and semantically commonplace 12 word sequence like that, you surely don't have 128 bits of entropy. But the ease of memorization comes almost entirely from those attributes!

To get 128 bits of entropy with words, you need to pick about thirteen out of a million words--which is on the order of all the words in the English language--and give all of them equal probability. The sequence needs to be fully random as well. What you end up with will surely be easier to memorize than a UUID, but substantially more difficult than the start of the Lord's Prayer.

EDIT: Math is wrong, I was thinking 10 bits per million instead of 20. So 6-7 words out of a million (whole language) or 13 words out of a thousand (very limited subset of the language). Point about random selection still stands, but it's certainly easier than 13 very uncommon words. Still much harder than a realistic sentence of that length, though.

reply

xkcdentropy | karma 0 | avg karma 0.0 · | 2011-08-11 22:59:28

The XKCD comic is only partially correct. Depending on what source you believe English text has about 0.6 to 2.3 bits of entropy per character. This means you need somewhere between 4.7 and 18.3 characters in each word to reach 11 bits of entropy per word. Assuming entropy is closer to 2 bits per character this is a realistic situation. However, when you assume entropy is closer to 1 bit per character the words have to be too long to be realistic.

GuB-42 | karma 15915 | avg karma 4.0 · | 2019-03-28 13:02:42+00:00

I've seen somewhere that there is about 1 bit of entropy per letter of English text. The best packer (Hutter prize) compresses 100MB of wikipedia down to 15MB, including the packer itself, a ratio of 1:6.5, which isn't that far off.

Considering that, 40k words to 400kbits is not too surprising.

reply

mNovak | karma 1884 | avg karma 3.28 · | 2023-05-23 15:55:24

For those wondering, the source including word list is linked [1].

At a glance, the base dictionary is 2280 words; jargon is 8800; science is 575. So, definitely consider adding all the lists! That gives (check my math) ~13.5 bits entropy per word.

[1] https://bitbucket.org/jvdl/correcthorsebatterystaple/src/mas...

reply

e12e | karma 13838 | avg karma 1.49 · | 2017-01-14 09:23:17

> The 11 bits of entropy refer to a dictionary of 2K words to choose from. The reason to type full ones is you're not hamstrung by the "no common prefix" limitation, which allows larger (and easier to remember) dictionaries.

Another note on this - assume an average word length of 5 - that's 11/5 or 2.5 bits per character typed (again, assuming the wordlist doesn't loose some bits for "double coding" like "at hat/a that").

At 7 bits per word - of which two characters are enough, we type 7/2 or 3.5 bits per character.

Conversely, we only memorize 7 bits per word vs 11 bits.

reply