Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

English has about 1 bit per character of entropy so if you type a normal expression it is going to give you ~60 bits of entropy. Random words will give you something like 2-3 bits per character so ~120-180 bits for a 60 character string. Random alphanumeric strings have about 5.95 bits of entropy so you get 178 bits from 30 of them.

So your comparison is somewhat right but only with an annoying definition of haiku and dictionary of words. However, a less good scheme should still be fine



sort by: page size:

English words are ~1.5 bits per character, random ascii garbage is ~6.5 bits per character.

English text has 1.46 bits of entropy per character.

https://www.gwern.net/docs/statistics/1996-teahan.pdf


I'm not sure your average English word has as much as five characters on average. Okay, let's count this paragraph.

91 characters, in 21 words. 4.3 characters on average, which would mean 13 bits of entropy. I don't believe it. It sounds like your source didn't account for word frequencies in real sentences, let alone grammatical constraints.


Your method is vastly different from Randall's method, each word there gets 11 bits of entropy because it's randomly chosen.

In properly formed English sentences, each character only has about 1 to 1.5 bits of entropy, and I'm not certain that taking the first letters of words in a sentence would have much higher per-character entropy than that, as the first letters of words are not very randomly distributed.


You have completely missed the point of the comic, which is that if you choose 4 common English words at random, the entropy is surprisingly high. It isn't based on "human readable strings" at all.

For example, my /usr/share/dict/american-english contains just shy of 100,000 words. A random word chosen from that set has 16.6 bits of entropy, and four randomly chosen words has over 66 bits of entropy. If anything, XKCD's comic is understating the entropy involved.


The XKCD comic is only partially correct. Depending on what source you believe English text has about 0.6 to 2.3 bits of entropy per character. This means you need somewhere between 4.7 and 18.3 characters in each word to reach 11 bits of entropy per word. Assuming entropy is closer to 2 bits per character this is a realistic situation. However, when you assume entropy is closer to 1 bit per character the words have to be too long to be realistic.

Those 3000 words are not random in natural language. If they were your calculation would be correct, but they aren't so the actual entropy of the system is likely nowhere near 138 bits. In other words, song title or not, if the sentence was an actual sentence the entropy is much lower. To get maximum entropy out of sets of words you have to use something equivalent to Diceware.

Can anyone explain to me why he assings only 2^11 bits of entropy to a word? Doesn't that correspond to choosing from only about 2000 words? If we choose from the more typical adult vocabulary of 100,000 words, isn't that log2(100,000) = 17 bits? Or am I doing it wrong?

Words are not bad because they're words. They're bad because humans select them predictably.

Entropy is what matters in the end. 30 random characters will have approximately 185.7 bits of entropy. With Diceware and the 7776 entry wordlist, that's 14 words to match it. Although 8-9 words is sufficient for most uses.


As a rule of thumb, English text has about one bit per character of entropy. [0, 1] Since we're going with averages, let's say 5 letters + a space for each word. So you need a 7- or 8-word sentence, with normal capitalization and punctuation, to get 42 bits of entropy. And of course it shouldn't be a well-known phrase like "I've got a bad feeling about this!"

[0] The original http://www.princeton.edu/~wbialek/rome/refs/shannon_51.pdf

[1] and some evidence that it's still correct http://en.wikipedia.org/wiki/Hutter_Prize


The entropy count for each word represents picking one word at random from 2^11 choices of words. It doesn't have anything to do with the characters.

I am not convinced by one of the quiz answers:

> Our calculations confirm that a relatively short series of truly randomly chosen English dictionary words is secure; many people find these somewhat more memorable. Above we used "In the jungle! The mighty Jungle, the lion sleeps tonight!" The important thing is to choose enough words and to choose them in a random un-guessable way, such as by changing the spacing, punctuation, spelling, or capitalization.

The problem with this example is that the 10 words are not chosen independently. Type "in the j" into a google search box and the whole phrase will appear in the drop-down box. So the entropy for the choice of that phrase is about lg2(37^8) or about 42 bits.

So an approximation of the total entropy is:

Choice of source phrase = lg2(37^8) ~= 41.7 bits

Choose one of the 10 suggestions from the drop-down box = lg2(10) ~= 3.3 bits

Permutation of words = lg2(10! / 2! / 3!) ~= 18.2 bits

Spacing (assume each word may independently be precedeed by a space with probability 0.5) =10 bits

Punctuation (each word may be independently followed by '!') = 10 bits

Capitalization: independently choose one of {lowercase, camelcase, uppercase) for each word = lg2(3^10) ~= 15.8 bits

Total so far: 98 bits.

Now consider the third option: a mixture of 16 independently-chosen letters, numbers and symbols. Assume most ASCII characters are available (lets eliminate single quote, backslash and $ which cause problems for some web apps) and we have

lg2(92^16) ~= 104.4 bits, which wins.


> most of those sentences are meaningless so they won't come up in normal use

Feel free to come up with a better entropy model then. Stackoverflow gives me confidence that it will be between 5 and 11 bits per word anyway [https://linguistics.stackexchange.com/questions/8480/what-is...].

> if statements can grab patterns just fine in most languages, they're not limited to pure equality

This does not help you one bit. If you want to produce 9 sentences of output per query then regular expressions, pattern matching or even general intelligence inside your if statements will NOT be able to save the concept.


The thing about the Lord's Prayer doesn't really follow. If you use a grammatically correct and semantically commonplace 12 word sequence like that, you surely don't have 128 bits of entropy. But the ease of memorization comes almost entirely from those attributes!

To get 128 bits of entropy with words, you need to pick about thirteen out of a million words--which is on the order of all the words in the English language--and give all of them equal probability. The sequence needs to be fully random as well. What you end up with will surely be easier to memorize than a UUID, but substantially more difficult than the start of the Lord's Prayer.

EDIT: Math is wrong, I was thinking 10 bits per million instead of 20. So 6-7 words out of a million (whole language) or 13 words out of a thousand (very limited subset of the language). Point about random selection still stands, but it's certainly easier than 13 very uncommon words. Still much harder than a realistic sentence of that length, though.


The entropy of English is ~1 bit per letter (Measured in a funny way by Shannon - https://cs.stanford.edu/people/eroberts/courses/soco/project...)

In general, it's a bit funny in how we build ML models. You take a bucket of matrices, fill them with random numbers, and then start to introduce "biases" through back propagation. A model can converge down loss, but most of them are still filled with random noise.

Binary weights are somewhat obvious in retrospect. Weights indicate the strength of an association between two neurons. Intuitively, most of the value, is probably just that an association between two neurons exists, and whether it's positive or negatively associated.


I couldn't resist asking it about the word entropy of its training corpus and quite predictably it wasn't able to understand my question and instead answered similar questions. However, it stuck quite firmly to a figure of 10--12 bits per word! I'm fairly sure that refers to the vocabulary entropy and not word entropy of English.

I would say it has better entropy, but not by much.

The word "idea" appears list of 850 common words:

https://en.wiktionary.org/wiki/Appendix:Basic_English_word_l...

"fish" is ranked #1453 in the list of lemmas linked from here:

https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists#To...

We can arithmetically encode the choice of two words from dictionaries of 1000 and 1500, respectively, using 21 bits.

The two bits estimate is a somewhat pessimistic lower bound.


> The 11 bits of entropy refer to a dictionary of 2K words to choose from. The reason to type full ones is you're not hamstrung by the "no common prefix" limitation, which allows larger (and easier to remember) dictionaries.

Another note on this - assume an average word length of 5 - that's 11/5 or 2.5 bits per character typed (again, assuming the wordlist doesn't loose some bits for "double coding" like "at hat/a that").

At 7 bits per word - of which two characters are enough, we type 7/2 or 3.5 bits per character.

Conversely, we only memorize 7 bits per word vs 11 bits.


I've seen somewhere that there is about 1 bit of entropy per letter of English text. The best packer (Hutter prize) compresses 100MB of wikipedia down to 15MB, including the packer itself, a ratio of 1:6.5, which isn't that far off.

Considering that, 40k words to 400kbits is not too surprising.

next

Legal | privacy