Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

How about 256 words (or sets of words) representing each byte value? It's less dense by a factor of ~5, but it would work, be easy to decode, and be very difficult to identify, especially if you used sets of words. You could even cleverly generate in a way that is grammatically correct.


sort by: page size:

The words are 36 bits. 3 12 bit characters (the analogue to a byte)

I would call that the word size. The word would be 2 bytes of 8 bits each.

Another idea: find 256 different syllables and use that to encode 8-bit numbers. This way we also can avoid profanity (except for the rare case where a string of multiple harmless syllables expresses a profanity).

We probably don't even need 128 bits of entropy. TOTP 2FA is only 20 bits after all and people seem happy with that.

If we make them be pseudo sentences they will probably be easier to remember (the $adjective $noun $adverb $verb a $adjective2 $noun2)


Wow, the nerdsniping on this is something. Note, for my math, I failed to consider that entries wouldn't fit in one byte. Figuring out an encoding might help. Laying out a sequence in blocks might also help overcome the speed difference between memory and storage. I'm not even really trying to delve into this, but it's tempting.

I think you might have miscalculated bits per bytes here?

8 * 17,763/64,860 = 2.19

Also, I attempted to implement this as described in this paper (variable length encoding the letters and the offsets, utilized L, and dropped F entirely because all words are the same length, N didn't make a big difference).

I achieved a naive size of 20,560 bytes, which I didn't have confidence implementing more advanced techniques outlined in the paper would get the size down sufficiently to compete with using a trie+Huffman representation (15,599 bytes, https://github.com/adamcw/wordle-trie-packing#all-words).

8 * 15,599/64,860 = 1.92 bits per byte.


> 27 bits for just one word seems wasteful

Where did you get that you need 27 bits for one word?

> Then to send any word you only need to send one number, and in binary it would have between 1 and at most 19 bits

Yep! By sorting by frequency, you are able to make it so the majority of words have shorter bit strings. By my calculations, common words such as "the", "of", and "and" will have ~4-6 bits associated with them. That means you can encode a large number of words (googling says those words make up ~1/7 of words based on frequency) with only 4-6 bits each. That's far from the 27 bits you calculated


In the article the compression doesn’t make sense

If you are only sending one word, and the recipient already needs to know the word, then you only need 1 bit, essentially just signaling that you are saying that specific word

If you want a richer vocabulary, you could create an index of about 300k words (from the English dictionary), shared between the parties

Then to send any word you only need to send one number, and in binary it would have between 1 and at most 19 bits, for any word in the index (2^19 is around 500k)

That’s without even sorting the index by frequency of appearance/usage

27 bits for just one word seems wasteful


The trick is to encode the numbers in binary, not plaintext. But still you'd probably want to use a Huffman coding rather than plain indexes so that the common words are shorter.

Or even more in-your-face:

Encode bits as spanish words. Enumerate 2^8 words in the dictionary and just use them.


We just need to agree on a 2^16 word dictionary, each 16-bit segment gets it's own word. Then we can write addresses such as correct:horse:battery:staple::one .

128 bits is like, what, ten words chosen randomly from a suitably large dictionary? Is that really prohibitive? Even as random alphanumeric string that doesn't seem beyond muscle memory. Seems like the biggest pain would be to input it consistently on a phone keyboard.

24 bits seems about right for the information content of six Latin characters arranged in a pronounceable English orthography (the ‘X’ has pretty high information value though).

or find 65,536 words and use them to encode 16-bit numbers, then you could leverage your existing word associations to memorize

I would throw out all obscure languages and useless symbols.

One character = one code point.

I would use 32 bits for a code point. The lower 16 bits would be code; the upper 16 bits contain some flags and fields for classification. This would allow simple bit operations to inquire about important properties. There would be a 4-bit flexible type code, leaving 12 bits, some of which would have a code-specific meaning, others fixed. Or something like that.

The goal would be to have a code that programs can work with, without requiring megabytes and megabytes of meta-data about the code.


There's a bitcoin wordlist that can encode 2^11 bits per word, so 3 words could hold a 32-bit hash. You'd need a pretty large wordlist for 3 words to hold 128 bits.

You are confusing the fact that you can store in some encodings one character per byte, 8 bits, with a bit of entropy.

OK. That's the super optimized way to read the encoded data. Makes sense if performance is important.

Writing is the harder part. I suppose you could just make 15 such tables, one for each possible value you could write.


I'm not the OP and I don't have a solution at hand, but I'd start by thinking about error correction and parity, then realize that 10 bits is enough to represent 1000 items completely.
next

Legal | privacy