How about 256 words (or sets of words) representing each byte value? It's less dense by a factor of ~5, but it would work, be easy to decode, and be very difficult to identify, especially if you used sets of words. You could even cleverly generate in a way that is grammatically correct.
Another idea: find 256 different syllables and use that to encode 8-bit numbers. This way we also can avoid profanity (except for the rare case where a string of multiple harmless syllables expresses a profanity).
Wow, the nerdsniping on this is something.
Note, for my math, I failed to consider that entries wouldn't fit in one byte. Figuring out an encoding might help.
Laying out a sequence in blocks might also help overcome the speed difference between memory and storage.
I'm not even really trying to delve into this, but it's tempting.
I think you might have miscalculated bits per bytes here?
8 * 17,763/64,860 = 2.19
Also, I attempted to implement this as described in this paper (variable length encoding the letters and the offsets, utilized L, and dropped F entirely because all words are the same length, N didn't make a big difference).
I achieved a naive size of 20,560 bytes, which I didn't have confidence implementing more advanced techniques outlined in the paper would get the size down sufficiently to compete with using a trie+Huffman representation (15,599 bytes, https://github.com/adamcw/wordle-trie-packing#all-words).
Where did you get that you need 27 bits for one word?
> Then to send any word you only need to send one number, and in binary it would have between 1 and at most 19 bits
Yep! By sorting by frequency, you are able to make it so the majority of words have shorter bit strings. By my calculations, common words such as "the", "of", and "and" will have ~4-6 bits associated with them. That means you can encode a large number of words (googling says those words make up ~1/7 of words based on frequency) with only 4-6 bits each. That's far from the 27 bits you calculated
If you are only sending one word, and the recipient already needs to know the word, then you only need 1 bit, essentially just signaling that you are saying that specific word
If you want a richer vocabulary, you could create an index of about 300k words (from the English dictionary), shared between the parties
Then to send any word you only need to send one number, and in binary it would have between 1 and at most 19 bits, for any word in the index (2^19 is around 500k)
That’s without even sorting the index by frequency of appearance/usage
The trick is to encode the numbers in binary, not plaintext. But still you'd probably want to use a Huffman coding rather than plain indexes so that the common words are shorter.
We just need to agree on a 2^16 word dictionary, each 16-bit segment gets it's own word. Then we can write addresses such as correct:horse:battery:staple::one .
128 bits is like, what, ten words chosen randomly from a suitably large dictionary? Is that really prohibitive? Even as random alphanumeric string that doesn't seem beyond muscle memory. Seems like the biggest pain would be to input it consistently on a phone keyboard.
24 bits seems about right for the information content of six Latin characters arranged in a pronounceable English orthography (the ‘X’ has pretty high information value though).
I would throw out all obscure languages and useless symbols.
One character = one code point.
I would use 32 bits for a code point. The lower 16 bits would be code; the upper 16 bits contain some flags and fields for classification. This would allow simple bit operations to inquire about important properties. There would be a 4-bit flexible type code, leaving 12 bits, some of which would have a code-specific meaning, others fixed. Or something like that.
The goal would be to have a code that programs can work with, without requiring megabytes and megabytes of meta-data about the code.
There's a bitcoin wordlist that can encode 2^11 bits per word, so 3 words could hold a 32-bit hash. You'd need a pretty large wordlist for 3 words to hold 128 bits.
I'm not the OP and I don't have a solution at hand, but I'd start by thinking about error correction and parity, then realize that 10 bits is enough to represent 1000 items completely.
reply