Hacker Read

gnode · 2019-11-01 15:04:45+00:00

This seems to be the worst of both worlds. It's not easy for a human to read a square (compared to a line of text). The pixellated font is also not easily readable compared to a vector font. It's also not easy for machines to read an optical coding with no spatially distributed redundancy.

QR codes and bar codes are brilliant for machines because misreads due to some spurious reflection or spec of dust is mitigated by error correction.

I feel like this problem is already well served by bar codes which have a human readable text representation below them (e.g. serial number stickers).

That said, I can see the security advantage of the computer reading the same representation as a human, although this is probably not the best place to enforce security. As there's no integrity check, there's little guarantee the computer will read what you see though. Maybe linear OCR combined with a barcode checksum would be a better way to achieve these goals.

reply

gnode | karma 2428 | avg karma 2.09 · | 2019-11-01 11:25:21

As I suggested at the end, you could still employ OCR and have a barcode checksum. The checksum would ensure a misread of the human readable text would fail. As long as the checksum was not error-correcting (so could not be engineered to augment a correct OCR read), it doesn't matter that it is incomprehensible to humans, because the OCR is authoritative.

You could also implement the checksum as OCR-able text, although it wouldn't be as dense, and probably wouldn't help human readability.

I think ultimately it should not be trusted that a machine will read what a code appears to be. That should be enforced on the device: "Are you sure you want to visit malware.site?". It's also easy to manipulate computer vision; you can engineer patterns which will read as one thing to humans, but another to machines. In some ways it's better for these codes to not be human readable, such that trust is not misplaced, and the machine is used as the best source of truth.

reply

noobiemcfoob | karma 1032 | avg karma 2.01 · | 2019-11-01 10:51:36

The problem this is addressing is a code being impenetrable by a human. If your solution is adding a second (human readable) code beneath the machine readable code...you haven't addressed the problem. The user must still trust their reader to parse the code.

QR codes' reconstructability is a major strength that this lacks, but I'd bet there's a way to expand this to include ECC around it, much as QR codes can.

BUT...OCR is quickly advancing, so the need for a specialized code a specialized machine can read will diminish over time anyway.

reply

sidwyn | karma 1703 | avg karma 7.44 · | 2011-01-13 15:30:44+00:00

So they're meant to distinguish humans from computers. But 30-40% of the time, the text is unreadable. Or are we just turning into computers?

rkagerer | karma 7961 | avg karma 3.04 · | 2019-11-01 22:45:34+00:00

It's interesting, but I agree not easy on human eyes. I think plain old text in a clear font, with an OCR reader than can lift URL's out of any text, would be nearly as effective and gain traction quicker. Feel free to rebut me!

SilasX | karma 17521 | avg karma 2.12 · | 2022-07-06 17:26:33

The point isn't that it has flaws, but that its description is wrong. "Non-human eyes" -- normally understood to be OCR -- read it just fine. I think most of us were expecting something that disrupts "computer eyes" (e.g. because of deceiving overly narrow "tricks" that neural networks use to identify characters) but left it readable for the typical human (like an easy Captcha).

A more accurate (and helpful!) description of the problem you're solving is that this disrupts text parsers. That is, any program that just reads this in as text won't see the "real" letters (unless it's been pre-programmed with a specific reverser, etc.) and thus will frustrate, say, text search.

Which, on that note, I notice elsewhere you mention this being a solution applied to document submission in legal proceedings. In that case, the assumption might be that one side wishes to run text searches and assume its compatible with that. In that case, this could be viewed as non-compliance with a judge's orders, so FYI.

reply

j2kun | karma 6207 | avg karma 3.38 · | 2014-05-08 08:54:45

So the use case is primarily to produce something machine readable rather than human readable (it's not that unreadable, but still). I can see that. Is there a human-readable flag?

ktpsns | karma 3476 | avg karma 2.97 · | 2020-01-27 10:42:37

This is great work. But I don't understand the key point. Is this all to counterfeit OCR? By just using this font in my office program?

My gut feeling is that, given enough work, one can write some ML/AI algorithm to write OCR also for that font. But sure, the author states that this is also a symbolic work.

Bottom line: As long as a human can read the text, a computer will be able to do so, too.

We all know how hard CAPTCHAS got to solve (to us humans!) in the last year. We probably won't enjoy to read a multi paragraph captcha.

reply

stavros | karma 66636 | avg karma 10.05 · | 2022-10-31 16:06:28

OCR doesn't help when you have to find the wrong character in two pages of random stuff. QR codes are your friend.

andai | karma 8456 | avg karma 2.45 · | 2022-02-15 21:37:21

I think as far as culture goes it would be simpler to promote the idea that black bars are the only way to do it, than that you need to make sure you're using a special software that knows how to securely pixelate text.

It is a cool idea though, and text-shaped pixelation is much more satisfying than random noise. (And if someone's cheeky enough to decode it, they will be very disappointed ;)

reply

kingbirdy | karma 1518 | avg karma 3.65 · | 2017-06-15 19:56:56+00:00

OCR is good enough nowadays I cant see it having an issue with code screenshots - it's generally unstyled except for color, and in a fixed-width font.

krapp | karma 17552 | avg karma 1.17 · | 2013-03-30 16:46:12+00:00

Yeah the regular grid and text would probably make ocr even easier if it were necessary. It wouldn't surprise me if they just used uppercase letters, too.

Maybe they assumed that bots would interpret the page the way a human being would, when they're not looking at the source code? Though my money's on someone just not caring.

reply

belltaco | karma 13576 | avg karma 4.11 · | 2023-01-18 10:05:10

The issue is with OCR. Imagine something that was printed(for being physically signed or filed into a cabinet, for example) and rescanned into an image or PDF. Both printing and scanning is lossy with lots of noise and artifacts. The article is saying that serif fonts are harder to OCR.

For example, something like this https://h30434.www3.hp.com/t5/image/serverpage/image-id/6945...

reply

tjoff | karma 8599 | avg karma 2.63 · | 2018-12-10 22:06:40+00:00

Maybe I'm not getting what you are saying.

But a rasterized text has a lot more information about each character than the raw ASCII values. If you halve the resolution or add compression artifacts an OCR will still pick it up.

If they rendered raw binary data they would have to add lots of redundant information as well as spreading it out over a large enough area. Maybe that's what you meant but didn't fit the 'trivially to do' part for me. Seems a lot easier to just render the text.

reply

technofiend | karma 5099 | avg karma 2.94 · | 2020-08-16 05:57:04+00:00

The answer is if a machine can't decode it, a human does.

One of my aunts worked as a postal worker for a while many years ago. She was hired for her ability to both read and type quickly. She reviewed addresses that couldn't be automatically recognized and did some sort of encoding that (I believe) generated that yellow printed barcode you see on some letters.

At least according to Atlas Obscura OCR has improved to the point that there's only one remote encoding site left and it's in Salt Lake City.

https://www.atlasobscura.com/places/usps-remote-encoding-fac...

reply

pronoiac | karma 2333 | avg karma 2.1 · | 2018-07-01 00:07:21

I'd avoid using this. This is a format that's lossy re: text, so text and numbers can get silently rewritten; it reuses a letter image, placing it throughout a page wherever it's recognized. A low resolution scan or bad ocr results in silently corrupted text.

asddubs | karma 4687 | avg karma 2.73 · | 2023-03-25 16:00:28

i guess the lazy way to prevent this in a foolproof way is to add an ocr somewhere in the pipeline, and use actual images generated from websites. although maybe then you'll get #010101 text on a #000000 background

aaaaaaaaaab | karma 928 | avg karma 2.49 · | 2018-11-20 18:25:12+00:00

Sure. And if you light the paper on fire even 24pt Helvetica becomes unreadable.

But I’m sure you too feel that the failure modes of this colored barcode form a proper subset of the failure modes of its monochromatic counterpart.

reply

userbinator | karma 78987 | avg karma 4.37 · | 2015-06-22 05:41:18

I would be surprised if they didn't use OCR with possibly human checking.

jcrawfordor | karma 7422 | avg karma 5.61 · | 2020-05-31 21:08:31+00:00

Very minor but interesting nitpick: the font used on checks is not OCR (optical) but MICR (magnetic ink). The design objectives are different and different font families exist for the two purposes. MICR as used on checks (more properly called E-13B) bears unusual, distinctive character shapes emphasizing abnormally wide horizontal components due to the need for each character to have a distinctive waveform when read as density from left to right, essentially by a tape recorder read head. Fonts optimized for OCR are usually more normal looking to humans because they emphasize clear detection of lines instead.

E-13B is a bit of an ideal use case for this method because of the highly constrained character set used on checks and the unusually nonuniform density of E-13B. The same thing can be done on text more generally but gets significantly more difficult.

reply