Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Maybe I'm not getting what you are saying.

But a rasterized text has a lot more information about each character than the raw ASCII values. If you halve the resolution or add compression artifacts an OCR will still pick it up.

If they rendered raw binary data they would have to add lots of redundant information as well as spreading it out over a large enough area. Maybe that's what you meant but didn't fit the 'trivially to do' part for me. Seems a lot easier to just render the text.



sort by: page size:

OCR is less reliable than looking at the character data directly, if it's available.

OCR generally does not work as you describe. The common case is for the OCR system to tag charactes in an image, so that text may be selected. More advanced systems will generate fonts from the images and replace the text with those. Either way, the text isn't reduced to a single byte.

I'm slightly shocked that (all?) modern OCR systems can't handle a perfectly clean image of Courier text with 100% reliability.

I wonder if reducing the font size for faster transmission made it worse? A larger font might have been easier to read. Probably save time in the long run.

EDIT: Actually, looking at the output of the Fax to Binary Converter program, I think that's very likely. Even I'm not 100% sure whether that 8x6 glob of pixels is a 0 or a D.

Hmmm. If nothing else, what about search-and-replacing the Word doc to replace some of the most difficult characters with clearer ones, and then reversing the process on the other end? I mean, that's ridiculously complex, but not as complex as writing a custom Fax to Binary Converter app.


The methodology seems obvious - interpret every three bytes in the text as a pixel in an image, compress the image, then reinterpret each (post-compression) pixel as three ASCII characters.

This is not about OCR.


Why wouldn't it?

They render ID into rasterized text (blocks of pixels) and blend in these blocks. But they could've just as easily rendered raw 0s and 1s of the ID into blocks the exact same way. It just would've not looked like text, but as a random pattern.


It's also generally a pain in the ass for people with good vision too, since text can't be copied out of a JPEG. If you want to forward a paragraph of text from a pure raster document, you have to read and type it all out yourself.

I don't think you're right on that second point. i'm pretty sure they do OCR, but they're only looking for image data to mine the text out of. The way they're coded now, they think that the document already is all text so it can't find any images to convert. Again. This is not a impervious approach. There are ways around it, it's just that crawlers don't go down that rabbit hole (right now).

Out of curiosity, what problem did you have that this approach solves?

Thinking about diffs, plain text diffs are typically compressed for transport anyway, so you end up with something that's human readable at the point of generation and application (where the storage/processing cost associated with legibility is basically insignificant) while being highly compressed during transport (where legibility is irrelevant and no processing beyond copying bits is necessary).


But it would be trivial to OCR the text in a few microseconds, so I don't see how it makes any difference at all to give the text to the computer...

Exactly, the stupid bit is using the original text to drive the pixelization. Another approach would be to just generate random gray pixel values over the redacted text. Simple. Simpler even than the weird assumption that you would use the original text.

Are you telling me that our compression algorithms can't compress a page of "e"s tighter than a page of random Chinese characters?

Accessibility is a fair point, but for print-to-file applications we're surely at the point where OCR can at least get the text to a readable format, no?


Author here. I was also surprised by this.

I simplified the story a bit for brevity. I actually tried a bunch of different font styles, including a 47 page fax using a pretty large size Courier (with only 72 characters per line). The screenshots from the blog post were after the point I decided OCR wasn't working, so I was using a heavily reduced font size to optimize the transfer time. Hence the characters looked like barely-legible blobs.

The Fax-to-Binary converter isn't doing anything particularly complicated with the image, just breaking it up into an accurately-aligned grid and hashing the pixel data of each tile.

Replacing the characters in the document hadn't occurred to me at the time! It's a good idea, but for my programmer brain, writing this software was the easier (and more fun) solution :)


This seems to be the worst of both worlds. It's not easy for a human to read a square (compared to a line of text). The pixellated font is also not easily readable compared to a vector font. It's also not easy for machines to read an optical coding with no spatially distributed redundancy.

QR codes and bar codes are brilliant for machines because misreads due to some spurious reflection or spec of dust is mitigated by error correction.

I feel like this problem is already well served by bar codes which have a human readable text representation below them (e.g. serial number stickers).

That said, I can see the security advantage of the computer reading the same representation as a human, although this is probably not the best place to enforce security. As there's no integrity check, there's little guarantee the computer will read what you see though. Maybe linear OCR combined with a barcode checksum would be a better way to achieve these goals.


Replacing the typeset text with any reasonable fidelity seems like a much harder problem than reproducing the scan and providing the ocr'ed text content. It might still be a good idea to do, maybe some software does this.

I don't have any references, sorry.


Well sure, but then why don't we just call it "lossy GZIP"? OCR is a pretty specific subset, and produces characters - this does not produce computer-readable characters, therefore not OCR.

What are you on about? What does it produce if not computer-readable characters? Computer illegible characters? Are you saying it cannot read from the dictionary it creates? Or from the characters it is later optically recognizing off that dictionary?

Again from the JBIG2 wiki[1]:

"Textual regions are compressed as follows: the foreground pixels in the regions are grouped into symbols. A dictionary of symbols is then created and encoded.."

It seems not only is JBIG2 being deployed as OCR by Xerox for whatever reason, its implementation in this case is an absolute failure.

[1] http://en.wikipedia.org/wiki/JBIG2


The stated goal of JBIG2 is to recognize 'characters' on the fly and compress them together. It's not traditional OCR but I wouldn't take such a hard line.

> why can’t it do a simple OCR to know those are characters not random shapes?

It's pretty easy to add this if you wanted to.

But a better method would be to fine tune on a bunch of machine-generated images of words if you want your model to be good at generating characters. You'll need to consider which of the many Unicode character sets you want your model to specialize in though.


But the thing is, in that case the information contained in the images was actually much less than what we are meant to make believe.

So if we are reconstructing letters from a known font we essentially are extracting 8 bits of information from the image. I'm pretty certain that if you distort the image to an SNR equivalent of below 8 bits you will not be able to extract the information.

next

Legal | privacy