Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

If you didn't now it, give ocrmypdf (python/tesseract wrapper) a try, all you need is `ocrmypdf in.pdf out.pdf`. It's not super-perfect but works well enough in 99% of common cases.


sort by: page size:

OCRmyPDF is a tool using Tesseract, specifically designed for PDFs. I would recommend that over pure Tesseract.

https://github.com/ocrmypdf/OCRmyPDF


I've used tesseract directly and there definitely is some footguns when it comes to PDFs and being sure not to re-compress them and lose quality.

If you're looking to add a text layer to a PDF (for search purposes for instance) I can highly recommend OCRmyPDF: https://github.com/jbarlow83/OCRmyPDF/

It uses Tesseract and works quite well for most PDFs, I made a semi-functional script before I discovered it and it would have saved a lot of hassle.


You might be interested in https://github.com/ocrmypdf/OCRmyPDF then.

It does quite some preprocessing on the PDF pages before passing it on to tesseract.


Also, ocrmypdf [0] is a great Tesseract-based tool which makes it easy to add ocr layers to raster pdfs. Plus, it takes care of optimizing the resulting file (compression, etc). I used it on several old academic papers and have been pleased with the results.

[0]: https://github.com/jbarlow83/OCRmyPDF


If you're doing this local/cli

`pdftext`, from http://www.foolabs.com/xpdf/

For OCR, `pdfimages` (also from xpdf), combined with ImageMagick's `convert`, and `tesseract` (http://code.google.com/p/tesseract-ocr/) works passably well.


I struggled to get tesseract to OCR my image based PDFs directly, so resorted to using GhostScript to extract the pages to pngs which I then put through tesseract. As an added bonus though, I gained the ability to have a thumbnail png for the search front end.

Anyone knows a PDF OCR tool? I am using a free online one. I take pictures with opennotescanner which spits out a PDF which I want searchable.

Tesseracts expects PNG and outputs to text. I want the same PDF to be hidden overlaid with OCR text.

This free PDF online service does a decent job but offline would be better.


There is the python library ocrmypdf https://ocrmypdf.readthedocs.io/en/latest/ that works really well. I have found the results comparable to Adobe in accuracy.

I believe it uses tesseract, ghostscript and some other libraries.

Speaking of ghostscript, one way to deal with problematic PDFs is to print them to file and deal with the result instead.


OCRmyPDF (based on Tesseract) works very well: https://github.com/jbarlow83/OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched.


For scanned images we use https://github.com/tesseract-ocr/tesseract. For text based PDFs we pull the text directly from the file and all languages are supported.

To extract text from photos and non OCR-ed PDFs Tesseract[1] with language specific model[2] never fails me.

I use my shell utility[3] to automate the workflow with ImageMagick and Tesseract, with intermediate step using monochrome TIFFs. Extracting each page into separate text file allows to ag/grep a phrase and then find it easily back in the original PDF.

Having greppable libraries of books on various domains and not having to crawl through the web search each time is very useful and time-saving.

[1] https://tesseract-ocr.github.io/

[2] https://github.com/tesseract-ocr/tessdata

[3] https://github.com/undebuggable/pdf2txt


I have a friend who has also developed a number of applications that use OCR specifically for PDF which uses Tesseract. The Report Miner application does a nice job of locating and extracting PDF tables.

https://www.opait.com/tesseractstudio/

https://www.opait.com/Pdfreportminer/


If its a scanned PDF (essentially a collection of 1 image per page), there would need to be an OCR step to get some text out first. Tesseract would work for this.

Once that's done, you have all the options available to perform that search. But I don't know of a search tool that does the OCR for you. I did read a blog post of someone uploading PDFs to google drive (they OCR them on upload) as an easy way to do this.


I looked into OCR a while ago for some hundreds of thousands of pages of PDF. All hosted offerings would end up costing quite a bit.

After looking at options and few tests, I figured I'd use https://github.com/jbarlow83/OCRmyPDF It converts the PDF to an image for Tesseract and then recreates the PDF with the text copy-able.

It won't identify the address part of a driver's license, but that wasn't necessary for this project.


Not sure if a multi-step is ok, but convert pdf to image format such as png, use AI to recognize 'tabular blocks', convert pdf to 'text format' with tabular blocks as embeddable image to preserve spacing.

https://stackoverflow.com/questions/3203790/parsing-pdf-file...

https://excalibur-py.readthedocs.io/en/master/

https://ledgerbox.io/blog/extract-tables-with-tesseract-ocr

https://www.johnsnowlabs.com/extract-tabular-data-from-pdf-i...


I agree on all points. I use the following one-liner in directories of PDFs to reduce their file size while retaining dimensions, not hurting readability, and keeping the embedded OCR text in place. It skips re-running the OCR. It's basically a recipe from the docs, I believe.

find . -name '*.pdf' | parallel --tag -j 1 ocrmypdf --tesseract-timeout=0 --skip-text --jbig2-lossy --optimize 3 --output-type pdf '{}' '{}-sm.pdf'


Check out DangerZone. It encodes a .pdf (and other formats) to image data then converts it back to .pdf, optionally preserving OCR'ed text, so that any potential executable code hidden within is lost. For further security, all operations run sandboxed.

https://github.com/freedomofpress/dangerzone


> For the generated pdfs, I found pdftotext could pull with 100% fidelity, and so that was 'option #1'. for scanned-images-saved-as-pdfs, then tesseract could sometimes extract with 90+% accuracy.

Arrived to a similar conclusion although never have bothered with DB or any web interface running locally. Simply grepping the text files works flawlessly for me.


Scanned PDFs only work well if they already have an OCR layer. There's some optional integration of rga with tesseract, but it's pretty slow and less good than external OCR tools.

ripgrep-all can do the same regexes as rg on any filetypes it supports. So you can could do something like --multiline and foo(\w+[\s\n]+){,20}bar

It won't work exactly like this, but something similar should do it:

--multiline enables multiline matching

* foo searches for foo

* \w+ searches for at least one word character

* [\W]+ searches for at least one space/nonword character like sentence marks

* {,20} searches for at most 20 iterations of the word-space combination bar searches for bar

next

Legal | privacy