If you didn't now it, give ocrmypdf (python/tesseract wrapper) a try, all you need is `ocrmypdf in.pdf out.pdf`. It's not super-perfect but works well enough in 99% of common cases.
I've used tesseract directly and there definitely is some footguns when it comes to PDFs and being sure not to re-compress them and lose quality.
If you're looking to add a text layer to a PDF (for search purposes for instance) I can highly recommend OCRmyPDF: https://github.com/jbarlow83/OCRmyPDF/
It uses Tesseract and works quite well for most PDFs, I made a semi-functional script before I discovered it and it would have saved a lot of hassle.
Also, ocrmypdf [0] is a great Tesseract-based tool which makes it easy to add ocr layers to raster pdfs. Plus, it takes care of optimizing the resulting file (compression, etc). I used it on several old academic papers and have been pleased with the results.
For OCR, `pdfimages` (also from xpdf), combined with ImageMagick's `convert`, and `tesseract` (http://code.google.com/p/tesseract-ocr/) works passably well.
I struggled to get tesseract to OCR my image based PDFs directly, so resorted to using GhostScript to extract the pages to pngs which I then put through tesseract. As an added bonus though, I gained the ability to have a thumbnail png for the search front end.
To extract text from photos and non OCR-ed PDFs Tesseract[1] with language specific model[2] never fails me.
I use my shell utility[3] to automate the workflow with ImageMagick and Tesseract, with intermediate step using monochrome TIFFs. Extracting each page into separate text file allows to ag/grep a phrase and then find it easily back in the original PDF.
Having greppable libraries of books on various domains and not having to crawl through the web search each time is very useful and time-saving.
I have a friend who has also developed a number of applications that use OCR specifically for PDF which uses Tesseract. The Report Miner application does a nice job of locating and extracting PDF tables.
If its a scanned PDF (essentially a collection of 1 image per page), there would need to be an OCR step to get some text out first. Tesseract would work for this.
Once that's done, you have all the options available to perform that search. But I don't know of a search tool that does the OCR for you. I did read a blog post of someone uploading PDFs to google drive (they OCR them on upload) as an easy way to do this.
I looked into OCR a while ago for some hundreds of thousands of pages of PDF. All hosted offerings would end up costing quite a bit.
After looking at options and few tests, I figured I'd use https://github.com/jbarlow83/OCRmyPDF
It converts the PDF to an image for Tesseract and then recreates the PDF with the text copy-able.
It won't identify the address part of a driver's license, but that wasn't necessary for this project.
Not sure if a multi-step is ok, but convert pdf to image format such as png, use AI to recognize 'tabular blocks', convert pdf to 'text format' with tabular blocks as embeddable image to preserve spacing.
I agree on all points. I use the following one-liner in directories of PDFs to reduce their file size while retaining dimensions, not hurting readability, and keeping the embedded OCR text in place. It skips re-running the OCR. It's basically a recipe from the docs, I believe.
Check out DangerZone. It encodes a .pdf (and other formats) to image data then converts it back to .pdf, optionally preserving OCR'ed text, so that any potential executable code hidden within is lost. For further security, all operations run sandboxed.
> For the generated pdfs, I found pdftotext could pull with 100% fidelity, and so that was 'option #1'. for scanned-images-saved-as-pdfs, then tesseract could sometimes extract with 90+% accuracy.
Arrived to a similar conclusion although never have bothered with DB or any web interface running locally. Simply grepping the text files works flawlessly for me.
Scanned PDFs only work well if they already have an OCR layer. There's some optional integration of rga with tesseract, but it's pretty slow and less good than external OCR tools.
ripgrep-all can do the same regexes as rg on any filetypes it supports. So you can could do something like --multiline and foo(\w+[\s\n]+){,20}bar
It won't work exactly like this, but something similar should do it:
--multiline enables multiline matching
* foo searches for foo
* \w+ searches for at least one word character
* [\W]+ searches for at least one space/nonword character like sentence marks
* {,20} searches for at most 20 iterations of the word-space combination
bar searches for bar
reply