Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

To extract text from photos and non OCR-ed PDFs Tesseract[1] with language specific model[2] never fails me.

I use my shell utility[3] to automate the workflow with ImageMagick and Tesseract, with intermediate step using monochrome TIFFs. Extracting each page into separate text file allows to ag/grep a phrase and then find it easily back in the original PDF.

Having greppable libraries of books on various domains and not having to crawl through the web search each time is very useful and time-saving.

[1] https://tesseract-ocr.github.io/

[2] https://github.com/tesseract-ocr/tessdata

[3] https://github.com/undebuggable/pdf2txt



sort by: page size:

For scanned images we use https://github.com/tesseract-ocr/tesseract. For text based PDFs we pull the text directly from the file and all languages are supported.

ImageMagick and Tesseract for OCR-ing each page of a PDF into a separate text file (through TIFF image format, disregard the huge TIFFs afterwards), private git repos for hosting, then ag/grep for searching. Not as easy to find the phrase back in PDF as with eg. Google Books, but then GB with copyright related content restrictions is useless most of the time.

I also built a system that extracted structured and unstructured text from images/pdfs. For the generated pdfs, I found pdftotext could pull with 100% fidelity, and so that was 'option #1'. for scanned-images-saved-as-pdfs, then tesseract could sometimes extract with 90+% accuracy. But never 100%. Combining pdftotext (with the right flags set) with some of the other associated pdf-tools, we were able to achieve what we were after: Building a searchable DB and auto-informing corpus of information derived entirely from various pdf sources. All in-house. No sending off to 3rd parties.

I struggled to get tesseract to OCR my image based PDFs directly, so resorted to using GhostScript to extract the pages to pngs which I then put through tesseract. As an added bonus though, I gained the ability to have a thumbnail png for the search front end.

If you're doing this local/cli

`pdftext`, from http://www.foolabs.com/xpdf/

For OCR, `pdfimages` (also from xpdf), combined with ImageMagick's `convert`, and `tesseract` (http://code.google.com/p/tesseract-ocr/) works passably well.


I have a friend who has also developed a number of applications that use OCR specifically for PDF which uses Tesseract. The Report Miner application does a nice job of locating and extracting PDF tables.

https://www.opait.com/tesseractstudio/

https://www.opait.com/Pdfreportminer/


> For the generated pdfs, I found pdftotext could pull with 100% fidelity, and so that was 'option #1'. for scanned-images-saved-as-pdfs, then tesseract could sometimes extract with 90+% accuracy.

Arrived to a similar conclusion although never have bothered with DB or any web interface running locally. Simply grepping the text files works flawlessly for me.


If its a scanned PDF (essentially a collection of 1 image per page), there would need to be an OCR step to get some text out first. Tesseract would work for this.

Once that's done, you have all the options available to perform that search. But I don't know of a search tool that does the OCR for you. I did read a blog post of someone uploading PDFs to google drive (they OCR them on upload) as an easy way to do this.


Anyone knows a PDF OCR tool? I am using a free online one. I take pictures with opennotescanner which spits out a PDF which I want searchable.

Tesseracts expects PNG and outputs to text. I want the same PDF to be hidden overlaid with OCR text.

This free PDF online service does a decent job but offline would be better.


> so resorted to using GhostScript to extract the pages to pngs which I then put through tesseract

I understand you use this to extract text from non OCR-ed PDFs, especially consisting of low quality scans or photos (e.g. low resolution, JPEG artifacts).

Ocassionally passing higher resolution to ImageMagick when converting a page to TIFF helped, but this sounds like a reasonable fallback as well.


Although https://www.willus.com/k2pdfopt/ is meant for reformatting PDFs to view on e-readers, it does do a reasonable job of extracting text via OCR and storing as a PDF layer. The underlying engine can be either https://github.com/tesseract-ocr/tesseract or http://jocr.sourceforge.net/

I've used tesseract directly and there definitely is some footguns when it comes to PDFs and being sure not to re-compress them and lose quality.

If you're looking to add a text layer to a PDF (for search purposes for instance) I can highly recommend OCRmyPDF: https://github.com/jbarlow83/OCRmyPDF/

It uses Tesseract and works quite well for most PDFs, I made a semi-functional script before I discovered it and it would have saved a lot of hassle.


I looked into OCR a while ago for some hundreds of thousands of pages of PDF. All hosted offerings would end up costing quite a bit.

After looking at options and few tests, I figured I'd use https://github.com/jbarlow83/OCRmyPDF It converts the PDF to an image for Tesseract and then recreates the PDF with the text copy-able.

It won't identify the address part of a driver's license, but that wasn't necessary for this project.


> Are there any open source tools that would slurp in content like this ...

Yes, tesseract[1] can do a pretty good job. Here[2] is a blog post which describes using it to perform OCR on PDF's.

As for searching the PDF contents, Solr[3] might be what you are looking for instead.

1 - https://github.com/tesseract-ocr/tesseract

2 - http://fransdejonge.com/2012/04/ocr-text-in-pdf-with-tessera...

3- http://stackoverflow.com/questions/6694327/indexing-pdf-with...


OCRmyPDF (based on Tesseract) works very well: https://github.com/jbarlow83/OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched.


Hi Hackers,

Often I get pdfs which I want to extract text from and paste it somewhere else. Not all PDFs are always well constructed and a lot of them are scanned ones. Unfortunately Mac's Preview or other classic PDF viewers can not extract text from those.

So I have built a minimalist website to extract text from any PDFs, scanned ones as well. It uses OCR to extract text and the user can highlight specific areas on the document to extract from. The extraction is made locally by the browser thanks to the awesome Tesseract.js library.

I would love to have your feedback before adding more features (zoom setting, improve areas selections, png/jpeg support, mobile support, offline support, ...).


Scanned PDFs only work well if they already have an OCR layer. There's some optional integration of rga with tesseract, but it's pretty slow and less good than external OCR tools.

ripgrep-all can do the same regexes as rg on any filetypes it supports. So you can could do something like --multiline and foo(\w+[\s\n]+){,20}bar

It won't work exactly like this, but something similar should do it:

--multiline enables multiline matching

* foo searches for foo

* \w+ searches for at least one word character

* [\W]+ searches for at least one space/nonword character like sentence marks

* {,20} searches for at most 20 iterations of the word-space combination bar searches for bar


With all due respect, if you're looking to extract text and images from documents, there are free and open-source options. DocuHarvest looks like a great service for an end-user, but if you're a programmer, I recommend you take a peek at Docsplit, an open-source project of ours that extracts text, images, and metadata from documents, including non-pdfs.

http://documentcloud.github.com/docsplit/

Or the Python port:

http://github.com/anderser/pydocsplit

It's thin wrapper on top of a number of excellent open-source projects that do the real work:

* OpenOffice / JODConverter, to convert ".doc", ".ppt", ".xls", ".rtf" into PDFs.

* GraphicsMagick and Ghostscript, to render PDFs into images of any size and format.

* Apache PDFBox, to extract UTF8 plain text and metadata from PDFs.

We're thinking about adding Tesseract-based OCR into the text-extraction, but it's still a little difficult to figure out how to package that up portably for multiple platforms. Full Disclosure: Docsplit is part of DocumentCloud, a non-profit project funded by the Knight Foundation to help journalists work with primary source documents.


If you didn't now it, give ocrmypdf (python/tesseract wrapper) a try, all you need is `ocrmypdf in.pdf out.pdf`. It's not super-perfect but works well enough in 99% of common cases.
next

Legal | privacy