Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

I have an interesting take on this. Most of the books I've read, I have a copy of. A while back, I endeavored to cut, scan and OCR them all into my computer. One idea was then I could do a full-text search, limited to what I've already read rather than what google thinks is relevant.

So far, I've found it very handy to find something if I at least remember which book it was in. But I need a program that can extract the OCR'd text from .pdf files - anyone know of a simple one?

(I can do it manually, one at a time, by bringing it up in a pdf reader, but that's too tedious and slow.)



sort by: page size:

Does anyone OCR? The systems I've heard of just extract the text from PDFs/docs. Then if some bits cannot be extracted, I was asked to type them myself.

On a related note: I'm looking for a command line utility (*nix) to go the other way. Any recommendations for a tool which extracts text from a PDF?

If its a scanned PDF (essentially a collection of 1 image per page), there would need to be an OCR step to get some text out first. Tesseract would work for this.

Once that's done, you have all the options available to perform that search. But I don't know of a search tool that does the OCR for you. I did read a blog post of someone uploading PDFs to google drive (they OCR them on upload) as an easy way to do this.


Amazon Textract does a phenomenal job of extracting text from dodgy scanned PDFs - I've been running it against scanned typewritten text and even handwritten journal text from the 1880s with great results.

I built a tool for running OCR against every PDF in an S3 bucket (which costs about $1.50/thousand pages) here: https://simonwillison.net/2022/Jun/30/s3-ocr/


Hi, I'm working on plagiarism detection and I need some help on text extraction from pdfs. I've tried PDFTextStream which really works well for extracting text from pdfs. I need to be able to extract the text into a strutured format where i could query thing like title, chapters,etc. Would appreciate it if I could get pointers to achieving this task. Thanks

In plain text format though?

AFAIK, most is in PDF, EPUB or mobi format. I'd presume it's not too difficult to extract text from the latter two, but extracting text from PDFs is far from simple, and something you get working for 1 PDF won't necessarily work on another.


Can someone convert this into a Searchable PDF? Using OCR?

Really great work!

Two questions:

What sort OCR stack did you use?

Is there a way to see the text inside the search results? I'm only seeing the PDFs themselves and would love to do some full text searches of my own!


Reminds me of printergate. Fair. Ok, so what about using an OCR tool to convert to text, then converting that back to PDF?

ImageMagick and Tesseract for OCR-ing each page of a PDF into a separate text file (through TIFF image format, disregard the huge TIFFs afterwards), private git repos for hosting, then ag/grep for searching. Not as easy to find the phrase back in PDF as with eg. Google Books, but then GB with copyright related content restrictions is useless most of the time.

Hi,

I have scanned a couple of books written by an ancestor.

He was not famous, and I doubt too many people are interested but I wanted to give the books a new digital life. I think the enormous back catalog of things that have been published but then disappeared is sad.

Anyways I have a few hundred pdf files. They have an image of the page and the OCR text. The text will have to be edited by hand I think. Its not in English.

I was wondering if anyone knew of scripts or applications that can take that as an input, do some decent formatting and typesetting and spit out a couple of different formats of eBooks at the other end.

There are photos and illustrations on some pages. I am hoping to keep them.

The images are ok but given the curvature of the page, and other factors the images are not I feel ready to be used.

I could try to cut and paste it all into Word

There is also an extensive cross reference and table of contents that I have no idea how to deal with. I presume I will do it by hand but it would take a long time.

Anyways if you have any tips for me on how to take the raw files and make them into pretty, easy to read eBooks that would be fabulous.


This looks amazing, I'll have to play around with this over the weekend.

I regularly hand transcribe RPG PDFs scans from dubious sources that have not always been run through OCR to have selectable text. If it has, it wasn't always done very well.

It's literally faster to type it all myself than fix all the errors from copy-pasting (or after using OCR to turn it into text).

Even if the file was an official PDF the formatting would often get screwed up with lots of double or triple spaces and even tabs included between words.

This would save so much time if I can get it to work. Thanks for sharing!


at least PDF occasionally contains actual text. My organisation systematically scans everything to TIFF images for archival. So now we are embarking on a major project to OCR the TIFFs to get back the text (!).

Not exactly trivial, but text can be extracted from a pdf.

You could always scrape plain text from the PDF files. It might even be better, since I imagine some books in plain text may be outdated. I did have K&R C and SICP in plain text, but I'm not sure where they've gone off to. Good luck though.

Apologies. The PDFs that we deal with are digital-native, but do not have embedded text and are not searchable. I simply want to OCR the PDF and spit the text into a Word/text file.

I don't even care about perfect formatting, that's easy to fix. I do care about perfect OCR. That's crucial.


To extract text from photos and non OCR-ed PDFs Tesseract[1] with language specific model[2] never fails me.

I use my shell utility[3] to automate the workflow with ImageMagick and Tesseract, with intermediate step using monochrome TIFFs. Extracting each page into separate text file allows to ag/grep a phrase and then find it easily back in the original PDF.

Having greppable libraries of books on various domains and not having to crawl through the web search each time is very useful and time-saving.

[1] https://tesseract-ocr.github.io/

[2] https://github.com/tesseract-ocr/tessdata

[3] https://github.com/undebuggable/pdf2txt


Debian has a command-line tool 'pdftotext' which extracts the text from a PDF. It is not OCR, it pulls the characters from the file itself. Its in the package called poppler-utils.

Do you have any tool suggestions or general advice for someone trying to do this? A while back I was trying to extract text from some government PDFs in order to make the information more accessible for others, but I became a bit overwhelmed when I started reading up on PDFs.
next

Legal | privacy