Apologies. The PDFs that we deal with are digital-native, but do not have embedded text and are not searchable. I simply want to OCR the PDF and spit the text into a Word/text file.
I don't even care about perfect formatting, that's easy to fix. I do care about perfect OCR. That's crucial.
If its a scanned PDF (essentially a collection of 1 image per page), there would need to be an OCR step to get some text out first. Tesseract would work for this.
Once that's done, you have all the options available to perform that search. But I don't know of a search tool that does the OCR for you. I did read a blog post of someone uploading PDFs to google drive (they OCR them on upload) as an easy way to do this.
Hey amelius! Though OCR would provide a generic solution, it would be an overkill for text-based PDFs. I'm working on getting a OCR solution up since there's still a lot of data that is trapped inside scanned PDFs and not text-based ones.
I think people will get tripped up by you saying it "can't" be OCR'd or that it is difficult to do so, and will end up looking past a pretty elegant solution in the process.
This seems like a nicely clever way to trip up non-targeted scrapers which might attempt to OCR any images they encounter, but which will ignore what looks like random gibberish codepoints. It doesn't eliminate the ability to index this data but I can see how it might greatly reduce it.
Obviously you could still convert these PDFs to an image and OCR them, but that's not the thing being defended against here.
Easily perform OCR (Optical Character Recognition) on PDFs. `pdf2searchablepdf input.pdf` = voila! "input_searchable.pdf" is created & now has searchable text
ImageMagick and Tesseract for OCR-ing each page of a PDF into a separate text file (through TIFF image format, disregard the huge TIFFs afterwards), private git repos for hosting, then ag/grep for searching. Not as easy to find the phrase back in PDF as with eg. Google Books, but then GB with copyright related content restrictions is useless most of the time.
Sometimes PDF files are just images of text from a scan. No actual text. Sometimes I try copy and pasting from PDFs and get random garbage. They could add OCR though.
I think this is a good idea, and I think it wouldn't be all that difficult to convert pdfs into text. At least some of them, since some are represented as text and some are just picture scans, which would be much more difficult to deal with.
A cursory Google search suggests you could use a package like poppler to convert the pdf to raw text, and then in theory use regex to create data your server could use and serve.
If the pdfs are published as scans like so many municipalities do, then OCR is the only way to go.
Yes, it is very easy. I wrote something similar back in 2020 iirc. It only took a few hours. The idea was to have an app that you dumped all your work pdfs into to make them searchable within. It would pull the text and then you could search by keywords and open the pdf at the associated page number. For PDFs that were scanned in from paper, I had to use tesseract OCR. I didn't pursue it further at the time, because the quality of output with OCR was often poor. The OCR would produce large chunks of gibberish mixed into the output text. Now one could do much better. Incorporating the tokenizing for RAG with the OCR output would probably improve the quality drastically, as the small tokens produced during tokenizing would help "filter out" much of the OCR produced gibberish while preserving much more of the actual text and contextual/semantic meaning.
But does the OCR recognize things like page numbers or chapter names at the top of the page, chapter headers etc? If it just makes searchable pdf's with the same layout as the original pages, it still wouldn't help much. But if OCR software does do that, that's be great - then I can convert also 'normal' pdf's into html/epub.
The code posted here isn't doing any OCR, but whatever generated the PDFs (Acrobat?) might have.
reply