Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Fortunately, the PDFs already contain the plain text as metadata. I believe they are what are known as a searchable image PDFs.

The code posted here isn't doing any OCR, but whatever generated the PDFs (Acrobat?) might have.



sort by: page size:

Apologies. The PDFs that we deal with are digital-native, but do not have embedded text and are not searchable. I simply want to OCR the PDF and spit the text into a Word/text file.

I don't even care about perfect formatting, that's easy to fix. I do care about perfect OCR. That's crucial.


If someone could help out and make a PR to do the OCR for PDFs, that'd be awesome! Would love to add it in.

If its a scanned PDF (essentially a collection of 1 image per page), there would need to be an OCR step to get some text out first. Tesseract would work for this.

Once that's done, you have all the options available to perform that search. But I don't know of a search tool that does the OCR for you. I did read a blog post of someone uploading PDFs to google drive (they OCR them on upload) as an easy way to do this.


Hey amelius! Though OCR would provide a generic solution, it would be an overkill for text-based PDFs. I'm working on getting a OCR solution up since there's still a lot of data that is trapped inside scanned PDFs and not text-based ones.

If you have any pointers in the OCR route, do suggest them here, or on this GitHub issue! https://github.com/socialcopsdev/camelot/issues/101


I think people will get tripped up by you saying it "can't" be OCR'd or that it is difficult to do so, and will end up looking past a pretty elegant solution in the process.

This seems like a nicely clever way to trip up non-targeted scrapers which might attempt to OCR any images they encounter, but which will ignore what looks like random gibberish codepoints. It doesn't eliminate the ability to index this data but I can see how it might greatly reduce it.

Obviously you could still convert these PDFs to an image and OCR them, but that's not the thing being defended against here.


Not as easy, no. OCR is needed if the PDF wasn't saved with embedded text/a document file.

OCRmyPDF (based on Tesseract) works very well: https://github.com/jbarlow83/OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched.


Easily perform OCR (Optical Character Recognition) on PDFs. `pdf2searchablepdf input.pdf` = voila! "input_searchable.pdf" is created & now has searchable text

Can someone convert this into a Searchable PDF? Using OCR?

The text is already embedded in the PDF files so there's no need for OCR - you won't see high-quality results with Ocrad either.

Really great work!

Two questions:

What sort OCR stack did you use?

Is there a way to see the text inside the search results? I'm only seeing the PDFs themselves and would love to do some full text searches of my own!


ImageMagick and Tesseract for OCR-ing each page of a PDF into a separate text file (through TIFF image format, disregard the huge TIFFs afterwards), private git repos for hosting, then ag/grep for searching. Not as easy to find the phrase back in PDF as with eg. Google Books, but then GB with copyright related content restrictions is useless most of the time.

Sometimes PDF files are just images of text from a scan. No actual text. Sometimes I try copy and pasting from PDFs and get random garbage. They could add OCR though.

Thanks!

I think this is a good idea, and I think it wouldn't be all that difficult to convert pdfs into text. At least some of them, since some are represented as text and some are just picture scans, which would be much more difficult to deal with.

I will look into this!


A cursory Google search suggests you could use a package like poppler to convert the pdf to raw text, and then in theory use regex to create data your server could use and serve.

If the pdfs are published as scans like so many municipalities do, then OCR is the only way to go.

Either way, good luck and decently nice design.


Yes, it is very easy. I wrote something similar back in 2020 iirc. It only took a few hours. The idea was to have an app that you dumped all your work pdfs into to make them searchable within. It would pull the text and then you could search by keywords and open the pdf at the associated page number. For PDFs that were scanned in from paper, I had to use tesseract OCR. I didn't pursue it further at the time, because the quality of output with OCR was often poor. The OCR would produce large chunks of gibberish mixed into the output text. Now one could do much better. Incorporating the tokenizing for RAG with the OCR output would probably improve the quality drastically, as the small tokens produced during tokenizing would help "filter out" much of the OCR produced gibberish while preserving much more of the actual text and contextual/semantic meaning.

I just use the OCR function built in to Adobe Acrobat.

Don't know it the OCR function is available in the reader version.


Reminds me of printergate. Fair. Ok, so what about using an OCR tool to convert to text, then converting that back to PDF?

But does the OCR recognize things like page numbers or chapter names at the top of the page, chapter headers etc? If it just makes searchable pdf's with the same layout as the original pages, it still wouldn't help much. But if OCR software does do that, that's be great - then I can convert also 'normal' pdf's into html/epub.
next

Legal | privacy