I worked on an online retailer's book scan ingestion pipeline. It's funny because we soon got most of our "scans" as print-ready PDFs, but we still ran them through the OCR pipeline (that would use the underlying pdf text) since parsing it any other way was a small nightmare.
at least PDF occasionally contains actual text. My organisation systematically scans everything to TIFF images for archival. So now we are embarking on a major project to OCR the TIFFs to get back the text (!).
OCR has come a long way, so much that visually interpreting a PDF is about as error-prom as parsing XML output from Microsoft in non-microsoft software.
if only .PDFs could easily be converted back to a useful raw format. parsing them is a bloody minefield, irregularly stuffed with proprietary metadata galore
> If you're using an OCR engine to understand PDFs that are nothing but a scanned image embedded in a PDF... what do you need a PDF parser for?
This should be obvious, but the answer is because OCR engines are not terribly accurate. If you have a native PDF, you're far better off parsing the PDF then converting to an image and OCRing. But if OCR ever becomes perfect, then sure.
My previous startup worked with parsing PDFs, trying to apply NLP to the texts within PDFs - extracting titles, paragraphs, tables, bullet points etc. Oh my that was a nightmare. Sure we were doing difficult things, so that made us unique, but it was a slog. Working with different dimensions, pages upside down, sentences spanning across multiple pages etc etc.
I've also recently worked on a small tool called scholars.io [1] where I had to work with PDFs. I wasn't doing anything like parsing, but I just used existing PDF tools and libraries, which were much more pleasant, but still working on top of PDF is a challenge.
[1] - https://scholars.io (a tool to read & review reearch papers together with colleagues)
If you're using an OCR engine to understand PDFs that are nothing but a scanned image embedded in a PDF... what do you need a PDF parser for? You can always just render an image of a document and then use that.
It is not uncommon for some (or all) of the PDF content to actually be a scan. In these cases, there is no text data to extract directly, so we have to resort to OCR techniques.
I've also seen a similar situation, but in some ways quite the opposite --- where all the text was simply vector graphics. In the limited time I had, OCR worked quite well, but I wonder if it would've been faster to recognise the vector shapes directly rather than going through a rasterisation and then traditional bitmap OCR.
The state of the art is to have OCR of those PDFs to automate it. There's a huge amount of corporate/enterprise solutions that are just working around not doing proper interfaces and integrations. There's even the concept of "Robotic Process Automation" to make working around not being able to deliver working software sound cool.
Wish I had this to share with my boss years ago. My first big project at my first post-college job was building a PDF parser that would generate notifications if a process document had been updated and it was the first time the logged in user was seeing it (to ensure they read the changelog of the process). Even with a single source of the PDFs (one technical document writer) I could only get a 70% success rate because the text I needed to parse was all over the place, when I stated we would need to use OCR to get better results no further development was done (ROI reasons). The technical writer was unwilling to standardize more than they already had, or consider an alternative upload process where they confirm the revision information.. which didn't help.
I don't envy working on ingesting even more diverse PDFs.
Yes, it is very easy. I wrote something similar back in 2020 iirc. It only took a few hours. The idea was to have an app that you dumped all your work pdfs into to make them searchable within. It would pull the text and then you could search by keywords and open the pdf at the associated page number. For PDFs that were scanned in from paper, I had to use tesseract OCR. I didn't pursue it further at the time, because the quality of output with OCR was often poor. The OCR would produce large chunks of gibberish mixed into the output text. Now one could do much better. Incorporating the tokenizing for RAG with the OCR output would probably improve the quality drastically, as the small tokens produced during tokenizing would help "filter out" much of the OCR produced gibberish while preserving much more of the actual text and contextual/semantic meaning.
Your point is? Various methods exist for running OCR on PDFs and then parsing the data and transforming it for structured input into other systems. Could be a periodic sync to the middleware; i.e. daily or at user's preference via emailing the doc.
It's possible, but it takes work. I can't remember the last time a pdf did something unreadably weird, usually my only gripe is with something that's a scan of an old document but whoever turned it into PDF didn't do OCR.
reply