Hacker Read

sideproject · 2022-05-04 20:17:58

My previous startup worked with parsing PDFs, trying to apply NLP to the texts within PDFs - extracting titles, paragraphs, tables, bullet points etc. Oh my that was a nightmare. Sure we were doing difficult things, so that made us unique, but it was a slog. Working with different dimensions, pages upside down, sentences spanning across multiple pages etc etc.

I've also recently worked on a small tool called scholars.io [1] where I had to work with PDFs. I wasn't doing anything like parsing, but I just used existing PDF tools and libraries, which were much more pleasant, but still working on top of PDF is a challenge.

[1] - https://scholars.io (a tool to read & review reearch papers together with colleagues)

reply

Zaheer | karma 4597 | avg karma 5.37 · | 2023-06-02 01:28:36

This is really cool! I've been toying with trying to make this as well. What are you using for parsing PDF's? I've found most libs are pretty bad at extracting content well from PDFs. If anyone else has any recommendations would be much appreciated!

voxadam | karma 3437 | avg karma 5.41 · | 2018-02-19 22:05:38+00:00

What tools did you end up settling on for PDF data/text extraction? I ask because I have a side project that I've been neglecting for far too long which depends in part on cleanly extracting text from PDF (other formats too but PDFs are by far the most headache inducing).

bbernoulli | karma 72 | avg karma 1.44 · | 2016-12-05 05:31:48+00:00

What did you use for parsing PDFs?

shurtler | karma 69 | avg karma 4.06 · | 2023-08-15 04:48:22

Fantastic! Any tips on tools one can use to parse PDF including their structure? Much appreciated.

unmei | karma 56 | avg karma 2.33 · | 2013-09-21 13:32:06

Very nice. I've been doing some table extraction from PDFs recently. Also check out PDF2JSON for nodejs-based parsing - it grabs all the texts and positions so you don't have to 'intercept' draw calls and dumps them out in JSON.

gumballindie | karma 2663 | avg karma 1.25 · | 2023-05-31 08:44:51

I honestly have no clue about what makes pdf parsing a complex task. I wasnt trying to sound condescending. Would be great to know what makes this so difficult, considering the pdf file format is ubiquitous.

UglyToad | karma 1517 | avg karma 4.33 · | 2020-01-22 11:11:38+00:00

I had to check we hadn't worked for the same company! Yeah, text extraction and layout analysis from PDFs is a super interesting challenge and still relatively underdeveloped. I'd put table detection at about the hardest challenge in that field.

One of the contributors to the PDF library I'm developing has been implementing some interesting algorithms for layout analysis https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Anal...

reply

radicalriddler | karma 420 | avg karma 1.47 · | 2023-04-03 18:02:23

I'd be interested in how people parse text off a PDF. I'm making a TTS tool to convert documents (mainly HTML docs at the moment) to speech, PDF's would be a great addition.

TheAceOfHearts | karma 8967 | avg karma 4.74 · | 2018-12-23 22:27:37+00:00

Do you have any tool suggestions or general advice for someone trying to do this? A while back I was trying to extract text from some government PDFs in order to make the information more accessible for others, but I became a bit overwhelmed when I started reading up on PDFs.

giovannibonetti | karma 822 | avg karma 2.19 · | 2023-02-01 10:36:11

Since you are working with raw text, it shouldn't need too much effort. There are a bunch of open source tools to extract text from PDFs.

The hard part would be parsing tables and other layout-dependent semantics. You usually start with text coordinates (like HTML elements with absolute position) and have to work backwards from that. I worked for some years in a project for a client that was full of edge cases, because whenever the input PDF (from a government agency) would have a slight layout change the parser would break. It took multiple iterations to make it robust enough.

reply

svat | karma 7818 | avg karma 6.86 · | 2022-11-17 21:30:58

This is really wonderful, thank you! It's great to see someone focusing on the internal structure of PDF files (the "Syntax" chapter of the spec), and doing things with a focus on browsing the internal structure etc. (I had a similar idea and did something in Rust/WASM back in May; let me see if I can dust it off and put it on GitHub. Edit: not very usable, but here FWIW: https://github.com/shreevatsa/pdf-explorer)

In particular, there are so many PDF libraries/tools that simply hide all the structure and try to provide an easy interface to the user, but they are always limited in various ways. Something like your project that focuses on parsing and browsing is really needed IMO.

reply

maxxxxx | karma 29489 | avg karma 3.55 · | 2017-02-25 19:02:29+00:00

Have you ever actually tried to parse PDF with software? It's a sheer nightmare. PDF often gets produced from text processors that have very rich format information. PDF strips it all out and then you somehow have to recreate it.

trez | karma 106 | avg karma 0.83 · | 2013-06-30 14:40:59+00:00

we are already close to do that but with a really slow parser (this one can even replace some text on the pdf). Our problem now is to understand if developers would rather have better text extraction or some other features like image extractions, etc.. Let us know what you would prefer.

willvarfar | karma 17791 | avg karma 5.17 · | 2020-01-22 10:27:04+00:00

I've worked with several companies that try to parse things in PDF documents, extracting tables and paragraphs etc. This is actually challenging because a PDF is a large bag of words and fragments of words with x y positions. There is a particularly popular word processor that emits individual characters. Just determining that two fragments are part of the same word is challenging as is detecting bullet points, etc.

The AI approaches are definitely still worse than human-written rules. I can infer - and I've chatted with the devs to confirm - from the quality of the text and table extraction whether the company is using a modern NN approach or someone has sat down and handwritten some simple rules that understand indents and baselines etc.

reply

ray991 | karma 8 | avg karma 1.0 · | 2020-05-31 01:20:42

I really wish the PDF layout was easier to parse. No matter which library you use, you always run into edge cases which make text selection and extraction an issue on certain files. I was recently extracting financial data from a bank which provides only PDFs and every time they changed the format just a little bit I had to change large parts of my code to extract the transactions I wanted.

mendeza | karma 314 | avg karma 1.52 · | 2020-02-26 22:24:16+00:00

Reading PDFs as html would Be nice, but as a ML engineer having high quality conversion would be very useful for large scale analysis and information extraction of pdf documents!

lwhsiao | karma 1762 | avg karma 6.05 · | 2017-04-24 06:30:42

On system for extracting information from PDFs is Fonduer[1], which is built on the Snorkel framework from Stanford. It may be worth checking out for your use case. Here's a blog post introducing it [2].

Disclosure: I worked on the project.

[1] https://arxiv.org/abs/1703.05028

[2] https://hazyresearch.github.io/snorkel/blog/fonduer.html

reply

mLuby | karma 4402 | avg karma 3.09 · | 2019-02-22 06:52:20+00:00

If the author is reading, I'd love to see more info on how you trained the system to understand the text content of the PDF. And how regular/structured do the PDFs have to be to work?

voicesarefree | karma 5 | avg karma 1.67 · | 2020-03-03 17:52:31+00:00

Wish I had this to share with my boss years ago. My first big project at my first post-college job was building a PDF parser that would generate notifications if a process document had been updated and it was the first time the logged in user was seeing it (to ensure they read the changelog of the process). Even with a single source of the PDFs (one technical document writer) I could only get a 70% success rate because the text I needed to parse was all over the place, when I stated we would need to use OCR to get better results no further development was done (ROI reasons). The technical writer was unwilling to standardize more than they already had, or consider an alternative upload process where they confirm the revision information.. which didn't help.

I don't envy working on ingesting even more diverse PDFs.

reply