Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Reading PDFs as html would Be nice, but as a ML engineer having high quality conversion would be very useful for large scale analysis and information extraction of pdf documents!


sort by: page size:

Well, sure - what's he expecting? PDF is a low-level, mostly unstructured display format. Converting that to fully semantic markup that recognizes all aspects of high-level document structure is probably an AI-complete problem.

For those of you not familiar with the gory details of PDF, it basically uses absolute positioning for each character. If we converted that directly into HTML it would be a disaster. So, we actually extract quite a bit of structure on top of that, recognizing spaces, lines, columns, and paragraphs, which enables us to write much cleaner HTML. Scribd (and most PDF readers) does this with heuristic algorithms that make reasonable guesses.

In the future, we'd like to push those algorithms further, and extract ever more semantic markup. But this is a "nice to have" for us - mostly, people just want the documents to display correctly and load quickly. And, anyway, expecting the output of an automated converter to match what a human would write shows a basic ignorance of the state of computers and AI.


You wouldn't need to use computer vision on a picture of the PDF. arXiv has the tex source for most of the papers. An LLM trained on code could do a pretty good job of translating tex to readable html with a bit of effort.

Agreed. Anything that can lighten the load of having to write custom scripts to handle pdf-to-data conversions will be helpful.

I do maintain some level of skepticism though. It is ocr :D


Thanks!

I think this is a good idea, and I think it wouldn't be all that difficult to convert pdfs into text. At least some of them, since some are represented as text and some are just picture scans, which would be much more difficult to deal with.

I will look into this!


My previous startup worked with parsing PDFs, trying to apply NLP to the texts within PDFs - extracting titles, paragraphs, tables, bullet points etc. Oh my that was a nightmare. Sure we were doing difficult things, so that made us unique, but it was a slog. Working with different dimensions, pages upside down, sentences spanning across multiple pages etc etc.

I've also recently worked on a small tool called scholars.io [1] where I had to work with PDFs. I wasn't doing anything like parsing, but I just used existing PDF tools and libraries, which were much more pleasant, but still working on top of PDF is a challenge.

[1] - https://scholars.io (a tool to read & review reearch papers together with colleagues)


If the author is reading, I'd love to see more info on how you trained the system to understand the text content of the PDF. And how regular/structured do the PDFs have to be to work?

I'd be interested in how people parse text off a PDF. I'm making a TTS tool to convert documents (mainly HTML docs at the moment) to speech, PDF's would be a great addition.

we are already close to do that but with a really slow parser (this one can even replace some text on the pdf). Our problem now is to understand if developers would rather have better text extraction or some other features like image extractions, etc.. Let us know what you would prefer.

I translated PDFs to text using PDFbox, not AI. The AI extracts the needed information from the resulting pile of unformatted text.

This is really cool! I've been toying with trying to make this as well. What are you using for parsing PDF's? I've found most libs are pretty bad at extracting content well from PDFs. If anyone else has any recommendations would be much appreciated!

I have some experience in this field. Typically I try to convert it to as many formats as possible and let the search engines take their pick. What is your PDF data about? Feel free to PM.

Using ML it is possible to parse PDF pages and interpret their layout and contents. Coupled with summarisation and text to speech it could make PDFs accessible to blind people. The parsing software would run on the user system and be able to cover all sorts of inputs, even images and web pages, as it is just OCR, CV and NLP.

The advantage is that PDFs become accessible immediately as opposed to the day all PDF creators would agree on a better presentation.

Example: A layout dataset - https://github.com/ibm-aur-nlp/PubLayNet

and: SoundGlance: Briefing the Glanceable Cues of Web Pages for Screen Reader Users http://library.usc.edu.ph/ACM/CHI2019/2exabs/LBW1821.pdf

A related problem is invoice/receipt/menu/form information extraction. Also, image based ad filtering.


Isn't there a pdfml format where you convert the pdf stack/tree, expose transforms and expand all non text stream into readable utf8 ?

Not OP, but I would definitely find a lot of value from processing PDFs in such a way that it could eg understand tables and images. I work in mining and having it digest a 43-101 technical report with images and tables would be supremely valuable.

I know that might be a niche case tho.

Absolutely incredible work you’re doing tho wow, I’m very impressed by what you’re doing and the way you’re doing it. Even if you stopped now this is a masterpiece, so while yes I would definitely find a lot of value from being able to process images and graphs/tables, simply being able to process the text and cite it is already a superpower. Thank you for your amazing work!!!


I think you can hack together something in pdf.js but you have to deal with the pain of digging through its code. I’m working on something of the sort in an application I’ve built but it’s v much “nice to have”.

If you’re talking about raw pdfs then you are at the whim of the encoding surely? I’ve always found Adobe etc to have utterly crap searches


if only .PDFs could easily be converted back to a useful raw format. parsing them is a bloody minefield, irregularly stuffed with proprietary metadata galore

Some types of PDFs have always been a challenge to convert to other file formats like Word or plain text. I've had many go from perfectly formatted PDFs to useless text when converted.

Has anyone found a good converter? In the age of AI you would think that there would be a fool proof way to do it.


A staggering number of people in any large organization are basically working as a sort of "information filter" to simply condense information and report it up the organizational food chain. A sufficiently clever combination of OCR, NLP, and ML could automate a lot of those jobs. In other words, the executive set needs a Summly for industry intelligence. (Startup idea that I'm sure someone with VC connections has thought of already)

The trouble with PDFs is they're designed to be consumed by human eyes only. Any attempt to automatically extract information from them is fundamentally a hacky scrape-job.


One thing I'm very interested in, as a grad student who has to consume a huge number of PDFs, is whether there are good tools for converting existing PDFs to portable EPUBs or HTML documents.

If I use, for instance, CloudConvert [1], I generally get a document that gets flowing text roughly right, but still interrupts the text with page numbers and book titles (that were originally at the top of each page) and includes additional bizarre line breaks, etc.

Every so often I wonder if this is an LLM problem ("please reformat the following text to...") but I think that one shouldn't reach for an LLM for these kinds of things.

1. https://cloudconvert.com/pdf-to-epub

next

Legal | privacy