I have scanned a couple of books written by an ancestor.
He was not famous, and I doubt too many people are interested but I wanted to give the books a new digital life.
I think the enormous back catalog of things that have been published but then disappeared is sad.
Anyways I have a few hundred pdf files.
They have an image of the page and the OCR text.
The text will have to be edited by hand I think.
Its not in English.
I was wondering if anyone knew of scripts or applications that can take that as an input,
do some decent formatting and typesetting and spit out a couple of different formats of
eBooks at the other end.
There are photos and illustrations on some pages.
I am hoping to keep them.
The images are ok but given the curvature of the page, and other factors
the images are not I feel ready to be used.
I could try to cut and paste it all into Word
There is also an extensive cross reference and table of contents that I have no idea how to deal with.
I presume I will do it by hand but it would take a long time.
Anyways if you have any tips for me on how to take the raw files and make them into pretty, easy to read
eBooks that would be fabulous.
I think this is a good idea, and I think it wouldn't be all that difficult to convert pdfs into text. At least some of them, since some are represented as text and some are just picture scans, which would be much more difficult to deal with.
The problem is that many PDF's are simply a bunch of scanned images. And these can't be readily converted to any text based format.
The exception is to put the PDF through an OCR process, but this destroys the formatting and can't deal with images. The whole point of my scanning these old books was to keep the historical information.
I have hundreds of old books I have scanned and converted to PDF, but I cant read these on a Kindle.
The simplest method is to use Adobe Acrobat to make a OCR'd PDF. There is no work to that process other than combining all the PDFs you want into a single file -- and that's something you can do inside Acrobat, too. It has OCR built in and will save the OCR'd text inside the file.
Once you have a PDF, that is an eBook. There are several online converters that can turn it into other formats. As for indices, etc., Acrobat won't make them automatically but your OCR'd eBook is searchable, so it may not matter in the same way it would with print.
I'd say the world would be better off with your rough-and-ready PDF (and yes, in the Internet Archive would be great!) than waiting for a perfect hand-made version.
I bought and will buy some books that are not sold as ebooks at all and that I want to read on my Mac and iPhone as PDFs.
I'm looking for recommendations of scanning/OCR/PDF conversion services for physical books. I'd consider both destructive and non-destructive options, although one of the books is pretty nice and I'd like to keep it without its spine cut off.
I care more about high image quality/legibility on retina-type screens and accurate OCR/searchability, and less about price, turnaround time, or file size.
Has anybody used such services with high-quality results?
I have an interesting take on this. Most of the books I've read, I have a copy of. A while back, I endeavored to cut, scan and OCR them all into my computer. One idea was then I could do a full-text search, limited to what I've already read rather than what google thinks is relevant.
So far, I've found it very handy to find something if I at least remember which book it was in. But I need a program that can extract the OCR'd text from .pdf files - anyone know of a simple one?
(I can do it manually, one at a time, by bringing it up in a pdf reader, but that's too tedious and slow.)
The one time I needed to turn a scanned PDF (600+ page book) into searchable text, I used this Ruby script https://github.com/gkovacs/pdfocr/ , which pulls out individual pages using pdftk, turns them into images to feed into an OCR engine of your choice (Tesseract seems to be the gold standard) and then puts them back together. It can blow up the file size tremendously, but worked well enough for my use case. (I did write a very special purpose PDF compressor to shrink the file back, but that was more for fun.)
I will take a look at it, thanks. I have tried some off the shelf OCR things and had little success. But what I do is really needle in the haystack kind of thing. I am looking for a "U" or and "Sr" in a particular place and happy to find one in a thousand pdf pages.
I am looking into auto-scrolling pdfs and batch loading, sequencing of pdfs to make things easier. But I will definitely check out fast.ai - thanks!
Yes, it is very easy. I wrote something similar back in 2020 iirc. It only took a few hours. The idea was to have an app that you dumped all your work pdfs into to make them searchable within. It would pull the text and then you could search by keywords and open the pdf at the associated page number. For PDFs that were scanned in from paper, I had to use tesseract OCR. I didn't pursue it further at the time, because the quality of output with OCR was often poor. The OCR would produce large chunks of gibberish mixed into the output text. Now one could do much better. Incorporating the tokenizing for RAG with the OCR output would probably improve the quality drastically, as the small tokens produced during tokenizing would help "filter out" much of the OCR produced gibberish while preserving much more of the actual text and contextual/semantic meaning.
I've used many different moethods, including using Calibre, k2optpdf, and even OCR'ing the pages but none of it is really straightforward. It all takes considerable time, and linebreaks and paragraphs are my personal hell.
What's the easiest way to go about doing this? Even converting it to a simple text file with proper line breaks and paragraphs would take this problem a long way.
Consider that you have a pdf file with margins, a common font, the name of the book/chapter on the top/bottom gutters and the page number. Parsing the contents page would be great but it is not necessary.
I worked on an online retailer's book scan ingestion pipeline. It's funny because we soon got most of our "scans" as print-ready PDFs, but we still ran them through the OCR pipeline (that would use the underlying pdf text) since parsing it any other way was a small nightmare.
I'm familiar with those tools, but they don't preserve the formatting of the source document (when PDF) and certain types of embedded resources don't transfer properly.
Some PDF documents don't contain the source text as digitized text, either. It's just a bundle of scanned images.
Do you have any tool suggestions or general advice for someone trying to do this? A while back I was trying to extract text from some government PDFs in order to make the information more accessible for others, but I became a bit overwhelmed when I started reading up on PDFs.
I started with a script similar to the one you're using (though hand-crafted) with my ScanSnap S1500 (though I have mine run the PDF conversion in the background so I can immediately scan another document without having to wait - this is easy to do now with scanpdf). I've been doing this for about 12 years now, originally manually sorting into directories and using "pdfgrep" to find stuff but more recently I've put everything into a paperless-ngx instance (gradually tagging all the old documents).
I've switched my hand-crafted scripts recently to use scanpdf[1] which seems to give better results (once I tweaked it to be a little less eager to downconvert to B+W). I experimented with using OpenCV models for cropping and straightening (based on examples in a stackoverflow thread at [2]) but I found results were worse than scanpdf so far.
But does the OCR recognize things like page numbers or chapter names at the top of the page, chapter headers etc? If it just makes searchable pdf's with the same layout as the original pages, it still wouldn't help much. But if OCR software does do that, that's be great - then I can convert also 'normal' pdf's into html/epub.
This looks amazing, I'll have to play around with this over the weekend.
I regularly hand transcribe RPG PDFs scans from dubious sources that have not always been run through OCR to have selectable text. If it has, it wasn't always done very well.
It's literally faster to type it all myself than fix all the errors from copy-pasting (or after using OCR to turn it into text).
Even if the file was an official PDF the formatting would often get screwed up with lots of double or triple spaces and even tabs included between words.
This would save so much time if I can get it to work. Thanks for sharing!
I have scanned a couple of books written by an ancestor.
He was not famous, and I doubt too many people are interested but I wanted to give the books a new digital life. I think the enormous back catalog of things that have been published but then disappeared is sad.
Anyways I have a few hundred pdf files. They have an image of the page and the OCR text. The text will have to be edited by hand I think. Its not in English.
I was wondering if anyone knew of scripts or applications that can take that as an input, do some decent formatting and typesetting and spit out a couple of different formats of eBooks at the other end.
There are photos and illustrations on some pages. I am hoping to keep them.
The images are ok but given the curvature of the page, and other factors the images are not I feel ready to be used.
I could try to cut and paste it all into Word
There is also an extensive cross reference and table of contents that I have no idea how to deal with. I presume I will do it by hand but it would take a long time.
Anyways if you have any tips for me on how to take the raw files and make them into pretty, easy to read eBooks that would be fabulous.
reply