pdfsandwich (add OCR text layer to scanned PDF files)
pdfsandwich generates "sandwich" OCR PDF files, i.e. PDF files which
contain only images (no text) will be processed by optical character
recognition (OCR) and the text will be added to each page invisibly
"behind" the images. This makes it possible to search for text in
the PDF, and copy text from the PDF.
Notes:
--
The man page explains this, but I'll mention it here: the "sandwich"
PDF (the output) for filename.pdf will be written to filename_ocr.pdf.
--
According to its man page, pdfsandwich can optionally use hocr2pdf.
However, the version of hocr2pdf on SlackBuilds.org (in the
exact-image package) doesn't seem to work correctly with pdfsandwich.
This isn't a real problem, just don't use the -enforcehocr2pdf option
with pdfsandwich.
--
The PDFs created by pdfsandwich are not quite to spec. In mupdf, you
may see "warning: broken xref subsection, proceeding anyway". This
seems to be harmless. If you discover that it isn't harmless, you
can use ghostscript to fix it, thus:
$ gs -dSAFER -dNOPAUSE -sDEVICE=pdfwrite \
-sOutputFile=pdf_fixed.pdf pdf_ocr.pdf
This requires: ocaml, unpaper, tesseract
Maintained by: B. Watson
Keywords: ocr
ChangeLog: pdfsandwich
Homepage:
http://www.tobias-elze.de/pdfsandwich/index.html
Download SlackBuild:
pdfsandwich.tar.gz
pdfsandwich.tar.gz.asc (FAQ)
(the SlackBuild does not include the source)
| Individual Files: |
| README |
| pdfsandwich.SlackBuild |
| pdfsandwich.info |
| slack-desc |
© 2006-2026 SlackBuilds.org Project. All rights reserved.
Slackware® is a registered trademark of
Patrick Volkerding
Linux® is a registered trademark of
Linus Torvalds