SlackBuilds Repository

15.0 > Office > pdfsandwich (0.1.7)

pdfsandwich (add OCR text layer to scanned PDF files)

pdfsandwich generates "sandwich" OCR PDF files, i.e. PDF files which
contain only images (no text) will be processed by optical character
recognition (OCR) and the text will be added to each page invisibly
"behind" the images. This makes it possible to search for text in
the PDF, and copy text from the PDF.

Notes:

--
The man page explains this, but I'll mention it here: the "sandwich"
PDF (the output) for filename.pdf will be written to filename_ocr.pdf.

--
According to its man page, pdfsandwich can optionally use hocr2pdf.
However, the version of hocr2pdf on SlackBuilds.org (in the
exact-image package) doesn't seem to work correctly with pdfsandwich.
This isn't a real problem, just don't use the -enforcehocr2pdf option
with pdfsandwich.

--
The PDFs created by pdfsandwich are not quite to spec. In mupdf, you
may see "warning: broken xref subsection, proceeding anyway". This
seems to be harmless. If you discover that it isn't harmless, you
can use ghostscript to fix it, thus:

$ gs -dSAFER -dNOPAUSE -sDEVICE=pdfwrite \
     -sOutputFile=pdf_fixed.pdf pdf_ocr.pdf

This requires: ocaml, unpaper, tesseract

Maintained by: B. Watson
Keywords: ocr
ChangeLog: pdfsandwich

Homepage:
http://www.tobias-elze.de/pdfsandwich/index.html

Source Downloads:
pdfsandwich-0.1.7.tar.bz2 (60e617cc398251cec5f42b870b1fccb4)

Download SlackBuild:
pdfsandwich.tar.gz
pdfsandwich.tar.gz.asc (FAQ)

(the SlackBuild does not include the source)

Individual Files:

• README

• pdfsandwich.SlackBuild

• pdfsandwich.info

• slack-desc

Validated for Slackware 15.0

See our HOWTO for instructions on how to use the contents of this repository.

Access to the repository is available via:
ftp git cgit http rsync