Praise-Singing Poppler Utilities

Last year I gave a presentation at Linux Users of Victoria entitled Being An Acrobat: Linux and PDFs (there was an additional discussion not in the presentation about embedding Javascript in a PDF and some related security issues, but that's for another post). Part of this presentation was singing the praises of Poppler Utilities (named after the Futurama episode, "The Problem with Popplers"). This is probably the most single most common suite of tools I use when dealing with PDFs with the single exception of reading and creating. There is an enormous number of times I have encountered an issue with PDFs and resolved it with something from the Popper utils suite.

The entire relevant slide from the presentation is reproduced was as follows:

Derived from xpdf, Poppler is a software library for rendering PDF documents and is used in a variety of PDF readers (including Evince, KPDF, LibreOffice, Inkscape, Okular, Zathura etc). A collection of tools, poppler-utils, is built on Poppler’s API provides a variety of useful functions e.g.,

pdffonts - lists the fonts used in a PDF (e.g., pdfonts filename.pdf)
pdfimages - extract images from a PDF (e.g., pdfimages -png filename.pdf images/)
pdfseparate - extract single pages from a PDF (e.g., pdfseparate sample.pdf sample-%d.pdf)
pdftohtml - convert PDF to HTML format retaining formatting
pdftops - convert PDF to printable PS format
pdftotext - extract text from a PDF
pdfunite - merges PDFs (pdfunite page{01..13}.pdf combined.pdf)

Recently I had an experience with this that illustrates a practical example of one of the tools. I am currently doing a MSc in Information Systems at the University of Salford. The course content itself is conducted through the Robert Kennedy College in Swizterland. For each asssignment the course accepts uploads for one file and one file only, and only in particular formats as well (e.g., docx, pdf, rtf etc). An additional upload on the RKC system will overwrite one's previously submitted assignment file.

If you have multi-part components to an assignment, you will have to export them to a common format, combine them, and upload them as a single document. In a project management course, I ended with several files, as the assignment demanded a title page, a slideshow with nodes, a main body of the assignment (business case and project product plan), a Gannt chart (created through ProjLibre), and an reference and appendix file.

At the end of the assignment, I had a Title Page file, a Part A file, Part B Main Body file, Part B LibreProj file, and a Refs and Appendix File. First I converted them all to PDFs. This is one of the file formats accepted, and is a common export format for the text, slideshow, and project. Then I used the application pdfunite (Linux application) to combine them into a single file. e.g.,

pdfunite title.pdf parta.pdf partb.pdf gannt.pdf refs.pfs assignment.pdf

Quite clearly RKC has a limited and arguably poorly designed upload system. But the options are either complain about it, give up, or work around it. Information systems science demands the latter because we will encounter this all the time. We all have to learn how to workaround system limitations. When it comes to PDF limitations, I have found that Popper Utilities are one of the most useful tools available and I have found that I use the various utilities with almost alarming regularity. So here's a few more that I didn't mention in my initial presentation due to time constraints:

pdfdetach - extract embedded documents from a PDF (e.g., pdfdetach --saveall filename.pdf)
pdfinfo - print file information from a PDF (title, subject, keywords, author, creator etc) (e.g., pdfinfo filename.pdf)
pdftocairo - convert pdf to a png/jpeg/tiff/ps/eps/svg using cairo (e.g., pdftocairo -svg filename.pdf filename.svg)
pdftoppm - convert pdf to Portable Pixmap bitmaps (e.g., pdftoppm filename.pdf filename)

Comments

From Craig Sanders
I use poppler-utils and other command-line PDF tools a lot, especially pdftotext - searching with grep and viewing the text with less is often far more convenient than viewing in a PDF viewer. BTW, if you have lesspipe configured you can just run "less filename.pdf" and it will automatically run pdftotext and pipe it into less for viewing.
If the PDF file is just a scan with no embedded text, ocrmypdf & tesseract-ocr can be used to add embedded text to a PDF (with the quality of text depending on the quality of the scan - poor scans result in very poor text)
and once you have the text, you can add markdown or latex formatting and produce an epub or a better quality PDF, or just a summary/cheat-sheet PDF with minimal contents.
Anyway, one of the things I use these tools for is fixing the outlines/ bookmarks in PDFs that I've bought from commercial sources (like dtrpg). This is only semi-automated, there's some manual text-editing involved.
Many commercial PDFs come with lousy, completely useless outlines, or no outline at all - but that's fairly easily fixed by:
1. use pdftotext to extract the text.
2. edit the text with vim or whatever to get rid of everything but the table of contents, and then convert the TOC to "
text" entries (one per line).
"level" is the "indentation-level" of the entry, 0 is top-level, e.g., for chapter headings, 1 for sections within a chapter, 2 for sub-sections, etc.
The page number may have to be offset by 1 or 2 or more depending on how many un-numbered pages there are at the front of the pdf (e.g. cover page, title page) - easily done with awk or perl, e.g. to add 2 to every page number: "awk '{$2 = $2 + 2}'"
2a. optionally add extra outline entries for important figures, tables, etc.
3. use pdfoutline from the fntsample package to replace the PDF's outline with one that is actually useful.
result: a pdf with outlines you can click on to instantly jump to chapters, sections, sub-sections etc.
---
BTW, qpdf (not to be confused with the poppler-based qpdfview) is another excellent command-line PDF manipulation tool. It's a lot harder to learn how to use than the poppler collection of single-purpose tools (mostly because it's an all-in-one program that does lots of things) but it can do some things that they can't.