Praise-Singing Poppler Utilities

Last year I gave a presentation at Linux Users of Victoria entitled Being An Acrobat: Linux and PDFs (there was an additional discussion not in the presentation about embedding Javascript in a PDF and some related security issues, but that's for another post). Part of this presentation was singing the praises of Poppler Utilities (named after the Futurama episode, "The Problem with Popplers"). This is probably the most single most common suite of tools I use when dealing with PDFs with the single exception of reading and creating. There is an enormous number of times I have encountered an issue with PDFs and resolved it with something from the Popper utils suite.

The entire relevant slide from the presentation is reproduced was as follows:

Derived from xpdf, Poppler is a software library for rendering PDF documents and is used in a variety of PDF readers (including Evince, KPDF, LibreOffice, Inkscape, Okular, Zathura etc). A collection of tools, poppler-utils, is built on Poppler’s API provides a variety of useful functions e.g.,

pdffonts - lists the fonts used in a PDF (e.g., pdfonts filename.pdf)
pdfimages - extract images from a PDF (e.g., pdfimages -png filename.pdf images/)
pdfseparate - extract single pages from a PDF (e.g., pdfseparate sample.pdf sample-%d.pdf)
pdftohtml - convert PDF to HTML format retaining formatting
pdftops - convert PDF to printable PS format
pdftotext - extract text from a PDF
pdfunite - merges PDFs (pdfunite page{01..13}.pdf combined.pdf)

Recently I had an experience with this that illustrates a practical example of one of the tools. I am currently doing a MSc in Information Systems at the University of Salford. The course content itself is conducted through the Robert Kennedy College in Swizterland. For each asssignment the course accepts uploads for one file and one file only, and only in particular formats as well (e.g., docx, pdf, rtf etc). An additional upload on the RKC system will overwrite one's previously submitted assignment file.

If you have multi-part components to an assignment, you will have to export them to a common format, combine them, and upload them as a single document. In a project management course, I ended with several files, as the assignment demanded a title page, a slideshow with nodes, a main body of the assignment (business case and project product plan), a Gannt chart (created through ProjLibre), and an reference and appendix file.

At the end of the assignment, I had a Title Page file, a Part A file, Part B Main Body file, Part B LibreProj file, and a Refs and Appendix File. First I converted them all to PDFs. This is one of the file formats accepted, and is a common export format for the text, slideshow, and project. Then I used the application pdfunite (Linux application) to combine them into a single file. e.g.,

pdfunite title.pdf parta.pdf partb.pdf gannt.pdf refs.pfs assignment.pdf

Quite clearly RKC has a limited and arguably poorly designed upload system. But the options are either complain about it, give up, or work around it. Information systems science demands the latter because we will encounter this all the time. We all have to learn how to workaround system limitations. When it comes to PDF limitations, I have found that Popper Utilities are one of the most useful tools available and I have found that I use the various utilities with almost alarming regularity. So here's a few more that I didn't mention in my initial presentation due to time constraints:

pdfdetach - extract embedded documents from a PDF (e.g., pdfdetach --saveall filename.pdf)
pdfinfo - print file information from a PDF (title, subject, keywords, author, creator etc) (e.g., pdfinfo filename.pdf)
pdftocairo - convert pdf to a png/jpeg/tiff/ps/eps/svg using cairo (e.g., pdftocairo -svg filename.pdf filename.svg)
pdftoppm - convert pdf to Portable Pixmap bitmaps (e.g., pdftoppm filename.pdf filename)