PDF Madness

IRE 2014 // San Francisco

Cheryl Phillips // @cephillips
The Seattle Times

Tyler Dukes // @mtdukes

Fact: PDFs are the worst.

Interrogate your PDFs


In five easy steps!

1. What kind of data do you hold?

  • Text? Numbers? Codes?
  • NOTICE: Don't ask what data you want

2. Do you come in another format?

  • Ask for the original, whenever possible.
  • Weigh the risks vs. time
  • Getting stonewalled? Ask the document!

3. Can I select your text?

  • Scanned image vs. "native" PDF

4. What do you look like?

  • Are the rows and columns neat and orderly?
  • Is the data schema irregular?
  • Are there "one-to-many relationships?"

5. How many of you are there?

  • Is it feasible to review one by one?

Government reports

  • Cometdocs
  • Tabula

Scanned images

  • DocumentCloud
  • Overview

Password-protected documents

Can't alter the PDF or select the text

Two options

  • OCR the document
  • Crack the password

Decrypting tools


Tipsheet // bit.ly/pdf-madness