el­studio

What I learned from ripping 16,000 articles from PDF into Drupal

First off, Drupal is made for lots of content – 16,000 articles is no problem. Taxonomy, views, search – so much of what you need for handling big content is baked into the Drupal platform. Add Apache Solr for faceted search, and you have a scalable, flexible web publishing platform.

From a developer perspective, Drupal is a pleasure to work with. Getting content out of PDF, though – that’s not so fun.

The problem is not so much reading the text – lots of open source and free tools can extract text from PDF. (Though drop capitals and drop quotes do complicate this.) The trick is doing it in the proper order so that the content makes sense. Following columns of text to get articles in proper reading order is tough. Differentiating between columnar text and table text can be even tougher.

Since many of these issues are layout-specific, you really need to look at samples of the PDFs you will be working with. Beyond that, here’s what I’d suggest.

  • First, must you extract from PDF sources? Almost any other document format – Indesign, Word, OpenOffice – will provide you with better information about your text. Once the data is in PDF, software has to guess about capitalization, hyphenation, column position, run-ons. The source document will have most of that information.
  • Xpdf is the best of the open-source tools for PDF to text. It’s super-fast, and every Linux distribution packages it. Try this one first. I’ve found that it can scramble columnar text (though the -raw option helps). Pdftotext is written in C, and it lacks a convenient API for changing the order that it reads a page. Still, in terms of raw text extraction quality, it’s the best.
  • PDFMiner, a newer open-source Python library, does have convenient command-line options for tweaking for specific text areas. But it’s not the best at extracting basic text – I found it occasionally garbled pages of my test input.
  • PDFBox is a Java library from the Apache Software Foundation. If you want to have Solr index your PDFs, this is the most straightforward way to do it. It’s good at extracting text. Bad at following columns, though the -sort option can help. If you write your own Java code, it’s straightforward to feed it the areas on the page you want to work with.
  • ABBYY FineReader, is commercial OCR software. The OCR processing makes this the slowest of the bunch. But it does a great job of putting a point-and-click interface around PDF text extraction. It does a fair job of guessing what’s column vs table text. Recommended if you need to give non-technical staff the job.

The slides below talk about how this all worked for OOSKAnews.com, a publisher of specialized water industry news. For the OOSKAnews project, we used both Xpdf and ABBYY FineReader, depending on the article layout. Once we had the text, a bunch of custom sed scripts converted it to individual articles for import into Drupal.

It wasn’t easy, but in the end it worked great.

Need help doing this kind of thing? Have a better way to do it? Then by all means let me know in the comments – or drop us a line.