Monthly Archives: March 2014

Table extraction and PDF to XML with PDFGenie


Intro

PDF is a hugely popular format, and for good reason: with a PDF, you can be virtually assured that a document will display and print exactly the same way on different computers. However, PDF documents suffer from a drawback in that they are usually missing information specifying which content constitutes paragraphs, tables, figures, header/footer info etc. This lack of ‘logical structure’ information makes it difficult to edit files or to view documents on small screens, or to extract meaningful data from a PDF. In a sense, the content becomes ‘trapped’. In this article we discuss the logical structure problem and introduce PDFGenie, a tool for extracting text and tables, as well as establishing a ground truth for evaluating progress in this area by PDFGenie as well as other tools.

Why is PDF so popular and what is its Achilles’ heel?

After HTML, PDF is by far one of most popular document formats on the Web. Google stats show that PDF is used to represent over 70% of the non-html web. These are just the files that Google has indexed. There are likely to be many more in private silos such as company databases, academic archives, bank statements, credit card bills, material safety data sheets, product catalogues, product specifications, etc.

One of the main reasons why PDF is so popular is that it can be used for accurate and reliable visual reproduction across software, hardware, and operating systems.

To achieve this, PDF essentially became the ‘assembly language’ of document formats. It is fairly easy to ‘compile’ (i.e. convert) other document formats to PDF, but the reverse (i.e. decompiling PDF to a high-level representation) is much more difficult.

As a result, most PDF documents are missing logical structures such as paragraphs, tables, figures, header/footers, the reading order, sections, chapters, TOC, etc.

Although PDF could technically be used to store this type of structured information via marked content, it is usually not present. When available, techniques similar to one shown in the LogicalStructure sample can be used to extract structured content.

Unfortunately, even when a file contains some tags, they are frequently not very useful because there is no universally accepted grammar for logical structure in documents (just like there is no universally accepted high-level programming language). Tags are also frequently incorrect or damaged due to file manipulation or errors in PDF generation software.

The lack of structural information makes it difficult to reuse and repurpose the digital content represented by PDF.

So, although massive amounts of unstructured data are held in the form of PDF documents, automated extraction of tables, figures, and other structured information from PDF can be very difficult and costly.

Continue reading