Category Archives: Whitepaper

Semantic Content Recognition in PDF

Semantic content recognition is the ability to identify components of a document by their “class” – that is if any particular content constitutes a title, subtitle, section, paragraph, word, figure, caption, table, etc. This is a problem, that despite decades of research, remains open. Available solutions are unreliable and are far, far behind the ability of a human being.

At the 2015 PDF Technical Conference, PDFTron’s CTO gave a presentation addressing the problem of semantic content recognition in PDF. The presentation gives an overview of the problem itself, why it has been such a hard problem to solve, and how the industry as a whole might organize itself to finally develop solutions that perform with the same accuracy as a person.

pdf.js: Interesting Project, Incorrect Rendering

pdf.js is a well known project for rendering PDF documents directly in the browser. In that sense, it is similar to our recently announced PDFNetJS. While pdf.js is interesting project, and may be a reasonable choice in some very specific situations, it has a number of serious problems that make it unreliable for any situation where PDF rendering is important.

Continue reading

Introducing PDFNetJS: A Complete Browser-Side PDF Viewer and Editor

PDFNetJS

The WEB is taking over (obviously)

On desktop computers, web apps continue to replace activities that were previously fulfilled by Windows/Mac/Linux programs. The advantages are many: web apps are immediately available on every connected computer; the user doesn’t need to download and install something; they instantly update and they’re cross-platform. That they naturally lend themselves to a subscription model is yet another reason that companies are choosing to develop web apps in favor of a traditional desktop program.

However, web apps have historically had a number of shortcomings. An inability to deal with local files (without long uploads). Multimedia required securitychallenged plugins. And they couldn’t display PDF files. Continue reading

Table extraction and PDF to XML with PDFGenie


Intro

PDF is a hugely popular format, and for good reason: with a PDF, you can be virtually assured that a document will display and print exactly the same way on different computers. However, PDF documents suffer from a drawback in that they are usually missing information specifying which content constitutes paragraphs, tables, figures, header/footer info etc. This lack of ‘logical structure’ information makes it difficult to edit files or to view documents on small screens, or to extract meaningful data from a PDF. In a sense, the content becomes ‘trapped’. In this article we discuss the logical structure problem and introduce PDFGenie, a tool for extracting text and tables, as well as establishing a ground truth for evaluating progress in this area by PDFGenie as well as other tools.

Why is PDF so popular and what is its Achilles’ heel?

After HTML, PDF is by far one of most popular document formats on the Web. Google stats show that PDF is used to represent over 70% of the non-html web. These are just the files that Google has indexed. There are likely to be many more in private silos such as company databases, academic archives, bank statements, credit card bills, material safety data sheets, product catalogues, product specifications, etc.

One of the main reasons why PDF is so popular is that it can be used for accurate and reliable visual reproduction across software, hardware, and operating systems.

To achieve this, PDF essentially became the ‘assembly language’ of document formats. It is fairly easy to ‘compile’ (i.e. convert) other document formats to PDF, but the reverse (i.e. decompiling PDF to a high-level representation) is much more difficult.

As a result, most PDF documents are missing logical structures such as paragraphs, tables, figures, header/footers, the reading order, sections, chapters, TOC, etc.

Although PDF could technically be used to store this type of structured information via marked content, it is usually not present. When available, techniques similar to one shown in the LogicalStructure sample can be used to extract structured content.

Unfortunately, even when a file contains some tags, they are frequently not very useful because there is no universally accepted grammar for logical structure in documents (just like there is no universally accepted high-level programming language). Tags are also frequently incorrect or damaged due to file manipulation or errors in PDF generation software.

The lack of structural information makes it difficult to reuse and repurpose the digital content represented by PDF.

So, although massive amounts of unstructured data are held in the form of PDF documents, automated extraction of tables, figures, and other structured information from PDF can be very difficult and costly.

Continue reading

Mobile Cross-Platform PDF Viewers: Options for Android, iOS, Windows Store Apps and Windows Phone 8

The rise of mobile platforms, each with its own native programming language and API, has created new demand for cross-platform development tools and SDKs. To display a PDF, most cross-platform toolkits offer either a C++ interface (which do not provide a native UI component) or might be a simple PDF-to-image style solution. In this post, we will outline some better options for handling PDFs in a cross-platform manner on mobile devices.

Continue reading

All About PDF/A

What is PDF/A?

PDF/A (as in Archive) is a special variant of PDF that has been designed specifically for long term document preservation. Initially released in 2005 and based on PDF 1.4, the specification’s goal was to create a format that can be reliably rendered on any system when opened with a compliant viewer. PDF itself did not meet this criteria because PDF documents can contain elements which are not reliably rendered because their appearance can change based on the viewer, host operating system, or state of the PDF itself. Some examples include PDF documents that do not embed their fonts and instead hope that the viewing systems have them, documents that use device dependent color, documents that are encrypted, and documents that contain dynamic content such as JavaScript and 3D. PDF/A was developed to solve these problems.

Continue reading