All About PDF/A

What is PDF/A?

PDF/A (as in Archive) is a special variant of PDF that has been designed specifically for long term document preservation. Initially released in 2005 and based on PDF 1.4, the specification’s goal was to create a format that can be reliably rendered on any system when opened with a compliant viewer. PDF itself did not meet this criteria because PDF documents can contain elements which are not reliably rendered because their appearance can change based on the viewer, host operating system, or state of the PDF itself. Some examples include PDF documents that do not embed their fonts and instead hope that the viewing systems have them, documents that use device dependent color, documents that are encrypted, and documents that contain dynamic content such as JavaScript and 3D. PDF/A was developed to solve these problems.

Why use PDF/A?

There are many reasons to use PDF/A for archiving purposes, but the two main ones are its advantages over other electronic formats, and its industry acceptance.

Advantages of PDF/A over TIFF

The first and other widely used format for digital archiving is TIFF. TIFF is a raster image format that promises the same guaranteed visual appearance of the document that PDF/A does. However because it is an image format it is missing the ability to include vector content such as shapes, gradients and vector fonts that are available in PDF. Not only does vector content more accurately describe the original document, it often takes up less disk space, which can be a consideration when archiving a large number of documents. This feature is especially relevant for “digitally born” documents which are now far more common than when TIFF was first adopted.

Another advantage of PDF/A over TIFF is that unicode can be included to ensure text is extractable and searchable, making a large digital archive potentially far more useful.

Lastly, the PDF/A format can accommodate embedded digital signatures, giving users a way to verify that the PDF has not been altered, which can be important for the legal admissibility of a document.

Industry Acceptance

PDF/A has achieved a high level of industry acceptance. When the format was published in 2005, a group of companies in Europe formed the PDF/A Competence Center in order to raise the format’s profile and promote its benefits to industry and government. Since then, many institutions, especially in Europe, have mandated PDF/A as the required file format for archiving. Various US agencies also accept PDF/A as a format, such as NARA and  PACER.

Secondly, the fact that a PDF/A file is a PDF file means that free viewers are widely available on virtually all computing devices.

PDF/A-1, PDF/A-2 and PDF/A-3

PDF/A exists in several subtypes. Below is an explanation of each.

PDF/A-1: ISO 19005-1:2005

The original PDF/A specification, it is the most restrictive of the PDF/A standards. Because it was released before PDF was defined in an ISO standard, it is based on PDF 1.4, and open but proprietary specification published by Adobe Systems Inc. PDF/A-1 files do not allow JPEG2000, attachments or layers. Despite transparency being included in PDF 1.4, it was “too new” at the time and so it was omitted from the PDF/A-1 specification. PDF/A-1 files are most frequently used PDF/A format, possibly because PDF/A-1 is the original PDF/A format.

There are two different levels of PDF/A-1 conformance:

PDF/A-1b: Level B (Basic) Conformance

Basic conformance guarantees reliable viewing.

PDF/A-1a: Level A (Accessible) Conformance

Accessible conformance is a superset of level B conformance, adding requirements that help machines and people better understand the content. The additional requirements are the content must be tagged with a structure tree, meaning elements such as reading order, figures and tables are explicitly identified through metadata. It also requires that the language of the document be identified (which helps screen readers), and that unicode mappings are included to ensure reliable text search and copy. None of these additional requirements change the appearance of the document, but they are important for computers to quickly and reliably search or re-purpose a document. The also help disabled users read the document, because without this information screen readers may not work reliably nor will the reading order be guaranteed to be correct.

PDF/A-2

PDF/A-2 was released in July 2011 and brought with it two big features:

  1. PDF/A-2 is  based on an ISO standard, ISO 32000-1.
  2. PDF/A-2 includes new features not available in PDF/A-1, namely transparency, JPEG2000, layers and attachments (only other PDF/A files).

Note that PDF/A-1 is backwards compatible with PDF/A-2, that is any PDF/A-1 is also a PDF/A-2 file.

PDF/A-2 has the same conformance levels as PDF/A-1 (that is PDF/A-2b, PDF/A-2a), plus a new one:

PDF/A-2u: Level U (Unicode) Conformance

PDF/A-2u is like PDF/A-2a, but drops the need for the logical structure information (tags and structure tree) as specified in section 6.8 of ISO 19005-2. This means that PDF with u conformance will have text that can be searched and copied, but reading order will not guaranteed.

PDF/A-3

PDF/A-3 was released in October 2012, and it is exactly the same as PDF/A-2 (they even left the typos intact) except for one difference, which is that any type of file can be added as an attachment, rather than only other PDF/A files. This change was largely driven by the desire to have a machine readable component available, such as XML or proprietary binary data.

PDF/A-3 has the same levels of conformance as PDF/A-2: B, A and U.

Problems with PDF/A

PDF/A is a more reliable type of PDF that has been accepted as a suitable archive format. That said, there are in our opinion problems with the format that are not often discussed, which we believe has led to a false impression among users regarding the reliability of PDF/A. Further, there are issues with the format that has lead to confusion and disagreement between providers of PDF/A related software.

PDF/A’s Reliability Promise

PDF/A is promoted as a 100% reliable version of PDF. It is said to achieve this by removing features that could display differently on different machines (e.g. non-embedded fonts) or that can create dynamic, changing content (e.g. JavaScript). PDF/A validators will verify that a file complies with the PDF/A specification, giving the user confidence that the file will be viewed correctly on different machines now and in the future. So what is wrong with this picture?

The big elephant in the room is that PDF/A validators validate only against what is contained in the PDF/A specification. That is they will fail a document that contains forbidden content, such as transparency in a PDF/A-1 file, or does not contain required content, such as device-independent color information. Everything else is considered a valid PDF/A file. So what’s the problem? The problem is that the PDF/A specification references the PDF specification, which is itself enormous (1000+ pages), and any violations of it are not checked. So what is a viewer supposed to do when it encounters a violation of the PDF specification? Technically, the document is no longer a PDF (or PDF/A) document, so the behavior is undefined. What most viewers will do is that they silently try to correct the problem by making assumptions about what the creator intended. This is inherently unreliable, as different implementations may make different guesses. Of course to make a guess the viewer has to catch the problem—if it doesn’t it may well get “confused” and display completely undefined content, or perhaps even crash. Not exactly a reliable viewing experience!

Making matters worse is that the PDF specification itself references still other specifications such as JPEG, JPEG2000, different font specifications, color profile specifications, etc. Any violation of any of these specs opens up the same possibilities as before that result in undefined viewing.

Is PDF/A still useful?

Yes. There is nothing wrong with PDF/A files, it is simply that the PDF/A promise of one hundred percent reliable viewing is overstated. What PDF/A should promise is much more reliable viewing, with the additional promise that if the document does not violate any of the other specifications it relies on, then it will be completely reliable. By validating the PDF/A specification, one is given a level of confidence that the most common reasons for unreliable viewing in PDF files have been addressed.

It is simply important to use the format with a full understanding of its benefits and limitations.

PDF/A Validation and PDF/A Conversion with PDF/A Manager

PDFTron makes both a PDF/A validator and a PDF/A converter. Both are part of the same product, PDF/A Manager. PDF/A Manager is available as both an SDK and command line program, as well as as an add-on for PDFNet, our full-featured PDF SDK.

PDF/A Validation

PDF/A validation consists of checking the internal PDF structure against what is defined in the PDF/A specification. An XML report is generated that details any errors that are found. Each error consists of an error ID, a message describing the problem, and the PDF indirect object number that has the problem. (The object numbers could be used with CosEdit to directly view and modify the PDF structure if that level of interaction is required.)

PDF/A Manager supports validating all PDF/A subtypes and conformance levels, that is PDF/A-1a, PDF/A-1b, PDF/A-2a, PDF/A-2b, PDF/A-2u, PDF/A-3a, PDF/A-3b, and PDF/A-3u.

PDF/A Conversion

PDF/A Manager will also convert any PDF document to be PDF/A compliant. It does this by replacing non-PDF/A compliant content with the PDF/A equivalent, embedding missing fonts, adding missing metadata streams, etc.

See below for an example report from converting a PDF document to a PDF/A-2b document:

PDF/A Validation Results

For more information on PDF/A Manager and its capabilities, please see the product website: https://www.pdftron.com/pdfamanager/index.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s