How To Integrate a PDF Viewer into HTML5 Apps

HTML5 apps offer many of advantages over native ones. Web apps are

  • Naturally cross-platform: develop once, run on iOS, Android, Windows Phone and everything else.
  • Easy to update the app for everyone, immediately.
  • Do not have to go through Apple or Google to access customers (but you still can by embedding it into a native shell app)

But web apps suffer one big problem, and that’s the user experience.

Today, in 2013, even the best-crafted mobile web apps come nowhere near the quality of experience of the best native apps. In fact, with but a few exceptions, the best mobile web apps today still don’t approach the quality of the first batch of native iPhone apps from 2007.

John Gruber, Daring Fireball

One area of the user experience where HTML5 apps have been historically weak is in their ability to display a PDF within the app. For a long time, “viewing” a PDF on the web meant downloading it, and opening it in a different program. Next came browser PDF plugins, that would take over the browser screen in order to display the PDF. A small improvement, but still not integrated and certainly not a good user experience.

So, if the goal is to integrate PDF viewing into a web app, how can that be done? There are a number of approaches, each with pros and cons. Keep reading to see what techniques exist, and which might be best for your app.

1. Rasterization to images

This is probably the simplest way to get “PDF” onto the web. Take the PDF, turn it into images, serve. Voila. PDF on the web in a format that is compatible with all browsers on all operating systems. However, there are some issues:

  • No vector content limits quality at high resolutions,
  • Storage- and bandwidth-heavy bitmap data,
  • Does not support PDF capabilities such as forms or a standard method of annotations, needs extra work to simulate text selection
  • Scalability problems: computationally expensive to rasterize, large storage requirements,
  • Requires extra work to implement text selection and indexability.

While converting to images may be a good solution for some applications, it is unlikely to be an optimal one. So what can we do?

2. HTML DOM

The idea here is to use the browser’s native text rendering and layer it on top of an image that contains all of the non-text data. (This technique is implemented by PDFTron in pdfton.PDF.Convert.ToHtml().) While it sounds like an incremental change from full rasterization, there are some significant advantages:

  • Text quality is often preserved. People are especially sensitive to the quality of text, so preserving the vector nature of the glyphs is a big improvement.
  • Allows the user to use the browser’s standard text selection/copying capabilities, which can also be read by search engine robots.

So while this is a step up from full rasterization, problems remain:

  • Quality for non-text elements is sacrificed for all non-text data.
  • Accurate text positioning is possible, however it requires a separate for every letter. Doing this reduces page load speed and the ability to search/index/select text. So one must accept this limitation, or instead accept somewhat inaccurate text positioning.
  • Degrades to full PDF rasterization when text is semi-transparent, partially occluded or covered by transparent objects, pattern-filled objects, etc.
  • It is easy for users to save DOM content locally, which is a concern if serving copyright content.
  • Storage requirements could be significant

3. SVG

The W3C recognized the need to bring high-quality vector graphics to the web, and proposed SVG (scalable vector graphics). At first, this technology seems very promising: it will deliver the vector data and precise positioning we want, with fonts, gradients, masks and more. A “PDF killer” some predicted. PDFTron took action and developed the first PDF to SVG converter in 2001. However, widespread adoption of SVG and the supplanting of PDF never came to pass. Why not? Here are a few reasons:

  • SVG is not fully compatible with the PDF graphics model (e.g. transparency/blend mode), making it impossible to faithfully reproduce PDF content using SVG.
  • A bloated spec designed to also compete with Flash, incorporating scripting and animation, put a high burden for those wishing to implement the spec completely.
  • It is missing support for efficient monochrome compression, which is important for many scanned business documents.
  • Worst of all, most implementations were incomplete and buggy. Until IE9, Microsoft did not support SVG at all, and even now there is no support for SVG fonts. In other browsers (Chrome, Firefox) there are many glitches related to text positioning.

SVG had some built in technical limitations, but its biggest problem was (and still is) a lack of complete and correct implementations within browsers. Ultimately it has found success in certain niches, but it has not experienced widespread adoption for general use cases.

4. HTML5 Canvas

So where does that leave us? Not surprising, we are going to take a close look at “HTML5”, specifically the canvas. Does this technology finally deliver the ability to view a PDF inline? Will it succeed where others have come up short?

The HTML5 Canvas gives us 2D drawing capabilities similar to a system level library like GDI and Direct2D on Windows, and Quartz on OS X and iOS. This means that shapes, curves, text and opacities can be represented mathematically, and rendered by the canvas at any resolution. So the big question is can we “translate” the mathematical representation of content in a PDF to a series of Javascript commands that draw them to the HTML5 Canvas. Let’s take a look.

PDF → JS code → HTML5 Canvas: pdf.js

The “holy grail” would be to use JavaScript to directly read a PDF and draw it onto an HTML5 canvas. This would offer a number of benefits:

  • Vector graphics
  • Render the PDF directly rather than using an intermediate format (such as images or SVG)
  • Would not suffer from limitations of the previously outlined techniques
  • Consistent behaviour across browsers

Building such a system would seem a significant task, but it has in fact been attempted by the Mozilla Foundation in pdf.js. Pdf.js is an impressive technical achievement, but close examination leads one to conclude that it unfortunately suffers from many usability and quality issues. This is not a reflection of pdf.js per se, but rather a technical limitation that would be inherent in any product that attempted to use Javascript/HTML5 to render a PDF. Some of the problems we encountered:

1. Accuracy

From the ‘get-go’ pdf.js faced issues on the rendering side. For example, standard HTML5 Canvas does not support paths with dashes, the even-odd fill rule, or PDF blend modes. Since Mozilla developers were in control of their own browser they were able to bandage Firefox with custom extensions (prefixed with moz-… ). Unfortunately these extensions are not part of the HTML5 standard and are not supported by all browsers, including the dominant mobile browsers. Also even with all of the custom moz extensions, ‘pdf.js’ can’t deal with some transparency groups, overprint, some soft masks, non-rgb color spaces, etc. Perhaps one day all browsers will add every extension required to accurately render a PDF, however the project clearly showed some limitations of implementing a complex graphics system in JS.

pdf.js Rendering                                    Correct Rendering

pdf.js Rendering                                         Correct PDF rendering

2. Performance

JavaScript is much slower than native code. Despite using GPU accelerated canvas rendering, viewing PDFs in pdf.js is slower than native viewers/plug-ins that do not use hardware acceleration. Native viewers will always be able to stay one step ahead of JavaScript viewers in terms of performance.

3. Reliability

Mobile browsers do not respond well when they run out of memory: they simply exit, i.e. crash. Because PDF documents can be large and use complex resources it is not difficult to exceed the limit. (The same issues exist on the desktop, but thanks to large amounts of RAM and virtual memory, they are less critical.)

4. Usability

Because pdf.js uses PDF documents ‘as is’, it is more likely than not that the documents have not been “linearized”, that is saved in a format that is streamable over the web. This means that the entire document must be downloaded (and stored in memory) before it can be rendered, leaving the user waiting. Although this issue is not specific to a Javascript viewer, it is a drawback to using PDF documents that have not been processed for online viewing.

A solution: PDF→ PDFNet → JS code → HTML5: WebViewer

What can be done to resolve these shortcomings? When you look at the source of the problems, it is that PDF documents can simply be too big and complicated to be competently handled by a pure JavaScript/HTML5 Canvas solution. So, perhaps with some pre-processing, a PDF can be normalized to a format that can be properly handled by a pure JavaScript/HTML 5 Canvas viewer. What needs to be done?

  • Optimize the file for fast random access loading. This means that any page could be fetched and displayed regardless of which other pages in the document have already been downloaded.
  • Downsample high resolution images so that they do not consume large amounts of memory, which is a real problem on mobile devices.
  • Reduce the complexity of a document for accurate and efficient display on mobile devices. This means analyzing a PDF page element-by-element, looking for simplifications and alternate means of representing content that is known to be compatible with HTML5 Canvas. This may also mean rasterizing content that cannot in any way be accurately rendered by an HTML5 Canvas.
  • Normalize all images to a form that can be natively decoded by a browser

So how well does this work? After 3+ years of implementing these optimizations for WebViewer, we are able to say that it indeed works very well. Once the PDF has been optimized for web viewing, all of pdf.js’s shortcomings melt away, and viewing is

  • fast
  • reliable
  • high-quality
  • cross-browser
  • mobile-friendly

These optimized documents have also served as a good basis for implementing PDF features such as interactive forms and annotations.

Conclusions

Displaying a PDF within a web browser is by no means trivial. What is clear is that for accurate and reliable viewing, the PDF needs to be “normalized” to a web friendly representation. Some normalization methods, such as converting to images, do work, but with limitations. Sophisticated normalization, such as what is done for WebViewer, offer an experience that approaches that of a native PDF viewer.

April 2015 Update

What a difference 18 months makes. Most of the article above holds, however new technology and an innovative approach has allowed us to provide reliable and correct in-browser PDF rendering without the need to pre-process. (And no, not by using pdf.js, its problems remain.) Check out the newly released Webviewer 2.0, and our post on PDFNetJS.

3 thoughts on “How To Integrate a PDF Viewer into HTML5 Apps

  1. xodo Post author

    PDFTron also recently gave a presentation in Seattle for a PDF Tech conference, and the second part of this presentation deals with PDF to HTML, and other alternatives.

    (fast forward 18 min in the video)

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s