A simple example of converting PDF to HTML

We have received lots of interest in our new PDF to HTML/EPUB conversion since it was released in PDFNet 6.0. With this interest we have also gotten questions on customizing the output. So today I’ll provide a quick demo of converting a PDF to HTML using PDFNet.

In another post I will go into some of the details particular to PDF to EPUB conversion, but everything in today’s post applies to both HTML and EPUB output.

Furthermore, while PDFNet is available in C/C++, Java, Objective-C, Python, Ruby, PHP, VB and C#, due to its popularity I decided to do this demo in C#. The PDFNet API is consistent enough that you should be able to easily translate to another language.

Setup

First, download PDFNet from our download page.
http://www.pdftron.com/pdfnet/downloads.html
For this demo I downloaded PDFNet for Windows Desktop .Net 4+. But you can just as easily download any of our desktop versions (including Linux and Mac).

After unzipping the download, navigate to the Samples folder, and select one of the Visual Studio solutions. For me, I chose Samples_2013.sln.

Once in Visual Studio, right click the ConvertTestCS2013 project and select Set as Startup Project.

For this demo, we will simulate the following requirements:

  • Convert only odd number pages.
  • Target iOS devices
  • High image quality (DPI)
  • Use PNG instead of JPG
  • No HTML hyperlinks to URL’s outside of the document.

Since we are targeting iOS, a quick look at Apple’s official Safari iOS resource limits shows we want to have a 3 megapixel (MP) limit. We will also crank up the DPI so the output looks as good as possible on a retina display.
Safari Web Content Guide

Code

Here then is the code to accomplish the above.

using (PDFDoc doc = new PDFDoc(inputPath + "newsletter.pdf"))
{
    doc.InitSecurityHandler();
    // remove all even pages
    if(doc.GetPageCount() > 1)
    {
        PageIterator itr = doc.GetPageIterator();
        itr.Next(); // skip first page
        while (itr.HasNext())
        {
            doc.PageRemove(itr); // remove even pages
            itr.Next();
        }
    }
    pdftron.PDF.Convert.HTMLOutputOptions options = new pdftron.PDF.Convert.HTMLOutputOptions();
    options.SetInternalLinks(true);
    options.SetExternalLinks(false);
    options.SetPreferJPG(false);
    options.SetDPI(300);
    options.SetMaximumImagePixels(3000000);
    options.SetSimplifyText(true);
    options.SetScale(2.0);
    pdftron.PDF.Convert.ToHtml(doc, outputPath + "newsletter_odd_pages", options);

What does all the code above mean?

After initializing the library, and opening the document, we first modify the document in memory by removing the even numbered pages. As long as we do not call PDFDoc.Save(), then these changes do not affect the original source file.

Tip: There are lots of more code example’s showing how to use PDFNet, available in the downloaded samples, and on our forum.
www.pdftron.com/pdfnet/samplecode.html
https://groups.google.com/forum/?fromgroups#!forum/pdfnet-sdk

PDF to HTML Options

Now onto the PDF to HTML code.

options.SetInternalLinks(true);
options.SetExternalLinks(false);

Above we make sure internal links are enabled, which ensures that any internal links in a PDF are included in the HTML, for example a table of contents. The next line though disables any links that would take the reader outside of the document, such as another website.

options.SetPreferJPG(false);
options.SetDPI(300);
options.SetMaximumImagePixels(3000000);

Next, we turn on PNG image output, increase the image DPI to 300, but set a 3 MP limit so as not to overload iOS device. The result will be that PNG’s will be generated at 300 DPI, except where that would put the image over 3MP. In the latter case, the image will be down-sampled to the highest DPI that will keep it under 3MP.

options.SetSimplifyText(true);

Here, we enable text optimization. This attempts to merge text runs in the PDF file, to reduce HTML DOM complexity, and reduce HTML file size. This can result in text placement not matching exactly what was in the PDF, but to the human eye it is typically not noticeable, even when viewing the output side by side with the original. On the other hand, it will reduce download, layout, and rendering times.

options.SetScale(2.0)

Finally, we will scale the html output so that it is easier to read in the browser, without having to rely on the browser to zoom.

DocPub CLI

For those that prefer command line tools, here is how you would get the same output using our DocPub command line tool.

docpub.exe -f html --internal_links --prefer_jpg false --dpi 300 --max_image_pixels 3000000 --simplify_text --scale 2.0 input.pdf

You can download DocPub for Windows, Mac and Linux from here.
http://www.pdftron.com/docpub/downloads.html

Conclusion

I hope that you find this information useful, and that you give PDF to HTML conversion a test drive soon, and stay tuned for more information on creating EPUB’s!

References:

High Quality EPUB / HTML From PDF : https://blog.pdftron.com/2013/11/15/high-quality-epub-html-from-pdf/

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s