Semantic Content Recognition in PDF

Semantic content recognition is the ability to identify components of a document by their “class” – that is if any particular content constitutes a title, subtitle, section, paragraph, word, figure, caption, table, etc. This is a problem, that despite decades of research, remains open. Available solutions are unreliable and are far, far behind the ability of a human being.

At the 2015 PDF Technical Conference, PDFTron’s CTO gave a presentation addressing the problem of semantic content recognition in PDF. The presentation gives an overview of the problem itself, why it has been such a hard problem to solve, and how the industry as a whole might organize itself to finally develop solutions that perform with the same accuracy as a person.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s