Comments
Richard Davies wrote: The UK has a good crop of technology pioneers in cloud computing - for example ElasticHosts, FlexiScale, Flexiant, OnApp - and also some strong government initiatives such as G-Cloud. We will have to see whether this kind of technical leadership converts into swift mass-market adoption or not.
Cloud Expo on Google News

SYS-CON.TV
Cloud Expo & Virtualization 2009 East
PLATINUM SPONSORS:
IBM
Smarter Business Solutions Through Dynamic Infrastructure
IBM
Smarter Insights: How the CIO Becomes a Hero Again
Microsoft
Windows Azure
GOLD SPONSORS:
Appsense
Why VDI?
CA
Maximizing the Business Value of Virtualization in Enterprise and Cloud Computing Environments
ExactTarget
Messaging in the Cloud - Email, SMS and Voice
Freedom OSS
Stairway to the Cloud
Sun
Sun's Incubation Platform: Helping Startups Serve the Enterprise
POWER PANELS:
Cloud Computing & Enterprise IT: Cost & Operational Benefits
How and Why is a Flexible IT Infrastructure the Key To the Future?
Click For 2008 West
Event Webcasts
Getting Serious About PDF Content Integration
An introduction to enterprise-class PDF text and metadata extraction

The PDF file format has become the gold standard of document distribution and archiving. It's therefore virtually certain that data critical to your organization is sitting quietly in PDF documents somewhere. This situation means you have to get serious about integrating PDF content into your applications - taking shortcuts in this area and not finding or leveraging that mission-critical data can lead to millions of dollars in lost sales and/or similar levels of increased costs, compliance difficulties, or liability entanglements. It's time to use a high-performance PDF component that will yield accurate text and metadata extracts suitable for use with your existing search, content management, text analysis/mining, CRM, or other system(s).

However, the PDF file format is complex and wasn't designed for content extraction. So, any PDF library that doesn't specialize in content extraction is likely to exhibit various undesirable traits:

  • Poor performance or performance degradation in high-volume environments
  • Poor text extract accuracy
  • Incomplete PDF file format support
  • Lack of or limited support for extracting Unicode text, including Chinese, Japanese, and Korean text
  • A complicated API that requires knowing the PDF file format
  • Lack of any tools for identifying and converting unstructured data
If the PDF library you're using exhibits any of these problems then it's time to upgrade to an enterprise-class component that specializes in content extraction. And if your application doesn't leverage PDF content then it makes sense to skip the training wheels and use the best tool for the job from the start.

PDFTextStream fits the bill either way. A pure Java library (also available for .NET and Python), PDFTextStream specializes in extracting text and metadata from PDF documents. Because of its focus, PDFTextStream has none of the downside all too common when using a general-purpose PDF library for content extraction purposes.

This introduction will cover just a few of the use cases where PDFTextStream's focus on content extraction yields significant value.

Simple/Powerful Text Extraction
PDF documents specify their text content a character at a time without any indication of each page's physical layout (such as lines, paragraphs, columns, tables, etc.). Thankfully, PDFTextStream automatically derives these structures for every page it extracts using state-of-the-art page segmentation and read-ordering processes - similar to how an OCR application derives the structure of a scanned document. And thankfully again, this accuracy doesn't come at the expense of speed or ease of use.

Now for some code. As you can see, extracting text using PDFTextStream is super-simple:

StringBuffer pdfText = new StringBuffer(1024);
com.snowtide.pdf.OutputTarget tgt = new
com.snowtide.pdf.OutputTarget(pdfText);
PDFTextStream stream = new PDFTextStream(pdfFile);
stream.pipe(tgt);
stream.close();

The full text of the PDF file is now available in the pdfText StringBuffer.

OutputTarget is the default implementation of the com.snowtide.pdf.OutputHandler interface, which can be thought of as a SAX interface for PDFTextStream document model events. These events are generated any time an OutputHandler is passed into a pipe(OutputHandler) function, which is available on many document model objects as well (com.snowtide.pdf.Page, com.snowtide.pdf.layout.Block, and com.snowtide.pdf.layout.Line).

OutputTarget's primary purpose is to provide a straightforward way to direct extracted text to a StringBuffer or a java.io.Writer. Further, OutputTarget passes through PDFTextStream's default text layout: content is in the proper semantic order, columns of text are separated, and rotated text is normalized and grouped in reasonable ways. This is really important if the PDF text you're extracting is going to be used as input to a semantically sensitive process, such as text mining or search engine indexing.

There are many OutputHandler implementations included with PDFTextStream, each of which interprets and processes PDF text events differently. If none of them meet your application's needs, you can easily write your own.

Unicode Text Extraction
Today's global economy demands that your application be world-ready, in any major language. Thankfully, PDFTextStream always extracts text from PDF documents as Unicode (a perfect match for Java's consistent and thorough Unicode support). Further, PDFTextStream extracts Chinese, Japanese, and Korean (CJK) text from PDF documents without any performance penalties.

Nothing special needs to be done to enable these capabilities - they're always on, so you can use the simplest code and always get Unicode and CJK text out of your source PDF documents.

Search Engine Integration
PDFTextStream was designed to be easily integrated into other applications, including content management systems, text mining processes, and, of course, search engines. A great example is its Lucene integration module, which produces Lucene documents using the content extracted from PDF files. Building a Lucene document that contains all of the text in a PDF file requires one line of code:

Document luceneDoc = com.snowtide.pdf.lucene.PDFDocumentFactory.buildPDFDocument(pdfFile);

The contents of the Lucene document, including whether PDF document attributes (such as author's name, title, creation date, etc.) should be included, as well as the Lucene document's indexing, tokenizing, and storage parameters can all be customized (via com.snowtide.pdf.lucene.DocumentFactoryConfig).

Also of interest to those who work with search engines, PDFTextStream enables Web crawlers to source new URLs to retrieve from PDF documents - see Enabling PDF Web Crawling below.

Metadata, Metadata Everywhere
Utilizing the metadata embedded in many PDF documents can add a great deal of value to your applications. PDFTextStream gives you easy access to the full world of PDF metadata:

  • Document attributes (as a key/value Map or in Adobe XMP XML format)
  • Document outline/bookmarks
  • Acroform data - interactive form data
  • PDF annotations (text notes, embedded URL links, etc.)
There's clearly a ton of metadata that you could work with; let's dig into a couple examples.

The Bulk Metadata Import
Consider a scenario where you need to load PDF documents into a content management system. A common requirement would be for each document's author, title, and creation date to be imported as well. Let's retrieve those attributes:

PDFTextStream stream = new PDFTextStream(pdfFile);
Object author = stream.getAttribute(PDFTextStream.ATTR_AUTHOR);
Object title = stream.getAttribute(PDFTextStream.ATTR_TITLE);
Object createDtStr = stream.getAttribute(PDFTextStream.ATTR_CREATION_DATE);
Date createDt = null;
if (createDtStr != null && createDtStr instanceof String)
createDt = PDFDateParser.parseDateString((String)createDtStr);

From here, you could easily add the metadata associated with each PDF document to the CMS. This code is straightforward, but there are some points worth noting:

  • The PDFTextStream class provides a set of attribute name constants, making standard attribute lookups easy.
  • The getAttribute(String) function returns an Object, not a String - this is because PDF files can technically specify attribute values of various types.
  • PDF date strings have a standard format; the com.snowtide.pdf.PDFDateParser.parseDateString(String) function can be used to convert PDF date Strings into java.util.Date objects.
Enabling PDF Web Crawling
PDF documents can contain Internet URLs, but many Web crawlers don't look for and follow such links. Here, we'll retrieve the embedded PDF annotations that contain URL links, which could then be retrieved by a Web crawler.

PDFTextStream stream = new PDFTextStream(pdfFile);
List<Annotation> annots =
    stream.getAllAnnotations();
ArrayList<String> uriList = new ArrayList<String>();
for (Annotation annot : annots) {
    if (annot instanceof com.snowtide.pdf.annot.LinkAnnotation) {
      LinkAnnotation link = (LinkAnnotation)annot;
      if (link.getLinkActionName().equals("URI"))
        uriList.add(link.getURI());
    }
}

This example will add all of the available URLs in the PDF document to the uriList ArrayList. The process is very simple: find all of the PDF annotations of type com.snowtide.pdf.annot.LinkAnnotation, and ignore any LinkAnnotations that do not have an "action name" of URI. There are a variety of link action names, each of which have different behaviors in a PDF viewer. Only URI LinkAnnotations contain a URL, which is retrieved using the getURI() function.

Identifying and Converting Unstructured Data
Coping with "unstructured" data is a popular topic these days, mostly because:

  • It's being recognized that unstructured data represents most of the data generated and received by most organizations
  • Significant operational advantages can be achieved only if organizations can identify, convert, and harness the available unstructured data
Given that PDF documents are a primary vehicle for unstructured data, it's worth noting that PDFTextStream provides some tools to make extracting this data easier. The use of these tools is beyond the scope of this article, so please refer to the PDFTextStream documentation for details.

First, PDFTextStream provides a table API (com.snowtide.pdf.layout.Table) that represents the data of any table that PDFTextStream can detect while processing a PDF document. This API can be used as the basis of a process that converts tabular data found in PDF documents into CSV or Excel files, or directly into database records.

Secondly, for broader unstructured data conversion purposes (or for tabular data that can't be detected automatically through its table API), PDFTextStream provides VisualOutputTarget, an OutputHandler implementation that renders PDF text to a StringBuffer or java.io.Writer while maintaining the visual layout of each page of text. This maintains the visual alignment of table columns and other textual elements, which makes text extracts retrieved using VisualOutputTarget ideal for input into downstream text analysis and mining tools.

Conclusion: Enterprise-Class, Indeed
The term "enterprise-class" typically means that a component is robust - that it can take a beating and still keep going, while maintaining high performance levels.

That describes PDFTextStream quite well. It's feature-rich, it has a high degree of PDF file format support, and it's just plain fast: in extensive benchmarking (conducted by Snowtide Informatics and posted for review and verification on its Web site), PDFTextStream is shown to be 223% -1,141% faster than all other Java PDF libraries that are capable of text extraction. Even better, PDFTextStream clocks in as 13% faster than pdftotext, the popular native C/C++ PDF text extraction utility that's part of the Xpdf project.

There's a right tool for every job, and in general, it's better to use a tool that is designed for the specific job at hand. Accurately extracting text and metadata from PDF documents with high levels of performance is a surprisingly difficult job that presents a complex set of problems. Given the importance of finding and accessing critical data available only in PDF documents, it makes sense to use a PDF content extraction library designed from the ground up to solve these problems expertly and without compromises. Doing so will ensure that your application and your users receive the greatest benefits of enterprise-class PDF content integration.

References

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

Latest Cloud Developer Stories
Swisscom, the Swiss telecom, is going into the cloud business. Its subsidiary Swisscom IT Services AG has signed up with Red Hat as a Certified Cloud Provider and launched a public cloud Infrastructure-as-a-Service (IaaS) cloud targeting enterprise-class customers primarily in ...
Apache Deltacloud, the Red Hat-contributed ReSTful API that abstracts differences between clouds so services on any cloud can be managed – provided of course there’s a driver – has graduated from the Apache Foundation’s incubator and is now a full-fledged Top-Level Project (TLP)....
In a surprise move on Tuesday, January 10, Oracle wheeled out its Big Data Appliance. That’s the one it said in October would be ready sometime in the first half. Only nobody believed it meant early in the first half. Heck, it’s not even clear anybody thought Oracle could make ...
Rackspace Hosting, the service leader in cloud computing, on Thursday announced its acquisition of SharePoint911, an industry leader in SharePoint consulting, training, and "JumpStart" services within SharePoint. The unification of both companies provides capabilities to deliver ...
CloudLinux, Inc., on Thursday released CafeFS 3, a virtualized file system for shared hosters that cages each customer within its own virtualized file system. CageFS becomes part of CloudLinux OS at no additional charge. CloudLinux OS, the only commercially-supported Linux OS m...
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

SYS-CON Featured Whitepapers
ADS BY GOOGLE

Breaking Cloud Computing News