|
Comments
Did you read today's front page stories & breaking news?
SYS-CON.TV
|
First Look Getting Serious About PDF Content Integration
An introduction to enterprise-class PDF text and metadata extraction
By: Chas Emerick
Oct. 1, 2006 06:45 PM
The PDF file format has become the gold standard of document distribution and archiving. It's therefore virtually certain that data critical to your organization is sitting quietly in PDF documents somewhere. This situation means you have to get serious about integrating PDF content into your applications - taking shortcuts in this area and not finding or leveraging that mission-critical data can lead to millions of dollars in lost sales and/or similar levels of increased costs, compliance difficulties, or liability entanglements. It's time to use a high-performance PDF component that will yield accurate text and metadata extracts suitable for use with your existing search, content management, text analysis/mining, CRM, or other system(s).
PDFTextStream fits the bill either way. A pure Java library (also available for .NET and Python), PDFTextStream specializes in extracting text and metadata from PDF documents. Because of its focus, PDFTextStream has none of the downside all too common when using a general-purpose PDF library for content extraction purposes. This introduction will cover just a few of the use cases where PDFTextStream's focus on content extraction yields significant value.
Simple/Powerful Text Extraction Now for some code. As you can see, extracting text using PDFTextStream is super-simple:
StringBuffer pdfText = new StringBuffer(1024); The full text of the PDF file is now available in the pdfText StringBuffer. OutputTarget is the default implementation of the com.snowtide.pdf.OutputHandler interface, which can be thought of as a SAX interface for PDFTextStream document model events. These events are generated any time an OutputHandler is passed into a pipe(OutputHandler) function, which is available on many document model objects as well (com.snowtide.pdf.Page, com.snowtide.pdf.layout.Block, and com.snowtide.pdf.layout.Line). OutputTarget's primary purpose is to provide a straightforward way to direct extracted text to a StringBuffer or a java.io.Writer. Further, OutputTarget passes through PDFTextStream's default text layout: content is in the proper semantic order, columns of text are separated, and rotated text is normalized and grouped in reasonable ways. This is really important if the PDF text you're extracting is going to be used as input to a semantically sensitive process, such as text mining or search engine indexing. There are many OutputHandler implementations included with PDFTextStream, each of which interprets and processes PDF text events differently. If none of them meet your application's needs, you can easily write your own.
Unicode Text Extraction Nothing special needs to be done to enable these capabilities - they're always on, so you can use the simplest code and always get Unicode and CJK text out of your source PDF documents.
Search Engine Integration Document luceneDoc = com.snowtide.pdf.lucene.PDFDocumentFactory.buildPDFDocument(pdfFile); The contents of the Lucene document, including whether PDF document attributes (such as author's name, title, creation date, etc.) should be included, as well as the Lucene document's indexing, tokenizing, and storage parameters can all be customized (via com.snowtide.pdf.lucene.DocumentFactoryConfig). Also of interest to those who work with search engines, PDFTextStream enables Web crawlers to source new URLs to retrieve from PDF documents - see Enabling PDF Web Crawling below.
Metadata, Metadata Everywhere
The Bulk Metadata Import
PDFTextStream stream = new PDFTextStream(pdfFile); From here, you could easily add the metadata associated with each PDF document to the CMS. This code is straightforward, but there are some points worth noting:
PDF documents can contain Internet URLs, but many Web crawlers don't look for and follow such links. Here, we'll retrieve the embedded PDF annotations that contain URL links, which could then be retrieved by a Web crawler.
PDFTextStream stream = new PDFTextStream(pdfFile); This example will add all of the available URLs in the PDF document to the uriList ArrayList. The process is very simple: find all of the PDF annotations of type com.snowtide.pdf.annot.LinkAnnotation, and ignore any LinkAnnotations that do not have an "action name" of URI. There are a variety of link action names, each of which have different behaviors in a PDF viewer. Only URI LinkAnnotations contain a URL, which is retrieved using the getURI() function.
Identifying and Converting Unstructured Data
First, PDFTextStream provides a table API (com.snowtide.pdf.layout.Table) that represents the data of any table that PDFTextStream can detect while processing a PDF document. This API can be used as the basis of a process that converts tabular data found in PDF documents into CSV or Excel files, or directly into database records. Secondly, for broader unstructured data conversion purposes (or for tabular data that can't be detected automatically through its table API), PDFTextStream provides VisualOutputTarget, an OutputHandler implementation that renders PDF text to a StringBuffer or java.io.Writer while maintaining the visual layout of each page of text. This maintains the visual alignment of table columns and other textual elements, which makes text extracts retrieved using VisualOutputTarget ideal for input into downstream text analysis and mining tools.
Conclusion: Enterprise-Class, Indeed That describes PDFTextStream quite well. It's feature-rich, it has a high degree of PDF file format support, and it's just plain fast: in extensive benchmarking (conducted by Snowtide Informatics and posted for review and verification on its Web site), PDFTextStream is shown to be 223% -1,141% faster than all other Java PDF libraries that are capable of text extraction. Even better, PDFTextStream clocks in as 13% faster than pdftotext, the popular native C/C++ PDF text extraction utility that's part of the Xpdf project. There's a right tool for every job, and in general, it's better to use a tool that is designed for the specific job at hand. Accurately extracting text and metadata from PDF documents with high levels of performance is a surprisingly difficult job that presents a complex set of problems. Given the importance of finding and accessing critical data available only in PDF documents, it makes sense to use a PDF content extraction library designed from the ground up to solve these problems expertly and without compromises. Doing so will ensure that your application and your users receive the greatest benefits of enterprise-class PDF content integration. References
Reader Feedback: Page 1 of 1
Latest Cloud Developer Stories
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
|
SYS-CON Featured Whitepapers
Most Read This Week
Breaking Cloud Computing News
|
|||||||||||||||||||||||||||||||||||||||||||||||||