pdfts.examples
Class XMLOutputTarget

java.lang.Object
  extended by com.snowtide.pdf.OutputHandler
      extended by pdfts.examples.XMLOutputTarget

public class XMLOutputTarget
extends OutputHandler

This class is an example OutputHandler implementation that builds up a DOM XML model of extracted PDF content.

The full source code for this class is included in every PDFTextStream distribution.

The aim of this OutputHandler implementation is to output a valid XML document containing the text extracted from the PDF document, as well as interesting structural information (i.e. where Page and Block structures begin and end), as well as what text ranges are outputted using bolded, italicized, and underlined fonts. This kind of information might be particularly interesting to systems that perform some kind of semantic analysis of a document's structure and significant textual passages (such as a search engine or data mining system).

Version:
©2004-2012 Snowtide Informatics Systems, Inc.

Constructor Summary
XMLOutputTarget()
          Creates a new XMLOutputTarget.
 
Method Summary
 void endBlock(Block block)
          Invoked when PDFTextStream has finished processing a Block.
 void endPage(Page page)
          Invoked when PDFTextStream has finished processing a page
 void endPDF(java.lang.String pdfName, java.io.File pdfFile)
          Invoked when PDFTextStream has finished processing a PDF.
 java.lang.String getXMLAsString()
          Returns the XML built by this XMLOutputTarget as a String.
 org.w3c.dom.Document getXMLDocument()
          Returns the DOM Document that this XMLOutputTarget is building.
 void linebreaks(int linebreakCnt)
          Invoked when PDFTextStream determines that a series of line breaks should be outputted between the previous entity (page, block, line, etc) and the next entity (page, block, line, etc).
static void main(java.lang.String[] args)
          A main method suitable for using this class' functionality from the command line.
 void spaces(int spaceCnt)
          Invoked when PDFTextStream determines that a series of spaces should be outputted between the previous entity (block, line, text unit, etc) and the next entity (block, line, text unit, etc).
 void startBlock(Block block)
          Invoked when a Block is about to be processed.
 void startPage(Page page)
          Invoked when a page is about to be processed.
 void startPDF(java.lang.String pdfName, java.io.File pdfFile)
          Invoked when a new PDF is about to be processed.
 void textUnit(TextUnit tu)
          Invoked when a run of characters is to be outputted, as represented by the given TextUnit instance.
 
Methods inherited from class com.snowtide.pdf.OutputHandler
endLine, startLine
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

XMLOutputTarget

public XMLOutputTarget()
                throws java.io.IOException
Creates a new XMLOutputTarget.

Throws:
java.io.IOException - if an error occurs initializing a new DOM document
Method Detail

getXMLDocument

public org.w3c.dom.Document getXMLDocument()
Returns the DOM Document that this XMLOutputTarget is building.


getXMLAsString

public java.lang.String getXMLAsString()
                                throws java.io.IOException
Returns the XML built by this XMLOutputTarget as a String.

Throws:
java.io.IOException

textUnit

public void textUnit(TextUnit tu)
Description copied from class: OutputHandler
Invoked when a run of characters is to be outputted, as represented by the given TextUnit instance.

Overrides:
textUnit in class OutputHandler

spaces

public void spaces(int spaceCnt)
Description copied from class: OutputHandler
Invoked when PDFTextStream determines that a series of spaces should be outputted between the previous entity (block, line, text unit, etc) and the next entity (block, line, text unit, etc).

Overrides:
spaces in class OutputHandler
Parameters:
spaceCnt - - the number of spaces that PDFTextStream recommends should be outputted

linebreaks

public void linebreaks(int linebreakCnt)
Description copied from class: OutputHandler
Invoked when PDFTextStream determines that a series of line breaks should be outputted between the previous entity (page, block, line, etc) and the next entity (page, block, line, etc).

Overrides:
linebreaks in class OutputHandler
Parameters:
linebreakCnt - - the number of line breaks that PDFTextStream recommends should be outputted

startBlock

public void startBlock(Block block)
Description copied from class: OutputHandler
Invoked when a Block is about to be processed.

Overrides:
startBlock in class OutputHandler
Parameters:
block - - a reference to the Block that is about to be processed

endBlock

public void endBlock(Block block)
Description copied from class: OutputHandler
Invoked when PDFTextStream has finished processing a Block.

Overrides:
endBlock in class OutputHandler
Parameters:
block - - a reference to the Page that has been processed

startPDF

public void startPDF(java.lang.String pdfName,
                     java.io.File pdfFile)
Description copied from class: OutputHandler
Invoked when a new PDF is about to be processed.

Overrides:
startPDF in class OutputHandler
Parameters:
pdfName - - the 'name' of the PDF document, as provided by PDFTextStream.getName() }
pdfFile - - the file reference PDFTextStream is about to begin processing. This reference may be null if the PDFTextStream instance was not created using one of the java.io.File- or java.io.InputStream-based constructors.

endPDF

public void endPDF(java.lang.String pdfName,
                   java.io.File pdfFile)
Description copied from class: OutputHandler
Invoked when PDFTextStream has finished processing a PDF.

Overrides:
endPDF in class OutputHandler
Parameters:
pdfName - - the 'name' of the PDF document, as provided by PDFTextStream.getName() }
pdfFile - - the file reference PDFTextStream has finished processing

startPage

public void startPage(Page page)
Description copied from class: OutputHandler
Invoked when a page is about to be processed.

Overrides:
startPage in class OutputHandler
Parameters:
page - - a reference to the Page that is about to be processed

endPage

public void endPage(Page page)
Description copied from class: OutputHandler
Invoked when PDFTextStream has finished processing a page

Overrides:
endPage in class OutputHandler
Parameters:
page - - a reference to the Page that has been processed

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
A main method suitable for using this class' functionality from the command line. All of the command-line arguments will be taken to be paths to input PDF documents; each PDF documents will be opened by PDFTextStream, and its content piped through a XMLOutputTarget instance. Each PDF's extracted content is then written to a ".xml" file in the same directory as the input document.

Throws:
java.lang.Exception