XMLOutputTarget (PDFxStream API Reference)

java.lang.Object
- com.snowtide.pdf.OutputHandler
- - pdfts.examples.XMLOutputTarget

```
public class XMLOutputTarget
extends OutputHandler
```
This class is an example OutputHandler implementation that builds up a DOM XML model of extracted PDF content.

The full source code for this class is included in every PDFxStream distribution.

The aim of this OutputHandler implementation is to output a valid XML document containing the text extracted from the PDF document, as well as interesting structural information (i.e. where Page and Block structures begin and end), as well as what text ranges are outputted using bolded, italicized, and underlined fonts. This kind of information might be particularly interesting to systems that perform some kind of semantic analysis of a document's structure and significant textual passages (such as a search engine or data mining system).

Version:

©2004-2014 Snowtide

Constructor Summary

Constructors
Constructor and Description

XMLOutputTarget()
Creates a new XMLOutputTarget.

Constructors
Constructor and Description
`XMLOutputTarget()` Creates a new `XMLOutputTarget`.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods
Modifier and Type	Method and Description
`void`	`endBlock(Block block)` Invoked when PDFxStream has finished processing a Block.
`void`	`endPage(Page page)` Invoked when PDFxStream has finished processing a page
`void`	`endPDF(java.lang.String pdfName, java.io.File pdfFile)` Invoked when PDFxStream has finished processing a PDF.
`java.lang.String`	`getXMLAsString()` Returns the XML built by this `XMLOutputTarget` as a `String`.
`org.w3c.dom.Document`	`getXMLDocument()` Returns the DOM Document that this `XMLOutputTarget` is building.
`void`	`linebreaks(int linebreakCnt)` Invoked when PDFxStream determines that a series of line breaks should be outputted between the previous entity (page, block, line, etc) and the next entity (page, block, line, etc).
`static void`	`main(java.lang.String[] args)` Deprecated. Command-line usage of this class may be moved or removed in future PDFxStream releases.
`void`	`spaces(int spaceCnt)` Invoked when PDFxStream determines that a series of spaces should be outputted between the previous entity (block, line, text unit, etc) and the next entity (block, line, text unit, etc).
`void`	`startBlock(Block block)` Invoked when a Block is about to be processed.
`void`	`startPage(Page page)` Invoked when a page is about to be processed.
`void`	`startPDF(java.lang.String pdfName, java.io.File pdfFile)` Invoked when a new PDF is about to be processed.
`void`	`textUnit(TextUnit tu)` Invoked when a run of characters is to be outputted, as represented by the given `TextUnit` instance.

Methods inherited from class com.snowtide.pdf.OutputHandler
endLine, startLine

- Constructor Detail
  - XMLOutputTarget
```
public XMLOutputTarget()
```
    Creates a new XMLOutputTarget.
    
    Throws:
    
    java.io.IOException - if an error occurs initializing a new DOM document
- Method Detail
  - getXMLDocument
```
public org.w3c.dom.Document getXMLDocument()
```
    Returns the DOM Document that this XMLOutputTarget is building.
  - getXMLAsString
```
public java.lang.String getXMLAsString()
```
    Returns the XML built by this XMLOutputTarget as a String.
  - textUnit
```
public void textUnit(TextUnit tu)
```
    Description copied from class: OutputHandler
    
    Invoked when a run of characters is to be outputted, as represented by the given TextUnit instance.
    
    Overrides:
    
    textUnit in class OutputHandler
  - spaces
```
public void spaces(int spaceCnt)
```
    Description copied from class: OutputHandler
    
    Invoked when PDFxStream determines that a series of spaces should be outputted between the previous entity (block, line, text unit, etc) and the next entity (block, line, text unit, etc).
    
    Overrides:
    
    spaces in class OutputHandler
    
    Parameters:
    
    spaceCnt - the number of spaces that PDFxStream recommends should be outputted
  - linebreaks
```
public void linebreaks(int linebreakCnt)
```
    Description copied from class: OutputHandler
    
    Invoked when PDFxStream determines that a series of line breaks should be outputted between the previous entity (page, block, line, etc) and the next entity (page, block, line, etc).
    
    Overrides:
    
    linebreaks in class OutputHandler
    
    Parameters:
    
    linebreakCnt - the number of line breaks that PDFxStream recommends should be outputted
  - startBlock
```
public void startBlock(Block block)
```
    Description copied from class: OutputHandler
    
    Invoked when a Block is about to be processed.
    
    Overrides:
    
    startBlock in class OutputHandler
    
    Parameters:
    
    block - a reference to the Block that is about to be processed
  - endBlock
```
public void endBlock(Block block)
```
    Description copied from class: OutputHandler
    
    Invoked when PDFxStream has finished processing a Block.
    
    Overrides:
    
    endBlock in class OutputHandler
    
    Parameters:
    
    block - a reference to the Page that has been processed
  - startPDF
```
public void startPDF(java.lang.String pdfName,
                     java.io.File pdfFile)
```
    Description copied from class: OutputHandler
    
    Invoked when a new PDF is about to be processed.
    
    Overrides:
    
    startPDF in class OutputHandler
    
    Parameters:
    
    pdfName - the 'name' of the PDF document, as provided by Document.getName() }
    
    pdfFile - the file reference PDFxStream is about to begin processing. This reference may be null if the source Document is not reading from a File or InputStream.
  - endPDF
```
public void endPDF(java.lang.String pdfName,
                   java.io.File pdfFile)
```
    Description copied from class: OutputHandler
    
    Invoked when PDFxStream has finished processing a PDF.
    
    Overrides:
    
    endPDF in class OutputHandler
    
    Parameters:
    
    pdfName - the 'name' of the PDF document, as provided by Document.getName() }
    
    pdfFile - the file reference PDFxStream has finished processing
  - startPage
```
public void startPage(Page page)
```
    Description copied from class: OutputHandler
    
    Invoked when a page is about to be processed.
    
    Overrides:
    
    startPage in class OutputHandler
    
    Parameters:
    
    page - a reference to the Page that is about to be processed
  - endPage
```
public void endPage(Page page)
```
    Description copied from class: OutputHandler
    
    Invoked when PDFxStream has finished processing a page
    
    Overrides:
    
    endPage in class OutputHandler
    
    Parameters:
    
    page - a reference to the Page that has been processed
  - main
```
public static void main(java.lang.String[] args)
```
    Deprecated. Command-line usage of this class may be moved or removed in future PDFxStream releases.
    
    A main method suitable for using this class' functionality from the command line. All of the command-line arguments will be taken to be paths to input PDF documents; each PDF documents will be opened by PDF, and its content piped through a XMLOutputTarget instance. Each PDF's extracted content is then written to a ".xml" file in the same directory as the input document.

Class XMLOutputTarget

Constructor Summary

Method Summary

Methods inherited from class com.snowtide.pdf.OutputHandler

Constructor Detail

XMLOutputTarget

Method Detail

getXMLDocument

getXMLAsString

textUnit

spaces

linebreaks

startBlock

endBlock

startPDF

endPDF

startPage

endPage

main