public class XMLOutputTarget extends OutputHandler
This class is an example OutputHandler
implementation that builds up
a DOM XML model of extracted PDF content.
The full source code for this class is included in every PDFxStream distribution.
The aim of this OutputHandler
implementation is to output
a valid XML document containing the text extracted from the PDF document,
as well as interesting structural information (i.e. where Page
and
Block
structures begin and end), as well as what text ranges are
outputted using bolded, italicized, and underlined fonts. This kind of information
might be particularly interesting to systems that perform some kind of semantic
analysis of a document's structure and significant textual passages (such as a
search engine or data mining system).
Constructor and Description |
---|
XMLOutputTarget()
Creates a new
XMLOutputTarget . |
Modifier and Type | Method and Description |
---|---|
void |
endBlock(Block block)
Invoked when PDFxStream has finished processing a Block.
|
void |
endPage(Page page)
Invoked when PDFxStream has finished processing a page
|
void |
endPDF(java.lang.String pdfName,
java.io.File pdfFile)
Invoked when PDFxStream has finished processing a PDF.
|
java.lang.String |
getXMLAsString()
Returns the XML built by this
XMLOutputTarget as a String . |
org.w3c.dom.Document |
getXMLDocument()
Returns the DOM Document that this
XMLOutputTarget is building. |
void |
linebreaks(int linebreakCnt)
Invoked when PDFxStream determines that a series of line breaks should
be outputted between the previous entity (page, block, line, etc) and the
next entity (page, block, line, etc).
|
static void |
main(java.lang.String[] args)
Deprecated.
Command-line usage of this class may be moved or removed in future PDFxStream releases.
|
void |
spaces(int spaceCnt)
Invoked when PDFxStream determines that a series of spaces should
be outputted between the previous entity (block, line, text unit, etc) and the
next entity (block, line, text unit, etc).
|
void |
startBlock(Block block)
Invoked when a Block is about to be processed.
|
void |
startPage(Page page)
Invoked when a page is about to be processed.
|
void |
startPDF(java.lang.String pdfName,
java.io.File pdfFile)
Invoked when a new PDF is about to be processed.
|
void |
textUnit(TextUnit tu)
Invoked when a run of characters is to be outputted, as represented by the
given
TextUnit instance. |
endLine, startLine
public XMLOutputTarget()
XMLOutputTarget
.java.io.IOException
- if an error occurs initializing a new DOM documentpublic org.w3c.dom.Document getXMLDocument()
XMLOutputTarget
is building.public java.lang.String getXMLAsString()
XMLOutputTarget
as a String
.public void textUnit(TextUnit tu)
OutputHandler
TextUnit
instance.textUnit
in class OutputHandler
public void spaces(int spaceCnt)
OutputHandler
spaces
in class OutputHandler
spaceCnt
- the number of spaces that PDFxStream
recommends should be outputtedpublic void linebreaks(int linebreakCnt)
OutputHandler
linebreaks
in class OutputHandler
linebreakCnt
- the number of line breaks that PDFxStream
recommends should be outputtedpublic void startBlock(Block block)
OutputHandler
startBlock
in class OutputHandler
block
- a reference to the Block
that is about to be processedpublic void endBlock(Block block)
OutputHandler
endBlock
in class OutputHandler
block
- a reference to the Page
that has been processedpublic void startPDF(java.lang.String pdfName, java.io.File pdfFile)
OutputHandler
startPDF
in class OutputHandler
pdfName
- the 'name' of the PDF document, as provided by
Document.getName()
}pdfFile
- the file reference PDFxStream is about to begin processing.
This reference may be null if the source Document
is not reading from a
File
or InputStream
.public void endPDF(java.lang.String pdfName, java.io.File pdfFile)
OutputHandler
endPDF
in class OutputHandler
pdfName
- the 'name' of the PDF document, as provided by
Document.getName()
}pdfFile
- the file reference PDFxStream has finished processingpublic void startPage(Page page)
OutputHandler
startPage
in class OutputHandler
page
- a reference to the Page
that is about to be processedpublic void endPage(Page page)
OutputHandler
endPage
in class OutputHandler
page
- a reference to the Page
that has been processedpublic static void main(java.lang.String[] args)
PDF
, and its content piped
through a XMLOutputTarget
instance. Each PDF's extracted content
is then written to a ".xml" file in the same directory as the input
document.