public class XMLOutputTarget extends OutputHandler
This class is an example OutputHandler
implementation that builds up
a DOM XML model of extracted PDF content.
The full source code for this class is included in every PDFTextStream distribution.
The aim of this OutputHandler
implementation is to output
a valid XML document containing the text extracted from the PDF document,
as well as interesting structural information (i.e. where Page
and
Block
structures begin and end), as well as what text ranges are
outputted using bolded, italicized, and underlined fonts. This kind of information
might be particularly interesting to systems that perform some kind of semantic
analysis of a document's structure and significant textual passages (such as a
search engine or data mining system).
Constructor and Description |
---|
XMLOutputTarget()
Creates a new
XMLOutputTarget . |
Modifier and Type | Method and Description |
---|---|
void |
endBlock(Block block)
Invoked when PDFTextStream has finished processing a Block.
|
void |
endPage(Page page)
Invoked when PDFTextStream has finished processing a page
|
void |
endPDF(java.lang.String pdfName,
java.io.File pdfFile)
Invoked when PDFTextStream has finished processing a PDF.
|
java.lang.String |
getXMLAsString()
Returns the XML built by this
XMLOutputTarget as a String . |
org.w3c.dom.Document |
getXMLDocument()
Returns the DOM Document that this
XMLOutputTarget is building. |
void |
linebreaks(int linebreakCnt)
Invoked when PDFTextStream determines that a series of line breaks should
be outputted between the previous entity (page, block, line, etc) and the
next entity (page, block, line, etc).
|
static void |
main(java.lang.String[] args)
A main method suitable for using this class' functionality from the command line.
|
void |
spaces(int spaceCnt)
Invoked when PDFTextStream determines that a series of spaces should
be outputted between the previous entity (block, line, text unit, etc) and the
next entity (block, line, text unit, etc).
|
void |
startBlock(Block block)
Invoked when a Block is about to be processed.
|
void |
startPage(Page page)
Invoked when a page is about to be processed.
|
void |
startPDF(java.lang.String pdfName,
java.io.File pdfFile)
Invoked when a new PDF is about to be processed.
|
void |
textUnit(TextUnit tu)
Invoked when a run of characters is to be outputted, as represented by the
given
TextUnit instance. |
endLine, startLine
public XMLOutputTarget() throws java.io.IOException
XMLOutputTarget
.java.io.IOException
- if an error occurs initializing a new DOM documentpublic org.w3c.dom.Document getXMLDocument()
XMLOutputTarget
is building.public java.lang.String getXMLAsString() throws java.io.IOException
XMLOutputTarget
as a String
.java.io.IOException
public void textUnit(TextUnit tu)
OutputHandler
TextUnit
instance.textUnit
in class OutputHandler
public void spaces(int spaceCnt)
OutputHandler
spaces
in class OutputHandler
spaceCnt
- - the number of spaces that PDFTextStream
recommends should be outputtedpublic void linebreaks(int linebreakCnt)
OutputHandler
linebreaks
in class OutputHandler
linebreakCnt
- - the number of line breaks that PDFTextStream
recommends should be outputtedpublic void startBlock(Block block)
OutputHandler
startBlock
in class OutputHandler
block
- - a reference to the Block
that is about to be processedpublic void endBlock(Block block)
OutputHandler
endBlock
in class OutputHandler
block
- - a reference to the Page
that has been processedpublic void startPDF(java.lang.String pdfName, java.io.File pdfFile)
OutputHandler
startPDF
in class OutputHandler
pdfName
- - the 'name' of the PDF document, as provided by
PDFTextStream.getName()
}pdfFile
- - the file reference PDFTextStream is about to begin processing.
This reference may be null if the PDFTextStream instance was not created using one of the
java.io.File
- or java.io.InputStream
-based constructors.public void endPDF(java.lang.String pdfName, java.io.File pdfFile)
OutputHandler
endPDF
in class OutputHandler
pdfName
- - the 'name' of the PDF document, as provided by
PDFTextStream.getName()
}pdfFile
- - the file reference PDFTextStream has finished processingpublic void startPage(Page page)
OutputHandler
startPage
in class OutputHandler
page
- - a reference to the Page
that is about to be processedpublic void endPage(Page page)
OutputHandler
endPage
in class OutputHandler
page
- - a reference to the Page
that has been processedpublic static void main(java.lang.String[] args) throws java.lang.Exception
PDFTextStream
, and its content piped
through a XMLOutputTarget
instance. Each PDF's extracted content
is then written to a ".xml" file in the same directory as the input
document.java.lang.Exception