Class XMLOutputTarget
- java.lang.Object
-
- com.snowtide.pdf.OutputHandler
-
- pdfts.examples.XMLOutputTarget
-
public class XMLOutputTarget extends OutputHandler
This class is an example
OutputHandler
implementation that builds up a DOM XML model of extracted PDF content.The full source code for this class is included in every PDFxStream distribution.
The aim of this
OutputHandler
implementation is to output a valid XML document containing the text extracted from the PDF document, as well as interesting structural information (i.e. wherePage
andBlock
structures begin and end), as well as what text ranges are outputted using bolded, italicized, and underlined fonts. This kind of information might be particularly interesting to systems that perform some kind of semantic analysis of a document's structure and significant textual passages (such as a search engine or data mining system).- Version:
- ©2004-2024 Snowtide
-
-
Constructor Summary
Constructors Constructor Description XMLOutputTarget()
Creates a newXMLOutputTarget
.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description void
endBlock(Block block)
Invoked when PDFxStream has finished processing a Block.void
endPage(Page page)
Invoked when PDFxStream has finished processing a pagevoid
endPDF(String pdfName, File pdfFile)
Invoked when PDFxStream has finished processing a PDF.String
getXMLAsString()
Returns the XML built by thisXMLOutputTarget
as aString
.Document
getXMLDocument()
Returns the DOM Document that thisXMLOutputTarget
is building.void
linebreaks(int linebreakCnt)
Invoked when PDFxStream determines that a series of line breaks should be outputted between the previous entity (page, block, line, etc) and the next entity (page, block, line, etc).static void
main(String[] args)
Deprecated.Command-line usage of this class may be moved or removed in future PDFxStream releases.void
spaces(int spaceCnt)
Invoked when PDFxStream determines that a series of spaces should be outputted between the previous entity (block, line, text unit, etc) and the next entity (block, line, text unit, etc).void
startBlock(Block block)
Invoked when a Block is about to be processed.void
startPage(Page page)
Invoked when a page is about to be processed.void
startPDF(String pdfName, File pdfFile)
Invoked when a new PDF is about to be processed.void
textUnit(TextUnit tu)
Invoked when a run of characters is to be outputted, as represented by the givenTextUnit
instance.-
Methods inherited from class com.snowtide.pdf.OutputHandler
endLine, endSpan, startLine, startSpan
-
-
-
-
Constructor Detail
-
XMLOutputTarget
public XMLOutputTarget() throws IOException
Creates a newXMLOutputTarget
.- Throws:
IOException
- if an error occurs initializing a new DOM document
-
-
Method Detail
-
getXMLDocument
public Document getXMLDocument()
Returns the DOM Document that thisXMLOutputTarget
is building.
-
getXMLAsString
public String getXMLAsString() throws IOException
Returns the XML built by thisXMLOutputTarget
as aString
.- Throws:
IOException
-
textUnit
public void textUnit(TextUnit tu)
Description copied from class:OutputHandler
Invoked when a run of characters is to be outputted, as represented by the givenTextUnit
instance.- Overrides:
textUnit
in classOutputHandler
-
spaces
public void spaces(int spaceCnt)
Description copied from class:OutputHandler
Invoked when PDFxStream determines that a series of spaces should be outputted between the previous entity (block, line, text unit, etc) and the next entity (block, line, text unit, etc).- Overrides:
spaces
in classOutputHandler
- Parameters:
spaceCnt
- the number of spaces that PDFxStream recommends should be outputted
-
linebreaks
public void linebreaks(int linebreakCnt)
Description copied from class:OutputHandler
Invoked when PDFxStream determines that a series of line breaks should be outputted between the previous entity (page, block, line, etc) and the next entity (page, block, line, etc).- Overrides:
linebreaks
in classOutputHandler
- Parameters:
linebreakCnt
- the number of line breaks that PDFxStream recommends should be outputted
-
startBlock
public void startBlock(Block block)
Description copied from class:OutputHandler
Invoked when a Block is about to be processed.- Overrides:
startBlock
in classOutputHandler
- Parameters:
block
- a reference to theBlock
that is about to be processed
-
endBlock
public void endBlock(Block block)
Description copied from class:OutputHandler
Invoked when PDFxStream has finished processing a Block.- Overrides:
endBlock
in classOutputHandler
- Parameters:
block
- a reference to thePage
that has been processed
-
startPDF
public void startPDF(String pdfName, File pdfFile)
Description copied from class:OutputHandler
Invoked when a new PDF is about to be processed.- Overrides:
startPDF
in classOutputHandler
- Parameters:
pdfName
- the 'name' of the PDF document, as provided byDocument.getName()
}pdfFile
- the file reference PDFxStream is about to begin processing. This reference may be null if the sourceDocument
is not reading from aFile
orInputStream
.
-
endPDF
public void endPDF(String pdfName, File pdfFile)
Description copied from class:OutputHandler
Invoked when PDFxStream has finished processing a PDF.- Overrides:
endPDF
in classOutputHandler
- Parameters:
pdfName
- the 'name' of the PDF document, as provided byDocument.getName()
}pdfFile
- the file reference PDFxStream has finished processing
-
startPage
public void startPage(Page page)
Description copied from class:OutputHandler
Invoked when a page is about to be processed.- Overrides:
startPage
in classOutputHandler
- Parameters:
page
- a reference to thePage
that is about to be processed
-
endPage
public void endPage(Page page)
Description copied from class:OutputHandler
Invoked when PDFxStream has finished processing a page- Overrides:
endPage
in classOutputHandler
- Parameters:
page
- a reference to thePage
that has been processed
-
main
public static void main(String[] args) throws Exception
Deprecated.Command-line usage of this class may be moved or removed in future PDFxStream releases.A main method suitable for using this class' functionality from the command line. All of the command-line arguments will be taken to be paths to input PDF documents; each PDF documents will be opened byPDF
, and its content piped through aXMLOutputTarget
instance. Each PDF's extracted content is then written to a ".xml" file in the same directory as the input document.- Throws:
Exception
-
-