Class XMLOutputTarget
- java.lang.Object
-
- com.snowtide.pdf.OutputHandler
-
- pdfts.examples.XMLOutputTarget
-
public class XMLOutputTarget extends OutputHandler
This class is an example
OutputHandlerimplementation that builds up a DOM XML model of extracted PDF content.The full source code for this class is included in every PDFxStream distribution.
The aim of this
OutputHandlerimplementation is to output a valid XML document containing the text extracted from the PDF document, as well as interesting structural information (i.e. wherePageandBlockstructures begin and end), as well as what text ranges are outputted using bolded, italicized, and underlined fonts. This kind of information might be particularly interesting to systems that perform some kind of semantic analysis of a document's structure and significant textual passages (such as a search engine or data mining system).- Version:
- ©2004-2025 Snowtide
-
-
Constructor Summary
Constructors Constructor Description XMLOutputTarget()Creates a newXMLOutputTarget.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description voidendBlock(Block block)Invoked when PDFxStream has finished processing a Block.voidendPage(Page page)Invoked when PDFxStream has finished processing a pagevoidendPDF(String pdfName, File pdfFile)Invoked when PDFxStream has finished processing a PDF.StringgetXMLAsString()Returns the XML built by thisXMLOutputTargetas aString.DocumentgetXMLDocument()Returns the DOM Document that thisXMLOutputTargetis building.voidlinebreaks(int linebreakCnt)Invoked when PDFxStream determines that a series of line breaks should be outputted between the previous entity (page, block, line, etc) and the next entity (page, block, line, etc).static voidmain(String[] args)Deprecated.Command-line usage of this class may be moved or removed in future PDFxStream releases.voidspaces(int spaceCnt)Invoked when PDFxStream determines that a series of spaces should be outputted between the previous entity (block, line, text unit, etc) and the next entity (block, line, text unit, etc).voidstartBlock(Block block)Invoked when a Block is about to be processed.voidstartPage(Page page)Invoked when a page is about to be processed.voidstartPDF(String pdfName, File pdfFile)Invoked when a new PDF is about to be processed.voidtextUnit(TextUnit tu)Invoked when a run of characters is to be outputted, as represented by the givenTextUnitinstance.-
Methods inherited from class com.snowtide.pdf.OutputHandler
endLine, endSpan, startLine, startSpan
-
-
-
-
Constructor Detail
-
XMLOutputTarget
public XMLOutputTarget() throws IOExceptionCreates a newXMLOutputTarget.- Throws:
IOException- if an error occurs initializing a new DOM document
-
-
Method Detail
-
getXMLDocument
public Document getXMLDocument()
Returns the DOM Document that thisXMLOutputTargetis building.
-
getXMLAsString
public String getXMLAsString() throws IOException
Returns the XML built by thisXMLOutputTargetas aString.- Throws:
IOException
-
textUnit
public void textUnit(TextUnit tu)
Description copied from class:OutputHandlerInvoked when a run of characters is to be outputted, as represented by the givenTextUnitinstance.- Overrides:
textUnitin classOutputHandler
-
spaces
public void spaces(int spaceCnt)
Description copied from class:OutputHandlerInvoked when PDFxStream determines that a series of spaces should be outputted between the previous entity (block, line, text unit, etc) and the next entity (block, line, text unit, etc).- Overrides:
spacesin classOutputHandler- Parameters:
spaceCnt- the number of spaces that PDFxStream recommends should be outputted
-
linebreaks
public void linebreaks(int linebreakCnt)
Description copied from class:OutputHandlerInvoked when PDFxStream determines that a series of line breaks should be outputted between the previous entity (page, block, line, etc) and the next entity (page, block, line, etc).- Overrides:
linebreaksin classOutputHandler- Parameters:
linebreakCnt- the number of line breaks that PDFxStream recommends should be outputted
-
startBlock
public void startBlock(Block block)
Description copied from class:OutputHandlerInvoked when a Block is about to be processed.- Overrides:
startBlockin classOutputHandler- Parameters:
block- a reference to theBlockthat is about to be processed
-
endBlock
public void endBlock(Block block)
Description copied from class:OutputHandlerInvoked when PDFxStream has finished processing a Block.- Overrides:
endBlockin classOutputHandler- Parameters:
block- a reference to thePagethat has been processed
-
startPDF
public void startPDF(String pdfName, File pdfFile)
Description copied from class:OutputHandlerInvoked when a new PDF is about to be processed.- Overrides:
startPDFin classOutputHandler- Parameters:
pdfName- the 'name' of the PDF document, as provided byDocument.getName()}pdfFile- the file reference PDFxStream is about to begin processing. This reference may be null if the sourceDocumentis not reading from aFileorInputStream.
-
endPDF
public void endPDF(String pdfName, File pdfFile)
Description copied from class:OutputHandlerInvoked when PDFxStream has finished processing a PDF.- Overrides:
endPDFin classOutputHandler- Parameters:
pdfName- the 'name' of the PDF document, as provided byDocument.getName()}pdfFile- the file reference PDFxStream has finished processing
-
startPage
public void startPage(Page page)
Description copied from class:OutputHandlerInvoked when a page is about to be processed.- Overrides:
startPagein classOutputHandler- Parameters:
page- a reference to thePagethat is about to be processed
-
endPage
public void endPage(Page page)
Description copied from class:OutputHandlerInvoked when PDFxStream has finished processing a page- Overrides:
endPagein classOutputHandler- Parameters:
page- a reference to thePagethat has been processed
-
main
@Deprecated public static void main(String[] args) throws Exception
Deprecated.Command-line usage of this class may be moved or removed in future PDFxStream releases.A main method suitable for using this class' functionality from the command line. All of the command-line arguments will be taken to be paths to input PDF documents; each PDF documents will be opened byPDF, and its content piped through aXMLOutputTargetinstance. Each PDF's extracted content is then written to a ".xml" file in the same directory as the input document.- Throws:
Exception
-
-