Package com.snowtide.pdf
Class VisualOutputTarget
- java.lang.Object
 - 
- com.snowtide.pdf.OutputHandler
 - 
- com.snowtide.pdf.VisualOutputTarget
 
 
 
- 
public class VisualOutputTarget extends OutputHandler
This OutputHandler implementation aims to preserve as much of a PDF's text layout as possible so that text extracts will retain the visual arrangement of text as present in the original document. This is ideal when the content being extracted is to be used as input into a downstream conversion process. For example, this OutputHandler will maintain the layout of most tabular data (this table is a sample for illustration purposes only):Column 1 Column 2 Column 3 Row 1 $500 $1,000 14B Row 2 $1,000 $5,621 8A Row 3 $6,009 $121 N/Awhereas the default
OutputTargetis more likely to output such tabular data with proper read-ordering, but with no concern for spacing or line breaks between the table's cells and rows:Column 1 Column 2 Column 3 Row 1 $500 $1,000 14B Row 2 $1,000 $5,621 8A Row 3 $6,009 $121 N/A
Please note the following regarding
VisualOutputTarget:- Because VisualOutputTarget attempts to maintain the visual appearance of each page's text, it will not separate columns and other document features that might be very important to a content-oriented text analysis process (such as search indexing).
 VisualOutputTargetwill yield very poor output when used to format rotated text; in such a case, the results are essentially undefined. You may optionally suppress the inclusion of rotated characters fromVisualOutputTarget's output usingsetIncludingRotatedChars(boolean).- Using 
VisualOutputTargetis likely to impose a slight performance penalty compared to using the defaultOutputTarget. This penalty should be no more than 5%, and is necessary because of the processing needed to normalize the extracted text to appear as it does on the page. - The current implementation of 
VisualOutputTargetperforms best when working with text rendered using a monospace font. Proportional fonts and (especially) justified text complicates the process of normalizing the spacing of the text formatted by this class. VisualOutputTargetdoes not handle right-to-left (RTL) or bidirectional (bidi) text properly. Read more here.
- Since:
 - v2.0
 - Version:
 - ©2004-2025 Snowtide
 
 
- 
- 
Constructor Summary
Constructors Constructor Description VisualOutputTarget(Appendable sb) 
- 
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidendLine(Line line)Invoked when PDFxStream has finished processing a Line.voidendPage(Page page)Invoked when PDFxStream has finished processing a pagefloatgetSpacingScale()Returns the spacing scale currently in effect for thisVisualOutputTarget.booleanisIncludingRotatedChars()Return true if thisVisualOutputTargetwill include rotatedTextUnits in its output (true by default).booleanisMarginTrimmed()Returns true if thisVisualOutputTargettrims whitespace corresponding to the left margin of each page piped to it.voidlinebreaks(int linebreakCnt)Invoked when PDFxStream determines that a series of line breaks should be outputted between the previous entity (page, block, line, etc) and the next entity (page, block, line, etc).voidsetIncludingRotatedChars(boolean includingRotatedChars)Used to set whether or not thisVisualOutputTargetwill include rotatedTextUnits in its output (true by default).voidsetMarginTrimmed(boolean marginTrimmed)Sets whether or not thisVisualOutputTargettrims the whitespace corresponding to the left margin of each page it handles.voidsetSpacingScale(float scale)Modifies the spacing scale that is used when outputting content laid out using thisVisualOutputTarget.voidspaces(int spaceCnt)Invoked when PDFxStream determines that a series of spaces should be outputted between the previous entity (block, line, text unit, etc) and the next entity (block, line, text unit, etc).voidstartBlock(Block b)Invoked when a Block is about to be processed.voidstartLine(Line line)Invoked when a Line is about to be processed.voidstartPage(Page page)Invoked when a page is about to be processed.voidtextUnit(TextUnit tu)Invoked when a run of characters is to be outputted, as represented by the givenTextUnitinstance. 
 - 
 
- 
- 
Constructor Detail
- 
VisualOutputTarget
public VisualOutputTarget(Appendable sb)
 
 - 
 
- 
Method Detail
- 
setSpacingScale
public void setSpacingScale(float scale)
Modifies the spacing scale that is used when outputting content laid out using thisVisualOutputTarget. The default is 1; using a value of 2 will (approximately) double the number of spaces that are outputted between recognized words, while a value of .5 will (approximately) halve that number. This is useful in circumstances where:- Columnar data elements should be separated by greater distances, perhaps to aid data collection
 - Words are being outputted without any spaces between them at all, perhaps because of the use of (pathologically?) small fonts.
 
 
- 
getSpacingScale
public float getSpacingScale()
Returns the spacing scale currently in effect for thisVisualOutputTarget.- See Also:
 setSpacingScale(float)
 
- 
isIncludingRotatedChars
public boolean isIncludingRotatedChars()
Return true if thisVisualOutputTargetwill include rotatedTextUnits in its output (true by default). 
- 
setIncludingRotatedChars
public void setIncludingRotatedChars(boolean includingRotatedChars)
Used to set whether or not thisVisualOutputTargetwill include rotatedTextUnits in its output (true by default). 
- 
setMarginTrimmed
public void setMarginTrimmed(boolean marginTrimmed)
Sets whether or not thisVisualOutputTargettrims the whitespace corresponding to the left margin of each page it handles. Defaults tofalse, meaning that text set off from the left edge of a page will be preceded by a corresponding number of spaces in resulting text extracts. This means that text located at the same horizontal position on different pages using the same font and font size will be found in the same column position in extracted text, simplifying identification and organization of implicitly tabular data that spans page boundaries.If set to
true, then a minimum number of spaces will be added to the beginning of each line of text. 
- 
isMarginTrimmed
public boolean isMarginTrimmed()
Returns true if thisVisualOutputTargettrims whitespace corresponding to the left margin of each page piped to it.- See Also:
 setMarginTrimmed(boolean)
 
- 
endPage
public void endPage(Page page)
Description copied from class:OutputHandlerInvoked when PDFxStream has finished processing a page- Overrides:
 endPagein classOutputHandler- Parameters:
 page- a reference to thePagethat has been processed
 
- 
startBlock
public void startBlock(Block b)
Description copied from class:OutputHandlerInvoked when a Block is about to be processed.- Overrides:
 startBlockin classOutputHandler- Parameters:
 b- a reference to theBlockthat is about to be processed
 
- 
textUnit
public void textUnit(TextUnit tu)
Description copied from class:OutputHandlerInvoked when a run of characters is to be outputted, as represented by the givenTextUnitinstance.- Overrides:
 textUnitin classOutputHandler
 
- 
spaces
public void spaces(int spaceCnt)
Description copied from class:OutputHandlerInvoked when PDFxStream determines that a series of spaces should be outputted between the previous entity (block, line, text unit, etc) and the next entity (block, line, text unit, etc).- Overrides:
 spacesin classOutputHandler- Parameters:
 spaceCnt- the number of spaces that PDFxStream recommends should be outputted
 
- 
linebreaks
public void linebreaks(int linebreakCnt)
Description copied from class:OutputHandlerInvoked when PDFxStream determines that a series of line breaks should be outputted between the previous entity (page, block, line, etc) and the next entity (page, block, line, etc).- Overrides:
 linebreaksin classOutputHandler- Parameters:
 linebreakCnt- the number of line breaks that PDFxStream recommends should be outputted
 
- 
startLine
public void startLine(Line line)
Description copied from class:OutputHandlerInvoked when a Line is about to be processed.- Overrides:
 startLinein classOutputHandler- Parameters:
 line- a reference to theLinethat is about to be processed
 
- 
endLine
public void endLine(Line line)
Description copied from class:OutputHandlerInvoked when PDFxStream has finished processing a Line.- Overrides:
 endLinein classOutputHandler- Parameters:
 line- a reference to theLinethat has been processed
 
- 
startPage
public void startPage(Page page)
Description copied from class:OutputHandlerInvoked when a page is about to be processed.- Overrides:
 startPagein classOutputHandler- Parameters:
 page- a reference to thePagethat is about to be processed
 
 - 
 
 -