Package com.snowtide.pdf
Class VisualOutputTarget
- java.lang.Object
-
- com.snowtide.pdf.OutputHandler
-
- com.snowtide.pdf.VisualOutputTarget
-
public class VisualOutputTarget extends OutputHandler
This OutputHandler implementation aims to preserve as much of a PDF's text layout as possible so that text extracts will retain the visual arrangement of text as present in the original document. This is ideal when the content being extracted is to be used as input into a downstream conversion process. For example, this OutputHandler will maintain the layout of most tabular data (this table is a sample for illustration purposes only):Column 1 Column 2 Column 3 Row 1 $500 $1,000 14B Row 2 $1,000 $5,621 8A Row 3 $6,009 $121 N/A
whereas the default
OutputTarget
is more likely to output such tabular data with proper read-ordering, but with no concern for spacing or line breaks between the table's cells and rows:Column 1 Column 2 Column 3 Row 1 $500 $1,000 14B Row 2 $1,000 $5,621 8A Row 3 $6,009 $121 N/A
Please note the following regarding
VisualOutputTarget
:- Because VisualOutputTarget attempts to maintain the visual appearance of each page's text, it will not separate columns and other document features that might be very important to a content-oriented text analysis process (such as search indexing).
VisualOutputTarget
will yield very poor output when used to format rotated text; in such a case, the results are essentially undefined. You may optionally suppress the inclusion of rotated characters fromVisualOutputTarget
's output usingsetIncludingRotatedChars(boolean)
.- Using
VisualOutputTarget
is likely to impose a slight performance penalty compared to using the defaultOutputTarget
. This penalty should be no more than 5%, and is necessary because of the processing needed to normalize the extracted text to appear as it does on the page. - The current implementation of
VisualOutputTarget
performs best when working with text rendered using a monospace font. Proportional fonts and (especially) justified text complicates the process of normalizing the spacing of the text formatted by this class.
- Since:
- v2.0
- Version:
- ©2004-2024 Snowtide
-
-
Constructor Summary
Constructors Constructor Description VisualOutputTarget(Appendable sb)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
endLine(Line line)
Invoked when PDFxStream has finished processing a Line.void
endPage(Page page)
Invoked when PDFxStream has finished processing a pagefloat
getSpacingScale()
Returns the spacing scale currently in effect for thisVisualOutputTarget
.boolean
isIncludingRotatedChars()
Return true if thisVisualOutputTarget
will include rotatedTextUnit
s in its output (true by default).boolean
isMarginTrimmed()
Returns true if thisVisualOutputTarget
trims whitespace corresponding to the left margin of each page piped to it.void
linebreaks(int linebreakCnt)
Invoked when PDFxStream determines that a series of line breaks should be outputted between the previous entity (page, block, line, etc) and the next entity (page, block, line, etc).void
setIncludingRotatedChars(boolean includingRotatedChars)
Used to set whether or not thisVisualOutputTarget
will include rotatedTextUnit
s in its output (true by default).void
setMarginTrimmed(boolean marginTrimmed)
Sets whether or not thisVisualOutputTarget
trims the whitespace corresponding to the left margin of each page it handles.void
setSpacingScale(float scale)
Modifies the spacing scale that is used when outputting content laid out using thisVisualOutputTarget
.void
spaces(int spaceCnt)
Invoked when PDFxStream determines that a series of spaces should be outputted between the previous entity (block, line, text unit, etc) and the next entity (block, line, text unit, etc).void
startBlock(Block b)
Invoked when a Block is about to be processed.void
startLine(Line line)
Invoked when a Line is about to be processed.void
startPage(Page page)
Invoked when a page is about to be processed.void
textUnit(TextUnit tu)
Invoked when a run of characters is to be outputted, as represented by the givenTextUnit
instance.
-
-
-
Constructor Detail
-
VisualOutputTarget
public VisualOutputTarget(Appendable sb)
-
-
Method Detail
-
setSpacingScale
public void setSpacingScale(float scale)
Modifies the spacing scale that is used when outputting content laid out using thisVisualOutputTarget
. The default is 1; using a value of 2 will (approximately) double the number of spaces that are outputted between recognized words, while a value of .5 will (approximately) halve that number. This is useful in circumstances where:- Columnar data elements should be separated by greater distances, perhaps to aid data collection
- Words are being outputted without any spaces between them at all, perhaps because of the use of (pathologically?) small fonts.
-
getSpacingScale
public float getSpacingScale()
Returns the spacing scale currently in effect for thisVisualOutputTarget
.- See Also:
setSpacingScale(float)
-
isIncludingRotatedChars
public boolean isIncludingRotatedChars()
Return true if thisVisualOutputTarget
will include rotatedTextUnit
s in its output (true by default).
-
setIncludingRotatedChars
public void setIncludingRotatedChars(boolean includingRotatedChars)
Used to set whether or not thisVisualOutputTarget
will include rotatedTextUnit
s in its output (true by default).
-
setMarginTrimmed
public void setMarginTrimmed(boolean marginTrimmed)
Sets whether or not thisVisualOutputTarget
trims the whitespace corresponding to the left margin of each page it handles. Defaults tofalse
, meaning that text set off from the left edge of a page will be preceded by a corresponding number of spaces in resulting text extracts. This means that text located at the same horizontal position on different pages using the same font and font size will be found in the same column position in extracted text, simplifying identification and organization of implicitly tabular data that spans page boundaries.If set to
true
, then a minimum number of spaces will be added to the beginning of each line of text.
-
isMarginTrimmed
public boolean isMarginTrimmed()
Returns true if thisVisualOutputTarget
trims whitespace corresponding to the left margin of each page piped to it.- See Also:
setMarginTrimmed(boolean)
-
endPage
public void endPage(Page page)
Description copied from class:OutputHandler
Invoked when PDFxStream has finished processing a page- Overrides:
endPage
in classOutputHandler
- Parameters:
page
- a reference to thePage
that has been processed
-
startBlock
public void startBlock(Block b)
Description copied from class:OutputHandler
Invoked when a Block is about to be processed.- Overrides:
startBlock
in classOutputHandler
- Parameters:
b
- a reference to theBlock
that is about to be processed
-
textUnit
public void textUnit(TextUnit tu)
Description copied from class:OutputHandler
Invoked when a run of characters is to be outputted, as represented by the givenTextUnit
instance.- Overrides:
textUnit
in classOutputHandler
-
spaces
public void spaces(int spaceCnt)
Description copied from class:OutputHandler
Invoked when PDFxStream determines that a series of spaces should be outputted between the previous entity (block, line, text unit, etc) and the next entity (block, line, text unit, etc).- Overrides:
spaces
in classOutputHandler
- Parameters:
spaceCnt
- the number of spaces that PDFxStream recommends should be outputted
-
linebreaks
public void linebreaks(int linebreakCnt)
Description copied from class:OutputHandler
Invoked when PDFxStream determines that a series of line breaks should be outputted between the previous entity (page, block, line, etc) and the next entity (page, block, line, etc).- Overrides:
linebreaks
in classOutputHandler
- Parameters:
linebreakCnt
- the number of line breaks that PDFxStream recommends should be outputted
-
startLine
public void startLine(Line line)
Description copied from class:OutputHandler
Invoked when a Line is about to be processed.- Overrides:
startLine
in classOutputHandler
- Parameters:
line
- a reference to theLine
that is about to be processed
-
endLine
public void endLine(Line line)
Description copied from class:OutputHandler
Invoked when PDFxStream has finished processing a Line.- Overrides:
endLine
in classOutputHandler
- Parameters:
line
- a reference to theLine
that has been processed
-
startPage
public void startPage(Page page)
Description copied from class:OutputHandler
Invoked when a page is about to be processed.- Overrides:
startPage
in classOutputHandler
- Parameters:
page
- a reference to thePage
that is about to be processed
-
-