|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectcom.snowtide.pdf.OutputHandler
com.snowtide.pdf.VisualOutputTarget
public class VisualOutputTarget
This OutputHandler implementation aims to preserve as much of a PDF's text layout as possible so that text extracts yielded by this OutputHandler will retain the visual arrangement of text as present in the original document. This is ideal when the content being extracted is to be used as input into a downstream conversion process. For example, this OutputHandler will maintain the layout of most tabular data (this table is a sample for illustration purposes only):
Column 1 Column 2 Column 3 Row 1 $500 $1,000 14B Row 2 $1,000 $5,621 8A Row 3 $6,009 $121 N/A
whereas the default OutputTarget
is more likely to output such
tabular data with proper read-ordering, but with no concern for spacing or line breaks between the
table's cells and rows:
Column 1 Column 2 Column 3 Row 1 $500 $1,000 14B Row 2 $1,000 $5,621 8A Row 3 $6,009 $121 N/A
Please note the following regarding VisualOutputTarget
:
VisualOutputTarget
will yield very poor output when used to format rotated text;
in such a case, the results are essentially undefined. You may optionally suppress the inclusion of
rotated characters from VisualOutputTarget
's output using setIncludingRotatedChars(boolean)
.VisualOutputTarget
is likely to impose a slight performance penalty
compared to using the default OutputTarget
. This penalty
should be no more than 5%, and is necessary because of the processing needed to normalize the
extracted text to appear as it does on the page.VisualOutputTarget
performs best when working
with text rendered using a monospace font. Proportional fonts and (especially) justified text complicates
the process of normalizing the spacing of the text formatted by this class. Improvements will likely be made
in future PDFTextStream releases to make this class more capable when handling justified text or text
rendered using proportional fonts.
Constructor Summary | |
---|---|
VisualOutputTarget(java.lang.Appendable sb)
|
|
VisualOutputTarget(java.io.Writer w)
|
Method Summary | |
---|---|
void |
endLine(Line line)
Invoked when PDFTextStream has finished processing a Line. |
void |
endPage(Page page)
Invoked when PDFTextStream has finished processing a page |
float |
getSpacingScale()
Returns the spacing scale currently in effect for this VisualOutputTarget . |
boolean |
isIncludingRotatedChars()
Return true if this VisualOutputTarget will include rotated TextUnit s in its output
(true by default). |
void |
linebreaks(int linebreakCnt)
Invoked when PDFTextStream determines that a series of line breaks should be outputted between the previous entity (page, block, line, etc) and the next entity (page, block, line, etc). |
void |
setIncludingRotatedChars(boolean includingRotatedChars)
Used to set whether or not this VisualOutputTarget will include rotated TextUnit s in its output
(true by default). |
void |
setSpacingScale(float scale)
Modifies the spacing scale that is used when outputting content laid out using this VisualOutputTarget . |
void |
spaces(int spaceCnt)
Invoked when PDFTextStream determines that a series of spaces should be outputted between the previous entity (block, line, text unit, etc) and the next entity (block, line, text unit, etc). |
void |
startBlock(Block b)
Invoked when a Block is about to be processed. |
void |
startLine(Line line)
Invoked when a Line is about to be processed. |
void |
startPage(Page page)
Invoked when a page is about to be processed. |
void |
textUnit(TextUnit tu)
Invoked when a run of characters is to be outputted, as represented by the given TextUnit instance. |
Methods inherited from class com.snowtide.pdf.OutputHandler |
---|
endBlock, endPDF, startPDF |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public VisualOutputTarget(java.io.Writer w)
public VisualOutputTarget(java.lang.Appendable sb)
Method Detail |
---|
public void setSpacingScale(float scale)
VisualOutputTarget
.
The default is 1; using a value of 2 will (approximately) double the number of spaces that are outputted between
recognized words, while a value of .5 will (approximately) halve that number. This is useful in circumstances where:
public float getSpacingScale()
VisualOutputTarget
.
setSpacingScale(float)
public boolean isIncludingRotatedChars()
VisualOutputTarget
will include rotated TextUnit
s in its output
(true by default).
public void setIncludingRotatedChars(boolean includingRotatedChars)
VisualOutputTarget
will include rotated TextUnit
s in its output
(true by default).
public void endPage(Page page)
OutputHandler
endPage
in class OutputHandler
page
- - a reference to the Page
that has been processedpublic void startBlock(Block b)
OutputHandler
startBlock
in class OutputHandler
b
- - a reference to the Block
that is about to be processedpublic void textUnit(TextUnit tu)
OutputHandler
TextUnit
instance.
textUnit
in class OutputHandler
public void spaces(int spaceCnt)
OutputHandler
spaces
in class OutputHandler
spaceCnt
- - the number of spaces that PDFTextStream
recommends should be outputtedpublic void linebreaks(int linebreakCnt)
OutputHandler
linebreaks
in class OutputHandler
linebreakCnt
- - the number of line breaks that PDFTextStream
recommends should be outputtedpublic void startLine(Line line)
OutputHandler
startLine
in class OutputHandler
line
- - a reference to the Line
that is about to be processedpublic void endLine(Line line)
OutputHandler
endLine
in class OutputHandler
line
- - a reference to the Line
that has been processedpublic void startPage(Page page)
OutputHandler
startPage
in class OutputHandler
page
- - a reference to the Page
that is about to be processed
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |