com.snowtide.pdf
Class VisualOutputTarget

java.lang.Object
  extended by com.snowtide.pdf.OutputHandler
      extended by com.snowtide.pdf.VisualOutputTarget

public class VisualOutputTarget
extends OutputHandler

This OutputHandler implementation aims to preserve as much of a PDF's text layout as possible so that text extracts yielded by this OutputHandler will retain the visual arrangement of text as present in the original document. This is ideal when the content being extracted is to be used as input into a downstream conversion process. For example, this OutputHandler will maintain the layout of most tabular data (this table is a sample for illustration purposes only):

         Column 1     Column 2      Column 3
 Row 1      $500        $1,000           14B
 Row 2    $1,000        $5,621            8A
 Row 3    $6,009          $121           N/A
 

whereas the default OutputTarget is more likely to output such tabular data with proper read-ordering, but with no concern for spacing or line breaks between the table's cells and rows:

 Column 1 Column 2 Column 3
 Row 1 $500 $1,000 14B
 Row 2 $1,000 $5,621 8A
 Row 3 $6,009 $121 N/A
 

Please note the following regarding VisualOutputTarget:

Since:
v2.0
Version:
©2004-2012 Snowtide Informatics Systems, Inc.

Constructor Summary
VisualOutputTarget(java.lang.Appendable sb)
           
VisualOutputTarget(java.io.Writer w)
           
 
Method Summary
 void endLine(Line line)
          Invoked when PDFTextStream has finished processing a Line.
 void endPage(Page page)
          Invoked when PDFTextStream has finished processing a page
 float getSpacingScale()
          Returns the spacing scale currently in effect for this VisualOutputTarget.
 boolean isIncludingRotatedChars()
          Return true if this VisualOutputTarget will include rotated TextUnits in its output (true by default).
 void linebreaks(int linebreakCnt)
          Invoked when PDFTextStream determines that a series of line breaks should be outputted between the previous entity (page, block, line, etc) and the next entity (page, block, line, etc).
 void setIncludingRotatedChars(boolean includingRotatedChars)
          Used to set whether or not this VisualOutputTarget will include rotated TextUnits in its output (true by default).
 void setSpacingScale(float scale)
          Modifies the spacing scale that is used when outputting content laid out using this VisualOutputTarget.
 void spaces(int spaceCnt)
          Invoked when PDFTextStream determines that a series of spaces should be outputted between the previous entity (block, line, text unit, etc) and the next entity (block, line, text unit, etc).
 void startBlock(Block b)
          Invoked when a Block is about to be processed.
 void startLine(Line line)
          Invoked when a Line is about to be processed.
 void startPage(Page page)
          Invoked when a page is about to be processed.
 void textUnit(TextUnit tu)
          Invoked when a run of characters is to be outputted, as represented by the given TextUnit instance.
 
Methods inherited from class com.snowtide.pdf.OutputHandler
endBlock, endPDF, startPDF
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

VisualOutputTarget

public VisualOutputTarget(java.io.Writer w)

VisualOutputTarget

public VisualOutputTarget(java.lang.Appendable sb)
Method Detail

setSpacingScale

public void setSpacingScale(float scale)
Modifies the spacing scale that is used when outputting content laid out using this VisualOutputTarget. The default is 1; using a value of 2 will (approximately) double the number of spaces that are outputted between recognized words, while a value of .5 will (approximately) halve that number. This is useful in circumstances where:


getSpacingScale

public float getSpacingScale()
Returns the spacing scale currently in effect for this VisualOutputTarget.

See Also:
setSpacingScale(float)

isIncludingRotatedChars

public boolean isIncludingRotatedChars()
Return true if this VisualOutputTarget will include rotated TextUnits in its output (true by default).


setIncludingRotatedChars

public void setIncludingRotatedChars(boolean includingRotatedChars)
Used to set whether or not this VisualOutputTarget will include rotated TextUnits in its output (true by default).


endPage

public void endPage(Page page)
Description copied from class: OutputHandler
Invoked when PDFTextStream has finished processing a page

Overrides:
endPage in class OutputHandler
Parameters:
page - - a reference to the Page that has been processed

startBlock

public void startBlock(Block b)
Description copied from class: OutputHandler
Invoked when a Block is about to be processed.

Overrides:
startBlock in class OutputHandler
Parameters:
b - - a reference to the Block that is about to be processed

textUnit

public void textUnit(TextUnit tu)
Description copied from class: OutputHandler
Invoked when a run of characters is to be outputted, as represented by the given TextUnit instance.

Overrides:
textUnit in class OutputHandler

spaces

public void spaces(int spaceCnt)
Description copied from class: OutputHandler
Invoked when PDFTextStream determines that a series of spaces should be outputted between the previous entity (block, line, text unit, etc) and the next entity (block, line, text unit, etc).

Overrides:
spaces in class OutputHandler
Parameters:
spaceCnt - - the number of spaces that PDFTextStream recommends should be outputted

linebreaks

public void linebreaks(int linebreakCnt)
Description copied from class: OutputHandler
Invoked when PDFTextStream determines that a series of line breaks should be outputted between the previous entity (page, block, line, etc) and the next entity (page, block, line, etc).

Overrides:
linebreaks in class OutputHandler
Parameters:
linebreakCnt - - the number of line breaks that PDFTextStream recommends should be outputted

startLine

public void startLine(Line line)
Description copied from class: OutputHandler
Invoked when a Line is about to be processed.

Overrides:
startLine in class OutputHandler
Parameters:
line - - a reference to the Line that is about to be processed

endLine

public void endLine(Line line)
Description copied from class: OutputHandler
Invoked when PDFTextStream has finished processing a Line.

Overrides:
endLine in class OutputHandler
Parameters:
line - - a reference to the Line that has been processed

startPage

public void startPage(Page page)
Description copied from class: OutputHandler
Invoked when a page is about to be processed.

Overrides:
startPage in class OutputHandler
Parameters:
page - - a reference to the Page that is about to be processed