Class VisualOutputTarget


  • public class VisualOutputTarget
    extends OutputHandler
    This OutputHandler implementation aims to preserve as much of a PDF's text layout as possible so that text extracts will retain the visual arrangement of text as present in the original document. This is ideal when the content being extracted is to be used as input into a downstream conversion process. For example, this OutputHandler will maintain the layout of most tabular data (this table is a sample for illustration purposes only):

             Column 1     Column 2      Column 3
     Row 1      $500        $1,000           14B
     Row 2    $1,000        $5,621            8A
     Row 3    $6,009          $121           N/A
     

    whereas the default OutputTarget is more likely to output such tabular data with proper read-ordering, but with no concern for spacing or line breaks between the table's cells and rows:

     Column 1 Column 2 Column 3
     Row 1 $500 $1,000 14B
     Row 2 $1,000 $5,621 8A
     Row 3 $6,009 $121 N/A
     

    Please note the following regarding VisualOutputTarget:

    • Because VisualOutputTarget attempts to maintain the visual appearance of each page's text, it will not separate columns and other document features that might be very important to a content-oriented text analysis process (such as search indexing).
    • VisualOutputTarget will yield very poor output when used to format rotated text; in such a case, the results are essentially undefined. You may optionally suppress the inclusion of rotated characters from VisualOutputTarget's output using setIncludingRotatedChars(boolean).
    • Using VisualOutputTarget is likely to impose a slight performance penalty compared to using the default OutputTarget. This penalty should be no more than 5%, and is necessary because of the processing needed to normalize the extracted text to appear as it does on the page.
    • The current implementation of VisualOutputTarget performs best when working with text rendered using a monospace font. Proportional fonts and (especially) justified text complicates the process of normalizing the spacing of the text formatted by this class.
    Since:
    v2.0
    Version:
    ©2004-2024 Snowtide
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void endLine​(Line line)
      Invoked when PDFxStream has finished processing a Line.
      void endPage​(Page page)
      Invoked when PDFxStream has finished processing a page
      float getSpacingScale()
      Returns the spacing scale currently in effect for this VisualOutputTarget.
      boolean isIncludingRotatedChars()
      Return true if this VisualOutputTarget will include rotated TextUnits in its output (true by default).
      boolean isMarginTrimmed()
      Returns true if this VisualOutputTarget trims whitespace corresponding to the left margin of each page piped to it.
      void linebreaks​(int linebreakCnt)
      Invoked when PDFxStream determines that a series of line breaks should be outputted between the previous entity (page, block, line, etc) and the next entity (page, block, line, etc).
      void setIncludingRotatedChars​(boolean includingRotatedChars)
      Used to set whether or not this VisualOutputTarget will include rotated TextUnits in its output (true by default).
      void setMarginTrimmed​(boolean marginTrimmed)
      Sets whether or not this VisualOutputTarget trims the whitespace corresponding to the left margin of each page it handles.
      void setSpacingScale​(float scale)
      Modifies the spacing scale that is used when outputting content laid out using this VisualOutputTarget.
      void spaces​(int spaceCnt)
      Invoked when PDFxStream determines that a series of spaces should be outputted between the previous entity (block, line, text unit, etc) and the next entity (block, line, text unit, etc).
      void startBlock​(Block b)
      Invoked when a Block is about to be processed.
      void startLine​(Line line)
      Invoked when a Line is about to be processed.
      void startPage​(Page page)
      Invoked when a page is about to be processed.
      void textUnit​(TextUnit tu)
      Invoked when a run of characters is to be outputted, as represented by the given TextUnit instance.
    • Constructor Detail

      • VisualOutputTarget

        public VisualOutputTarget​(Appendable sb)
    • Method Detail

      • setSpacingScale

        public void setSpacingScale​(float scale)
        Modifies the spacing scale that is used when outputting content laid out using this VisualOutputTarget. The default is 1; using a value of 2 will (approximately) double the number of spaces that are outputted between recognized words, while a value of .5 will (approximately) halve that number. This is useful in circumstances where:
        • Columnar data elements should be separated by greater distances, perhaps to aid data collection
        • Words are being outputted without any spaces between them at all, perhaps because of the use of (pathologically?) small fonts.
      • getSpacingScale

        public float getSpacingScale()
        Returns the spacing scale currently in effect for this VisualOutputTarget.
        See Also:
        setSpacingScale(float)
      • isIncludingRotatedChars

        public boolean isIncludingRotatedChars()
        Return true if this VisualOutputTarget will include rotated TextUnits in its output (true by default).
      • setIncludingRotatedChars

        public void setIncludingRotatedChars​(boolean includingRotatedChars)
        Used to set whether or not this VisualOutputTarget will include rotated TextUnits in its output (true by default).
      • setMarginTrimmed

        public void setMarginTrimmed​(boolean marginTrimmed)
        Sets whether or not this VisualOutputTarget trims the whitespace corresponding to the left margin of each page it handles. Defaults to false, meaning that text set off from the left edge of a page will be preceded by a corresponding number of spaces in resulting text extracts. This means that text located at the same horizontal position on different pages using the same font and font size will be found in the same column position in extracted text, simplifying identification and organization of implicitly tabular data that spans page boundaries.

        If set to true, then a minimum number of spaces will be added to the beginning of each line of text.

      • isMarginTrimmed

        public boolean isMarginTrimmed()
        Returns true if this VisualOutputTarget trims whitespace corresponding to the left margin of each page piped to it.
        See Also:
        setMarginTrimmed(boolean)
      • endPage

        public void endPage​(Page page)
        Description copied from class: OutputHandler
        Invoked when PDFxStream has finished processing a page
        Overrides:
        endPage in class OutputHandler
        Parameters:
        page - a reference to the Page that has been processed
      • startBlock

        public void startBlock​(Block b)
        Description copied from class: OutputHandler
        Invoked when a Block is about to be processed.
        Overrides:
        startBlock in class OutputHandler
        Parameters:
        b - a reference to the Block that is about to be processed
      • spaces

        public void spaces​(int spaceCnt)
        Description copied from class: OutputHandler
        Invoked when PDFxStream determines that a series of spaces should be outputted between the previous entity (block, line, text unit, etc) and the next entity (block, line, text unit, etc).
        Overrides:
        spaces in class OutputHandler
        Parameters:
        spaceCnt - the number of spaces that PDFxStream recommends should be outputted
      • linebreaks

        public void linebreaks​(int linebreakCnt)
        Description copied from class: OutputHandler
        Invoked when PDFxStream determines that a series of line breaks should be outputted between the previous entity (page, block, line, etc) and the next entity (page, block, line, etc).
        Overrides:
        linebreaks in class OutputHandler
        Parameters:
        linebreakCnt - the number of line breaks that PDFxStream recommends should be outputted
      • startLine

        public void startLine​(Line line)
        Description copied from class: OutputHandler
        Invoked when a Line is about to be processed.
        Overrides:
        startLine in class OutputHandler
        Parameters:
        line - a reference to the Line that is about to be processed
      • endLine

        public void endLine​(Line line)
        Description copied from class: OutputHandler
        Invoked when PDFxStream has finished processing a Line.
        Overrides:
        endLine in class OutputHandler
        Parameters:
        line - a reference to the Line that has been processed
      • startPage

        public void startPage​(Page page)
        Description copied from class: OutputHandler
        Invoked when a page is about to be processed.
        Overrides:
        startPage in class OutputHandler
        Parameters:
        page - a reference to the Page that is about to be processed