Class RegionOutputTarget


  • public class RegionOutputTarget
    extends OutputHandler

    This OutputHandler implemenation is used to selectively extract text from certain regions of each PDF page.

    Here is the typical usage pattern:

    1. Open a new Document via PDF
    2. Create RegionOutputTarget instance, optionally specifying which type of OutputTarget to delegate text layout rending.
    3. Register each region of interest with the RegionOutputTarget, optionally specifying a name for each
    4. For each Page in the Document:
      1. Pass the created RegionOutputTarget instance to each Page's OutputSource.pipe(OutputHandler) function
      2. Retrieve the text extracted for each region from the RegionOutputTarget, using either getRegionText(int) or getRegionText(String)

    Example:

     Document pdf = PDF.open(pdfFile);
     RegionOutputTarget tgt = new RegionOutputTarget();
     tgt.addRegion(40, 600, 120, 16, "name");
     tgt.addRegion(40, 570, 120, 16, "address");
    
     Page p = pdf.getPage(0);
     p.pipe(tgt);
     pdf.close();
    
     String name = tgt.getRegionText("name");
     String address = tgt.getRegionText("address");
     

    Important notes:

    • The coordinates provided to RegionOutputTarget via the addRegion(float, float, float, float) or addRegion(float, float, float, float, String) functions are denominated in 1/72". Recall that the origin of each page is its lower left corner. So, for example, a region that would encompass the top-left quarter of a 8.5" x 11" page would be registered with the parameters 0, 396, 306, 396.
    • RegionOutputTarget uses a "greedy" algorithm: if any portion of a character overlaps a registered region, then that character is included in that region's text.
    • Passing a RegionOutputTarget to the pipe(OutputHandler) function of anything other than a Page will have undefined results. RegionOutputTarget depends on a PDF page being the "top-level" object in the PDF event stream.
    Since:
    v2.0.2
    Version:
    ©2004-2024 Snowtide
    • Constructor Detail

      • RegionOutputTarget

        public RegionOutputTarget()
        Creates a new RegionOutputTarget, using a VisualOutputTarget to lay out the text extracted for each region.
      • RegionOutputTarget

        public RegionOutputTarget​(boolean useVisualTarget)
      • RegionOutputTarget

        public RegionOutputTarget​(boolean useVisualTarget,
                                  Direction bd)
        Creates a new RegionOutputTarget.
        Parameters:
        useVisualTarget - if true, then the layout of the text for each region will be determined by VisualOutputTarget; otherwise, the standard OutputTarget will be used.
    • Method Detail

      • getMinimumOverlapPct

        public float getMinimumOverlapPct()
      • setMinimumOverlapPct

        public void setMinimumOverlapPct​(float minimumOverlapPct)
        Sets the minimum overlap between registered regions and each considered character on a page for the latter to be included in extracted content. By default, any overlap qualifies a character for inclusion; this configuration option can be used to require that e.g. a majority of the character's bounds be within a registered region.
      • addRegion

        public void addRegion​(float x,
                              float y,
                              float width,
                              float height)

        Registers a new unnamed region. The coordinate pair x, y describes the origin and bottom-left corner of the rectangular region to be extracted; the width and height parameters represent the size of the rectangular region, extending up and to the right from the origin specified by x, y.

        All values are denominated in 1/72" (called points). Please note that the origin of each page is its lower left corner. So, for example, a region that would encompass the top-left quarter of a 8.5" x 11" page would be registered with the parameters 0, 396, 306, 396.

      • addRegion

        public void addRegion​(float x,
                              float y,
                              float width,
                              float height,
                              String name)

        Registers a new named region. The coordinate pair x, y describes the origin and bottom-left corner of the rectangular region to be extracted; the width and height parameters represent the size of the rectangular region, extending up and to the right from the origin specified by x, y. Text extracted from this region will be available via the getRegionText(String) function.

        All values are denominated in 1/72" (called points). Please note that the origin of each page is its lower left corner. So, for example, a region that would encompass the top-left quarter of a 8.5" x 11" page would be registered with the parameters 0, 396, 306, 396.

      • getRegionText

        public String getRegionText​(int i)
        Returns the text extracted from the i-th region that was registered with this RegionOutputTarget.
      • getRegionText

        public String getRegionText​(String regionName)
        Returns the text extracted from the region that was registered with this RegionOutputTarget using the provided name.
      • getRegionCnt

        public int getRegionCnt()
        Returns the number of registered regions.
      • startPage

        public void startPage​(Page page)
        Description copied from class: OutputHandler
        Invoked when a page is about to be processed.
        Overrides:
        startPage in class OutputHandler
        Parameters:
        page - a reference to the Page that is about to be processed
      • endPage

        public void endPage​(Page page)
        Description copied from class: OutputHandler
        Invoked when PDFxStream has finished processing a page
        Overrides:
        endPage in class OutputHandler
        Parameters:
        page - a reference to the Page that has been processed