Class RegionOutputTarget
- java.lang.Object
-
- com.snowtide.pdf.OutputHandler
-
- com.snowtide.pdf.RegionOutputTarget
-
public class RegionOutputTarget extends OutputHandler
This
OutputHandlerimplementation is used to selectively extract text from certain regions of each PDF page.Here is the typical usage pattern:
- Open a new
DocumentviaPDF - Create RegionOutputTarget instance, optionally specifying which type of OutputTarget to delegate text layout rending.
- Register each region of interest with the RegionOutputTarget, optionally specifying a name for each
- For each
Pagein theDocument:- Pass the created RegionOutputTarget instance to each Page's
OutputSource.pipe(OutputHandler)function - Retrieve the text extracted for each region from the RegionOutputTarget, using either
getRegionText(int)orgetRegionText(String)
- Pass the created RegionOutputTarget instance to each Page's
Example:
Document pdf = PDF.open(pdfFile); RegionOutputTarget tgt = new RegionOutputTarget(); tgt.addRegion(40, 600, 120, 16, "name"); tgt.addRegion(40, 570, 120, 16, "address"); Page p = pdf.getPage(0); p.pipe(tgt); pdf.close(); String name = tgt.getRegionText("name"); String address = tgt.getRegionText("address");Important notes:
- The coordinates provided to RegionOutputTarget via the
addRegion(float, float, float, float)oraddRegion(float, float, float, float, String)functions are denominated in 1/72". Recall that the origin of each page is its lower left corner. So, for example, a region that would encompass the top-left quarter of a 8.5" x 11" page would be registered with the parameters0, 396, 306, 396. - RegionOutputTarget uses a "greedy" algorithm: if any portion of a character overlaps a registered region, then that character is included in that region's text.
- Passing a RegionOutputTarget to the
pipe(OutputHandler)function of anything other than aPagewill have undefined results. RegionOutputTarget depends on a PDF page being the "top-level" object in the PDF event stream.
- Since:
- v2.0.2
- Version:
- ©2004-2025 Snowtide
- Open a new
-
-
Constructor Summary
Constructors Constructor Description RegionOutputTarget()Creates a new RegionOutputTarget, using aVisualOutputTargetto lay out the text extracted for each region.RegionOutputTarget(boolean useVisualTarget)RegionOutputTarget(boolean useVisualTarget, Direction bd)Creates a new RegionOutputTarget.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddRegion(float x, float y, float width, float height)Registers a new unnamed region.voidaddRegion(float x, float y, float width, float height, String name)Registers a new named region.voidendPage(Page page)Invoked when PDFxStream has finished processing a pagefloatgetMinimumOverlapPct()intgetRegionCnt()Returns the number of registered regions.Set<String>getRegionNames()Returns a set containing each of the names used to register regions on this RegionOutputTarget viaaddRegion(float, float, float, float, String).StringgetRegionText(int i)Returns the text extracted from thei-thregion that was registered with this RegionOutputTarget.StringgetRegionText(String regionName)Returns the text extracted from the region that was registered with this RegionOutputTarget using the provided name.voidsetMinimumOverlapPct(float minimumOverlapPct)Sets the minimum overlap between registered regions and each considered character on a page for the latter to be included in extracted content.voidstartPage(Page page)Invoked when a page is about to be processed.voidtextUnit(TextUnit tu)Invoked when a run of characters is to be outputted, as represented by the givenTextUnitinstance.-
Methods inherited from class com.snowtide.pdf.OutputHandler
endBlock, endLine, endPDF, endSpan, linebreaks, spaces, startBlock, startLine, startPDF, startSpan
-
-
-
-
Constructor Detail
-
RegionOutputTarget
public RegionOutputTarget()
Creates a new RegionOutputTarget, using aVisualOutputTargetto lay out the text extracted for each region.
-
RegionOutputTarget
public RegionOutputTarget(boolean useVisualTarget)
-
RegionOutputTarget
public RegionOutputTarget(boolean useVisualTarget, Direction bd)Creates a new RegionOutputTarget.- Parameters:
useVisualTarget- if true, then the layout of the text for each region will be determined byVisualOutputTarget; otherwise, the standardOutputTargetwill be used.
-
-
Method Detail
-
getMinimumOverlapPct
public float getMinimumOverlapPct()
-
setMinimumOverlapPct
public void setMinimumOverlapPct(float minimumOverlapPct)
Sets the minimum overlap between registered regions and each considered character on a page for the latter to be included in extracted content. By default, any overlap qualifies a character for inclusion; this configuration option can be used to require that e.g. a majority of the character's bounds be within a registered region.
-
addRegion
public void addRegion(float x, float y, float width, float height)Registers a new unnamed region. The coordinate pair
x, ydescribes the origin and bottom-left corner of the rectangular region to be extracted; thewidthandheightparameters represent the size of the rectangular region, extending up and to the right from the origin specified byx, y.All values are denominated in 1/72" (called points). Please note that the origin of each page is its lower left corner. So, for example, a region that would encompass the top-left quarter of a 8.5" x 11" page would be registered with the parameters
0, 396, 306, 396.
-
addRegion
public void addRegion(float x, float y, float width, float height, String name)Registers a new named region. The coordinate pair
x, ydescribes the origin and bottom-left corner of the rectangular region to be extracted; thewidthandheightparameters represent the size of the rectangular region, extending up and to the right from the origin specified byx, y. Text extracted from this region will be available via thegetRegionText(String)function.All values are denominated in 1/72" (called points). Please note that the origin of each page is its lower left corner. So, for example, a region that would encompass the top-left quarter of a 8.5" x 11" page would be registered with the parameters
0, 396, 306, 396.
-
getRegionText
public String getRegionText(int i)
Returns the text extracted from thei-thregion that was registered with this RegionOutputTarget.
-
getRegionText
public String getRegionText(String regionName)
Returns the text extracted from the region that was registered with this RegionOutputTarget using the provided name.
-
getRegionNames
public Set<String> getRegionNames()
Returns a set containing each of the names used to register regions on this RegionOutputTarget viaaddRegion(float, float, float, float, String).
-
getRegionCnt
public int getRegionCnt()
Returns the number of registered regions.
-
startPage
public void startPage(Page page)
Description copied from class:OutputHandlerInvoked when a page is about to be processed.- Overrides:
startPagein classOutputHandler- Parameters:
page- a reference to thePagethat is about to be processed
-
textUnit
public void textUnit(TextUnit tu)
Description copied from class:OutputHandlerInvoked when a run of characters is to be outputted, as represented by the givenTextUnitinstance.- Overrides:
textUnitin classOutputHandler
-
endPage
public void endPage(Page page)
Description copied from class:OutputHandlerInvoked when PDFxStream has finished processing a page- Overrides:
endPagein classOutputHandler- Parameters:
page- a reference to thePagethat has been processed
-
-