Class RegionOutputTarget
- java.lang.Object
-
- com.snowtide.pdf.OutputHandler
-
- com.snowtide.pdf.RegionOutputTarget
-
public class RegionOutputTarget extends OutputHandler
This
OutputHandler
implemenation is used to selectively extract text from certain regions of each PDF page.Here is the typical usage pattern:
- Open a new
Document
viaPDF
- Create RegionOutputTarget instance, optionally specifying which type of OutputTarget to delegate text layout rending.
- Register each region of interest with the RegionOutputTarget, optionally specifying a name for each
- For each
Page
in theDocument
:- Pass the created RegionOutputTarget instance to each Page's
OutputSource.pipe(OutputHandler)
function - Retrieve the text extracted for each region from the RegionOutputTarget, using either
getRegionText(int)
orgetRegionText(String)
- Pass the created RegionOutputTarget instance to each Page's
Example:
Document pdf = PDF.open(pdfFile); RegionOutputTarget tgt = new RegionOutputTarget(); tgt.addRegion(40, 600, 120, 16, "name"); tgt.addRegion(40, 570, 120, 16, "address"); Page p = pdf.getPage(0); p.pipe(tgt); pdf.close(); String name = tgt.getRegionText("name"); String address = tgt.getRegionText("address");
Important notes:
- The coordinates provided to RegionOutputTarget via the
addRegion(float, float, float, float)
oraddRegion(float, float, float, float, String)
functions are denominated in 1/72". Recall that the origin of each page is its lower left corner. So, for example, a region that would encompass the top-left quarter of a 8.5" x 11" page would be registered with the parameters0, 396, 306, 396
. - RegionOutputTarget uses a "greedy" algorithm: if any portion of a character overlaps a registered region, then that character is included in that region's text.
- Passing a RegionOutputTarget to the
pipe(OutputHandler)
function of anything other than aPage
will have undefined results. RegionOutputTarget depends on a PDF page being the "top-level" object in the PDF event stream.
- Since:
- v2.0.2
- Version:
- ©2004-2024 Snowtide
- Open a new
-
-
Constructor Summary
Constructors Constructor Description RegionOutputTarget()
Creates a new RegionOutputTarget, using aVisualOutputTarget
to lay out the text extracted for each region.RegionOutputTarget(boolean useVisualTarget)
RegionOutputTarget(boolean useVisualTarget, Direction bd)
Creates a new RegionOutputTarget.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addRegion(float x, float y, float width, float height)
Registers a new unnamed region.void
addRegion(float x, float y, float width, float height, String name)
Registers a new named region.void
endPage(Page page)
Invoked when PDFxStream has finished processing a pagefloat
getMinimumOverlapPct()
int
getRegionCnt()
Returns the number of registered regions.Set<String>
getRegionNames()
Returns a set containing each of the names used to register regions on this RegionOutputTarget viaaddRegion(float, float, float, float, String)
.String
getRegionText(int i)
Returns the text extracted from thei-th
region that was registered with this RegionOutputTarget.String
getRegionText(String regionName)
Returns the text extracted from the region that was registered with this RegionOutputTarget using the provided name.void
setMinimumOverlapPct(float minimumOverlapPct)
Sets the minimum overlap between registered regions and each considered character on a page for the latter to be included in extracted content.void
startPage(Page page)
Invoked when a page is about to be processed.void
textUnit(TextUnit tu)
Invoked when a run of characters is to be outputted, as represented by the givenTextUnit
instance.-
Methods inherited from class com.snowtide.pdf.OutputHandler
endBlock, endLine, endPDF, endSpan, linebreaks, spaces, startBlock, startLine, startPDF, startSpan
-
-
-
-
Constructor Detail
-
RegionOutputTarget
public RegionOutputTarget()
Creates a new RegionOutputTarget, using aVisualOutputTarget
to lay out the text extracted for each region.
-
RegionOutputTarget
public RegionOutputTarget(boolean useVisualTarget)
-
RegionOutputTarget
public RegionOutputTarget(boolean useVisualTarget, Direction bd)
Creates a new RegionOutputTarget.- Parameters:
useVisualTarget
- if true, then the layout of the text for each region will be determined byVisualOutputTarget
; otherwise, the standardOutputTarget
will be used.
-
-
Method Detail
-
getMinimumOverlapPct
public float getMinimumOverlapPct()
-
setMinimumOverlapPct
public void setMinimumOverlapPct(float minimumOverlapPct)
Sets the minimum overlap between registered regions and each considered character on a page for the latter to be included in extracted content. By default, any overlap qualifies a character for inclusion; this configuration option can be used to require that e.g. a majority of the character's bounds be within a registered region.
-
addRegion
public void addRegion(float x, float y, float width, float height)
Registers a new unnamed region. The coordinate pair
x, y
describes the origin and bottom-left corner of the rectangular region to be extracted; thewidth
andheight
parameters represent the size of the rectangular region, extending up and to the right from the origin specified byx, y
.All values are denominated in 1/72" (called points). Please note that the origin of each page is its lower left corner. So, for example, a region that would encompass the top-left quarter of a 8.5" x 11" page would be registered with the parameters
0, 396, 306, 396
.
-
addRegion
public void addRegion(float x, float y, float width, float height, String name)
Registers a new named region. The coordinate pair
x, y
describes the origin and bottom-left corner of the rectangular region to be extracted; thewidth
andheight
parameters represent the size of the rectangular region, extending up and to the right from the origin specified byx, y
. Text extracted from this region will be available via thegetRegionText(String)
function.All values are denominated in 1/72" (called points). Please note that the origin of each page is its lower left corner. So, for example, a region that would encompass the top-left quarter of a 8.5" x 11" page would be registered with the parameters
0, 396, 306, 396
.
-
getRegionText
public String getRegionText(int i)
Returns the text extracted from thei-th
region that was registered with this RegionOutputTarget.
-
getRegionText
public String getRegionText(String regionName)
Returns the text extracted from the region that was registered with this RegionOutputTarget using the provided name.
-
getRegionNames
public Set<String> getRegionNames()
Returns a set containing each of the names used to register regions on this RegionOutputTarget viaaddRegion(float, float, float, float, String)
.
-
getRegionCnt
public int getRegionCnt()
Returns the number of registered regions.
-
startPage
public void startPage(Page page)
Description copied from class:OutputHandler
Invoked when a page is about to be processed.- Overrides:
startPage
in classOutputHandler
- Parameters:
page
- a reference to thePage
that is about to be processed
-
textUnit
public void textUnit(TextUnit tu)
Description copied from class:OutputHandler
Invoked when a run of characters is to be outputted, as represented by the givenTextUnit
instance.- Overrides:
textUnit
in classOutputHandler
-
endPage
public void endPage(Page page)
Description copied from class:OutputHandler
Invoked when PDFxStream has finished processing a page- Overrides:
endPage
in classOutputHandler
- Parameters:
page
- a reference to thePage
that has been processed
-
-