public class RegionOutputTarget extends OutputHandler
This OutputHandler
implemenation is used to selectively extract text from certain regions of each PDF page.
Here is the typical usage pattern:
Document
via PDF
Page
in the Document
:
Page.pipe(OutputHandler)
functionRegionOutputTarget.getRegionText(int)
or RegionOutputTarget.getRegionText(String)
Example:
Document pdf = PDF.open(pdfFile); RegionOutputTarget tgt = new RegionOutputTarget(); tgt.addRegion(40, 600, 120, 16, "name"); tgt.addRegion(40, 570, 120, 16, "address"); Page p = pdf.getPage(0); p.pipe(tgt); pdf.close(); String name = tgt.getRegionText("name"); String address = tgt.getRegionText("address");
Important notes:
RegionOutputTarget.addRegion(float, float, float, float)
or
RegionOutputTarget.addRegion(float, float, float, float, String)
functions are denominated in 1/72". Recall that the origin
of each page is its lower left corner. So, for example, a region that would encompass the top-left quarter of a
8.5" x 11" page would be registered with the parameters 0, 396, 306, 396
.pipe(OutputHandler)
function of anything other than a Page
will have undefined results. RegionOutputTarget depends on a PDF page being the "top-level" object in the PDF event stream.Constructor and Description |
---|
RegionOutputTarget()
Creates a new RegionOutputTarget, using a
VisualOutputTarget to lay out the text extracted for each region. |
RegionOutputTarget(boolean useVisualTarget)
Creates a new RegionOutputTarget.
|
Modifier and Type | Method and Description |
---|---|
void |
addRegion(float x,
float y,
float width,
float height)
Registers a new unnamed region.
|
void |
addRegion(float x,
float y,
float width,
float height,
java.lang.String name)
Registers a new named region.
|
void |
endPage(Page page)
Invoked when PDFxStream has finished processing a page
|
float |
getMinimumOverlapPct() |
int |
getRegionCnt()
Returns the number of registered regions.
|
java.util.Set |
getRegionNames()
Returns a set containing each of the names used to register regions on this RegionOutputTarget via
RegionOutputTarget.addRegion(float, float, float, float, String) . |
java.lang.String |
getRegionText(int i)
Returns the text extracted from the
i-th region that was registered with this RegionOutputTarget. |
java.lang.String |
getRegionText(java.lang.String regionName)
Returns the text extracted from the region that was registered with this RegionOutputTarget using the provided name.
|
void |
setMinimumOverlapPct(float minimumOverlapPct)
Sets the minimum overlap between registered regions and each considered character on a page for the latter to
be included in extracted content.
|
void |
startPage(Page page)
Invoked when a page is about to be processed.
|
void |
textUnit(TextUnit tu)
Invoked when a run of characters is to be outputted, as represented by the
given
TextUnit instance. |
endBlock, endLine, endPDF, linebreaks, spaces, startBlock, startLine, startPDF
public RegionOutputTarget()
VisualOutputTarget
to lay out the text extracted for each region.public RegionOutputTarget(boolean useVisualTarget)
useVisualTarget
- if true, then the layout of the text for each region will be determined by VisualOutputTarget
;
otherwise, the standard OutputTarget
will be used.public float getMinimumOverlapPct()
public void setMinimumOverlapPct(float minimumOverlapPct)
public void addRegion(float x, float y, float width, float height)
Registers a new unnamed region. The coordinate pair x, y
describes the origin and bottom-left corner
of the rectangular region to be extracted; the width
and height
parameters represent
the size of the rectangular region, extending up and to the right from the origin specified by x, y
.
All values are denominated in 1/72" (called points). Please note that the origin of each page is its lower left
corner. So, for example, a region that would encompass the top-left quarter of a 8.5" x 11" page would be registered
with the parameters 0, 396, 306, 396
.
public void addRegion(float x, float y, float width, float height, java.lang.String name)
Registers a new named region. The coordinate pair x, y
describes the origin and bottom-left corner
of the rectangular region to be extracted; the width
and height
parameters represent
the size of the rectangular region, extending up and to the right from the origin specified by x, y
.
Text extracted from this region will be available via the RegionOutputTarget.getRegionText(String)
function.
All values are denominated in 1/72" (called points). Please note that the origin of each page is its lower left
corner. So, for example, a region that would encompass the top-left quarter of a 8.5" x 11" page would be registered
with the parameters 0, 396, 306, 396
.
public java.lang.String getRegionText(int i)
i-th
region that was registered with this RegionOutputTarget.public java.lang.String getRegionText(java.lang.String regionName)
public java.util.Set getRegionNames()
RegionOutputTarget.addRegion(float, float, float, float, String)
.public int getRegionCnt()
public void startPage(Page page)
OutputHandler
startPage
in class OutputHandler
page
- a reference to the Page
that is about to be processedpublic void textUnit(TextUnit tu)
OutputHandler
TextUnit
instance.textUnit
in class OutputHandler
public void endPage(Page page)
OutputHandler
endPage
in class OutputHandler
page
- a reference to the Page
that has been processed