com.snowtide.pdf
Interface Page


public interface Page

Instances of this class provide access to the text and attributes of a page extracted from a PDF document.

Version:
©2004-2012 Snowtide Informatics Systems, Inc.

Field Summary
static int COLUMN_POSITION_HALVES
          A constant parameter for use with addColumnPartition(int).
static int COLUMN_POSITION_QUARTERS
          A constant parameter for use with addColumnPartition(int).
static int COLUMN_POSITION_THIRDS
          A constant parameter for use with addColumnPartition(int).
 
Method Summary
 void addColumnPartition(int xcoord)
          Adds the given coordinate as an acceptable midline between columns, used when this page is segmented.
 Page crop(Region area)
          Returns a Page instance that contains only the content held by this Page instance that intersects the given "query" area.
 PDFTextStreamConfig getConfig()
          Returns the PDFTextStreamConfig instance provided to this page by its parent PDFTextStream instance.
 Region getCropBox()
          The "crop box" defined by the PDF for this page, expressed in user space units as with getPageHeight() and getPageWidth().
 int getPageHeight()
          Returns the height of this page in PDF "default user space units" (as specified by the PDF spec).
 int getPageNumber()
          Returns this Page's page number.
 int getPageWidth()
          Returns the width of this page in PDF "default user space units" (as specified by the PDF spec).
 java.lang.String getPdfName()
          Returns the name of the PDF document from which this Page was extracted.
 int getRotationTheta()
          Returns the number of degrees by which the page has been rotated clockwise.
 PDFTextStream getStream()
          Returns the PDFTextStream instance from which this Page was sourced.
 BlockParent getTextContent()
          Returns a BlockParent instance that contains all Block instances held by this Page, which in turn hold all text content for this Page.
 void pipe(OutputHandler tgt)
          Extracts all text from this page, sending necessary events to the given OutputHandler implementation.
 

Field Detail

COLUMN_POSITION_HALVES

static final int COLUMN_POSITION_HALVES
A constant parameter for use with addColumnPartition(int).

See Also:
Constant Field Values

COLUMN_POSITION_THIRDS

static final int COLUMN_POSITION_THIRDS
A constant parameter for use with addColumnPartition(int).

See Also:
Constant Field Values

COLUMN_POSITION_QUARTERS

static final int COLUMN_POSITION_QUARTERS
A constant parameter for use with addColumnPartition(int).

See Also:
Constant Field Values
Method Detail

addColumnPartition

void addColumnPartition(int xcoord)
                        throws java.lang.UnsupportedOperationException
Adds the given coordinate as an acceptable midline between columns, used when this page is segmented. By default, no specific coordinate restrictions are applied to column partitioning. Adding any column partition coordinate will restrict acceptable column spacing midlines to only those coordinates specified.

The exceptions to this are when privileged constants of COLUMN_POSITION_HALVES, COLUMN_POSITION_THIRDS, or COLUMN_POSITION_QUARTERS are provided. Those constants "expand" into multiple column partitions; e.g. specifying COLUMN_POSITION_THIRDS will result in two column partitions, one at getPageWidth() / 3 and another at 2 * getPageWidth() / 3.

In order to be effective, this method must be used before either getTextContent() or pipe(OutputHandler) are invoked.

Throws:
java.lang.UnsupportedOperationException - if this Page's implementation does not support specifying column positions.

getStream

PDFTextStream getStream()
Returns the PDFTextStream instance from which this Page was sourced.


getPdfName

java.lang.String getPdfName()
Returns the name of the PDF document from which this Page was extracted.


getPageNumber

int getPageNumber()
Returns this Page's page number.


getPageWidth

int getPageWidth()
Returns the width of this page in PDF "default user space units" (as specified by the PDF spec). Typically, each "user space unit" is equivalent to 1/72 of an inch, so dividing the value returned by this method by 72 will yield the page height in inches. The value returned from this method corresponds to the width value of the /MediaBox attribute of a PDF page object.


getPageHeight

int getPageHeight()
Returns the height of this page in PDF "default user space units" (as specified by the PDF spec). Typically, each "user space unit" is equivalent to 1/72 of an inch, so dividing the value returned by this method by 72 will yield the page height in inches. The value returned from this method corresponds to the height value of the /MediaBox attribute of a PDF page object.


getCropBox

Region getCropBox()
The "crop box" defined by the PDF for this page, expressed in user space units as with getPageHeight() and getPageWidth(). This rectangle will default to the page width and height if it is not otherwise specified.


getRotationTheta

int getRotationTheta()
Returns the number of degrees by which the page has been rotated clockwise. This value should be a factor of 90, and can be negative. The value returned from this method corresponds to the value of the /Rotate attribute of a PDF page object.


pipe

void pipe(OutputHandler tgt)
          throws java.io.IOException
Extracts all text from this page, sending necessary events to the given OutputHandler implementation. Unless custom text event handling is required, using an OutputTarget is the easiest way to take advantage of this function.

Throws:
java.io.IOException

getTextContent

BlockParent getTextContent()
Returns a BlockParent instance that contains all Block instances held by this Page, which in turn hold all text content for this Page.


getConfig

PDFTextStreamConfig getConfig()
Returns the PDFTextStreamConfig instance provided to this page by its parent PDFTextStream instance.


crop

Page crop(Region area)
Returns a Page instance that contains only the content held by this Page instance that intersects the given "query" area. If all of the content held by this instance is intersected by the query area, then this instance may be returned unchanged. If no content in this Page intersects the query area, then an empty Page instance will be returned.

Throws:
java.lang.UnsupportedOperationException - if this Page implementation does not support the crop(Region) function
See Also:
Block.crop(Region)