Package com.snowtide.pdf
Interface Page
-
- All Superinterfaces:
OutputSource
public interface Page extends OutputSource
Provides access to the text, images, and attributes of a page extracted from a PDF document.- Version:
- ©2004-2024 Snowtide
- See Also:
Document.getPage(int)
-
-
Field Summary
Fields Modifier and Type Field Description static int
COLUMN_POSITION_HALVES
A constant parameter for use withaddColumnPartition(int)
.static int
COLUMN_POSITION_QUARTERS
A constant parameter for use withaddColumnPartition(int)
.static int
COLUMN_POSITION_THIRDS
A constant parameter for use withaddColumnPartition(int)
.
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description void
addColumnPartition(int xcoord)
Adds the given coordinate as an acceptable midline between columns, used when this page is segmented.Page
crop(Region area)
Returns aPage
instance that contains only the content held by thisPage
instance that intersects the given "query" area.Collection<TextUnit>
getCharacters()
Returns a collection ofTextUnit
s on this page.Configuration
getConfig()
Returns theConfiguration
instance provided to this page by its parentDocument
instance.Region
getCropBox()
The "crop box" defined by the PDF for this page, expressed in user space units as withgetPageHeight()
andgetPageWidth()
.Document
getDocument()
Returns theDocument
from which this Page was sourced.Collection<Image>
getImages()
Returns a Collection ofImage
objects, one for each image on this page.int
getPageHeight()
Returns the height of this page in PDF "default user space units" (as specified by the PDF spec).int
getPageNumber()
Returns this Page's page number.int
getPageWidth()
Returns the width of this page in PDF "default user space units" (as specified by the PDF spec).int
getRotationTheta()
Returns the number of degrees by which the page has been rotated clockwise.BlockParent
getTextContent()
Returns a BlockParent instance that contains all Block instances held by this Page, which in turn hold all text content for this Page.Direction
guessDirection()
Returns the "base"Direction
of this page, determined via a heuristic that surveys the entire page's contents.-
Methods inherited from interface com.snowtide.pdf.OutputSource
pipe, pipe
-
-
-
-
Field Detail
-
COLUMN_POSITION_HALVES
static final int COLUMN_POSITION_HALVES
A constant parameter for use withaddColumnPartition(int)
.- Since:
- 2.5.0
- See Also:
- Constant Field Values
-
COLUMN_POSITION_THIRDS
static final int COLUMN_POSITION_THIRDS
A constant parameter for use withaddColumnPartition(int)
.- Since:
- 2.5.0
- See Also:
- Constant Field Values
-
COLUMN_POSITION_QUARTERS
static final int COLUMN_POSITION_QUARTERS
A constant parameter for use withaddColumnPartition(int)
.- Since:
- 2.5.0
- See Also:
- Constant Field Values
-
-
Method Detail
-
addColumnPartition
void addColumnPartition(int xcoord) throws UnsupportedOperationException
Adds the given coordinate as an acceptable midline between columns, used when this page is segmented. By default, no specific coordinate restrictions are applied to column partitioning. Adding any column partition coordinate will restrict acceptable column spacing midlines to only those coordinates specified.
The exceptions to this are when privileged constants ofCOLUMN_POSITION_HALVES
,COLUMN_POSITION_THIRDS
, orCOLUMN_POSITION_QUARTERS
are provided. Those constants "expand" into multiple column partitions; e.g. specifyingCOLUMN_POSITION_THIRDS
will result in two column partitions, one atgetPageWidth() / 3
and another at2 * getPageWidth() / 3
.
In order to be effective, this method must be used before eithergetTextContent()
orOutputSource.pipe(OutputHandler)
are invoked.- Throws:
UnsupportedOperationException
- if this Page's implementation does not support specifying column positions.- Since:
- 2.5.0
-
getPageNumber
int getPageNumber()
Returns this Page's page number.
-
getPageWidth
int getPageWidth()
Returns the width of this page in PDF "default user space units" (as specified by the PDF spec). Typically, each "user space unit" is equivalent to 1/72 of an inch, so dividing the value returned by this method by 72 will yield the page height in inches.
The value returned from this method corresponds to the width value of the/MediaBox
attribute of a PDF page object.
-
getPageHeight
int getPageHeight()
Returns the height of this page in PDF "default user space units" (as specified by the PDF spec). Typically, each "user space unit" is equivalent to 1/72 of an inch, so dividing the value returned by this method by 72 will yield the page height in inches.
The value returned from this method corresponds to the height value of the/MediaBox
attribute of a PDF page object.
-
getCropBox
Region getCropBox()
The "crop box" defined by the PDF for this page, expressed in user space units as withgetPageHeight()
andgetPageWidth()
. This rectangle will default to the page width and height if it is not otherwise specified.
-
getRotationTheta
int getRotationTheta()
Returns the number of degrees by which the page has been rotated clockwise. This value should be a factor of 90, and can be negative.
The value returned from this method corresponds to the value of the /Rotate attribute of a PDF page object.
-
getTextContent
BlockParent getTextContent()
Returns a BlockParent instance that contains all Block instances held by this Page, which in turn hold all text content for this Page.- Throws:
InsufficientLicenseException
- if alicense has been loaded
, but that license does not includePDF.Feature.Text
.
-
getCharacters
Collection<TextUnit> getCharacters()
Returns a collection ofTextUnit
s on this page.Note that this collection is unordered unless a
license has been loaded
that includesPDF.Feature.Text
.If attempting PDF text extraction, using
OutputSource.pipe(OutputHandler)
with an appropriateOutputHandler
, or accessing thedocument model
produced byPDFTextStream
is strongly recommended.
-
getImages
Collection<Image> getImages()
Returns a Collection ofImage
objects, one for each image on this page. Note that the same image data might be displayed multiple times on a page; in such situations, multipleImage
instances will still be included in the returned collection so as to represent each displayed image's dimensions and position.- Throws:
InsufficientLicenseException
- if alicense has been loaded
, but that license does not includePDF.Feature.Images
.
-
getConfig
Configuration getConfig()
Returns theConfiguration
instance provided to this page by its parentDocument
instance.
-
crop
Page crop(Region area)
Returns aPage
instance that contains only the content held by thisPage
instance that intersects the given "query" area. If all of the content held by this instance is intersected by the query area, then this instance may be returned unchanged. If no content in thisPage
intersects the query area, then an emptyPage
instance will be returned.- Throws:
UnsupportedOperationException
- if thisPage
implementation does not supportcrop
.- See Also:
Block.crop(Region)
-
guessDirection
Direction guessDirection()
Returns the "base"Direction
of this page, determined via a heuristic that surveys the entire page's contents. This value is what determines the default base direction used whenOutputSource.pipe(OutputHandler)
is invoked (which always delegates to theOutputSource.pipe(OutputHandler, Direction)
overload).- Since:
- v4.0
-
-