Package com.snowtide.pdf
Interface Page
-
- All Superinterfaces:
OutputSource
public interface Page extends OutputSource
Provides access to the text, images, and attributes of a page extracted from a PDF document.- Version:
- ©2004-2025 Snowtide
- See Also:
Document.getPage(int)
-
-
Field Summary
Fields Modifier and Type Field Description static intCOLUMN_POSITION_HALVESA constant parameter for use withaddColumnPartition(int).static intCOLUMN_POSITION_QUARTERSA constant parameter for use withaddColumnPartition(int).static intCOLUMN_POSITION_THIRDSA constant parameter for use withaddColumnPartition(int).
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description voidaddColumnPartition(int xcoord)Adds the given coordinate as an acceptable midline between columns, used when this page is segmented.Pagecrop(Region area)Returns aPageinstance that contains only the content held by thisPageinstance that intersects the given "query" area.Collection<TextUnit>getCharacters()Returns a collection ofTextUnits on this page.ConfigurationgetConfig()Returns theConfigurationinstance provided to this page by its parentDocumentinstance.RegiongetCropBox()The "crop box" defined by the PDF for this page, expressed in user space units as withgetPageHeight()andgetPageWidth().DocumentgetDocument()Returns theDocumentfrom which this Page was sourced.Collection<Image>getImages()Returns a Collection ofImageobjects, one for each image on this page.intgetPageHeight()Returns the height of this page in PDF "default user space units" (as specified by the PDF spec).intgetPageNumber()Returns this Page's page number.intgetPageWidth()Returns the width of this page in PDF "default user space units" (as specified by the PDF spec).intgetRotationTheta()Returns the number of degrees by which the page has been rotated clockwise.BlockParentgetTextContent()BlockParentgetTextContent(Direction baseDir)Returns a BlockParent instance that contains all Block instances held by this Page, which in turn hold all text content for this Page.DirectionguessDirection()Returns the "base"Directionof this page, determined via a heuristic that surveys the entire page's contents.-
Methods inherited from interface com.snowtide.pdf.OutputSource
pipe, pipe
-
-
-
-
Field Detail
-
COLUMN_POSITION_HALVES
static final int COLUMN_POSITION_HALVES
A constant parameter for use withaddColumnPartition(int).- Since:
- 2.5.0
- See Also:
- Constant Field Values
-
COLUMN_POSITION_THIRDS
static final int COLUMN_POSITION_THIRDS
A constant parameter for use withaddColumnPartition(int).- Since:
- 2.5.0
- See Also:
- Constant Field Values
-
COLUMN_POSITION_QUARTERS
static final int COLUMN_POSITION_QUARTERS
A constant parameter for use withaddColumnPartition(int).- Since:
- 2.5.0
- See Also:
- Constant Field Values
-
-
Method Detail
-
addColumnPartition
void addColumnPartition(int xcoord) throws UnsupportedOperationExceptionAdds the given coordinate as an acceptable midline between columns, used when this page is segmented. By default, no specific coordinate restrictions are applied to column partitioning. Adding any column partition coordinate will restrict acceptable column spacing midlines to only those coordinates specified.
The exceptions to this are when privileged constants ofCOLUMN_POSITION_HALVES,COLUMN_POSITION_THIRDS, orCOLUMN_POSITION_QUARTERSare provided. Those constants "expand" into multiple column partitions; e.g. specifyingCOLUMN_POSITION_THIRDSwill result in two column partitions, one atgetPageWidth() / 3and another at2 * getPageWidth() / 3.
In order to be effective, this method must be used before eithergetTextContent()orOutputSource.pipe(OutputHandler)are invoked.- Throws:
UnsupportedOperationException- if this Page's implementation does not support specifying column positions.- Since:
- 2.5.0
-
getPageNumber
int getPageNumber()
Returns this Page's page number.
-
getPageWidth
int getPageWidth()
Returns the width of this page in PDF "default user space units" (as specified by the PDF spec). Typically, each "user space unit" is equivalent to 1/72 of an inch, so dividing the value returned by this method by 72 will yield the page height in inches.
The value returned from this method corresponds to the width value of the/MediaBoxattribute of a PDF page object.
-
getPageHeight
int getPageHeight()
Returns the height of this page in PDF "default user space units" (as specified by the PDF spec). Typically, each "user space unit" is equivalent to 1/72 of an inch, so dividing the value returned by this method by 72 will yield the page height in inches.
The value returned from this method corresponds to the height value of the/MediaBoxattribute of a PDF page object.
-
getCropBox
Region getCropBox()
The "crop box" defined by the PDF for this page, expressed in user space units as withgetPageHeight()andgetPageWidth(). This rectangle will default to the page width and height if it is not otherwise specified.
-
getRotationTheta
int getRotationTheta()
Returns the number of degrees by which the page has been rotated clockwise. This value should be a factor of 90, and can be negative.
The value returned from this method corresponds to the value of the /Rotate attribute of a PDF page object.
-
getTextContent
BlockParent getTextContent()
-
getTextContent
BlockParent getTextContent(Direction baseDir)
Returns a BlockParent instance that contains all Block instances held by this Page, which in turn hold all text content for this Page.- Parameters:
baseDir- theDirectionthat should be used to disambiguate the order in which extracted text is emitted. At the page level, this directly affects the read-ordering of blocks (e.g. passingDirection.RTLwill cause columns to be traversed right-to-left).
-
getCharacters
Collection<TextUnit> getCharacters()
Returns a collection ofTextUnits on this page.If attempting PDF text extraction, using
OutputSource.pipe(OutputHandler)with an appropriateOutputHandler, or accessing thedocument modelis strongly recommended.
-
getImages
Collection<Image> getImages()
Returns a Collection ofImageobjects, one for each image on this page. Note that the same image data might be displayed multiple times on a page; in such situations, multipleImageinstances will still be included in the returned collection so as to represent each displayed image's dimensions and position.
-
getConfig
Configuration getConfig()
Returns theConfigurationinstance provided to this page by its parentDocumentinstance.
-
crop
Page crop(Region area)
Returns aPageinstance that contains only the content held by thisPageinstance that intersects the given "query" area. If all of the content held by this instance is intersected by the query area, then this instance may be returned unchanged. If no content in thisPageintersects the query area, then an emptyPageinstance will be returned.- Throws:
UnsupportedOperationException- if thisPageimplementation does not supportcrop.- See Also:
Block.crop(Region)
-
guessDirection
Direction guessDirection()
Returns the "base"Directionof this page, determined via a heuristic that surveys the entire page's contents. This value is what determines the default base direction used whenOutputSource.pipe(OutputHandler)is invoked (which always delegates to theOutputSource.pipe(OutputHandler, Direction)overload).- Since:
- v4.0
-
-