public interface Document
open
methods on the PDF
factory class — e.g. PDF.open(java.io.File)
— to obtain a Document
providing access to the contents of a particular PDF file.Modifier and Type | Field and Description |
---|---|
static java.lang.String |
ATTR_AUTHOR
Document attribute key used to retrieve a String indicating who created a PDF document.
|
static java.lang.String |
ATTR_CREATION_DATE
Document attribute key used to retrieve a String indicating the date and time that a PDF document
was created.
|
static java.lang.String |
ATTR_CREATOR
Document attribute key used to retrieve a String indicating the name of the application that
created the original document from which the PDF was generated.
|
static java.lang.String |
ATTR_KEYWORDS
Document attribute key used to retrieve a String containing keywords associated with a PDF document.
|
static java.lang.String |
ATTR_MOD_DATE
Document attribute key used to retrieve a String indicating the date and time that a PDF document
was last modified.
|
static java.lang.String |
ATTR_PRODUCER
Document attribute key used to retrieve a String indicating the name of the application that
generated a PDF document.
|
static java.lang.String |
ATTR_SUBJECT
Document attribute key used to retrieve a String indicating the subject of a PDF document.
|
static java.lang.String |
ATTR_TITLE
Document attribute key used to retrieve a String indicating the title of a PDF document.
|
static java.lang.String |
ATTR_TRAPPED
Document attribute key used to retrieve an indicator as to whether a PDF document includes trapping
information (trapping is a method for correcting printing errors in high-quality printing environments).
|
static java.lang.String |
ATTR_USES_GRAPH_FONTS
Some PDF files use fonts that are image-based -- instead of their encodings mapping
character codes to standard Unicode characters, they map character codes to images
of characters.
|
Modifier and Type | Method and Description |
---|---|
java.util.List<Annotation> |
getAllAnnotations()
Returns a list containing all of the
Annotation s contained in the
current PDF document. |
int |
getAllAnnotations(java.util.List tgt)
Adds to the given List all of the
Annotation s contained in the current PDF
document. |
java.util.List<EmbeddedFile> |
getAllEmbeddedFiles()
Returns a list of all of
the embedded files available in the source PDF. |
java.util.List<Annotation> |
getAnnotations(int page)
Returns a List of all annotations found on the page indicated by the given page number;
each object will be an instance of a class that implements the
Annotation interface. |
java.lang.Object |
getAttribute(java.lang.String attrName)
Returns the value of the specified document-level metadata attribute.
|
java.util.Set |
getAttributeKeys()
Returns a
Set containing the keys of all available document metadata attributes. |
java.util.Map |
getAttributeMap()
Returns a
Map containing a copy of all keys and values of all available document
metadata attributes. |
Bookmark |
getBookmarks()
If the current PDF document contains a bookmark tree, this function will return its root node.
|
Configuration |
getConfig()
Returns the
Configuration instance that this Document is using
to govern its operation. |
java.util.List<EmbeddedFile> |
getEmbeddedFiles()
Returns a list of
the embedded files associated with the source PDF document itself. |
EncryptionInfo |
getEncryptionInfo()
Returns an EncryptionInfo object, which provides access to some of the parameters used for the current
PDF document's encryption.
|
Form |
getFormData()
Loads the form data contained in the current document, and returns a
Form object
that represents that data. |
java.util.Collection<Image> |
getImages()
|
java.lang.String |
getName()
|
Page |
getPage(int n)
Reads and returns a single page.
|
int |
getPageCnt()
Returns the number of pages in the PDF document.
|
java.util.List<Page> |
getPages()
|
java.io.File |
getPDFFile()
Returns a reference to the file that this
Document is processing. |
long |
getPdfFileSize()
Returns the size of the PDF file being read, in bytes.
|
PDFVersion |
getPDFVersion()
Returns the
PDFVersion instance that corresponds with the version of the PDF file
specification to which current PDF file adheres. |
byte[] |
getXmlMetadata()
Returns the XML metadata available from this
Document , or null if no XML metadata is available. |
void |
pipe(OutputHandler handler)
Extracts all available text from this
Document , sending all PDF text events
to the given OutputHandler . |
void |
setConfig(Configuration config)
Sets the
Configuration instance that this Document will
use in various contexts to govern its operation. |
static final java.lang.String ATTR_TITLE
static final java.lang.String ATTR_AUTHOR
static final java.lang.String ATTR_SUBJECT
static final java.lang.String ATTR_KEYWORDS
static final java.lang.String ATTR_CREATOR
static final java.lang.String ATTR_PRODUCER
static final java.lang.String ATTR_CREATION_DATE
parseDateString(String)
method.static final java.lang.String ATTR_MOD_DATE
parseDateString(String)
method.static final java.lang.String ATTR_TRAPPED
static final java.lang.String ATTR_USES_GRAPH_FONTS
Some PDF files use fonts that are image-based -- instead of their encodings mapping character codes to standard Unicode characters, they map character codes to images of characters. This makes it possible for these kinds of fonts (typically referred to as Type3 fonts) to, for example, map the character code 32 to the image of a letter 'g' instead of the standard space character.
PDFxStream can derive the Unicode encoding of Type3 fonts in many cases, and will do
so automatically if possible. Otherwise, content that uses
a Type3 font for which no proper encoding can be derived will be skipped, and a
document attribute with this key will be set and mapped to a
Boolean
object with a value of true
.
void setConfig(Configuration config)
Configuration
instance that this Document
will
use in various contexts to govern its operation.
Note that certain configuration options are utilized only when a Document
is being opened.
In order for non-default settings for those such options to take effect, a customized Configuration
object must either be set as the default configuration
,
or must be provided to any of the com.snowtide.PDF.open()
static methods that accept a
Configuration
object, e.g. PDF.open(java.io.File, byte[], Configuration)
.
Configuration getConfig()
Configuration
instance that this Document
is using
to govern its operation.void pipe(OutputHandler handler)
Extracts all available text from this Document
, sending all PDF text events
to the given OutputHandler
.
If no special PDF text event handling is needed (i.e. you just want a straight text extract),
then using an OutputTarget
is recommended.
handler
- an OutputHandler instance.InsufficientLicenseException
- if a license has been loaded
,
but that license does not include PDF.Feature.Text
.java.io.IOException
- if an error occurs during the extraction processOutputHandler
,
OutputTarget
java.util.Collection<Image> getImages()
InsufficientLicenseException
- if a license has been loaded
,
but that license does not include PDF.Feature.Images
.java.io.IOException
- if an error occurs during the extraction processlong getPdfFileSize()
int getPageCnt()
Page getPage(int n)
n
- the number of the page to retrieve.java.io.IOException
- if an error occurs while preparing the Page for usejava.util.List<Page> getPages()
pages
from this Document
,
which are loaded lazily when accessed via the returned list.java.lang.String getName()
Document
is reading; this will be either the name
of the PDF
file that is being read, or the pdfName
String that was provided if this
Document
was opened using one of the com.snowtide.PDF.open()
methods that
accepts an InputStream
or ByteBuffer
,
e.g. PDF.open(java.io.InputStream, String)
Nearly all of the logging messages generated by PDFxStream include the relevant
Document
's name, making them easier to interpret in a multithreaded production
environment.
java.io.File getPDFFile()
Document
is processing.
This reference may be null if the Document
instance is not reading from a File
or
InputStream
.Form getFormData()
Form
object
that represents that data. If the current PDF contains no forms, this function returns null.
The Form
instance that is returned by this function is guaranteed to be an
AcroForm
.
This function MUST NOT be called after this Document
is closed
.
java.io.IOException
- if an error occurs loading the form dataInsufficientLicenseException
- if a license has been loaded
,
but that license does not include the PDF.Feature.Forms
feature.java.util.List<EmbeddedFile> getEmbeddedFiles()
the embedded files
associated with the source PDF document itself.
Use Document.getAllEmbeddedFiles()
to include all embedded files associated with annotations as well.java.io.IOException
- if reading the embedded file metadata failsDocument.getAllEmbeddedFiles()
java.util.List<EmbeddedFile> getAllEmbeddedFiles()
the embedded files
available in the source PDF.
This method includes all files associated with annotations as well; if you only want those
embedded files that are associated with the source document itself (and not annotations),
use Document.getEmbeddedFiles()
.java.io.IOException
- if reading the embedded file metadata failsDocument.getEmbeddedFiles()
Bookmark getBookmarks()
An exception will be thrown if this function is called after this Document
instance
is closed
.
java.io.IOException
- if an error occurs reading the bookmark treeBookmark
java.util.List<Annotation> getAnnotations(int page)
Annotation
interface.
This function will never return null; if a page contains no annotations, an empty list will be returned. The returned list is guaranteed to offer efficient random access to its elements.
java.io.IOException
- if an error occurs retrieving the annotation dataAnnotation
java.util.List<Annotation> getAllAnnotations()
Annotation
s contained in the
current PDF document.
The returned list is guaranteed to offer efficient random access to its elements.java.io.IOException
- if an error occurs retrieving the annotation dataint getAllAnnotations(java.util.List tgt)
Annotation
s contained in the current PDF
document.java.io.IOException
- if an error occurs retrieving the annotation dataAnnotation
PDFVersion getPDFVersion()
Returns the PDFVersion
instance that corresponds with the version of the PDF file
specification to which current PDF file adheres. PDF specification version numbers
correspond directly with particular versions of Adobe Acrobat:
This method may not be called after the Document
is
closed
.
java.io.IOException
- if an error occurs in determining what the PDF file's version isEncryptionInfo getEncryptionInfo()
If the current PDF document is not encrypted, this method will return null.
byte[] getXmlMetadata()
Returns the XML metadata available from this Document
, or null if no XML metadata is available.
Note: This method must be called before the Document
is closed, and it should not
be called while text is being actively read out of it. (Supporting such concurrency would require synchronization
that would negatively impact performance.) Therefore, the best times to call this method are:
Document
but before reading text out of itDocument
, but before it is closedPDFxStream does not control the content returned by this method -- it just provides access to the data that is already stored in a PDF document. The schema of the the returned XML data is defined by Adobe, and is called the Extensible Metadata Platform (XMP). More information about XMP can be found on Adobe's website
java.io.IOException
- if this Document
has already been closed, or if an error occurs retrieving
the XML metadata.java.lang.Object getAttribute(java.lang.String attrName)
All of the standard attribute names are defined in constants in this class, and are all prefixed with 'ATTR_'. A few notes should be kept in mind when accessing attribute values:
getAttributeKeys()
method to get a
Set of the names of all available attributes.parseDateString(String)
to get a Date object.
Note: the attributes available through this method are retrieved from the "classic" document /Info entry.
The document metadata in an XML format (which typically contains the same set of metadata attributes
that are available through this method) may be obtained via the
getXmlMetadata()
method.
attrName
- the name of the attribute to be retrievedjava.io.IOException
- if an error occurs while retrieving the PDF document's metadatagetXmlMetadata() for access to the XML-formatted document metadata
java.util.Set getAttributeKeys()
Set
containing the keys of all available document metadata attributes.java.io.IOException
- if an error occurs while retrieving the PDF document's metadatajava.util.Map getAttributeMap()
Map
containing a copy of all keys and values of all available document
metadata attributes.java.io.IOException
- if an error occurs while retrieving the PDF document's metadata