public class PDFTextStream
extends java.io.Reader
implements java.io.Closeable
PDFTextStream
gives your Java, .NET, and Python applications the ability to:
Instances of this class can either access a PDF file directly, or process equivalent data
delivered via a java.io.InputStream
or java.nio.ByteBuffer
.
Certain aspects of PDFTextStream's operation may be customized by providing a suitably-configured
PDFTextStreamConfig
object to a PDFTextStream constructor, or by changing the default PDFTextStreamConfig
instance via the PDFTextStreamConfig.setDefaultConfig(PDFTextStreamConfig)
function, or by
setting a PDFTextStream instance's configuration settings after initialization via the setConfig(PDFTextStreamConfig)
function.
PDFTextStream
supports the core of the PDF file specification up to and
including version 1.7 (corresponding to Acrobat 8), including 40/128-bit document
encryption methods. PDFTextStream
also supports a variety of PDF format variants: formats that
deviate from the official PDF document specification significantly, yet still render as expected in Adobe Reader.
Using PDFTextStream to extract text from PDF documents is very simple; first, create an instance of PDFTextStream
with a reference to a PDF file (alternatively, you can provide an java.io.InputStream
or a java.nio.ByteBuffer
):
PDFTextStream stream = new PDFTextStream(pdfFile);Once a PDFTextStream instance is available, it can be used just like a
java.io.Reader
:
BufferedReader bufPDF = new BufferedReader(stream); String firstLine = bufPDF.readLine(); // ... etc. ...That's convenient, but only using PDFTextStream's
java.io.Reader
interface can be limiting; for instance, there's no way
to extract text from individual pages of a PDF that way. A more flexible extraction mechanism is available by using
OutputHandler
implementations to control the extraction of text. The "standard" implementation is
OutputTarget
(which is what PDFTextStream uses to format PDF text delivered through its java.io.Reader
interface):
Page page = stream.getPage(0); StringBuffer sb = new StringBuffer(1024); OutputTarget tgt = new OutputTarget(sb); page.pipe(tgt); String firstPageText = sb.toString();
OutputTarget
can also direct extracted text to a file on disk via a java.io.Writer
,
instead of to a StringBuffer
:
Writer textOutputFile = new OutputStreamWriter(new BufferedOutputStream(new FileOutputStream(new File("C:\pdfExtract.txt")))); OutputTarget tgt = new OutputTarget(textOutputFile); page.pipe(tgt);
OutputTarget
is only one of the OutputHandler
implementations provided with PDFTextStream. Another commonly-used
implementation is VisualOutputTarget
. In contrast to OutputTarget
, which separates columns and other
blocks of text to enable semantically-sensitive applications (such as search indexing), VisualOutputTarget
retains the visual appearance and layout of each page of extracted text as much as possible. It is used just like
OutputTarget
:
StringBuffer sb = new StringBuffer(1024); VisualOutputTarget tgt = new VisualOutputTarget(sb); page.pipe(tgt); String firstPageText = sb.toString();Source code for some sample
OutputHandler
implementations are included with PDFTextStream, including
GoogleHTMLOutputHandler
and XMLOutputTarget
. Building a custom OutputHandler
implementation
is sometimes the simplest and most straightforward way to handle PDF text extracts appropriately for one's
application.
PDFTextStream
supports the extraction of interactive AcroForm
data, as well as
updating the values of most field types in such forms.
The AcroForm
instance for a particular PDF file may be retrieved using the getFormData()
function. From there, all of the AcroFormField
s available in that PDF file may be retrieved.
PDFTextStream
also includes XMLFormExport
, which will generate an XML document containing
all interactive form data associated with a PDF document. (The source code for XMLFormExport
is also included
in the PDFTextStream distribution for your reference.
The persistent values of form fields accessible through the
AcroForm
may also be updated. Doing so is usually as simple as calling
AcroFormField.setValue(String)
on the fields to be changed, using the desired new values as arguments.
Some field types also provide simpler or more comprehensive setters appropriate for that field type; for example,
the AcroCheckboxField
provides the AcroCheckboxField.setValue(boolean)
function, which enables
a checkbox's value to be set without having to determine what String should be used to represent the
"checked" checkbox state.
After updating the values of form fields as appropriate, either the AcroForm.writeUpdatedDocument(File)
or
AcroForm.writeUpdatedDocument(OutputStream)
may be used to write out an updated version of the
PDF document that contains the new form field values.
PDFTextStream provides access to all document-level metadata. This metadata includes creation and modification dates, author information, what application was used to generate a PDF document, and other items of potential interest. There are two potential sources of this metadata within a PDF document, and PDFTextStream provides a mechanism for retrieving metadata from each source.
Most PDF documents contain a mapping of simple name/value pair metadata attributes, which are stored in the document '/Info' object. PDFTextStream provides a set of methods for accessing these metadata attributes:
getAttribute(String)
for retrieving the value associated with a named attributegetAttributeKeys()
for retrieving a java.util.Set
view of the
names of the attributes defined in a particular PDF documentgetAttributeMap()
for retrieving a java.util.Map
view of
all of the metadata name / value mappings.These methods may be called at any time before a PDFTextStream instance is closed. For
more details about retrieval of metadata attribute values, please refer to the documentation for
getAttribute(String)
.
Adobe has developed an XML-based architecture for delivering richer, more
flexible metadata within a PDF document, called XMP (Extensible Metadata Platform). Many PDF documents
include XMP streams, which can be accessed via the getXmlMetadata()
method. This XML data typically is just another view of the metadata stored in the 'classic' document /Info
object, but in some PDF workflows, the XMP data is used to carry richer metadata than can be stored
in the 'classic' way. More information about XMP can be found at
Adobe's website.
PDFTextStream supports the retrieval of bookmarks supplied by some PDF documents (sometimes referred to as
outline data). Bookmarks are represented in PDF documents as a simple tree structure, which PDFTextStream's
Bookmark implementation mirrors. See the getBookmarks()
function and the Bookmark
class for details.
PDFTextStream supports the retrieval of PDF annotations; these include textual annotations
(notes, comments, etc), URL's (used by PDF documents to implement hyperlinks), and others. Several functions
in PDFTextStream support the retrieval of annotations (getAllAnnotations()
, getAllAnnotations(List)
,
and getAnnotations(int)
); see the documentation for Annotation
for details on how each type
of annotation is implemented.
Text in a PDF document can be encoded in a variety of ways. PDFTextStream
supports all single-byte and double-byte Unicode character sets; it is therefore able to extract
all text written using western languages (English, Spanish, French, Icelandic, Dutch, Swedish, German, etc) as well
as Chinese, Japanese, and Korean (including vertical writing modes).
PDFTextStream does not currently support right-to-left writing modes, so text in languages such as
Arabic and Hebew is not extracted as one would expect.
LoggingRegistry
provides a central hook for customizing which logging
framework PDFTextStream links to, and how. See the documentation for LoggingRegistry
for details.
MergeUtil
provides PDF document merging functionalityKodakPrintData
enables the extraction of Kodak print job data (%KDK commands) from PDF documents
that contain such content.
Many PDFTextStream functions and its constructors pass IOException
s along as they are thrown
due to underlying system I/O errors (permissions issues, etc.). FaultyPDFException
s
may also be thrown in circumstances where a parsing or file structure problem is detected by PDFTextStream,
and it is suspected that the PDF file in question is corrupt, invalid, or otherwise not readable.
Any errors encountered while decrypting PDF content will be signaled by a EncryptedPDFException
.
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
ATTR_AUTHOR
Document attribute key used to retrieve a String indicating who created a PDF document.
|
static java.lang.String |
ATTR_CREATION_DATE
Document attribute key used to retrieve a String indicating the date and time that a PDF document
was created.
|
static java.lang.String |
ATTR_CREATOR
Document attribute key used to retrieve a String indicating the name of the application that
created the original document from which the PDF was generated.
|
static java.lang.String |
ATTR_KEYWORDS
Document attribute key used to retrieve a String containing keywords associated with a PDF document.
|
static java.lang.String |
ATTR_MOD_DATE
Document attribute key used to retrieve a String indicating the date and time that a PDF document
was last modified.
|
static java.lang.String |
ATTR_PRODUCER
Document attribute key used to retrieve a String indicating the name of the application that
generated a PDF document.
|
static java.lang.String |
ATTR_SUBJECT
Document attribute key used to retrieve a String indicating the subject of a PDF document.
|
static java.lang.String |
ATTR_TITLE
Document attribute key used to retrieve a String indicating the title of a PDF document.
|
static java.lang.String |
ATTR_TRAPPED
Document attribute key used to retrieve an indicator as to whether a PDF document includes trapping
information (trapping is a method for correcting printing errors in high-quality printing environments).
|
static java.lang.String |
ATTR_USES_GRAPH_FONTS
Some PDF files use fonts that are image-based -- instead of their encodings mapping
character codes to standard Unicode characters, they map character codes to images
of characters.
|
Constructor and Description |
---|
PDFTextStream(java.nio.ByteBuffer pdfData,
java.lang.String pdfName)
Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.
|
PDFTextStream(java.nio.ByteBuffer pdfData,
java.lang.String pdfName,
byte[] userPasswd)
Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.
|
PDFTextStream(java.nio.ByteBuffer pdfData,
java.lang.String pdfName,
byte[] userPasswd,
PDFTextStreamConfig config)
Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.
|
PDFTextStream(java.io.File pdfFile)
Creates a new PDFTextStream that reads PDF content from the given File.
|
PDFTextStream(java.io.File pdfFile,
byte[] userPasswd)
Creates a new PDFTextStream that reads PDF content from the given File.
|
PDFTextStream(java.io.File pdfFile,
byte[] userPasswd,
PDFTextStreamConfig config)
Creates a new PDFTextStream that reads PDF content from the given File.
|
PDFTextStream(java.io.InputStream is,
java.lang.String pdfName)
Creates a new PDFTextStream that reads PDF content from the given InputStream.
|
PDFTextStream(java.io.InputStream is,
java.lang.String pdfName,
byte[] userPasswd)
Creates a new PDFTextStream that reads PDF content from the given InputStream.
|
PDFTextStream(java.io.InputStream is,
java.lang.String pdfName,
byte[] userPasswd,
PDFTextStreamConfig config)
Creates a new PDFTextStream that reads PDF content from the given InputStream.
|
PDFTextStream(java.lang.String pdfFilePath)
Creates a new PDFTextStream that reads PDF content from a file located at the given path.
|
PDFTextStream(java.lang.String pdfFilePath,
byte[] userPasswd)
Creates a new PDFTextStream that reads PDF content from the given file at the given path.
|
PDFTextStream(java.lang.String pdfFilePath,
byte[] userPasswd,
PDFTextStreamConfig config)
Creates a new PDFTextStream that reads PDF content from the file located at the given path.
|
Modifier and Type | Method and Description |
---|---|
void |
close() |
void |
finalize() |
java.util.List |
getAllAnnotations()
Returns a list containing all of the annotations contained in the current PDF document.
|
int |
getAllAnnotations(java.util.List tgt)
Adds to the given List all of the annotations contained in the current PDF document.
|
java.util.List |
getAnnotations(int page)
Returns a List of all annotations found on the page indicated by the given page number;
each object will be an instance of a class that implements the
Annotation interface. |
java.lang.Object |
getAttribute(java.lang.String attrName)
This method is used to access all of the document-level metadata attributes that
are set in a PDF document.
|
java.util.Set |
getAttributeKeys()
Returns a Set containing the keys of all available document attributes.
|
java.util.Map |
getAttributeMap()
Returns a Map containing a copy of all keys and values of all available document attributes.
|
Bookmark |
getBookmarks()
If the current PDF document contains a bookmark tree, this function will return its root node.
|
PDFTextStreamConfig |
getConfig()
Returns the
PDFTextStreamConfig instance that this PDFTextStream instance is using
to govern its operation. |
EncryptionInfo |
getEncryptionInfo()
Returns an EncryptionInfo object, which provides access to some of the parameters used for the current
PDF document's encryption.
|
Form |
getFormData()
Loads the form data contained in the current document, and returns a
Form object
that represents that data. |
java.lang.String |
getName()
Returns the name of the PDF that this stream is configured to read; this will be either the name of the PDF
file that is being read, or the
pdfName String that was provided if this instance was created
with an InputStream constructor. |
Page |
getPage(int n)
Reads and returns a single page from the current PDF document.
|
int |
getPageCnt()
Returns the number of pages in the PDF document.
|
java.io.File |
getPDFFile()
Returns a reference to the file that this PDFTextStream instance is processing.
|
long |
getPdfFileSize()
Returns the size of the PDF file being read, in bytes.
|
PDFVersion |
getPDFVersion()
Retrieves the PDFVersion instance that corresponds with the version of the PDF file
specification to which current PDF file adheres.
|
byte[] |
getXmlMetadata()
Returns the XML metadata available for the current PDF document.
|
static boolean |
isLicensed()
Returns true if PDFTextStream has loaded and verified a non-evaluation license file that has not yet expired.
|
static boolean |
loadLicense(java.lang.String licenseFilePath)
Loads and attempts to verify a PDFTextStream license file at the given path.
|
static boolean |
loadLicense(java.net.URL licenseLocation)
Loads and attempts to verify a PDFTextStream license file at the given URL.
|
static void |
main(java.lang.String[] args)
Main-method to allow extraction of text from a PDF file from the command line.
|
void |
pipe(OutputHandler handler)
Extracts all available text from this PDFTextStream instance, sending all PDF text events
to the given
OutputHandler . |
int |
read() |
int |
read(char[] buf) |
int |
read(char[] buf,
int off,
int len) |
void |
setConfig(PDFTextStreamConfig config)
Sets the
PDFTextStreamConfig instance that this PDFTextStream instance will
use in various contexts to govern its operation. |
public static final java.lang.String ATTR_TITLE
public static final java.lang.String ATTR_AUTHOR
public static final java.lang.String ATTR_SUBJECT
public static final java.lang.String ATTR_KEYWORDS
public static final java.lang.String ATTR_CREATOR
public static final java.lang.String ATTR_PRODUCER
public static final java.lang.String ATTR_CREATION_DATE
parseDateString(String)
method.public static final java.lang.String ATTR_MOD_DATE
parseDateString(String)
method.public static final java.lang.String ATTR_TRAPPED
public static final java.lang.String ATTR_USES_GRAPH_FONTS
Some PDF files use fonts that are image-based -- instead of their encodings mapping character codes to standard Unicode characters, they map character codes to images of characters. This makes it possible for these kinds of fonts (typically referred to as Type3 fonts) to, for example, map the character code 32 to the image of a letter 'g' instead of the standard space character.
PDFTextStream can derive the Unicode encoding of Type3 fonts in many cases, and will do
so automatically if possible. Otherwise, content that uses
a Type3 font for which no proper encoding can be derived will be skipped, and a
document attribute with this key will be set and mapped to a
Boolean
object with a value of true
.
public PDFTextStream(java.io.InputStream is, java.lang.String pdfName) throws java.io.IOException
InputStream
provided to a PDFTextStream
constructor will be read in its entirety and
written to a temporary file for processing. All temporary files are closed and deleted when
the creating PDFTextStream
instance is closed or (in the worst case)
garbage-collected.is
- - an InputStream delivering the content of a PDF filepdfName
- - the name of the PDF file (used mostly in logging / debugging)java.io.IOException
- - if an error occurs while writing initializing the new PDFTextStreamEncryptedPDFException
- - if an error occurs configuring the new PDFTextStream to decrypt
the pdf data.public PDFTextStream(java.io.File pdfFile) throws java.io.IOException
pdfFile
- - the PDF file to be readjava.io.IOException
- - if an error occurs while writing initializing the new PDFTextStreamEncryptedPDFException
- - if an error occurs configuring the new PDFTextStream to decrypt
the pdf file.public PDFTextStream(java.lang.String pdfFilePath) throws java.io.IOException
pdfFilePath
- - the path to the PDF file to be readjava.io.IOException
- - if an error occurs while writing initializing the new PDFTextStreamEncryptedPDFException
- - if an error occurs configuring the new PDFTextStream to decrypt
the pdf file.public PDFTextStream(java.io.InputStream is, java.lang.String pdfName, byte[] userPasswd, PDFTextStreamConfig config) throws java.io.IOException
InputStream
provided to a PDFTextStream
constructor will be read in its entirety and
written to a temporary file for processing. All temporary files are closed and deleted when
the creating PDFTextStream
instance is closed or (in the worst case)
garbage-collected.is
- - an InputStream delivering the content of a PDF filepdfName
- - the name of the PDF file (used mostly in logging / debugging)userPasswd
- - the password that should be used to decrypt the given pdf data -- defaults to an empty byte array.config
- - a PDFTextStreamConfig
object from which the new PDFTextStream instance will obtain
various configuration settings.java.io.IOException
- - if an error occurs while writing initializing the new PDFTextStreamEncryptedPDFException
- - if an error occurs configuring the new PDFTextStream to decrypt
the pdf data.public PDFTextStream(java.io.InputStream is, java.lang.String pdfName, byte[] userPasswd) throws java.io.IOException
InputStream
provided to a PDFTextStream
constructor will be read in its entirety and
written to a temporary file for processing. All temporary files are closed and deleted when
the creating PDFTextStream
instance is closed or (in the worst case)
garbage-collected.is
- - an InputStream delivering the content of a PDF filepdfName
- - the name of the PDF file (used mostly in logging / debugging)userPasswd
- - the password that should be used to decrypt the given pdf data -- defaults to an empty byte array.java.io.IOException
- - if an error occurs while writing initializing the new PDFTextStreamEncryptedPDFException
- - if an error occurs configuring the new PDFTextStream to decrypt
the pdf data.public PDFTextStream(java.io.File pdfFile, byte[] userPasswd, PDFTextStreamConfig config) throws java.io.IOException
pdfFile
- - the PDF file to be readuserPasswd
- - the password that should be used to decrypt the given pdf file -- defaults to an empty byte array.config
- - a PDFTextStreamConfig
object from which the new PDFTextStream instance will obtain
various configuration settings.java.io.IOException
- - if an error occurs while writing initializing the new PDFTextStreamEncryptedPDFException
- - if an error occurs configuring the new PDFTextStream to decrypt
the pdf file.public PDFTextStream(java.lang.String pdfFilePath, byte[] userPasswd, PDFTextStreamConfig config) throws java.io.IOException
pdfFilePath
- - the path to the PDF file to be readuserPasswd
- - the password that should be used to decrypt the given pdf file -- defaults to an empty byte array.config
- - a PDFTextStreamConfig
object from which the new PDFTextStream instance will obtain
various configuration settings.java.io.IOException
- - if an error occurs while writing initializing the new PDFTextStreamEncryptedPDFException
- - if an error occurs configuring the new PDFTextStream to decrypt
the pdf file.public PDFTextStream(java.io.File pdfFile, byte[] userPasswd) throws java.io.IOException
pdfFile
- - the PDF file to be readuserPasswd
- - the password that should be used to decrypt the given pdf file -- defaults to an empty byte array.java.io.IOException
- - if an error occurs while writing initializing the new PDFTextStreamEncryptedPDFException
- - if an error occurs configuring the new PDFTextStream to decrypt
the pdf file.public PDFTextStream(java.lang.String pdfFilePath, byte[] userPasswd) throws java.io.IOException
pdfFilePath
- - the path to the PDF file to be readuserPasswd
- - the password that should be used to decrypt the pdf file -- defaults to an empty byte array.java.io.IOException
- - if an error occurs while writing initializing the new PDFTextStreamEncryptedPDFException
- - if an error occurs configuring the new PDFTextStream to decrypt
the pdf file.public PDFTextStream(java.nio.ByteBuffer pdfData, java.lang.String pdfName, byte[] userPasswd, PDFTextStreamConfig config) throws java.io.IOException
pdfData
- - a ByteBuffer providing the entirety of a PDF file's datapdfName
- - the name of the PDF whose data is provided by pdfData
(this name is used
only for logging and debugging purposes).userPasswd
- - the password that should be used to decrypt the given PDF data -- defaults to an empty byte array.config
- - a PDFTextStreamConfig
object from which the new PDFTextStream instance will obtain
various configuration settings.java.io.IOException
- - if an error occurs while writing initializing the new PDFTextStreamEncryptedPDFException
- - if an error occurs configuring the new PDFTextStream to decrypt
the pdf data.public PDFTextStream(java.nio.ByteBuffer pdfData, java.lang.String pdfName, byte[] userPasswd) throws java.io.IOException
pdfData
- - a ByteBuffer providing the entirety of a PDF file's datapdfName
- - the name of the PDF whose data is provided by pdfData
(this name is used
only for logging and debugging purposes).userPasswd
- - the password that should be used to decrypt the given PDF data -- defaults to an empty byte array.java.io.IOException
- - if an error occurs while writing initializing the new PDFTextStreamEncryptedPDFException
- - if an error occurs configuring the new PDFTextStream to decrypt
the pdf data.public PDFTextStream(java.nio.ByteBuffer pdfData, java.lang.String pdfName) throws java.io.IOException
pdfData
- - a ByteBuffer providing the entirety of a PDF file's datapdfName
- - the name of the PDF whose data is provided by pdfData
(this name is used
only for logging and debugging purposes).java.io.IOException
- - if an error occurs while writing initializing the new PDFTextStreamEncryptedPDFException
- - if an error occurs configuring the new PDFTextStream to decrypt
the pdf data.public void setConfig(PDFTextStreamConfig config)
PDFTextStreamConfig
instance that this PDFTextStream
instance will
use in various contexts to govern its operation.
Note that certain configuration options are utilized only
during PDFTextStream
initialization (such as PDFTextStreamConfig.isMemoryMappingEnabled()
).
In order for non-default settings for those such options to take effect, a customized PDFTextStreamConfig
object must either be set as the default configuration
,
or must be provided to any of the PDFTextStream
constructors that accept a
PDFTextStreamConfig
object.
public PDFTextStreamConfig getConfig()
PDFTextStreamConfig
instance that this PDFTextStream
instance is using
to govern its operation.public int read() throws java.io.IOException
read
in class java.io.Reader
java.io.IOException
public int read(char[] buf) throws java.io.IOException
read
in class java.io.Reader
java.io.IOException
public int read(char[] buf, int off, int len) throws java.io.IOException
read
in class java.io.Reader
java.io.IOException
public void pipe(OutputHandler handler) throws java.io.IOException
Extracts all available text from this PDFTextStream instance, sending all PDF text events
to the given OutputHandler
. Using this method of text extraction will always be
the fastest approach, as it eliminates any and all of the
intermediate data copying that is necessary to support extraction
via PDFTextStream's java.io.Reader
implementation.
If no special PDF text event handling is needed (i.e. you just want a straight text extract),
then just pass a simple OutputTarget
instance to this method.
The results of using this extraction method and the java.io.Reader
interface on the
same PDFTextStream interface are undefined.
handler
- - an OutputHandler instance.java.io.IOException
- - if an error occurrs during the extraction processOutputHandler
,
OutputTarget
public long getPdfFileSize()
public int getPageCnt()
public Page getPage(int n) throws java.io.IOException
n
- - the number of the page to retrieve.java.io.IOException
- if an error occurs while preparing the Page for usepublic java.lang.String getName()
pdfName
String that was provided if this instance was created
with an InputStream constructor.
Nearly all of the logging messages generated by the PDFTextStream library include the current PDFTextStream
instance's name, making them easier to interpret in a multithreaded environment.public java.io.File getPDFFile()
java.io.File
- or java.io.InputStream
-based constructors.public void finalize()
finalize
in class java.lang.Object
public void close() throws java.io.IOException
close
in interface java.io.Closeable
close
in interface java.lang.AutoCloseable
close
in class java.io.Reader
java.io.IOException
public Form getFormData() throws java.io.IOException
Form
object
that represents that data. If the current PDF contains no forms, this function returns null.
The Form
instance that is returned by this function is guaranteed to be an
AcroForm
.
This function MUST NOT be called after this PDFTextStream instance is closed
.java.io.IOException
- - if an error occurs loading the form datapublic Bookmark getBookmarks() throws java.io.IOException
closed
.java.io.IOException
- - if an error occurs reading the bookmark treeBookmark
public java.util.List getAnnotations(int page) throws java.io.IOException
Annotation
interface.
This function will never return null; if a page contains no annotations, an empty list will be returned.
The returned list is guaranteed to offer efficient random access to its elements.java.io.IOException
- - if an error occurs retrieving the annotation dataAnnotation
public java.util.List getAllAnnotations() throws java.io.IOException
java.io.IOException
- - if an error occurs retrieving the annotation dataAnnotation
public int getAllAnnotations(java.util.List tgt) throws java.io.IOException
java.io.IOException
- - if an error occurs retrieving the annotation dataAnnotation
public PDFVersion getPDFVersion() throws java.io.IOException
Retrieves the PDFVersion instance that corresponds with the version of the PDF file specification to which current PDF file adheres. PDF specification version numbers correspond directly with particular versions of Adobe Acrobat:
PDF files are generally forward-compatible. For example, Acrobat 5 should be able to read any PDF file that adheres to versions 1.0, 1.1, 1.2, 1.3, or 1.4 of the PDF file spec, etc.
Note that this method may not be called after the PDFTextStream instance is
closed
.
java.io.IOException
- - if an error occurs in determining what the PDF file's version ispublic EncryptionInfo getEncryptionInfo()
public byte[] getXmlMetadata() throws java.io.IOException
Returns the XML metadata available for the current PDF document. If no XML metadata is available in the current document, this method returns null.
Note: This method must be called before the PDFTextStream instance is closed, and it should not be called while text is being actively read out of it. (Supporting such concurrency would require synchronization that would negatively impact performance.) Therefore, the best times to call this method are:
PDFTextStream does not control the content returned by this method -- it just provides access to the data that is already stored in a PDF document. The schema of the the returned XML data is defined by Adobe, and is called the Extensible Metadata Platform (XMP). More information about XMP can be found on Adobe's website
java.io.IOException
- - if this PDFTextStream instance has already been closed, or if an error occurs retrieving
the XML metadata.public java.lang.Object getAttribute(java.lang.String attrName) throws java.io.IOException
getAttributeKeys()
method to get a
Set of the names of all available attributes.parseDateString(String)
to get a Date object.getXmlMetadata()
method.attrName
- - the name of the attribute to be retrievedjava.io.IOException
- - if an error occurs while retrieving the PDF document's metadatagetXmlMetadata() for access to the XML-formatted document metadata
public java.util.Set getAttributeKeys() throws java.io.IOException
java.io.IOException
- - if an error occurs while retrieving the PDF document's metadatapublic java.util.Map getAttributeMap() throws java.io.IOException
java.io.IOException
- - if an error occurs while retrieving the PDF document's metadatapublic static boolean loadLicense(java.lang.String licenseFilePath)
Loads and attempts to verify a PDFTextStream license file at the given path.
PDFTextStream may also be configured to load a license file from a specific path by setting the
system property or environment variable pdfts_license_path
to that path.
licenseFilePath
- - an absolute or relative file pathpublic static boolean loadLicense(java.net.URL licenseLocation)
Loads and attempts to verify a PDFTextStream license file at the given URL.
licenseLocation
- - a URL objectpublic static boolean isLicensed()
public static void main(java.lang.String[] args)
java PDFTextStream [pdfFile] [optional outputpath]
pdfFile
should be a path to the PDF file you wish to extract text from, outputpath
should be a path to which you want the text extracted from the PDF to be written. If no
outputpath
is provided, then the text of the PDF file will be written to stdout.