com.snowtide.pdf
Class PDFTextStream

java.lang.Object
  extended by java.io.Reader
      extended by com.snowtide.pdf.PDFTextStream
All Implemented Interfaces:
java.io.Closeable, java.lang.Readable

public class PDFTextStream
extends java.io.Reader
implements java.io.Closeable

PDFTextStream gives your Java, .NET, and Python applications the ability to:

Instances of this class can either access a PDF file directly, or process equivalent data delivered via a java.io.InputStream or java.nio.ByteBuffer.

Certain aspects of PDFTextStream's operation may be customized by providing a suitably-configured PDFTextStreamConfig object to a PDFTextStream constructor, or by changing the default PDFTextStreamConfig instance via the PDFTextStreamConfig.setDefaultConfig(PDFTextStreamConfig) function, or by setting a PDFTextStream instance's configuration settings after initialization via the setConfig(PDFTextStreamConfig) function.

Level of Support

PDFTextStream supports the core of the PDF file specification up to and including version 1.7 (corresponding to Acrobat 8), including 40/128-bit document encryption methods. PDFTextStream also supports a variety of PDF format variants: formats that deviate from the official PDF document specification significantly, yet still render as expected in Adobe Reader.

Text Extraction

Using PDFTextStream to extract text from PDF documents is very simple; first, create an instance of PDFTextStream with a reference to a PDF file (alternatively, you can provide an java.io.InputStream or a java.nio.ByteBuffer):

 PDFTextStream stream = new PDFTextStream(pdfFile);
 
Once a PDFTextStream instance is available, it can be used just like a java.io.Reader:
 BufferedReader bufPDF = new BufferedReader(stream);
 String firstLine = bufPDF.readLine();
 // ... etc. ...
 
That's convenient, but only using PDFTextStream's java.io.Reader interface can be limiting; for instance, there's no way to extract text from individual pages of a PDF that way. A more flexible extraction mechanism is available by using OutputHandler implementations to control the extraction of text. The "standard" implementation is OutputTarget (which is what PDFTextStream uses to format PDF text delivered through its java.io.Reader interface):
 
 Page page = stream.getPage(0);
 StringBuffer sb = new StringBuffer(1024);
 OutputTarget tgt = new OutputTarget(sb);
 page.pipe(tgt);
 
 String firstPageText = sb.toString();
 
OutputTarget can also direct extracted text to a file on disk via a java.io.Writer, instead of to a StringBuffer:
 Writer textOutputFile = new OutputStreamWriter(new BufferedOutputStream(new FileOutputStream(new File("C:\pdfExtract.txt"))));
 OutputTarget tgt = new OutputTarget(textOutputFile);
 page.pipe(tgt);
 
OutputTarget is only one of the OutputHandler implementations provided with PDFTextStream. Another commonly-used implementation is VisualOutputTarget. In contrast to OutputTarget, which separates columns and other blocks of text to enable semantically-sensitive applications (such as search indexing), VisualOutputTarget retains the visual appearance and layout of each page of extracted text as much as possible. It is used just like OutputTarget:
 StringBuffer sb = new StringBuffer(1024);
 VisualOutputTarget tgt = new VisualOutputTarget(sb);
 page.pipe(tgt);
 
 String firstPageText = sb.toString();
 
Source code for some sample OutputHandler implementations are included with PDFTextStream, including GoogleHTMLOutputHandler and XMLOutputTarget. Building a custom OutputHandler implementation is sometimes the simplest and most straightforward way to handle PDF text extracts appropriately for one's application.

Form Data Extraction and Updating

PDFTextStream supports the extraction of interactive AcroForm data, as well as updating the values of most field types in such forms.

Form Data Extraction

The AcroForm instance for a particular PDF file may be retrieved using the getFormData() function. From there, all of the AcroFormFields available in that PDF file may be retrieved. PDFTextStream also includes XMLFormExport, which will generate an XML document containing all interactive form data associated with a PDF document. (The source code for XMLFormExport is also included in the PDFTextStream distribution for your reference.

Updating Interactive Forms

The persistent values of form fields accessible through the AcroForm may also be updated. Doing so is usually as simple as calling AcroFormField.setValue(String) on the fields to be changed, using the desired new values as arguments. Some field types also provide simpler or more comprehensive setters appropriate for that field type; for example, the AcroCheckboxField provides the AcroCheckboxField.setValue(boolean) function, which enables a checkbox's value to be set without having to determine what String should be used to represent the "checked" checkbox state.

After updating the values of form fields as appropriate, either the AcroForm.writeUpdatedDocument(File) or AcroForm.writeUpdatedDocument(OutputStream) may be used to write out an updated version of the PDF document that contains the new form field values.

Metadata Access

PDFTextStream provides access to all document-level metadata. This metadata includes creation and modification dates, author information, what application was used to generate a PDF document, and other items of potential interest. There are two potential sources of this metadata within a PDF document, and PDFTextStream provides a mechanism for retrieving metadata from each source.

Name / Value Pairs

Most PDF documents contain a mapping of simple name/value pair metadata attributes, which are stored in the document '/Info' object. PDFTextStream provides a set of methods for accessing these metadata attributes:

These methods may be called at any time before a PDFTextStream instance is closed. For more details about retrieval of metadata attribute values, please refer to the documentation for getAttribute(String).

XMP Metadata

Adobe has developed an XML-based architecture for delivering richer, more flexible metadata within a PDF document, called XMP (Extensible Metadata Platform). Many PDF documents include XMP streams, which can be accessed via the getXmlMetadata() method. This XML data typically is just another view of the metadata stored in the 'classic' document /Info object, but in some PDF workflows, the XMP data is used to carry richer metadata than can be stored in the 'classic' way. More information about XMP can be found at Adobe's website.

Bookmark Data Extraction

PDFTextStream supports the retrieval of bookmarks supplied by some PDF documents (sometimes referred to as outline data). Bookmarks are represented in PDF documents as a simple tree structure, which PDFTextStream's Bookmark implementation mirrors. See the getBookmarks() function and the Bookmark class for details.

Annotation

PDFTextStream supports the retrieval of PDF annotations; these include textual annotations (notes, comments, etc), URL's (used by PDF documents to implement hyperlinks), and others. Several functions in PDFTextStream support the retrieval of annotations (getAllAnnotations(), getAllAnnotations(List), and getAnnotations(int)); see the documentation for Annotation for details on how each type of annotation is implemented.

Character Sets and Encodings

Text in a PDF document can be encoded in a variety of ways. PDFTextStream supports all single-byte and double-byte Unicode character sets; it is therefore able to extract all text written using western languages (English, Spanish, French, Icelandic, Dutch, Swedish, German, etc) as well as Chinese, Japanese, and Korean (including vertical writing modes). PDFTextStream does not currently support right-to-left writing modes, so text in languages such as Arabic and Hebew is not extracted as one would expect.

Logging

PDFTextStream is designed to integrate smoothly into its environment; logging is commonly a large part of that. To that end, PDFTextStream's LoggingRegistry provides a central hook for customizing which logging framework PDFTextStream links to, and how. See the documentation for LoggingRegistry for details.

Utilities

Errors

Many PDFTextStream functions and its constructors pass IOExceptions along as they are thrown due to underlying system I/O errors (permissions issues, etc.). FaultyPDFExceptions may also be thrown in circumstances where a parsing or file structure problem is detected by PDFTextStream, and it is suspected that the PDF file in question is corrupt, invalid, or otherwise not readable. Any errors encountered while decrypting PDF content will be signaled by a EncryptedPDFException.

Version:
©2004-2012 Snowtide Informatics Systems, Inc.

Field Summary
static java.lang.String ATTR_AUTHOR
          Document attribute key used to retrieve a String indicating who created a PDF document.
static java.lang.String ATTR_CREATION_DATE
          Document attribute key used to retrieve a String indicating the date and time that a PDF document was created.
static java.lang.String ATTR_CREATOR
          Document attribute key used to retrieve a String indicating the name of the application that created the original document from which the PDF was generated.
static java.lang.String ATTR_KEYWORDS
          Document attribute key used to retrieve a String containing keywords associated with a PDF document.
static java.lang.String ATTR_MOD_DATE
          Document attribute key used to retrieve a String indicating the date and time that a PDF document was last modified.
static java.lang.String ATTR_PRODUCER
          Document attribute key used to retrieve a String indicating the name of the application that generated a PDF document.
static java.lang.String ATTR_SUBJECT
          Document attribute key used to retrieve a String indicating the subject of a PDF document.
static java.lang.String ATTR_TITLE
          Document attribute key used to retrieve a String indicating the title of a PDF document.
static java.lang.String ATTR_TRAPPED
          Document attribute key used to retrieve an indicator as to whether a PDF document includes trapping information (trapping is a method for correcting printing errors in high-quality printing environments).
static java.lang.String ATTR_USES_GRAPH_FONTS
           Some PDF files use fonts that are image-based -- instead of their encodings mapping character codes to standard Unicode characters, they map character codes to images of characters.
 
Fields inherited from class java.io.Reader
lock
 
Constructor Summary
PDFTextStream(java.nio.ByteBuffer pdfData, java.lang.String pdfName)
          Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.
PDFTextStream(java.nio.ByteBuffer pdfData, java.lang.String pdfName, byte[] userPasswd)
          Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.
PDFTextStream(java.nio.ByteBuffer pdfData, java.lang.String pdfName, byte[] userPasswd, PDFTextStreamConfig config)
          Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.
PDFTextStream(java.io.File pdfFile)
          Creates a new PDFTextStream that reads PDF content from the given File.
PDFTextStream(java.io.File pdfFile, byte[] userPasswd)
          Creates a new PDFTextStream that reads PDF content from the given File.
PDFTextStream(java.io.File pdfFile, byte[] userPasswd, PDFTextStreamConfig config)
          Creates a new PDFTextStream that reads PDF content from the given File.
PDFTextStream(java.io.InputStream is, java.lang.String pdfName)
          Creates a new PDFTextStream that reads PDF content from the given InputStream.
PDFTextStream(java.io.InputStream is, java.lang.String pdfName, byte[] userPasswd)
          Creates a new PDFTextStream that reads PDF content from the given InputStream.
PDFTextStream(java.io.InputStream is, java.lang.String pdfName, byte[] userPasswd, PDFTextStreamConfig config)
          Creates a new PDFTextStream that reads PDF content from the given InputStream.
PDFTextStream(java.lang.String pdfFilePath)
          Creates a new PDFTextStream that reads PDF content from a file located at the given path.
PDFTextStream(java.lang.String pdfFilePath, byte[] userPasswd)
          Creates a new PDFTextStream that reads PDF content from the given file at the given path.
PDFTextStream(java.lang.String pdfFilePath, byte[] userPasswd, PDFTextStreamConfig config)
          Creates a new PDFTextStream that reads PDF content from the file located at the given path.
 
Method Summary
 void close()
           
 void finalize()
           
 java.util.List getAllAnnotations()
          Returns a list containing all of the annotations contained in the current PDF document.
 int getAllAnnotations(java.util.List tgt)
          Adds to the given List all of the annotations contained in the current PDF document.
 java.util.List getAnnotations(int page)
          Returns a List of all annotations found on the page indicated by the given page number; each object will be an instance of a class that implements the Annotation interface.
 java.lang.Object getAttribute(java.lang.String attrName)
          This method is used to access all of the document-level metadata attributes that are set in a PDF document.
 java.util.Set getAttributeKeys()
          Returns a Set containing the keys of all available document attributes.
 java.util.Map getAttributeMap()
          Returns a Map containing a copy of all keys and values of all available document attributes.
 Bookmark getBookmarks()
          If the current PDF document contains a bookmark tree, this function will return its root node.
 PDFTextStreamConfig getConfig()
          Returns the PDFTextStreamConfig instance that this PDFTextStream instance is using to govern its operation.
 EncryptionInfo getEncryptionInfo()
          Returns an EncryptionInfo object, which provides access to some of the parameters used for the current PDF document's encryption.
 Form getFormData()
          Loads the form data contained in the current document, and returns a Form object that represents that data.
 java.lang.String getName()
          Returns the name of the PDF that this stream is configured to read; this will be either the name of the PDF file that is being read, or the pdfName String that was provided if this instance was created with an InputStream constructor.
 Page getPage(int n)
          Reads and returns a single page from the current PDF document.
 int getPageCnt()
          Returns the number of pages in the PDF document.
 java.io.File getPDFFile()
          Returns a reference to the file that this PDFTextStream instance is processing.
 long getPdfFileSize()
          Returns the size of the PDF file being read, in bytes.
 PDFVersion getPDFVersion()
           Retrieves the PDFVersion instance that corresponds with the version of the PDF file specification to which current PDF file adheres.
 byte[] getXmlMetadata()
           Returns the XML metadata available for the current PDF document.
static boolean isLicensed()
          Returns true if PDFTextStream has loaded and verified a non-evaluation license file that has not yet expired.
static boolean loadLicense(java.lang.String licenseFilePath)
          Loads and attempts to verify a PDFTextStream license file at the given path.
static boolean loadLicense(java.net.URL licenseLocation)
          Loads and attempts to verify a PDFTextStream license file at the given URL.
static void main(java.lang.String[] args)
          Main-method to allow extraction of text from a PDF file from the command line.
 void pipe(OutputHandler handler)
           Extracts all available text from this PDFTextStream instance, sending all PDF text events to the given OutputHandler.
 int read()
           
 int read(char[] buf)
           
 int read(char[] buf, int off, int len)
           
 void setConfig(PDFTextStreamConfig config)
          Sets the PDFTextStreamConfig instance that this PDFTextStream instance will use in various contexts to govern its operation.
 
Methods inherited from class java.io.Reader
mark, markSupported, read, ready, reset, skip
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ATTR_TITLE

public static final java.lang.String ATTR_TITLE
Document attribute key used to retrieve a String indicating the title of a PDF document.

See Also:
Constant Field Values

ATTR_AUTHOR

public static final java.lang.String ATTR_AUTHOR
Document attribute key used to retrieve a String indicating who created a PDF document.

See Also:
Constant Field Values

ATTR_SUBJECT

public static final java.lang.String ATTR_SUBJECT
Document attribute key used to retrieve a String indicating the subject of a PDF document.

See Also:
Constant Field Values

ATTR_KEYWORDS

public static final java.lang.String ATTR_KEYWORDS
Document attribute key used to retrieve a String containing keywords associated with a PDF document.

See Also:
Constant Field Values

ATTR_CREATOR

public static final java.lang.String ATTR_CREATOR
Document attribute key used to retrieve a String indicating the name of the application that created the original document from which the PDF was generated.

See Also:
Constant Field Values

ATTR_PRODUCER

public static final java.lang.String ATTR_PRODUCER
Document attribute key used to retrieve a String indicating the name of the application that generated a PDF document.

See Also:
Constant Field Values

ATTR_CREATION_DATE

public static final java.lang.String ATTR_CREATION_DATE
Document attribute key used to retrieve a String indicating the date and time that a PDF document was created. This String may be parsed into a java.util.Date object by passing it to the parseDateString(String) method.

See Also:
Constant Field Values

ATTR_MOD_DATE

public static final java.lang.String ATTR_MOD_DATE
Document attribute key used to retrieve a String indicating the date and time that a PDF document was last modified. This String may be parsed into a java.util.Date object by passing it to the parseDateString(String) method.

See Also:
Constant Field Values

ATTR_TRAPPED

public static final java.lang.String ATTR_TRAPPED
Document attribute key used to retrieve an indicator as to whether a PDF document includes trapping information (trapping is a method for correcting printing errors in high-quality printing environments). This key maps to a String, the valid values of which are 'False' and 'Unknown'.

See Also:
Constant Field Values

ATTR_USES_GRAPH_FONTS

public static final java.lang.String ATTR_USES_GRAPH_FONTS

Some PDF files use fonts that are image-based -- instead of their encodings mapping character codes to standard Unicode characters, they map character codes to images of characters. This makes it possible for these kinds of fonts (typically referred to as Type3 fonts) to, for example, map the character code 32 to the image of a letter 'g' instead of the standard space character.

PDFTextStream can derive the Unicode encoding of Type3 fonts in many cases, and will do so automatically if possible. Otherwise, content that uses a Type3 font for which no proper encoding can be derived will be skipped, and a document attribute with this key will be set and mapped to a Boolean object with a value of true.

See Also:
Constant Field Values
Constructor Detail

PDFTextStream

public PDFTextStream(java.io.InputStream is,
                     java.lang.String pdfName)
              throws java.io.IOException
Creates a new PDFTextStream that reads PDF content from the given InputStream. Please note that because reading PDF content requires random access to any and all parts of the PDF data, an InputStream provided to a PDFTextStream constructor will be read in its entirety and written to a temporary file for processing. All temporary files are closed and deleted when the creating PDFTextStream instance is closed or (in the worst case) garbage-collected.

Parameters:
is - - an InputStream delivering the content of a PDF file
pdfName - - the name of the PDF file (used mostly in logging / debugging)
Throws:
java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.

PDFTextStream

public PDFTextStream(java.io.File pdfFile)
              throws java.io.IOException
Creates a new PDFTextStream that reads PDF content from the given File.

Parameters:
pdfFile - - the PDF file to be read
Throws:
java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.

PDFTextStream

public PDFTextStream(java.lang.String pdfFilePath)
              throws java.io.IOException
Creates a new PDFTextStream that reads PDF content from a file located at the given path.

Parameters:
pdfFilePath - - the path to the PDF file to be read
Throws:
java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.

PDFTextStream

public PDFTextStream(java.io.InputStream is,
                     java.lang.String pdfName,
                     byte[] userPasswd,
                     PDFTextStreamConfig config)
              throws java.io.IOException
Creates a new PDFTextStream that reads PDF content from the given InputStream. Please note that because reading PDF content requires random access to any and all parts of the PDF data, an InputStream provided to a PDFTextStream constructor will be read in its entirety and written to a temporary file for processing. All temporary files are closed and deleted when the creating PDFTextStream instance is closed or (in the worst case) garbage-collected.

Parameters:
is - - an InputStream delivering the content of a PDF file
pdfName - - the name of the PDF file (used mostly in logging / debugging)
userPasswd - - the password that should be used to decrypt the given pdf data -- defaults to an empty byte array.
config - - a PDFTextStreamConfig object from which the new PDFTextStream instance will obtain various configuration settings.
Throws:
java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.

PDFTextStream

public PDFTextStream(java.io.InputStream is,
                     java.lang.String pdfName,
                     byte[] userPasswd)
              throws java.io.IOException
Creates a new PDFTextStream that reads PDF content from the given InputStream. Please note that because reading PDF content requires random access to any and all parts of the PDF data, an InputStream provided to a PDFTextStream constructor will be read in its entirety and written to a temporary file for processing. All temporary files are closed and deleted when the creating PDFTextStream instance is closed or (in the worst case) garbage-collected.

Parameters:
is - - an InputStream delivering the content of a PDF file
pdfName - - the name of the PDF file (used mostly in logging / debugging)
userPasswd - - the password that should be used to decrypt the given pdf data -- defaults to an empty byte array.
Throws:
java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.

PDFTextStream

public PDFTextStream(java.io.File pdfFile,
                     byte[] userPasswd,
                     PDFTextStreamConfig config)
              throws java.io.IOException
Creates a new PDFTextStream that reads PDF content from the given File.

Parameters:
pdfFile - - the PDF file to be read
userPasswd - - the password that should be used to decrypt the given pdf file -- defaults to an empty byte array.
config - - a PDFTextStreamConfig object from which the new PDFTextStream instance will obtain various configuration settings.
Throws:
java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.

PDFTextStream

public PDFTextStream(java.lang.String pdfFilePath,
                     byte[] userPasswd,
                     PDFTextStreamConfig config)
              throws java.io.IOException
Creates a new PDFTextStream that reads PDF content from the file located at the given path.

Parameters:
pdfFilePath - - the path to the PDF file to be read
userPasswd - - the password that should be used to decrypt the given pdf file -- defaults to an empty byte array.
config - - a PDFTextStreamConfig object from which the new PDFTextStream instance will obtain various configuration settings.
Throws:
java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.

PDFTextStream

public PDFTextStream(java.io.File pdfFile,
                     byte[] userPasswd)
              throws java.io.IOException
Creates a new PDFTextStream that reads PDF content from the given File.

Parameters:
pdfFile - - the PDF file to be read
userPasswd - - the password that should be used to decrypt the given pdf file -- defaults to an empty byte array.
Throws:
java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.

PDFTextStream

public PDFTextStream(java.lang.String pdfFilePath,
                     byte[] userPasswd)
              throws java.io.IOException
Creates a new PDFTextStream that reads PDF content from the given file at the given path.

Parameters:
pdfFilePath - - the path to the PDF file to be read
userPasswd - - the password that should be used to decrypt the pdf file -- defaults to an empty byte array.
Throws:
java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.

PDFTextStream

public PDFTextStream(java.nio.ByteBuffer pdfData,
                     java.lang.String pdfName,
                     byte[] userPasswd,
                     PDFTextStreamConfig config)
              throws java.io.IOException
Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.

Parameters:
pdfData - - a ByteBuffer providing the entirety of a PDF file's data
pdfName - - the name of the PDF whose data is provided by pdfData (this name is used only for logging and debugging purposes).
userPasswd - - the password that should be used to decrypt the given PDF data -- defaults to an empty byte array.
config - - a PDFTextStreamConfig object from which the new PDFTextStream instance will obtain various configuration settings.
Throws:
java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.

PDFTextStream

public PDFTextStream(java.nio.ByteBuffer pdfData,
                     java.lang.String pdfName,
                     byte[] userPasswd)
              throws java.io.IOException
Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.

Parameters:
pdfData - - a ByteBuffer providing the entirety of a PDF file's data
pdfName - - the name of the PDF whose data is provided by pdfData (this name is used only for logging and debugging purposes).
userPasswd - - the password that should be used to decrypt the given PDF data -- defaults to an empty byte array.
Throws:
java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.

PDFTextStream

public PDFTextStream(java.nio.ByteBuffer pdfData,
                     java.lang.String pdfName)
              throws java.io.IOException
Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.

Parameters:
pdfData - - a ByteBuffer providing the entirety of a PDF file's data
pdfName - - the name of the PDF whose data is provided by pdfData (this name is used only for logging and debugging purposes).
Throws:
java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.
Method Detail

setConfig

public void setConfig(PDFTextStreamConfig config)
Sets the PDFTextStreamConfig instance that this PDFTextStream instance will use in various contexts to govern its operation.

Note that certain configuration options are utilized only during PDFTextStream initialization (such as PDFTextStreamConfig.isMemoryMappingEnabled()). In order for non-default settings for those such options to take effect, a customized PDFTextStreamConfig object must either be set as the default configuration, or must be provided to any of the PDFTextStream constructors that accept a PDFTextStreamConfig object.


getConfig

public PDFTextStreamConfig getConfig()
Returns the PDFTextStreamConfig instance that this PDFTextStream instance is using to govern its operation.


read

public int read()
         throws java.io.IOException
Overrides:
read in class java.io.Reader
Throws:
java.io.IOException

read

public int read(char[] buf)
         throws java.io.IOException
Overrides:
read in class java.io.Reader
Throws:
java.io.IOException

read

public int read(char[] buf,
                int off,
                int len)
         throws java.io.IOException
Specified by:
read in class java.io.Reader
Throws:
java.io.IOException

pipe

public void pipe(OutputHandler handler)
          throws java.io.IOException

Extracts all available text from this PDFTextStream instance, sending all PDF text events to the given OutputHandler. Using this method of text extraction will always be the fastest approach, as it eliminates any and all of the intermediate data copying that is necessary to support extraction via PDFTextStream's java.io.Reader implementation.

If no special PDF text event handling is needed (i.e. you just want a straight text extract), then just pass a simple OutputTarget instance to this method.

The results of using this extraction method and the java.io.Reader interface on the same PDFTextStream interface are undefined.

Parameters:
handler - - an OutputHandler instance.
Throws:
java.io.IOException - - if an error occurrs during the extraction process
Since:
v1.3
See Also:
OutputHandler, OutputTarget

getPdfFileSize

public long getPdfFileSize()
Returns the size of the PDF file being read, in bytes.

Since:
v1.3

getPageCnt

public int getPageCnt()
Returns the number of pages in the PDF document.


getPage

public Page getPage(int n)
             throws java.io.IOException
Reads and returns a single page from the current PDF document. Page numbers are zero-indexed; they are not meant to correspond with any user-visible page number.

Parameters:
n - - the number of the page to retrieve.
Throws:
java.io.IOException - if an error occurs while preparing the Page for use
Since:
v1.3

getName

public java.lang.String getName()
Returns the name of the PDF that this stream is configured to read; this will be either the name of the PDF file that is being read, or the pdfName String that was provided if this instance was created with an InputStream constructor. Nearly all of the logging messages generated by the PDFTextStream library include the current PDFTextStream instance's name, making them easier to interpret in a multithreaded environment.


getPDFFile

public java.io.File getPDFFile()
Returns a reference to the file that this PDFTextStream instance is processing. This reference may be null if the PDFTextStream instance was not created using one of the java.io.File- or java.io.InputStream-based constructors.


finalize

public void finalize()
Overrides:
finalize in class java.lang.Object

close

public void close()
           throws java.io.IOException
Specified by:
close in interface java.io.Closeable
Specified by:
close in class java.io.Reader
Throws:
java.io.IOException

getFormData

public Form getFormData()
                 throws java.io.IOException
Loads the form data contained in the current document, and returns a Form object that represents that data. If the current PDF contains no forms, this function returns null. The Form instance that is returned by this function is guaranteed to be an AcroForm. This function MUST NOT be called after this PDFTextStream instance is closed.

Throws:
java.io.IOException - - if an error occurs loading the form data

getBookmarks

public Bookmark getBookmarks()
                      throws java.io.IOException
If the current PDF document contains a bookmark tree, this function will return its root node. If the document contains no bookmarks, this function will return null. An exception will be thrown if this function is called after this PDFTextStream instance is closed.

Throws:
java.io.IOException - - if an error occurs reading the bookmark tree
Since:
v1.3.5
See Also:
Bookmark

getAnnotations

public java.util.List getAnnotations(int page)
                              throws java.io.IOException
Returns a List of all annotations found on the page indicated by the given page number; each object will be an instance of a class that implements the Annotation interface. This function will never return null; if a page contains no annotations, an empty list will be returned. The returned list is guaranteed to offer efficient random access to its elements.

Throws:
java.io.IOException - - if an error occurs retrieving the annotation data
Since:
v1.3.5
See Also:
Annotation

getAllAnnotations

public java.util.List getAllAnnotations()
                                 throws java.io.IOException
Returns a list containing all of the annotations contained in the current PDF document. The returned list is guaranteed to offer efficient random access to its elements.

Throws:
java.io.IOException - - if an error occurs retrieving the annotation data
Since:
v1.3.5
See Also:
Annotation

getAllAnnotations

public int getAllAnnotations(java.util.List tgt)
                      throws java.io.IOException
Adds to the given List all of the annotations contained in the current PDF document.

Returns:
the number of annotations added to the list
Throws:
java.io.IOException - - if an error occurs retrieving the annotation data
Since:
v1.3.5
See Also:
Annotation

getPDFVersion

public PDFVersion getPDFVersion()
                         throws java.io.IOException

Retrieves the PDFVersion instance that corresponds with the version of the PDF file specification to which current PDF file adheres. PDF specification version numbers correspond directly with particular versions of Adobe Acrobat:

PDF files are generally forward-compatible. For example, Acrobat 5 should be able to read any PDF file that adheres to versions 1.0, 1.1, 1.2, 1.3, or 1.4 of the PDF file spec, etc.

Note that this method may not be called after the PDFTextStream instance is closed.

Throws:
java.io.IOException - - if an error occurs in determining what the PDF file's version is
Since:
v1.3

getEncryptionInfo

public EncryptionInfo getEncryptionInfo()
Returns an EncryptionInfo object, which provides access to some of the parameters used for the current PDF document's encryption. If the current PDF document is not encrypted, this method will return null.

Since:
v1.3

getXmlMetadata

public byte[] getXmlMetadata()
                      throws java.io.IOException

Returns the XML metadata available for the current PDF document. If no XML metadata is available in the current document, this method returns null.

Note: This method must be called before the PDFTextStream instance is closed, and it should not be called while text is being actively read out of it. (Supporting such concurrency would require synchronization that would negatively impact performance.) Therefore, the best times to call this method are:

PDFTextStream does not control the content returned by this method -- it just provides access to the data that is already stored in a PDF document. The schema of the the returned XML data is defined by Adobe, and is called the Extensible Metadata Platform (XMP). More information about XMP can be found on Adobe's website

Throws:
java.io.IOException - - if this PDFTextStream instance has already been closed, or if an error occurs retrieving the XML metadata.
Since:
v1.2

getAttribute

public java.lang.Object getAttribute(java.lang.String attrName)
                              throws java.io.IOException
This method is used to access all of the document-level metadata attributes that are set in a PDF document. All of the standard attribute names are specified in constants in this class, and are all prefixed with 'ATTR_'. A few notes should be kept in mind when accessing attribute values: Note: the attributes available through this method are retrieved from the "classic" document /Info entry. The document metadata in an XML format (which typically contains the same set of metadata attributes that are available through this method) may be obtained via the getXmlMetadata() method.

Parameters:
attrName - - the name of the attribute to be retrieved
Returns:
the value of the attribute with the given name defined in the PDF document being read, or null if no attribute is available with the given name. The type of this object depends upon which attribute is being retrieved, and is noted in the documentation of the attribute name constants held by this class.
Throws:
java.io.IOException - - if an error occurs while retrieving the PDF document's metadata
See Also:
getXmlMetadata() for access to the XML-formatted document metadata

getAttributeKeys

public java.util.Set getAttributeKeys()
                               throws java.io.IOException
Returns a Set containing the keys of all available document attributes.

Throws:
java.io.IOException - - if an error occurs while retrieving the PDF document's metadata

getAttributeMap

public java.util.Map getAttributeMap()
                              throws java.io.IOException
Returns a Map containing a copy of all keys and values of all available document attributes.

Throws:
java.io.IOException - - if an error occurs while retrieving the PDF document's metadata

loadLicense

public static boolean loadLicense(java.lang.String licenseFilePath)

Loads and attempts to verify a PDFTextStream license file at the given path.

PDFTextStream may also be configured to load a license file from a specific path by setting the system property or environment variable pdfts_license_path to that path.

Parameters:
licenseFilePath - - an absolute or relative file path
Returns:
true if a license file was found at the given path, and was successfully verified

loadLicense

public static boolean loadLicense(java.net.URL licenseLocation)

Loads and attempts to verify a PDFTextStream license file at the given URL.

Parameters:
licenseLocation - - a URL object
Returns:
true if a license file was found at the given path, and was successfully verified

isLicensed

public static boolean isLicensed()
Returns true if PDFTextStream has loaded and verified a non-evaluation license file that has not yet expired.


main

public static void main(java.lang.String[] args)
Main-method to allow extraction of text from a PDF file from the command line. Usage is simple:
java PDFTextStream [pdfFile] [optional outputpath]
pdfFile should be a path to the PDF file you wish to extract text from, outputpath should be a path to which you want the text extracted from the PDF to be written. If no outputpath is provided, then the text of the PDF file will be written to stdout.