PDFTextStream (PDFTextStream API Reference)

java.lang.Object
- java.io.Reader
- - com.snowtide.pdf.PDFTextStream

All Implemented Interfaces:

java.io.Closeable, java.lang.AutoCloseable, java.lang.Readable
```
public class PDFTextStream
extends java.io.Reader
implements java.io.Closeable
```
PDFTextStream gives your Java, .NET, and Python applications the ability to:
- Extract text and metadata from PDF documents (including metadata like XMP data, bookmarks, and annotations)
- Extract and update interactive AcroForm data
- Merge PDF documents
Instances of this class can either access a PDF file directly, or process equivalent data delivered via a java.io.InputStream or java.nio.ByteBuffer.

Certain aspects of PDFTextStream's operation may be customized by providing a suitably-configured PDFTextStreamConfig object to a PDFTextStream constructor, or by changing the default PDFTextStreamConfig instance via the PDFTextStreamConfig.setDefaultConfig(PDFTextStreamConfig) function, or by setting a PDFTextStream instance's configuration settings after initialization via the setConfig(PDFTextStreamConfig) function.

Level of Support

PDFTextStream supports the core of the PDF file specification up to and including version 1.7 (corresponding to Acrobat 8), including 40/128-bit document encryption methods. PDFTextStream also supports a variety of PDF format variants: formats that deviate from the official PDF document specification significantly, yet still render as expected in Adobe Reader.

Text Extraction

Using PDFTextStream to extract text from PDF documents is very simple; first, create an instance of PDFTextStream with a reference to a PDF file (alternatively, you can provide an java.io.InputStream or a java.nio.ByteBuffer):
```
 PDFTextStream stream = new PDFTextStream(pdfFile);
 
```
Once a PDFTextStream instance is available, it can be used just like a java.io.Reader:
```
 BufferedReader bufPDF = new BufferedReader(stream);
 String firstLine = bufPDF.readLine();
 // ... etc. ...
 
```
That's convenient, but only using PDFTextStream's java.io.Reader interface can be limiting; for instance, there's no way to extract text from individual pages of a PDF that way. A more flexible extraction mechanism is available by using OutputHandler implementations to control the extraction of text. The "standard" implementation is OutputTarget (which is what PDFTextStream uses to format PDF text delivered through its java.io.Reader interface):
```
 
 Page page = stream.getPage(0);
 StringBuffer sb = new StringBuffer(1024);
 OutputTarget tgt = new OutputTarget(sb);
 page.pipe(tgt);
 
 String firstPageText = sb.toString();
 
```
OutputTarget can also direct extracted text to a file on disk via a java.io.Writer, instead of to a StringBuffer:
```
 Writer textOutputFile = new OutputStreamWriter(new BufferedOutputStream(new FileOutputStream(new File("C:\pdfExtract.txt"))));
 OutputTarget tgt = new OutputTarget(textOutputFile);
 page.pipe(tgt);
 
```
OutputTarget is only one of the OutputHandler implementations provided with PDFTextStream. Another commonly-used implementation is VisualOutputTarget. In contrast to OutputTarget, which separates columns and other blocks of text to enable semantically-sensitive applications (such as search indexing), VisualOutputTarget retains the visual appearance and layout of each page of extracted text as much as possible. It is used just like OutputTarget:
```
 StringBuffer sb = new StringBuffer(1024);
 VisualOutputTarget tgt = new VisualOutputTarget(sb);
 page.pipe(tgt);
 
 String firstPageText = sb.toString();
 
```
Source code for some sample OutputHandler implementations are included with PDFTextStream, including GoogleHTMLOutputHandler and XMLOutputTarget. Building a custom OutputHandler implementation is sometimes the simplest and most straightforward way to handle PDF text extracts appropriately for one's application.

Form Data Extraction and Updating

PDFTextStream supports the extraction of interactive AcroForm data, as well as updating the values of most field types in such forms.

Form Data Extraction

The AcroForm instance for a particular PDF file may be retrieved using the getFormData() function. From there, all of the AcroFormFields available in that PDF file may be retrieved. PDFTextStream also includes XMLFormExport, which will generate an XML document containing all interactive form data associated with a PDF document. (The source code for XMLFormExport is also included in the PDFTextStream distribution for your reference.

Updating Interactive Forms

The persistent values of form fields accessible through the AcroForm may also be updated. Doing so is usually as simple as calling AcroFormField.setValue(String) on the fields to be changed, using the desired new values as arguments. Some field types also provide simpler or more comprehensive setters appropriate for that field type; for example, the AcroCheckboxField provides the AcroCheckboxField.setValue(boolean) function, which enables a checkbox's value to be set without having to determine what String should be used to represent the "checked" checkbox state.

After updating the values of form fields as appropriate, either the AcroForm.writeUpdatedDocument(File) or AcroForm.writeUpdatedDocument(OutputStream) may be used to write out an updated version of the PDF document that contains the new form field values.

Metadata Access

PDFTextStream provides access to all document-level metadata. This metadata includes creation and modification dates, author information, what application was used to generate a PDF document, and other items of potential interest. There are two potential sources of this metadata within a PDF document, and PDFTextStream provides a mechanism for retrieving metadata from each source.

Name / Value Pairs

Most PDF documents contain a mapping of simple name/value pair metadata attributes, which are stored in the document '/Info' object. PDFTextStream provides a set of methods for accessing these metadata attributes:
- getAttribute(String) for retrieving the value associated with a named attribute
- getAttributeKeys() for retrieving a java.util.Set view of the names of the attributes defined in a particular PDF document
- getAttributeMap() for retrieving a java.util.Map view of all of the metadata name / value mappings.
These methods may be called at any time before a PDFTextStream instance is closed. For more details about retrieval of metadata attribute values, please refer to the documentation for getAttribute(String).

XMP Metadata

Adobe has developed an XML-based architecture for delivering richer, more flexible metadata within a PDF document, called XMP (Extensible Metadata Platform). Many PDF documents include XMP streams, which can be accessed via the getXmlMetadata() method. This XML data typically is just another view of the metadata stored in the 'classic' document /Info object, but in some PDF workflows, the XMP data is used to carry richer metadata than can be stored in the 'classic' way. More information about XMP can be found at Adobe's website.

Bookmark Data Extraction

PDFTextStream supports the retrieval of bookmarks supplied by some PDF documents (sometimes referred to as outline data). Bookmarks are represented in PDF documents as a simple tree structure, which PDFTextStream's Bookmark implementation mirrors. See the getBookmarks() function and the Bookmark class for details.

Annotation

PDFTextStream supports the retrieval of PDF annotations; these include textual annotations (notes, comments, etc), URL's (used by PDF documents to implement hyperlinks), and others. Several functions in PDFTextStream support the retrieval of annotations (getAllAnnotations(), getAllAnnotations(List), and getAnnotations(int)); see the documentation for Annotation for details on how each type of annotation is implemented.

Character Sets and Encodings

Text in a PDF document can be encoded in a variety of ways. PDFTextStream supports all single-byte and double-byte Unicode character sets; it is therefore able to extract all text written using western languages (English, Spanish, French, Icelandic, Dutch, Swedish, German, etc) as well as Chinese, Japanese, and Korean (including vertical writing modes). PDFTextStream does not currently support right-to-left writing modes, so text in languages such as Arabic and Hebew is not extracted as one would expect.

Logging
PDFTextStream is designed to integrate smoothly into its environment; logging is commonly a large part of that. To that end, PDFTextStream's LoggingRegistry provides a central hook for customizing which logging framework PDFTextStream links to, and how. See the documentation for LoggingRegistry for details.
Utilities
- MergeUtil provides PDF document merging functionality
- KodakPrintData enables the extraction of Kodak print job data (%KDK commands) from PDF documents that contain such content.
Errors

Many PDFTextStream functions and its constructors pass IOExceptions along as they are thrown due to underlying system I/O errors (permissions issues, etc.). FaultyPDFExceptions may also be thrown in circumstances where a parsing or file structure problem is detected by PDFTextStream, and it is suspected that the PDF file in question is corrupt, invalid, or otherwise not readable. Any errors encountered while decrypting PDF content will be signaled by a EncryptedPDFException.
Version:

©2004-2012 Snowtide Informatics Systems, Inc.

Field Summary

Fields
Modifier and Type	Field and Description
`static java.lang.String`	`ATTR_AUTHOR` Document attribute key used to retrieve a String indicating who created a PDF document.
`static java.lang.String`	`ATTR_CREATION_DATE` Document attribute key used to retrieve a String indicating the date and time that a PDF document was created.
`static java.lang.String`	`ATTR_CREATOR` Document attribute key used to retrieve a String indicating the name of the application that created the original document from which the PDF was generated.
`static java.lang.String`	`ATTR_KEYWORDS` Document attribute key used to retrieve a String containing keywords associated with a PDF document.
`static java.lang.String`	`ATTR_MOD_DATE` Document attribute key used to retrieve a String indicating the date and time that a PDF document was last modified.
`static java.lang.String`	`ATTR_PRODUCER` Document attribute key used to retrieve a String indicating the name of the application that generated a PDF document.
`static java.lang.String`	`ATTR_SUBJECT` Document attribute key used to retrieve a String indicating the subject of a PDF document.
`static java.lang.String`	`ATTR_TITLE` Document attribute key used to retrieve a String indicating the title of a PDF document.
`static java.lang.String`	`ATTR_TRAPPED` Document attribute key used to retrieve an indicator as to whether a PDF document includes trapping information (trapping is a method for correcting printing errors in high-quality printing environments).
`static java.lang.String`	`ATTR_USES_GRAPH_FONTS` Some PDF files use fonts that are image-based -- instead of their encodings mapping character codes to standard Unicode characters, they map character codes to images of characters.

Fields inherited from class java.io.Reader
lock

Constructor Summary

Constructors
Constructor and Description
`PDFTextStream(java.nio.ByteBuffer pdfData, java.lang.String pdfName)` Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.
`PDFTextStream(java.nio.ByteBuffer pdfData, java.lang.String pdfName, byte[] userPasswd)` Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.
`PDFTextStream(java.nio.ByteBuffer pdfData, java.lang.String pdfName, byte[] userPasswd, PDFTextStreamConfig config)` Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.
`PDFTextStream(java.io.File pdfFile)` Creates a new PDFTextStream that reads PDF content from the given File.
`PDFTextStream(java.io.File pdfFile, byte[] userPasswd)` Creates a new PDFTextStream that reads PDF content from the given File.
`PDFTextStream(java.io.File pdfFile, byte[] userPasswd, PDFTextStreamConfig config)` Creates a new PDFTextStream that reads PDF content from the given File.
`PDFTextStream(java.io.InputStream is, java.lang.String pdfName)` Creates a new PDFTextStream that reads PDF content from the given InputStream.
`PDFTextStream(java.io.InputStream is, java.lang.String pdfName, byte[] userPasswd)` Creates a new PDFTextStream that reads PDF content from the given InputStream.
`PDFTextStream(java.io.InputStream is, java.lang.String pdfName, byte[] userPasswd, PDFTextStreamConfig config)` Creates a new PDFTextStream that reads PDF content from the given InputStream.
`PDFTextStream(java.lang.String pdfFilePath)` Creates a new PDFTextStream that reads PDF content from a file located at the given path.
`PDFTextStream(java.lang.String pdfFilePath, byte[] userPasswd)` Creates a new PDFTextStream that reads PDF content from the given file at the given path.
`PDFTextStream(java.lang.String pdfFilePath, byte[] userPasswd, PDFTextStreamConfig config)` Creates a new PDFTextStream that reads PDF content from the file located at the given path.

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`close()`
`void`	`finalize()`
`java.util.List`	`getAllAnnotations()` Returns a list containing all of the annotations contained in the current PDF document.
`int`	`getAllAnnotations(java.util.List tgt)` Adds to the given List all of the annotations contained in the current PDF document.
`java.util.List`	`getAnnotations(int page)` Returns a List of all annotations found on the page indicated by the given page number; each object will be an instance of a class that implements the `Annotation` interface.
`java.lang.Object`	`getAttribute(java.lang.String attrName)` This method is used to access all of the document-level metadata attributes that are set in a PDF document.
`java.util.Set`	`getAttributeKeys()` Returns a Set containing the keys of all available document attributes.
`java.util.Map`	`getAttributeMap()` Returns a Map containing a copy of all keys and values of all available document attributes.
`Bookmark`	`getBookmarks()` If the current PDF document contains a bookmark tree, this function will return its root node.
`PDFTextStreamConfig`	`getConfig()` Returns the `PDFTextStreamConfig` instance that this `PDFTextStream` instance is using to govern its operation.
`EncryptionInfo`	`getEncryptionInfo()` Returns an EncryptionInfo object, which provides access to some of the parameters used for the current PDF document's encryption.
`Form`	`getFormData()` Loads the form data contained in the current document, and returns a `Form` object that represents that data.
`java.lang.String`	`getName()` Returns the name of the PDF that this stream is configured to read; this will be either the name of the PDF file that is being read, or the `pdfName` String that was provided if this instance was created with an InputStream constructor.
`Page`	`getPage(int n)` Reads and returns a single page from the current PDF document.
`int`	`getPageCnt()` Returns the number of pages in the PDF document.
`java.io.File`	`getPDFFile()` Returns a reference to the file that this PDFTextStream instance is processing.
`long`	`getPdfFileSize()` Returns the size of the PDF file being read, in bytes.
`PDFVersion`	`getPDFVersion()` Retrieves the PDFVersion instance that corresponds with the version of the PDF file specification to which current PDF file adheres.
`byte[]`	`getXmlMetadata()` Returns the XML metadata available for the current PDF document.
`static boolean`	`isLicensed()` Returns true if PDFTextStream has loaded and verified a non-evaluation license file that has not yet expired.
`static boolean`	`loadLicense(java.lang.String licenseFilePath)` Loads and attempts to verify a PDFTextStream license file at the given path.
`static boolean`	`loadLicense(java.net.URL licenseLocation)` Loads and attempts to verify a PDFTextStream license file at the given URL.
`static void`	`main(java.lang.String[] args)` Main-method to allow extraction of text from a PDF file from the command line.
`void`	`pipe(OutputHandler handler)` Extracts all available text from this PDFTextStream instance, sending all PDF text events to the given `OutputHandler`.
`int`	`read()`
`int`	`read(char[] buf)`
`int`	`read(char[] buf, int off, int len)`
`void`	`setConfig(PDFTextStreamConfig config)` Sets the `PDFTextStreamConfig` instance that this `PDFTextStream` instance will use in various contexts to govern its operation.

Methods inherited from class java.io.Reader
mark, markSupported, read, ready, reset, skip

Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - ATTR_TITLE
```
public static final java.lang.String ATTR_TITLE
```
    Document attribute key used to retrieve a String indicating the title of a PDF document.
    
    See Also:
    Constant Field Values
  - ATTR_AUTHOR
```
public static final java.lang.String ATTR_AUTHOR
```
    Document attribute key used to retrieve a String indicating who created a PDF document.
    
    See Also:
    Constant Field Values
  - ATTR_SUBJECT
```
public static final java.lang.String ATTR_SUBJECT
```
    Document attribute key used to retrieve a String indicating the subject of a PDF document.
    
    See Also:
    Constant Field Values
  - ATTR_KEYWORDS
```
public static final java.lang.String ATTR_KEYWORDS
```
    Document attribute key used to retrieve a String containing keywords associated with a PDF document.
    
    See Also:
    Constant Field Values
  - ATTR_CREATOR
```
public static final java.lang.String ATTR_CREATOR
```
    Document attribute key used to retrieve a String indicating the name of the application that created the original document from which the PDF was generated.
    
    See Also:
    Constant Field Values
  - ATTR_PRODUCER
```
public static final java.lang.String ATTR_PRODUCER
```
    Document attribute key used to retrieve a String indicating the name of the application that generated a PDF document.
    
    See Also:
    Constant Field Values
  - ATTR_CREATION_DATE
```
public static final java.lang.String ATTR_CREATION_DATE
```
    Document attribute key used to retrieve a String indicating the date and time that a PDF document was created. This String may be parsed into a java.util.Date object by passing it to the parseDateString(String) method.
    
    See Also:
    Constant Field Values
  - ATTR_MOD_DATE
```
public static final java.lang.String ATTR_MOD_DATE
```
    Document attribute key used to retrieve a String indicating the date and time that a PDF document was last modified. This String may be parsed into a java.util.Date object by passing it to the parseDateString(String) method.
    
    See Also:
    Constant Field Values
  - ATTR_TRAPPED
```
public static final java.lang.String ATTR_TRAPPED
```
    Document attribute key used to retrieve an indicator as to whether a PDF document includes trapping information (trapping is a method for correcting printing errors in high-quality printing environments). This key maps to a String, the valid values of which are 'False' and 'Unknown'.
    
    See Also:
    Constant Field Values
  - ATTR_USES_GRAPH_FONTS
```
public static final java.lang.String ATTR_USES_GRAPH_FONTS
```
    Some PDF files use fonts that are image-based -- instead of their encodings mapping character codes to standard Unicode characters, they map character codes to images of characters. This makes it possible for these kinds of fonts (typically referred to as Type3 fonts) to, for example, map the character code 32 to the image of a letter 'g' instead of the standard space character.
    
    PDFTextStream can derive the Unicode encoding of Type3 fonts in many cases, and will do so automatically if possible. Otherwise, content that uses a Type3 font for which no proper encoding can be derived will be skipped, and a document attribute with this key will be set and mapped to a Boolean object with a value of true.
    
    See Also:
    Constant Field Values
- Constructor Detail
  - PDFTextStream
```
public PDFTextStream(java.io.InputStream is,
             java.lang.String pdfName)
              throws java.io.IOException
```
    Creates a new PDFTextStream that reads PDF content from the given InputStream. Please note that because reading PDF content requires random access to any and all parts of the PDF data, an InputStream provided to a PDFTextStream constructor will be read in its entirety and written to a temporary file for processing. All temporary files are closed and deleted when the creating PDFTextStream instance is closed or (in the worst case) garbage-collected.
    
    Parameters:
    is - - an InputStream delivering the content of a PDF file
    pdfName - - the name of the PDF file (used mostly in logging / debugging)
    
    Throws:
    
    java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
    
    EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.
  - PDFTextStream
```
public PDFTextStream(java.io.File pdfFile)
              throws java.io.IOException
```
    Creates a new PDFTextStream that reads PDF content from the given File.
    
    Parameters:
    pdfFile - - the PDF file to be read
    
    Throws:
    
    java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
    
    EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.
  - PDFTextStream
```
public PDFTextStream(java.lang.String pdfFilePath)
              throws java.io.IOException
```
    Creates a new PDFTextStream that reads PDF content from a file located at the given path.
    
    Parameters:
    pdfFilePath - - the path to the PDF file to be read
    
    Throws:
    
    java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
    
    EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.
  - PDFTextStream
```
public PDFTextStream(java.io.InputStream is,
             java.lang.String pdfName,
             byte[] userPasswd,
             PDFTextStreamConfig config)
              throws java.io.IOException
```
    Creates a new PDFTextStream that reads PDF content from the given InputStream. Please note that because reading PDF content requires random access to any and all parts of the PDF data, an InputStream provided to a PDFTextStream constructor will be read in its entirety and written to a temporary file for processing. All temporary files are closed and deleted when the creating PDFTextStream instance is closed or (in the worst case) garbage-collected.
    
    Parameters:
    is - - an InputStream delivering the content of a PDF file
    pdfName - - the name of the PDF file (used mostly in logging / debugging)
    userPasswd - - the password that should be used to decrypt the given pdf data -- defaults to an empty byte array.
    config - - a PDFTextStreamConfig object from which the new PDFTextStream instance will obtain various configuration settings.
    
    Throws:
    
    java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
    
    EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.
  - PDFTextStream
```
public PDFTextStream(java.io.InputStream is,
             java.lang.String pdfName,
             byte[] userPasswd)
              throws java.io.IOException
```
    Creates a new PDFTextStream that reads PDF content from the given InputStream. Please note that because reading PDF content requires random access to any and all parts of the PDF data, an InputStream provided to a PDFTextStream constructor will be read in its entirety and written to a temporary file for processing. All temporary files are closed and deleted when the creating PDFTextStream instance is closed or (in the worst case) garbage-collected.
    
    Parameters:
    is - - an InputStream delivering the content of a PDF file
    pdfName - - the name of the PDF file (used mostly in logging / debugging)
    userPasswd - - the password that should be used to decrypt the given pdf data -- defaults to an empty byte array.
    
    Throws:
    
    java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
    
    EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.
  - PDFTextStream
```
public PDFTextStream(java.io.File pdfFile,
             byte[] userPasswd,
             PDFTextStreamConfig config)
              throws java.io.IOException
```
    Creates a new PDFTextStream that reads PDF content from the given File.
    
    Parameters:
    pdfFile - - the PDF file to be read
    userPasswd - - the password that should be used to decrypt the given pdf file -- defaults to an empty byte array.
    config - - a PDFTextStreamConfig object from which the new PDFTextStream instance will obtain various configuration settings.
    
    Throws:
    
    java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
    
    EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.
  - PDFTextStream
```
public PDFTextStream(java.lang.String pdfFilePath,
             byte[] userPasswd,
             PDFTextStreamConfig config)
              throws java.io.IOException
```
    Creates a new PDFTextStream that reads PDF content from the file located at the given path.
    
    Parameters:
    pdfFilePath - - the path to the PDF file to be read
    userPasswd - - the password that should be used to decrypt the given pdf file -- defaults to an empty byte array.
    config - - a PDFTextStreamConfig object from which the new PDFTextStream instance will obtain various configuration settings.
    
    Throws:
    
    java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
    
    EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.
  - PDFTextStream
```
public PDFTextStream(java.io.File pdfFile,
             byte[] userPasswd)
              throws java.io.IOException
```
    Creates a new PDFTextStream that reads PDF content from the given File.
    
    Parameters:
    pdfFile - - the PDF file to be read
    userPasswd - - the password that should be used to decrypt the given pdf file -- defaults to an empty byte array.
    
    Throws:
    
    java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
    
    EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.
  - PDFTextStream
```
public PDFTextStream(java.lang.String pdfFilePath,
             byte[] userPasswd)
              throws java.io.IOException
```
    Creates a new PDFTextStream that reads PDF content from the given file at the given path.
    
    Parameters:
    pdfFilePath - - the path to the PDF file to be read
    userPasswd - - the password that should be used to decrypt the pdf file -- defaults to an empty byte array.
    
    Throws:
    
    java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
    
    EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.
  - PDFTextStream
```
public PDFTextStream(java.nio.ByteBuffer pdfData,
             java.lang.String pdfName,
             byte[] userPasswd,
             PDFTextStreamConfig config)
              throws java.io.IOException
```
    Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.
    
    Parameters:
    pdfData - - a ByteBuffer providing the entirety of a PDF file's data
    pdfName - - the name of the PDF whose data is provided by pdfData (this name is used only for logging and debugging purposes).
    userPasswd - - the password that should be used to decrypt the given PDF data -- defaults to an empty byte array.
    config - - a PDFTextStreamConfig object from which the new PDFTextStream instance will obtain various configuration settings.
    
    Throws:
    
    java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
    
    EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.
  - PDFTextStream
```
public PDFTextStream(java.nio.ByteBuffer pdfData,
             java.lang.String pdfName,
             byte[] userPasswd)
              throws java.io.IOException
```
    Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.
    
    Parameters:
    pdfData - - a ByteBuffer providing the entirety of a PDF file's data
    pdfName - - the name of the PDF whose data is provided by pdfData (this name is used only for logging and debugging purposes).
    userPasswd - - the password that should be used to decrypt the given PDF data -- defaults to an empty byte array.
    
    Throws:
    
    java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
    
    EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.
  - PDFTextStream
```
public PDFTextStream(java.nio.ByteBuffer pdfData,
             java.lang.String pdfName)
              throws java.io.IOException
```
    Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.
    
    Parameters:
    pdfData - - a ByteBuffer providing the entirety of a PDF file's data
    pdfName - - the name of the PDF whose data is provided by pdfData (this name is used only for logging and debugging purposes).
    
    Throws:
    
    java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream
    
    EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.
- Method Detail
  - setConfig
```
public void setConfig(PDFTextStreamConfig config)
```
    Sets the PDFTextStreamConfig instance that this PDFTextStream instance will use in various contexts to govern its operation.
    Note that certain configuration options are utilized only during PDFTextStream initialization (such as PDFTextStreamConfig.isMemoryMappingEnabled()). In order for non-default settings for those such options to take effect, a customized PDFTextStreamConfig object must either be set as the default configuration, or must be provided to any of the PDFTextStream constructors that accept a PDFTextStreamConfig object.
  - getConfig
```
public PDFTextStreamConfig getConfig()
```
    Returns the PDFTextStreamConfig instance that this PDFTextStream instance is using to govern its operation.
  - read
```
public int read()
         throws java.io.IOException
```
    Overrides:
    
    read in class java.io.Reader
    
    Throws:
    
    java.io.IOException
  - read
```
public int read(char[] buf)
         throws java.io.IOException
```
    Overrides:
    
    read in class java.io.Reader
    
    Throws:
    
    java.io.IOException
  - read
```
public int read(char[] buf,
       int off,
       int len)
         throws java.io.IOException
```
    Specified by:
    
    read in class java.io.Reader
    
    Throws:
    
    java.io.IOException
  - pipe
```
public void pipe(OutputHandler handler)
          throws java.io.IOException
```
    Extracts all available text from this PDFTextStream instance, sending all PDF text events to the given OutputHandler. Using this method of text extraction will always be the fastest approach, as it eliminates any and all of the intermediate data copying that is necessary to support extraction via PDFTextStream's java.io.Reader implementation.
    
    If no special PDF text event handling is needed (i.e. you just want a straight text extract), then just pass a simple OutputTarget instance to this method.
    
    The results of using this extraction method and the java.io.Reader interface on the same PDFTextStream interface are undefined.
    
    Parameters:
    handler - - an OutputHandler instance.
    
    Throws:
    
    java.io.IOException - - if an error occurrs during the extraction process
    Since:
    
    v1.3
    
    See Also:
    OutputHandler, OutputTarget
  - getPdfFileSize
```
public long getPdfFileSize()
```
    Returns the size of the PDF file being read, in bytes.
    
    Since:
    
    v1.3
  - getPageCnt
```
public int getPageCnt()
```
    Returns the number of pages in the PDF document.
  - getPage
```
public Page getPage(int n)
             throws java.io.IOException
```
    Reads and returns a single page from the current PDF document. Page numbers are zero-indexed; they are not meant to correspond with any user-visible page number.
    
    Parameters:
    n - - the number of the page to retrieve.
    
    Throws:
    
    java.io.IOException - if an error occurs while preparing the Page for use
    Since:
    
    v1.3
  - getName
```
public java.lang.String getName()
```
    Returns the name of the PDF that this stream is configured to read; this will be either the name of the PDF file that is being read, or the pdfName String that was provided if this instance was created with an InputStream constructor. Nearly all of the logging messages generated by the PDFTextStream library include the current PDFTextStream instance's name, making them easier to interpret in a multithreaded environment.
  - getPDFFile
```
public java.io.File getPDFFile()
```
    Returns a reference to the file that this PDFTextStream instance is processing. This reference may be null if the PDFTextStream instance was not created using one of the java.io.File- or java.io.InputStream-based constructors.
  - finalize
```
public void finalize()
```
    Overrides:
    
    finalize in class java.lang.Object
  - close
```
public void close()
           throws java.io.IOException
```
    Specified by:
    
    close in interface java.io.Closeable
    
    Specified by:
    
    close in interface java.lang.AutoCloseable
    
    Specified by:
    
    close in class java.io.Reader
    
    Throws:
    
    java.io.IOException
  - getFormData
```
public Form getFormData()
                 throws java.io.IOException
```
    Loads the form data contained in the current document, and returns a Form object that represents that data. If the current PDF contains no forms, this function returns null. The Form instance that is returned by this function is guaranteed to be an AcroForm. This function MUST NOT be called after this PDFTextStream instance is closed.
    
    Throws:
    
    java.io.IOException - - if an error occurs loading the form data
  - getBookmarks
```
public Bookmark getBookmarks()
                      throws java.io.IOException
```
    If the current PDF document contains a bookmark tree, this function will return its root node. If the document contains no bookmarks, this function will return null. An exception will be thrown if this function is called after this PDFTextStream instance is closed.
    
    Throws:
    
    java.io.IOException - - if an error occurs reading the bookmark tree
    Since:
    
    v1.3.5
    
    See Also:
    Bookmark
  - getAnnotations
```
public java.util.List getAnnotations(int page)
                              throws java.io.IOException
```
    Returns a List of all annotations found on the page indicated by the given page number; each object will be an instance of a class that implements the Annotation interface. This function will never return null; if a page contains no annotations, an empty list will be returned. The returned list is guaranteed to offer efficient random access to its elements.
    
    Throws:
    
    java.io.IOException - - if an error occurs retrieving the annotation data
    Since:
    
    v1.3.5
    
    See Also:
    Annotation
  - getAllAnnotations
```
public java.util.List getAllAnnotations()
                                 throws java.io.IOException
```
    Returns a list containing all of the annotations contained in the current PDF document. The returned list is guaranteed to offer efficient random access to its elements.
    
    Throws:
    
    java.io.IOException - - if an error occurs retrieving the annotation data
    Since:
    
    v1.3.5
    
    See Also:
    Annotation
  - getAllAnnotations
```
public int getAllAnnotations(java.util.List tgt)
                      throws java.io.IOException
```
    Adds to the given List all of the annotations contained in the current PDF document.
    
    Returns:
    the number of annotations added to the list
    
    Throws:
    
    java.io.IOException - - if an error occurs retrieving the annotation data
    Since:
    
    v1.3.5
    
    See Also:
    Annotation
  - getPDFVersion
```
public PDFVersion getPDFVersion()
                         throws java.io.IOException
```
    Retrieves the PDFVersion instance that corresponds with the version of the PDF file specification to which current PDF file adheres. PDF specification version numbers correspond directly with particular versions of Adobe Acrobat:
    - v1.0 - Acrobat 1
    - v1.1 - Acrobat 2
    - v1.2 - Acrobat 3
    - v1.3 - Acrobat 4
    - v1.4 - Acrobat 5
    - v1.5 - Acrobat 6
    - v1.6 - Acrobat 7
    - v1.7 - Acrobat 8
    PDF files are generally forward-compatible. For example, Acrobat 5 should be able to read any PDF file that adheres to versions 1.0, 1.1, 1.2, 1.3, or 1.4 of the PDF file spec, etc.
    
    Note that this method may not be called after the PDFTextStream instance is closed.
    Throws:
    
    java.io.IOException - - if an error occurs in determining what the PDF file's version is
    Since:
    
    v1.3
  - getEncryptionInfo
```
public EncryptionInfo getEncryptionInfo()
```
    Returns an EncryptionInfo object, which provides access to some of the parameters used for the current PDF document's encryption. If the current PDF document is not encrypted, this method will return null.
    
    Since:
    
    v1.3
  - getXmlMetadata
```
public byte[] getXmlMetadata()
                      throws java.io.IOException
```
    Returns the XML metadata available for the current PDF document. If no XML metadata is available in the current document, this method returns null.
    
    Note: This method must be called before the PDFTextStream instance is closed, and it should not be called while text is being actively read out of it. (Supporting such concurrency would require synchronization that would negatively impact performance.) Therefore, the best times to call this method are:
    - just after creating the PDFTextStream instance but before reading text out of it
    - after all text has been read out of the PDFTextStream instance, but before it is closed
    PDFTextStream does not control the content returned by this method -- it just provides access to the data that is already stored in a PDF document. The schema of the the returned XML data is defined by Adobe, and is called the Extensible Metadata Platform (XMP). More information about XMP can be found on Adobe's website
    Throws:
    
    java.io.IOException - - if this PDFTextStream instance has already been closed, or if an error occurs retrieving the XML metadata.
    Since:
    
    v1.2
  - getAttribute
```
public java.lang.Object getAttribute(java.lang.String attrName)
                              throws java.io.IOException
```
    This method is used to access all of the document-level metadata attributes that are set in a PDF document. All of the standard attribute names are specified in constants in this class, and are all prefixed with 'ATTR_'. A few notes should be kept in mind when accessing attribute values:
    - It is typical for only a subset of the possible attributes to be defined in a PDF document. Any attributes that are undefined will return a null value when their name is provided to this method.
    - Many more attributes are used in the real world than are formally specified by the PDF specification. It is entirely up to the PDF generator what attributes are to be outputted for a particular document, so some documents may contain attributes whose names are not canonicalized in the 'ATTR_' constants in this class. You can use the getAttributeKeys() method to get a Set of the names of all available attributes.
    - Most attribute values are Strings, but it is possible for attribute values to be Integers, Booleans, etc. The documentation associated with each attribute name constant in this class specifies what type may be expected when retrieving each particular attribute value. Any attributes specified as dates are returned from this method as String instances; these can be passed through parseDateString(String) to get a Date object.
    Note: the attributes available through this method are retrieved from the "classic" document /Info entry. The document metadata in an XML format (which typically contains the same set of metadata attributes that are available through this method) may be obtained via the getXmlMetadata() method.
    Parameters:
    attrName - - the name of the attribute to be retrieved
    
    Returns:
    the value of the attribute with the given name defined in the PDF document being read, or null if no attribute is available with the given name. The type of this object depends upon which attribute is being retrieved, and is noted in the documentation of the attribute name constants held by this class.
    
    Throws:
    
    java.io.IOException - - if an error occurs while retrieving the PDF document's metadata
    See Also:
    getXmlMetadata() for access to the XML-formatted document metadata
  - getAttributeKeys
```
public java.util.Set getAttributeKeys()
                               throws java.io.IOException
```
    Returns a Set containing the keys of all available document attributes.
    
    Throws:
    
    java.io.IOException - - if an error occurs while retrieving the PDF document's metadata
  - getAttributeMap
```
public java.util.Map getAttributeMap()
                              throws java.io.IOException
```
    Returns a Map containing a copy of all keys and values of all available document attributes.
    
    Throws:
    
    java.io.IOException - - if an error occurs while retrieving the PDF document's metadata
  - loadLicense
```
public static boolean loadLicense(java.lang.String licenseFilePath)
```
    Loads and attempts to verify a PDFTextStream license file at the given path.
    
    PDFTextStream may also be configured to load a license file from a specific path by setting the system property or environment variable pdfts_license_path to that path.
    
    Parameters:
    licenseFilePath - - an absolute or relative file path
    
    Returns:
    true if a license file was found at the given path, and was successfully verified
  - loadLicense
```
public static boolean loadLicense(java.net.URL licenseLocation)
```
    Loads and attempts to verify a PDFTextStream license file at the given URL.
    
    Parameters:
    licenseLocation - - a URL object
    
    Returns:
    true if a license file was found at the given path, and was successfully verified
  - isLicensed
```
public static boolean isLicensed()
```
    Returns true if PDFTextStream has loaded and verified a non-evaluation license file that has not yet expired.
  - main
```
public static void main(java.lang.String[] args)
```
    Main-method to allow extraction of text from a PDF file from the command line. Usage is simple:
    java PDFTextStream [pdfFile] [optional outputpath]
    pdfFile should be a path to the PDF file you wish to extract text from, outputpath should be a path to which you want the text extracted from the PDF to be written. If no outputpath is provided, then the text of the PDF file will be written to stdout.

Class PDFTextStream

Level of Support

Text Extraction

Form Data Extraction and Updating

Form Data Extraction

Updating Interactive Forms

Metadata Access

Name / Value Pairs

XMP Metadata

Bookmark Data Extraction

Annotation

Character Sets and Encodings

Logging

Utilities

Errors

Field Summary

Fields inherited from class java.io.Reader

Constructor Summary

Method Summary

Methods inherited from class java.io.Reader

Methods inherited from class java.lang.Object

Field Detail

ATTR_TITLE

ATTR_AUTHOR

ATTR_SUBJECT

ATTR_KEYWORDS

ATTR_CREATOR

ATTR_PRODUCER

ATTR_CREATION_DATE

ATTR_MOD_DATE

ATTR_TRAPPED

ATTR_USES_GRAPH_FONTS

Constructor Detail

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

Method Detail

setConfig

getConfig

read

read

read

pipe

getPdfFileSize

getPageCnt

getPage

getName

getPDFFile

finalize

close

getFormData

getBookmarks

getAnnotations

getAllAnnotations

getAllAnnotations

getPDFVersion

getEncryptionInfo

getXmlMetadata

getAttribute

getAttributeKeys

getAttributeMap

loadLicense

loadLicense

isLicensed

main