PDFTextStream (PDFTextStream API Reference)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.snowtide.pdf
Class PDFTextStream

java.lang.Object
  java.io.Reader
      com.snowtide.pdf.PDFTextStream

All Implemented Interfaces:: java.io.Closeable, java.lang.Readable

public class PDFTextStream
extends java.io.Reader
implements java.io.Closeable
extends java.io.Reader
implements java.io.Closeable

PDFTextStream gives your Java, .NET, and Python applications the ability to:

Extract text and metadata from PDF documents (including metadata like XMP data, bookmarks, and annotations)
Extract and update interactive AcroForm data
Merge PDF documents

Instances of this class can either access a PDF file directly, or process equivalent data delivered via a java.io.InputStream or java.nio.ByteBuffer.

Certain aspects of PDFTextStream's operation may be customized by providing a suitably-configured PDFTextStreamConfig object to a PDFTextStream constructor, or by changing the default PDFTextStreamConfig instance via the PDFTextStreamConfig.setDefaultConfig(PDFTextStreamConfig) function, or by setting a PDFTextStream instance's configuration settings after initialization via the setConfig(PDFTextStreamConfig) function.

Level of Support

PDFTextStream supports the core of the PDF file specification up to and including version 1.7 (corresponding to Acrobat 8), including 40/128-bit document encryption methods. PDFTextStream also supports a variety of PDF format variants: formats that deviate from the official PDF document specification significantly, yet still render as expected in Adobe Reader.

Text Extraction

Using PDFTextStream to extract text from PDF documents is very simple; first, create an instance of PDFTextStream with a reference to a PDF file (alternatively, you can provide an java.io.InputStream or a java.nio.ByteBuffer):

 PDFTextStream stream = new PDFTextStream(pdfFile);

Once a PDFTextStream instance is available, it can be used just like a java.io.Reader:

 BufferedReader bufPDF = new BufferedReader(stream);
 String firstLine = bufPDF.readLine();
 // ... etc. ...

That's convenient, but only using PDFTextStream's java.io.Reader interface can be limiting; for instance, there's no way to extract text from individual pages of a PDF that way. A more flexible extraction mechanism is available by using OutputHandler implementations to control the extraction of text. The "standard" implementation is OutputTarget (which is what PDFTextStream uses to format PDF text delivered through its java.io.Reader interface):

 
 Page page = stream.getPage(0);
 StringBuffer sb = new StringBuffer(1024);
 OutputTarget tgt = new OutputTarget(sb);
 page.pipe(tgt);
 
 String firstPageText = sb.toString();

OutputTarget can also direct extracted text to a file on disk via a java.io.Writer, instead of to a StringBuffer:

 Writer textOutputFile = new OutputStreamWriter(new BufferedOutputStream(new FileOutputStream(new File("C:\pdfExtract.txt"))));
 OutputTarget tgt = new OutputTarget(textOutputFile);
 page.pipe(tgt);

OutputTarget is only one of the OutputHandler implementations provided with PDFTextStream. Another commonly-used implementation is VisualOutputTarget. In contrast to OutputTarget, which separates columns and other blocks of text to enable semantically-sensitive applications (such as search indexing), VisualOutputTarget retains the visual appearance and layout of each page of extracted text as much as possible. It is used just like OutputTarget:

 StringBuffer sb = new StringBuffer(1024);
 VisualOutputTarget tgt = new VisualOutputTarget(sb);
 page.pipe(tgt);
 
 String firstPageText = sb.toString();

Source code for some sample OutputHandler implementations are included with PDFTextStream, including GoogleHTMLOutputHandler and XMLOutputTarget. Building a custom OutputHandler implementation is sometimes the simplest and most straightforward way to handle PDF text extracts appropriately for one's application.

Form Data Extraction and Updating

PDFTextStream supports the extraction of interactive AcroForm data, as well as updating the values of most field types in such forms.

Form Data Extraction

The AcroForm instance for a particular PDF file may be retrieved using the getFormData() function. From there, all of the AcroFormFields available in that PDF file may be retrieved. PDFTextStream also includes XMLFormExport, which will generate an XML document containing all interactive form data associated with a PDF document. (The source code for XMLFormExport is also included in the PDFTextStream distribution for your reference.

Updating Interactive Forms

The persistent values of form fields accessible through the AcroForm may also be updated. Doing so is usually as simple as calling AcroFormField.setValue(String) on the fields to be changed, using the desired new values as arguments. Some field types also provide simpler or more comprehensive setters appropriate for that field type; for example, the AcroCheckboxField provides the AcroCheckboxField.setValue(boolean) function, which enables a checkbox's value to be set without having to determine what String should be used to represent the "checked" checkbox state.

After updating the values of form fields as appropriate, either the AcroForm.writeUpdatedDocument(File) or AcroForm.writeUpdatedDocument(OutputStream) may be used to write out an updated version of the PDF document that contains the new form field values.

Metadata Access

PDFTextStream provides access to all document-level metadata. This metadata includes creation and modification dates, author information, what application was used to generate a PDF document, and other items of potential interest. There are two potential sources of this metadata within a PDF document, and PDFTextStream provides a mechanism for retrieving metadata from each source.

Name / Value Pairs

Most PDF documents contain a mapping of simple name/value pair metadata attributes, which are stored in the document '/Info' object. PDFTextStream provides a set of methods for accessing these metadata attributes:

getAttribute(String) for retrieving the value associated with a named attribute
getAttributeKeys() for retrieving a java.util.Set view of the names of the attributes defined in a particular PDF document
getAttributeMap() for retrieving a java.util.Map view of all of the metadata name / value mappings.

These methods may be called at any time before a PDFTextStream instance is closed. For more details about retrieval of metadata attribute values, please refer to the documentation for getAttribute(String).

XMP Metadata

Adobe has developed an XML-based architecture for delivering richer, more flexible metadata within a PDF document, called XMP (Extensible Metadata Platform). Many PDF documents include XMP streams, which can be accessed via the getXmlMetadata() method. This XML data typically is just another view of the metadata stored in the 'classic' document /Info object, but in some PDF workflows, the XMP data is used to carry richer metadata than can be stored in the 'classic' way. More information about XMP can be found at Adobe's website.

Bookmark Data Extraction

PDFTextStream supports the retrieval of bookmarks supplied by some PDF documents (sometimes referred to as outline data). Bookmarks are represented in PDF documents as a simple tree structure, which PDFTextStream's Bookmark implementation mirrors. See the getBookmarks() function and the Bookmark class for details.

Annotation

PDFTextStream supports the retrieval of PDF annotations; these include textual annotations (notes, comments, etc), URL's (used by PDF documents to implement hyperlinks), and others. Several functions in PDFTextStream support the retrieval of annotations (getAllAnnotations(), getAllAnnotations(List), and getAnnotations(int)); see the documentation for Annotation for details on how each type of annotation is implemented.

Character Sets and Encodings

Text in a PDF document can be encoded in a variety of ways. PDFTextStream supports all single-byte and double-byte Unicode character sets; it is therefore able to extract all text written using western languages (English, Spanish, French, Icelandic, Dutch, Swedish, German, etc) as well as Chinese, Japanese, and Korean (including vertical writing modes). PDFTextStream does not currently support right-to-left writing modes, so text in languages such as Arabic and Hebew is not extracted as one would expect.

Logging

PDFTextStream is designed to integrate smoothly into its environment; logging is commonly a large part of that. To that end, PDFTextStream's LoggingRegistry provides a central hook for customizing which logging framework PDFTextStream links to, and how. See the documentation for LoggingRegistry for details.

Utilities

MergeUtil provides PDF document merging functionality
KodakPrintData enables the extraction of Kodak print job data (%KDK commands) from PDF documents that contain such content.

Errors

Many PDFTextStream functions and its constructors pass IOExceptions along as they are thrown due to underlying system I/O errors (permissions issues, etc.). FaultyPDFExceptions may also be thrown in circumstances where a parsing or file structure problem is detected by PDFTextStream, and it is suspected that the PDF file in question is corrupt, invalid, or otherwise not readable. Any errors encountered while decrypting PDF content will be signaled by a EncryptedPDFException.

Version:: ©2004-2012 Snowtide Informatics Systems, Inc.

Field Summary
`static java.lang.String`	`ATTR_AUTHOR` Document attribute key used to retrieve a String indicating who created a PDF document.
`static java.lang.String`	`ATTR_CREATION_DATE` Document attribute key used to retrieve a String indicating the date and time that a PDF document was created.
`static java.lang.String`	`ATTR_CREATOR` Document attribute key used to retrieve a String indicating the name of the application that created the original document from which the PDF was generated.
`static java.lang.String`	`ATTR_KEYWORDS` Document attribute key used to retrieve a String containing keywords associated with a PDF document.
`static java.lang.String`	`ATTR_MOD_DATE` Document attribute key used to retrieve a String indicating the date and time that a PDF document was last modified.
`static java.lang.String`	`ATTR_PRODUCER` Document attribute key used to retrieve a String indicating the name of the application that generated a PDF document.
`static java.lang.String`	`ATTR_SUBJECT` Document attribute key used to retrieve a String indicating the subject of a PDF document.
`static java.lang.String`	`ATTR_TITLE` Document attribute key used to retrieve a String indicating the title of a PDF document.
`static java.lang.String`	`ATTR_TRAPPED` Document attribute key used to retrieve an indicator as to whether a PDF document includes trapping information (trapping is a method for correcting printing errors in high-quality printing environments).
`static java.lang.String`	`ATTR_USES_GRAPH_FONTS` Some PDF files use fonts that are image-based -- instead of their encodings mapping character codes to standard Unicode characters, they map character codes to images of characters.

Fields inherited from class java.io.Reader
`lock`

Constructor Summary
`PDFTextStream(java.nio.ByteBuffer pdfData, java.lang.String pdfName)` Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.
`PDFTextStream(java.nio.ByteBuffer pdfData, java.lang.String pdfName, byte[] userPasswd)` Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.
`PDFTextStream(java.nio.ByteBuffer pdfData, java.lang.String pdfName, byte[] userPasswd, PDFTextStreamConfig config)` Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.
`PDFTextStream(java.io.File pdfFile)` Creates a new PDFTextStream that reads PDF content from the given File.
`PDFTextStream(java.io.File pdfFile, byte[] userPasswd)` Creates a new PDFTextStream that reads PDF content from the given File.
`PDFTextStream(java.io.File pdfFile, byte[] userPasswd, PDFTextStreamConfig config)` Creates a new PDFTextStream that reads PDF content from the given File.
`PDFTextStream(java.io.InputStream is, java.lang.String pdfName)` Creates a new PDFTextStream that reads PDF content from the given InputStream.
`PDFTextStream(java.io.InputStream is, java.lang.String pdfName, byte[] userPasswd)` Creates a new PDFTextStream that reads PDF content from the given InputStream.
`PDFTextStream(java.io.InputStream is, java.lang.String pdfName, byte[] userPasswd, PDFTextStreamConfig config)` Creates a new PDFTextStream that reads PDF content from the given InputStream.
`PDFTextStream(java.lang.String pdfFilePath)` Creates a new PDFTextStream that reads PDF content from a file located at the given path.
`PDFTextStream(java.lang.String pdfFilePath, byte[] userPasswd)` Creates a new PDFTextStream that reads PDF content from the given file at the given path.
`PDFTextStream(java.lang.String pdfFilePath, byte[] userPasswd, PDFTextStreamConfig config)` Creates a new PDFTextStream that reads PDF content from the file located at the given path.

Method Summary
`void`	`close()`
`void`	`finalize()`
`java.util.List`	`getAllAnnotations()` Returns a list containing all of the annotations contained in the current PDF document.
`int`	`getAllAnnotations(java.util.List tgt)` Adds to the given List all of the annotations contained in the current PDF document.
`java.util.List`	`getAnnotations(int page)` Returns a List of all annotations found on the page indicated by the given page number; each object will be an instance of a class that implements the `Annotation` interface.
`java.lang.Object`	`getAttribute(java.lang.String attrName)` This method is used to access all of the document-level metadata attributes that are set in a PDF document.
`java.util.Set`	`getAttributeKeys()` Returns a Set containing the keys of all available document attributes.
`java.util.Map`	`getAttributeMap()` Returns a Map containing a copy of all keys and values of all available document attributes.
`Bookmark`	`getBookmarks()` If the current PDF document contains a bookmark tree, this function will return its root node.
`PDFTextStreamConfig`	`getConfig()` Returns the `PDFTextStreamConfig` instance that this `PDFTextStream` instance is using to govern its operation.
`EncryptionInfo`	`getEncryptionInfo()` Returns an EncryptionInfo object, which provides access to some of the parameters used for the current PDF document's encryption.
`Form`	`getFormData()` Loads the form data contained in the current document, and returns a `Form` object that represents that data.
`java.lang.String`	`getName()` Returns the name of the PDF that this stream is configured to read; this will be either the name of the PDF file that is being read, or the `pdfName` String that was provided if this instance was created with an InputStream constructor.
`Page`	`getPage(int n)` Reads and returns a single page from the current PDF document.
`int`	`getPageCnt()` Returns the number of pages in the PDF document.
`java.io.File`	`getPDFFile()` Returns a reference to the file that this PDFTextStream instance is processing.
`long`	`getPdfFileSize()` Returns the size of the PDF file being read, in bytes.
`PDFVersion`	`getPDFVersion()` Retrieves the PDFVersion instance that corresponds with the version of the PDF file specification to which current PDF file adheres.
`byte[]`	`getXmlMetadata()` Returns the XML metadata available for the current PDF document.
`static boolean`	`isLicensed()` Returns true if PDFTextStream has loaded and verified a non-evaluation license file that has not yet expired.
`static boolean`	`loadLicense(java.lang.String licenseFilePath)` Loads and attempts to verify a PDFTextStream license file at the given path.
`static boolean`	`loadLicense(java.net.URL licenseLocation)` Loads and attempts to verify a PDFTextStream license file at the given URL.
`static void`	`main(java.lang.String[] args)` Main-method to allow extraction of text from a PDF file from the command line.
`void`	`pipe(OutputHandler handler)` Extracts all available text from this PDFTextStream instance, sending all PDF text events to the given `OutputHandler`.
`int`	`read()`
`int`	`read(char[] buf)`
`int`	`read(char[] buf, int off, int len)`
`void`	`setConfig(PDFTextStreamConfig config)` Sets the `PDFTextStreamConfig` instance that this `PDFTextStream` instance will use in various contexts to govern its operation.

Methods inherited from class java.io.Reader
`mark, markSupported, read, ready, reset, skip`

Methods inherited from class java.lang.Object
`clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

ATTR_TITLE

public static final java.lang.String ATTR_TITLE

Document attribute key used to retrieve a String indicating the title of a PDF document.

See Also:: Constant Field Values

ATTR_AUTHOR

public static final java.lang.String ATTR_AUTHOR

Document attribute key used to retrieve a String indicating who created a PDF document.

See Also:: Constant Field Values

ATTR_SUBJECT

public static final java.lang.String ATTR_SUBJECT

Document attribute key used to retrieve a String indicating the subject of a PDF document.

See Also:: Constant Field Values

ATTR_KEYWORDS

public static final java.lang.String ATTR_KEYWORDS

Document attribute key used to retrieve a String containing keywords associated with a PDF document.

See Also:: Constant Field Values

ATTR_CREATOR

public static final java.lang.String ATTR_CREATOR

Document attribute key used to retrieve a String indicating the name of the application that created the original document from which the PDF was generated.

See Also:: Constant Field Values

ATTR_PRODUCER

public static final java.lang.String ATTR_PRODUCER

Document attribute key used to retrieve a String indicating the name of the application that generated a PDF document.

See Also:: Constant Field Values

ATTR_CREATION_DATE

public static final java.lang.String ATTR_CREATION_DATE

Document attribute key used to retrieve a String indicating the date and time that a PDF document was created. This String may be parsed into a java.util.Date object by passing it to the parseDateString(String) method.

See Also:: Constant Field Values

ATTR_MOD_DATE

public static final java.lang.String ATTR_MOD_DATE

Document attribute key used to retrieve a String indicating the date and time that a PDF document was last modified. This String may be parsed into a java.util.Date object by passing it to the parseDateString(String) method.

See Also:: Constant Field Values

ATTR_TRAPPED

public static final java.lang.String ATTR_TRAPPED

Document attribute key used to retrieve an indicator as to whether a PDF document includes trapping information (trapping is a method for correcting printing errors in high-quality printing environments). This key maps to a String, the valid values of which are 'False' and 'Unknown'.

See Also:: Constant Field Values

ATTR_USES_GRAPH_FONTS

public static final java.lang.String ATTR_USES_GRAPH_FONTS

Some PDF files use fonts that are image-based -- instead of their encodings mapping character codes to standard Unicode characters, they map character codes to images of characters. This makes it possible for these kinds of fonts (typically referred to as Type3 fonts) to, for example, map the character code 32 to the image of a letter 'g' instead of the standard space character.

PDFTextStream can derive the Unicode encoding of Type3 fonts in many cases, and will do so automatically if possible. Otherwise, content that uses a Type3 font for which no proper encoding can be derived will be skipped, and a document attribute with this key will be set and mapped to a Boolean object with a value of true.

See Also:: Constant Field Values

Constructor Detail

PDFTextStream

public PDFTextStream(java.io.InputStream is,
                     java.lang.String pdfName)
              throws java.io.IOException

Creates a new PDFTextStream that reads PDF content from the given InputStream. Please note that because reading PDF content requires random access to any and all parts of the PDF data, an InputStream provided to a PDFTextStream constructor will be read in its entirety and written to a temporary file for processing. All temporary files are closed and deleted when the creating PDFTextStream instance is closed or (in the worst case) garbage-collected.

Parameters:: is - - an InputStream delivering the content of a PDF file; pdfName - - the name of the PDF file (used mostly in logging / debugging)
Throws:: java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream; EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.

PDFTextStream

public PDFTextStream(java.io.File pdfFile)
              throws java.io.IOException

Creates a new PDFTextStream that reads PDF content from the given File.

Parameters:: pdfFile - - the PDF file to be read
Throws:: java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream; EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.

PDFTextStream

public PDFTextStream(java.lang.String pdfFilePath)
              throws java.io.IOException

Creates a new PDFTextStream that reads PDF content from a file located at the given path.

Parameters:: pdfFilePath - - the path to the PDF file to be read
Throws:: java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream; EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.

PDFTextStream

public PDFTextStream(java.io.InputStream is,
                     java.lang.String pdfName,
                     byte[] userPasswd,
                     PDFTextStreamConfig config)
              throws java.io.IOException

Parameters:: is - - an InputStream delivering the content of a PDF file; pdfName - - the name of the PDF file (used mostly in logging / debugging); userPasswd - - the password that should be used to decrypt the given pdf data -- defaults to an empty byte array.; config - - a PDFTextStreamConfig object from which the new PDFTextStream instance will obtain various configuration settings.
Throws:: java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream; EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.

PDFTextStream

public PDFTextStream(java.io.InputStream is,
                     java.lang.String pdfName,
                     byte[] userPasswd)
              throws java.io.IOException

Parameters:: is - - an InputStream delivering the content of a PDF file; pdfName - - the name of the PDF file (used mostly in logging / debugging); userPasswd - - the password that should be used to decrypt the given pdf data -- defaults to an empty byte array.
Throws:: java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream; EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.

PDFTextStream

public PDFTextStream(java.io.File pdfFile,
                     byte[] userPasswd,
                     PDFTextStreamConfig config)
              throws java.io.IOException

Creates a new PDFTextStream that reads PDF content from the given File.

Parameters:: pdfFile - - the PDF file to be read; userPasswd - - the password that should be used to decrypt the given pdf file -- defaults to an empty byte array.; config - - a PDFTextStreamConfig object from which the new PDFTextStream instance will obtain various configuration settings.
Throws:: java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream; EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.

PDFTextStream

public PDFTextStream(java.lang.String pdfFilePath,
                     byte[] userPasswd,
                     PDFTextStreamConfig config)
              throws java.io.IOException

Creates a new PDFTextStream that reads PDF content from the file located at the given path.

Parameters:: pdfFilePath - - the path to the PDF file to be read; userPasswd - - the password that should be used to decrypt the given pdf file -- defaults to an empty byte array.; config - - a PDFTextStreamConfig object from which the new PDFTextStream instance will obtain various configuration settings.
Throws:: java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream; EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.

PDFTextStream

public PDFTextStream(java.io.File pdfFile,
                     byte[] userPasswd)
              throws java.io.IOException

Creates a new PDFTextStream that reads PDF content from the given File.

Parameters:: pdfFile - - the PDF file to be read; userPasswd - - the password that should be used to decrypt the given pdf file -- defaults to an empty byte array.
Throws:: java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream; EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.

PDFTextStream

public PDFTextStream(java.lang.String pdfFilePath,
                     byte[] userPasswd)
              throws java.io.IOException

Creates a new PDFTextStream that reads PDF content from the given file at the given path.

Parameters:: pdfFilePath - - the path to the PDF file to be read; userPasswd - - the password that should be used to decrypt the pdf file -- defaults to an empty byte array.
Throws:: java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream; EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf file.

PDFTextStream

public PDFTextStream(java.nio.ByteBuffer pdfData,
                     java.lang.String pdfName,
                     byte[] userPasswd,
                     PDFTextStreamConfig config)
              throws java.io.IOException

Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.

Parameters:: pdfData - - a ByteBuffer providing the entirety of a PDF file's data; pdfName - - the name of the PDF whose data is provided by pdfData (this name is used only for logging and debugging purposes).; userPasswd - - the password that should be used to decrypt the given PDF data -- defaults to an empty byte array.; config - - a PDFTextStreamConfig object from which the new PDFTextStream instance will obtain various configuration settings.
Throws:: java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream; EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.

PDFTextStream

public PDFTextStream(java.nio.ByteBuffer pdfData,
                     java.lang.String pdfName,
                     byte[] userPasswd)
              throws java.io.IOException

Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.

Parameters:: pdfData - - a ByteBuffer providing the entirety of a PDF file's data; pdfName - - the name of the PDF whose data is provided by pdfData (this name is used only for logging and debugging purposes).; userPasswd - - the password that should be used to decrypt the given PDF data -- defaults to an empty byte array.
Throws:: java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream; EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.

PDFTextStream

public PDFTextStream(java.nio.ByteBuffer pdfData,
                     java.lang.String pdfName)
              throws java.io.IOException

Creates a new PDFTextStream that reads PDF content from the given ByteBuffer.

Parameters:: pdfData - - a ByteBuffer providing the entirety of a PDF file's data; pdfName - - the name of the PDF whose data is provided by pdfData (this name is used only for logging and debugging purposes).
Throws:: java.io.IOException - - if an error occurs while writing initializing the new PDFTextStream; EncryptedPDFException - - if an error occurs configuring the new PDFTextStream to decrypt the pdf data.

Method Detail

setConfig

public void setConfig(PDFTextStreamConfig config)

Sets the PDFTextStreamConfig instance that this PDFTextStream instance will use in various contexts to govern its operation.

Note that certain configuration options are utilized only during PDFTextStream initialization (such as PDFTextStreamConfig.isMemoryMappingEnabled()). In order for non-default settings for those such options to take effect, a customized PDFTextStreamConfig object must either be set as the default configuration, or must be provided to any of the PDFTextStream constructors that accept a PDFTextStreamConfig object.

getConfig

public PDFTextStreamConfig getConfig()

Returns the PDFTextStreamConfig instance that this PDFTextStream instance is using to govern its operation.

read

public int read()
         throws java.io.IOException

Overrides:: read in class java.io.Reader

Throws:: java.io.IOException

read

public int read(char[] buf)
         throws java.io.IOException

Overrides:: read in class java.io.Reader

Throws:: java.io.IOException

read

public int read(char[] buf,
                int off,
                int len)
         throws java.io.IOException

Specified by:: read in class java.io.Reader

Throws:: java.io.IOException

pipe

public void pipe(OutputHandler handler)
          throws java.io.IOException

Extracts all available text from this PDFTextStream instance, sending all PDF text events to the given OutputHandler. Using this method of text extraction will always be the fastest approach, as it eliminates any and all of the intermediate data copying that is necessary to support extraction via PDFTextStream's java.io.Reader implementation.

If no special PDF text event handling is needed (i.e. you just want a straight text extract), then just pass a simple OutputTarget instance to this method.

The results of using this extraction method and the java.io.Reader interface on the same PDFTextStream interface are undefined.

Parameters:: handler - - an OutputHandler instance.
Throws:: java.io.IOException - - if an error occurrs during the extraction process
Since:: v1.3
See Also:: OutputHandler, OutputTarget

getPdfFileSize

public long getPdfFileSize()

Returns the size of the PDF file being read, in bytes.

Since:: v1.3

getPageCnt

public int getPageCnt()

Returns the number of pages in the PDF document.

getPage

public Page getPage(int n)
             throws java.io.IOException

Reads and returns a single page from the current PDF document. Page numbers are zero-indexed; they are not meant to correspond with any user-visible page number.

Parameters:: n - - the number of the page to retrieve.
Throws:: java.io.IOException - if an error occurs while preparing the Page for use
Since:: v1.3

getName

public java.lang.String getName()

Returns the name of the PDF that this stream is configured to read; this will be either the name of the PDF file that is being read, or the pdfName String that was provided if this instance was created with an InputStream constructor. Nearly all of the logging messages generated by the PDFTextStream library include the current PDFTextStream instance's name, making them easier to interpret in a multithreaded environment.

getPDFFile

public java.io.File getPDFFile()

Returns a reference to the file that this PDFTextStream instance is processing. This reference may be null if the PDFTextStream instance was not created using one of the java.io.File- or java.io.InputStream-based constructors.

finalize

public void finalize()

Overrides:: finalize in class java.lang.Object

close

public void close()
           throws java.io.IOException

Specified by:: close in interface java.io.Closeable
Specified by:: close in class java.io.Reader

Throws:: java.io.IOException

getFormData

public Form getFormData()
                 throws java.io.IOException

Loads the form data contained in the current document, and returns a Form object that represents that data. If the current PDF contains no forms, this function returns null. The Form instance that is returned by this function is guaranteed to be an AcroForm. This function MUST NOT be called after this PDFTextStream instance is closed.

Throws:: java.io.IOException - - if an error occurs loading the form data

getBookmarks

public Bookmark getBookmarks()
                      throws java.io.IOException

If the current PDF document contains a bookmark tree, this function will return its root node. If the document contains no bookmarks, this function will return null. An exception will be thrown if this function is called after this PDFTextStream instance is closed.

Throws:: java.io.IOException - - if an error occurs reading the bookmark tree
Since:: v1.3.5
See Also:: Bookmark

getAnnotations

public java.util.List getAnnotations(int page)
                              throws java.io.IOException

Returns a List of all annotations found on the page indicated by the given page number; each object will be an instance of a class that implements the Annotation interface. This function will never return null; if a page contains no annotations, an empty list will be returned. The returned list is guaranteed to offer efficient random access to its elements.

Throws:: java.io.IOException - - if an error occurs retrieving the annotation data
Since:: v1.3.5
See Also:: Annotation

getAllAnnotations

public java.util.List getAllAnnotations()
                                 throws java.io.IOException

Returns a list containing all of the annotations contained in the current PDF document. The returned list is guaranteed to offer efficient random access to its elements.

Throws:: java.io.IOException - - if an error occurs retrieving the annotation data
Since:: v1.3.5
See Also:: Annotation

getAllAnnotations

public int getAllAnnotations(java.util.List tgt)
                      throws java.io.IOException

Adds to the given List all of the annotations contained in the current PDF document.

Returns:: the number of annotations added to the list
Throws:: java.io.IOException - - if an error occurs retrieving the annotation data
Since:: v1.3.5
See Also:: Annotation

getPDFVersion

public PDFVersion getPDFVersion()
                         throws java.io.IOException

Retrieves the PDFVersion instance that corresponds with the version of the PDF file specification to which current PDF file adheres. PDF specification version numbers correspond directly with particular versions of Adobe Acrobat:

v1.0 - Acrobat 1
v1.1 - Acrobat 2
v1.2 - Acrobat 3
v1.3 - Acrobat 4
v1.4 - Acrobat 5
v1.5 - Acrobat 6
v1.6 - Acrobat 7
v1.7 - Acrobat 8

PDF files are generally forward-compatible. For example, Acrobat 5 should be able to read any PDF file that adheres to versions 1.0, 1.1, 1.2, 1.3, or 1.4 of the PDF file spec, etc.

Note that this method may not be called after the PDFTextStream instance is closed.

Throws:: java.io.IOException - - if an error occurs in determining what the PDF file's version is
Since:: v1.3

getEncryptionInfo

public EncryptionInfo getEncryptionInfo()

Returns an EncryptionInfo object, which provides access to some of the parameters used for the current PDF document's encryption. If the current PDF document is not encrypted, this method will return null.

Since:: v1.3

getXmlMetadata

public byte[] getXmlMetadata()
                      throws java.io.IOException

Returns the XML metadata available for the current PDF document. If no XML metadata is available in the current document, this method returns null.

Note: This method must be called before the PDFTextStream instance is closed, and it should not be called while text is being actively read out of it. (Supporting such concurrency would require synchronization that would negatively impact performance.) Therefore, the best times to call this method are:

just after creating the PDFTextStream instance but before reading text out of it
after all text has been read out of the PDFTextStream instance, but before it is closed

PDFTextStream does not control the content returned by this method -- it just provides access to the data that is already stored in a PDF document. The schema of the the returned XML data is defined by Adobe, and is called the Extensible Metadata Platform (XMP). More information about XMP can be found on Adobe's website

Throws:: java.io.IOException - - if this PDFTextStream instance has already been closed, or if an error occurs retrieving the XML metadata.
Since:: v1.2

getAttribute

public java.lang.Object getAttribute(java.lang.String attrName)
                              throws java.io.IOException

This method is used to access all of the document-level metadata attributes that are set in a PDF document. All of the standard attribute names are specified in constants in this class, and are all prefixed with 'ATTR_'. A few notes should be kept in mind when accessing attribute values:

It is typical for only a subset of the possible attributes to be defined in a PDF document. Any attributes that are undefined will return a null value when their name is provided to this method.
Many more attributes are used in the real world than are formally specified by the PDF specification. It is entirely up to the PDF generator what attributes are to be outputted for a particular document, so some documents may contain attributes whose names are not canonicalized in the 'ATTR_' constants in this class. You can use the getAttributeKeys() method to get a Set of the names of all available attributes.
Most attribute values are Strings, but it is possible for attribute values to be Integers, Booleans, etc. The documentation associated with each attribute name constant in this class specifies what type may be expected when retrieving each particular attribute value. Any attributes specified as dates are returned from this method as String instances; these can be passed through parseDateString(String) to get a Date object.

Note: the attributes available through this method are retrieved from the "classic" document /Info entry. The document metadata in an XML format (which typically contains the same set of metadata attributes that are available through this method) may be obtained via the getXmlMetadata() method.

Parameters:: attrName - - the name of the attribute to be retrieved
Returns:: the value of the attribute with the given name defined in the PDF document being read, or null if no attribute is available with the given name. The type of this object depends upon which attribute is being retrieved, and is noted in the documentation of the attribute name constants held by this class.
Throws:: java.io.IOException - - if an error occurs while retrieving the PDF document's metadata
See Also:: getXmlMetadata() for access to the XML-formatted document metadata

getAttributeKeys

public java.util.Set getAttributeKeys()
                               throws java.io.IOException

Returns a Set containing the keys of all available document attributes.

Throws:: java.io.IOException - - if an error occurs while retrieving the PDF document's metadata

getAttributeMap

public java.util.Map getAttributeMap()
                              throws java.io.IOException

Returns a Map containing a copy of all keys and values of all available document attributes.

Throws:: java.io.IOException - - if an error occurs while retrieving the PDF document's metadata

loadLicense

public static boolean loadLicense(java.lang.String licenseFilePath)

Loads and attempts to verify a PDFTextStream license file at the given path.

PDFTextStream may also be configured to load a license file from a specific path by setting the system property or environment variable pdfts_license_path to that path.

Parameters:: licenseFilePath - - an absolute or relative file path
Returns:: true if a license file was found at the given path, and was successfully verified

loadLicense

public static boolean loadLicense(java.net.URL licenseLocation)

Loads and attempts to verify a PDFTextStream license file at the given URL.

Parameters:: licenseLocation - - a URL object
Returns:: true if a license file was found at the given path, and was successfully verified

isLicensed

public static boolean isLicensed()

Returns true if PDFTextStream has loaded and verified a non-evaluation license file that has not yet expired.

main

public static void main(java.lang.String[] args)

Main-method to allow extraction of text from a PDF file from the command line. Usage is simple:

java PDFTextStream [pdfFile] [optional outputpath]

pdfFile should be a path to the PDF file you wish to extract text from, outputpath should be a path to which you want the text extracted from the PDF to be written. If no outputpath is provided, then the text of the PDF file will be written to stdout.

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.snowtide.pdf Class PDFTextStream

Level of Support

Text Extraction

Form Data Extraction and Updating

Form Data Extraction

Updating Interactive Forms

Metadata Access

Name / Value Pairs

XMP Metadata

Bookmark Data Extraction

Annotation

Character Sets and Encodings

Logging

Utilities

Errors

ATTR_TITLE

ATTR_AUTHOR

ATTR_SUBJECT

ATTR_KEYWORDS

ATTR_CREATOR

ATTR_PRODUCER

ATTR_CREATION_DATE

ATTR_MOD_DATE

ATTR_TRAPPED

ATTR_USES_GRAPH_FONTS

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

PDFTextStream

setConfig

getConfig

read

read

read

pipe

getPdfFileSize

getPageCnt

getPage

getName

getPDFFile

finalize

close

getFormData

getBookmarks

getAnnotations

getAllAnnotations

getAllAnnotations

getPDFVersion

getEncryptionInfo

getXmlMetadata

getAttribute

getAttributeKeys

getAttributeMap

loadLicense

loadLicense

isLicensed

main

com.snowtide.pdf
Class PDFTextStream