#### Changes in PDFTextStream v2.7.0 - PDFTextStream.NET now uses and ships with IKVM 0.46.0.4 - ASCII "control characters" (0-8) are no longer added to page document models - Add support for /CXXX character names (observed in the wild, naming unicode code points in hex), which yields proper decoding of certain documents - Added support for embedded CMAP files that use CR characters for linebreaks - Added support for embedded font files that assume Windows-1250 text encoding - Added workaround for certain PDF documents with malformed embedded CMAP files that would cause an infinite loop / hang in PDFTextStream's processing of said files. - Fixed bug in PDF file merge facility that resulted in certain object stream-encoded objects from being included in the result of the merge. - Fixed bug where certain signature form fields were not decoded properly, which was causing AcroSignatureField.getValue() to return null (instead of a Map of the field's properties). - Fixed bug where compressed PDF object references were unnecessarily loaded repeatedly under .NET 4.5 on Windows - Fixed bug where some PDF document merge operations would yield an incorrect reference to /Metadata objects, and thus prevent Acrobat Reader from printing them #### Changes in PDFTextStream v2.6.4 - Added table detection enablement option to com.snowtide.pdf.PDFTextStreamConfig (PDFTextStreamConfig.isTableDetectionEnabled() and PDFTextStreamConfig.setTableDetectionEnabled(boolean)) - Eliminated potential NullPointerException when cropping a page prior to its layout being initially calculated (com.snowtide.pdf.Page.crop(Rect)) - Fixed bug where incorrect character spacing data was applied to Adobe-standard fonts embedded in PDFs (resulting in poor/nonexistent word spacing) #### Changes in PDFTextStream v2.6.3 - Significantly improved 'repair' procedure for damaged or malformed PDF documents - Fixed bug in PDF merge functionality that would occasionally manifest as a blank page #### Changes in PDFTextStream v2.6.2 - Fixed rendering issue associated with usage of TrueType fonts with multibyte-encoded text streams. - Added compatibility fix for PDF documents that contain spurious (out of range) non-printing bytes. #### Changes in PDFTextStream v2.6.1 - Enhanced "repair" procedure for PDF documents with one-off stream encoding errors. - Fixed handling of text encoding found in PDF documents generated within Mac OS X Lion (whitespace in cmap codepoints) #### Changes in PDFTextStream v2.6.0 - New OutputHandler: com.snowtide.pdf.SelectionOutputTarget, implementing text extraction based on a "selection coordinates", as commonly found in user-facing PDF viewer UIs. - PDFTextStream is now free for use in single-threaded applications; all previous "evaluation" limitations no longer apply when PDFTextStream is operated without a license file. #### Changes in PDFTextStream v2.5.0 - Added support for decryption of AES-encrypted PDF documents (includes support for 256-bit and variable bit length ciphers) - PDFTextStream for Java now requires v1.5.0 or higher of the JVM/JRE - PDFTextStream.NET is now tested and supported under Mono - PDFTextStream.NET now uses and ships with IKVM 0.46.0.1, and requires .NET 2.0 or higher. - com.snowtide.pdf.PDFTextStream now implements java.io.Closeable - com.snowtide.pdf.OutputTarget and its subclasses now accept java.lang.Appendables instead of strictly java.lang.StringBuffers - com.snowtide.pdf.PDFTextStream now offers String-based (file path) constructors - Dozens of performance and PDF document compatibility enhancements - added LinkAnnotation.getTargetPageNumber(); LinkAnnotation no longer improperly shadows Annotation.getPageNumber() - Fixes a fatal character decoding bug on IBM J9 JVMs - Fixes support for Windows ANSI-encoded PDF text - Fixes support for tracking the position and rotation of PDF media boxes (no longer just height/width) - The "NOCJK" build of PDFTextStream (all the same functionality, but without the font encoding files needed to extract CJK character sets) is no longer offered - PDF merge capability (com.snowtide.pdf.util.MergeUtil) has been deprecated - Memory-mapping of opened PDF files is now disabled by default, and has been deprecated #### Changes in PDFTextStream v2.3.2 - Fixed issue where PDFTextStream would fail to initialize when the default system locale was set to Shift_JIS (i.e. SJIS, MS932, Windows-31J) - Fixed an issue where certain Chinese, Japanese, and Korean fonts were not being loaded properly when specific encoding config data was missing. - Fixed an octal string parsing bug that could lead to a PDF parsing failure. - Added crop box attribute to com.snowtide.pdf.Page interface - An expanded set of control characters are now treated as whitespace. - Added support for non-compliant PDF documents produced by TXT2PDF for OS/390. #### Changes in PDFTextStream v2.3.1 - Added methods to VisualOutputTarget to enable the optional exclusion of rotated content from its output (523) - Fixed a bug where rotated characters were reporting a rotation angle (theta) of 0 when presented to VisualOutputTarget. (519) - Fixed a bug where use of PDFTextStream.NET in a multithreaded environment could produce garbled or missing text extracts in very limited circumstances. (512) - Added support for PDFs that contain malformed arrays in their graphics output streams (509) - Fixed a bug where text rendered using a Type3 font that has a proper unicode mapping was being omitted from extracts (507) - Significantly improved the emission of whitespace between words on lines with large amounts of tracking (506) - Fixed character mapping for 'ã' and '- ' ("middle dot") (502) - Fixed a bug affecting VisualOutputTarget and RegionOutputTarget where smaller characters would not be included in resulting text extracts. (499) - Fixed an issue where string values held in compressed object streams were being re-encrypted (primarily affecting key/value PDF attributes) (495) - Fixed an issue where PDF documents generated by PDFSharp were improperly handled, leading to significant degradation of extraction accuracy. (490) - Fixed an issue where CFF font encodings were being applied inappropriately, potentially leading to garbled extracts. (479) - fixed a bug related to zero-length cross-reference entry codes that was resulting in a improper FaultyPDFException being thrown (450) #### Changes in PDFTextStream v2.3.0 - Added an .isStruckThrough() method to com.snowtide.pdf.TextUnit, indicating whether a character has a strikethrough drawn through it. - Improved PDFTextStream's support for embedded character mappings. - The calculation of whitespace between words has been fixed to properly account for whitespace that is explicitly encoded in the source PDF documents. - Improved PDFTextStream's handling of composite content encodings, which previously could fail resulting in some ranges of PDF content being 'ignored' during extraction. - Fixed a bug in VisualOutputTarget where text from a single line would be split over multiple lines - Improved vertical alignment of text extracted using VisualOutputTarget - Improved VisualOutputTarget-produced extracts to eliminate spurious additional whitespace between closely-adjacent words #### Changes in PDFTextStream v2.2.5 - Added support for extracting XFA forms data as XML - Significantly improved the performance of text extraction using VisualOutputTarget - Added support for PDF documents larger than 2GB - Fixed a bug where the encodings from embedded Type1 fonts were previously not being applied properly in some circumstances. - Fixed a problem where newer content in updated PDF documents were sometimes being ignored. - Fixed a problem where PDFDocEncoding-encoded bookmarks and metadata were not being decoded properly - added .getDestinationName() method to com.snowtide.pdf.Bookmark #### Changes in PDFTextStream v2.2.1 - PDFTextStream.NET now ships with ikvm v0.3.4, which fixes a number of problems that prevented PDFTextStream from functioning properly across multiple AppDomains (598) - Added PDFTextStream.loadLicense(URL) function (475) - Added a 'spacing scale' property to VisualOutputTarget which allows applications to control the amount of horizontal whitespace that should be emitted per physical amount of whitespace found in the source document (528) - PDFTextStream will now attempt to load a license file from the host application's current directory before checking the current classpath / AppDomain (661) - Fixed a problem where pathological embedded Unicode character encodings were causing PDFTextStream to strings of control characters rather than reasonable extracted content. (428) - Fixed a bug in PDFTextStream's handling of cross reference entries that caused fatal errors in some documents (620) - Fixed a problem where UTF-16 encoded bookmark titles were not being decoded properly (618) #### Changes in PDFTextStream v2.2 - Added support for Apache Lucene v2.1 and v2.2 to PDFTextStream's Lucene integration module (com.snowtide.pdf.lucene.PDFDocumentFactory) - Added com.snowtide.pdf.PDFTextStreamConfig, which enables simple static and runtime configuration of PDFTextStream - Added new PDFTextStream constructors that accept customized PDFTextStreamConfig instances, and a setConfiguration(PDFTextStreamConfig) function to set a PDFTextStream instance's configuration at runtime - PDFTextStream now joins adjacent rectangles that have similar stroke and fill colors, which improves various page segmentation results - Improved table detection processes to adaptively recognize very small "variant" table cells - Improved pdfts.examples.XMLOutputTarget to build an XML DOM Document instead of constructing XML using a StringBuffer; block elements now include a type attribute of "table" if the block is a table - Significantly improved the quality of PDF documents generated when merging PDF files (com.snowtide.pdf.util.MergeUtil) and when saving updated PDF forms (com.snowtide.pdf.forms.AcroForm#writeUpdatedDocument(OutputStream)) - Rotated text blocks are now properly grouped within bounded regions - Changed pdfts.cjk.disable and pdfts.mmap.disable system properties to pdfts.cjk.enable and pdfts.mmap.enable, respectively - Fixed an overflow bug in PDFTextStream's PDF data parser - Fixed a bug where the ascent and descent characteristics of some fonts were defaulting to improper values - Fixed a bug where lines and rectangles drawn with a Separation color space were not being recognized properly - Fixed a bug where an error would result when reading a PDF file with a non-conforming linebreak sequence after the `stream' tag - Fixed a bug where tables containing underlined text would not be recognized properly - Fixed a bug where edges of rectangles were improperly recognized as text underlines - Fixed a bug where PDFTextStream wouldn't recognize PDF data stream filter name abbreviations #### Changes in PDFTextStream v2.1.6 - Added com.snowtide.pdf.util.TableUtils, which provides a set of CSV conversion functions for exporting the contents of tables - Added options to specify path to load PDFTextStream license file via pdfts_license_path environment variable or system property - Added com.snowtide.pdf.PDFTextStream.loadLicense(String) - programmatic way to specify path from which to load PDFTextStream license file - Changed PDFTextStream's default page segmentation algorithms to not eliminate empty table cells, making it simpler to export tabular content to Excel, etc. - Fixed bug in VisualOutputTarget where vertically-adjacent lines of text were being inappropriately combined - Fixed text encoding bug where text extracted from PDF documents generated by Adobe InDesign v4.0 - v5.0 would be "scrambled", or appear to be series of Chinese glyphs - Fixed bug where AFM font mappings were sometimes applied in an incorrect order, leading to spot errors in text extracts - Fixed bug where certain embedded Type1 font encodings were not being loaded correctly, resulting in single-character extraction errors #### Changes in PDFTextStream v2.1.5 - Significant improvements in the handling and standard output of rotated content - Added com.snowtide.pdf.layout.TextUnit.getTheta() #### Changes in PDFTextStream v2.1.3 - Added com.snowtide.pdf.Font.isItalic() -- indicates whether a font is italicized - Added com.snowtide.pdf.layout.TextUnit.isUnderlined() -- indicates whether a character is underlined - Added tagging of italic text regions in pdfts.examples.XMLOutputTarget #### Changes in PDFTextStream v2.1.2 - Fixed page rotation detection bug when processing PDF documents generated by Crystal Reports #### Changes in PDFTextStream v2.1.1 - Significant improvements in output of VisualOutputTarget, especially for pages with many different font sizes - Fixed calculation of character widths for Type0 font that have a recognized AFM base font name #### Changes in PDFTextStream v2.1 - Added support for updating text, checkbox, radio button, and choice interactive form fields - Added support for Kodak print job data extraction (%KDK commands) via com.snowtide.pdf.util.KodakPrintData - Exposed the AcroFormField.isReadOnly() function - Added ByteBuffer-based buildPDFDocument() functions to com.snowtide.pdf.lucene.PDFDocumentFactory - Added the `pdfts.logfactory' and `pdfts.loggingtype' system variables to simplify the customization of logging via com.snowtide.util.logging.LoggingRegistry - java.util.logging is now the default logging toolkit; `pdfts.loggingtype' may be used to change that. Refer to the LoggingRegistry javadocs for more info. - Improved documentation significantly - Fixed a problem where merged PDF documents that contained empty dictionaries would be improperly generated - Fixed a problem where the "rich text" value of text interactive form fields would not be loaded #### Changes in PDFTextStream v2.0.5 - Fixed handling of text spacing that was causing some columnated text to overrun column boundaries improperly - Fixed a problem where text from adjacent lines would be inappropriately intermingled - Changed unlicensed functionality so that evaluation use would not require a special evaluation license file; specifically, PDFTextStream will randomize some digits in text extracts when it is operating unlicensed, and the 8-page extract limitation has been removed #### Changes in PDFTextStream v2.0.2 - Added com.snowtide.pdf.RegionOutputTarget to support region-specific content extraction - Added ability to derive encoding and spatial metrics of Type3 fonts; added pdfts.type3.derive system property to disable derivation if necessary (359) - Fixed problem with com.snowtide.pdf.VisualOutputTarget, where lines would sometimes be inappropriately combined (356) #### Changes in PDFTextStream v2.0.1 - Better indication of corrupted or otherwise unreadable PDF files (com.snowtide.pdf.FaultyPDFException) - Added pipe(OutputHandler) function to com.snowtide.pdf.layout.Line - Added `pdfts.mmap.disable` system property option to disable memory-mapping of PDF files - avoids JDK bug #4724038 (355) #### Changes in PDFTextStream v2.0 - PDFTextStream now available for .NET and Python - Added support for extraction of Chinese, Japanese, and Korean text (CJK) - Added support for accessing derived table structure (com.snowtide.pdf.layout.Table) - Significantly improved performance - Significantly improved accuracy of extraction of rotated text - Added support for Lucene v1.9 and v2.0 - Added "visual" text layout output target (com.snowtide.pdf.VisualOutputTarget) - Added PDF merge capability (com.snowtide.pdf.util.PDFMergeUtil) - Added support for Type1C embedded font files (274) - Fixed issue where some bookmarks would have invalid page number attributes (i.e. -1) (324) - Fixed issues where blocks, lines, and textunits that represented rotated text reported inaccurate positions on the page - Fixed issue where xref table was not being rebuilt when object locator was simply missing (338) - Eliminated com.snowtide.pdf.PDFTextStreamOptions (deprecated in v1.3) #### Changes in PDFTextStream v1.4 - Added support for interactive PDF forms (AcroForms) (com.snowtide.pdf.forms.* and com.snowtide.pdf.PDFTextStream.getFormData()) (118) - Added support for derivation of 'graphical' font encoding (Type3) (297) - Added com.snowtide.pdf.OutputHandler base class for OutputTarget - Added PDFTextStream constructor that takes a java.nio.ByteBuffer, enabling completely in-memory operation - Added an example class that extracts form data as XML (pdfts.examples.XMLFormExport) - Added sample implementation of com.snowtide.pdf.OutputHandler that outputs PDF text as XML, indicating document structure and where bolded text ranges exist (pdfts.examples.XMLOutputTarget) - Added sample OutputHandler implementation that exports PDF text content as an XHTML document (pdfts.examples.GoogleHTMLOutputHandler) - Fixed bug where inline images were not being properly skipped (308) - Fixed bug where destination bounds of some bookmarks and annotations were not being properly set (307) - Fixed bug where text properties (font size, character encoding, etc) would persist beyond where they should (298) #### Changes in PDFTextStream v1.3.6 - Fixed potential OutOfMemoryError caused by complex graphical regions (295) - Fixed bug where out-of-date content might be extracted from updated PDF documents (296) #### Changes in PDFTextStream v1.3.5 - Added PDF annotation API (com.snowtide.pdf.annot.*) (76) - Added PDF bookmark API (com.snowtide.pdf.Bookmark and com.snowtide.pdf.PDFTextStream.getBookmarks()) (284) - Significantly improved performance parsing PDF data containing very complex illustrations (282) - Improved triage procedures for handling damaged or malformed PDF files (292) - Fixed bug where com.snowtide.pdf.Page.getPageNumber() was reporting 1-indexed page numbers; it now properly reports 0-indexed page numbers (283) - Fixed parsing bug related to zero-length PDF names (290) #### Changes in PDFTextStream v1.3.4 - Improved rectangle and line detection to avoid skipping graphics that impact text layout (272) - Improved the algorithm used to calculate the number of line breaks to be outputted between lines of text (271) - Improved detection and handling of malformed PDF documents to prevent potential infinite loops (278) - Fixed compatibility problem with PDFs generated by IBM Manyimage tool - Fixed compatibility problem with PDFs generated by SAP R/3 (276) - Fixed error thrown when some blank pages are encountered (270) #### Changes in PDFTextStream v1.3.3 - Expanded support for referenced form XObjects; results in more complete text extracts (263) - Improved font lookup routines; now caching frequently-referenced fonts for improved performance - Fixed logging classloading issue on JDK 1.3.1_01 #### Changes in PDFTextStream v1.3.2 - Significant performance enhancement through improved usage of java.nio.* classes; available only on JDK 1.4+ #### Changes in PDFTextStream v1.3.1 - Fixed integration with JDK v1.4 java.util.logging toolkit #### Changes in PDFTextStream v1.3 - Added ability to retrieve PDF document page attributes (height, width, rotation, etc) (94) - Added ability to retrieve PDF document pages one at a time (94) - Added ability to retrieve PDF document encryption parameters (99) - Added ability to retrieve PDF file specification version number (91) - Added pipe() method to PDFTextStream and retrieved PDF pages, allowing easy redirection of content to a buffer to file (92) - Significantly improved page segmentation and document read-ordering, resulting in more semantically-consistent text extracts - Significantly improved extraction of rotated text - Significantly improved extraction of line-bounded tables (107) - Deprecated PDFTextStreamOptions class: strictEncoding and page header options no longer used (87, 98) - PDFTextStream now always produces Unicode text; the ASCII-only option is no longer provided, as it proved to be unreliable (87) - Fixed some minor Unicode text extraction issues related to selecting the proper character encoding for Type 1 fonts (86) - Fixed PDFTextStream's implementation of the PDF graphics state stack to more closely conform to the PDF spec (90) - Fixed problem where certain monospaced character might be omitted from output (35) - Fixed problem where text might be scrambled on a line that contains certain monospaced text (182) #### Changes in PDFTextStream v1.2 - Added support for retrieving document-level Adobe XMP data (document metadata in an XML format) (66) - Added support for PDF v1.5 files encrypted using crypt filters that specify an invalid decryption key length (63) - Improved overview documentation of metadata access in Javadoc and Developer's Guide (70) - Fixed support for decrypting updated PDF v1.4 files encrypted with 128-bit passwords (62) - Fixed internal error that might have occurred in connection with processing updated PDF documents (72) #### Changes in PDFTextStream v1.1.2 - Enhanced the core parsing routines to accept PDF files that use improper (or nonexistant) string escape sequences - Fixed a bug that caused hard errors when processing some PDF v1.5 documents. - Fixed a bug where a particular text mapping (hex / CIDFont mappings) used in some PDF's would be misinterpreted, resulting in space characters being outputted instead of 'regular' characters #### Changes in PDFTextStream v1.1.1 - Fixed a problem where some PDF's that use a particular type of TrueType font were converted into useless text content #### Changes in PDFTextStream v1.1 - JDK v1.3 is now fully supported. - Significant improvements have been made in the layout and formatting of rotated text. - All logging is now channeled through Jakarta's commons-logging library to enable usage of logging toolkits other than log4j.