OutputHandler
for extracting text, as
well as Java API updates to add the Span
abstraction (see
com.snowtide.pdf.layout.Line.getSpans()
) for document model
access to RTL and bidi text.Comprehensive revision to handling of damaged/incomplete PDF documents, addressing dozens of classes thereof.
VisualOutputTarget
to incorrectly gauge the width of such
pages, leading to wildly inappropriate spacing.VisualOutputTarget
to avoid an
OutOfMemoryError
when the source page contains oddly-skewed
characterscom.snowtide.pdf.Configuration.setElideHorizontalTextualRules(boolean)
)com.snowtide.pdf.RegionOutputTarget
now provides a
“minimum overlap” configuration option via
.setMinimumOverlapPct(float)
, used to determine how much of
each considered character must lie within defined regions in order to be
included in the extracted contentpdfxs.layout.ignoreNonCardinalRotatedChars
. When set via
system property, environment variable, or
com.snowtide.pdf.Configuration.setIgnoreNonCardinalRotatedChars(boolean)
,
PDFxStream will ignore any characters that are rotated by non-cardinal
angles, i.e. any angle other than 0, 90, 180, or 270.com.snowtide.pdf.layout.Image.id()
) is unique to the
underlying bitmap data. This makes it easy to never load/decode the same
bitmap more than once, even if it appears in multiple places in a source
PDF.com.snowtide.pdf.forms.AcroRadioButtonGroupField
s now
properly report their pageNumber()
and
bounds()
(the latter being calculated to be the MBR of the
group’s component radio buttons).com.snowtide.pdf.Page.addColumnPartition(int)
would be
recognized.This release contains a significant fix to how text encodings and embedded character maps are unioned to produce efficient decoding of multibyte text encodings.
Further small fixes include:
com.snowtide.pdf.Console
class, now includes an option to
use VisualOutputTarget
when extracting text content from a
source PDF document.VisualOutputTarget
’s handling of pages that contain
text rendered at different sizes has been significantly improved./CIDInit
are now recognized.FontBBox
and ascender/descender
metrics was fixed in PDFxStream’s bundled Courier font
descriptions.pdfts.examples.GoogleHTMLOutputHandler
now includes appropriate META
tags in order to ensure
proper encoding and display of high-code-point Unicode characters.#20
where spaces
should be)com.snowtide.pdf.layout.TextUnit
s carrying the same
character (sequence) and rendered using monospace fonts would be omitted
from text extracts entirely.ToUnicode
character mappings.This release contains a number of new features and capabilities, as well as a large number of fixes of customer-reported bugs.
com.snowtide.pdf.Page
for each annotation that defines its
own appearance, accessible via
com.snowtide.pdf.annot.Annotation.getAppearance()
.com.snowtide.pdf.VisualOutputTarget
now
includes space at the beginning of each line as necessary
corresponding to the left margin of the page. This makes it easier to
concatenate the text extracted from multiple pages, and process the
result cumulatively, which is useful for e.g. tabular data spanning
multiple pages where columns are delineated by whitespace. The old
behaviour can be recovered via
com.snowtide.pdf.VisualOutputTarget.setMarginTrimmed(true)
.com.snowtide.pdf.TextUnit.getFontSize()
com.snowtide.pdf.Console
now offers an
--attrs
option, which will emit all of the document-level
metadata attributes present in the provided input file.Fixed a problem with the packaging of PDFxStream.NET bundles.
This is the first public release of PDFxStream v3.x. It introduces a number of new capabilities and adds tons of smaller improvements over PDFTextStream v2.7.0, which preceeded it. Upgrading to PDFxStream should be relatively painless: steps have been taken to maximize the API compatibility between the two releases, though there are some minor breaking changes (mostly related to rebranding to “PDFxStream” as the main product name).
Since moving from v2.7.0 to v3.1.0 constitutes a major upgrade, existing PDFTextStream customers will need new PDFxStream v3.x license file(s). Please contact us to request issuance of your new PDFxStream license file(s).
com.snowtide.PDF.Feature.Images
). The key method is
com.snowtide.pdf.Page.getImages()
, which returns a
collection of com.snowtide.pdf.layout.Image
objects that
can provide either encoded image data (as PNG, JPEG, etc) or “live”
platform-suitable image objects (either
java.awt.image.BufferedImage
on the JVM, or
System.Drawing.Bitmap
on .NET).com.snowtide.pdf.Page.getCharacters()
, yielding a
collection of com.snowtide.pdf.layout.TextUnit
s on a page
without incurring the costs associated with page segmentation and
read-ordering required by calling Page.getTextContent()
or
Page.pipe(OutputHandler)
.com.snowtide.pdf.Document.getEmbeddedFiles()
) and
those associated with particular annotations
(com.snowtide.pdf.annot.FileAttachmentAnnotation
).com.snowtide.pdf.DocumentLocation
as
superinterface of
com.snowtide.pdf.forms.AcroFormField
.The Lucene integration API provided by classes in the
com.snowtide.pdf.lucene
package are no longer included in
PDFxStream. They will be open-sourced separately in the near future
according to customer demand.
com.snowtide.pdf.PDFTextStreamConfig
has been
renamed Configuration
.
The memory-mapping option previously offered by
PDFTextStreamConfig
has been removed: memory mapping is now
never used by PDFxStream. This addresses various problems with
memory-mapping PDF files on Windows, and eliminates an option that no
longer had any benefit due to improvements made in how PDFxStream
utilizes I/O. Concretely, this change eliminates the following methods
and other facilities, with no replacement in the PDFxStream API:
PDFTextStreamConfig.setMemoryMappingEnabled(boolean)
PDFTextStreamConfig.isMemoryMappingEnabled()
pdfts.mmap.disable
system property / environment
variablecom.snowtide.pdf.layout.Rectangle
is no longer an
empty marker interface; it is now the concrete, default implementation
of com.snowtide.pdf.layout.Region
. The
Rectangle
interface was likely never used by any code
consuming PDFTextStream, so breakage associated with this change should
be minimal to nonexistent.
com.snowtide.pdf.PDFTextStream
no longer extends
java.io.Reader
, or implements
java.lang.Readable
. Based on customer feedback, these
affordances were never used.
Many classes in com.snowtide.pdf.layout
that used to
implement the Region
interface now implement
Bounded
instead. At worst, this change will require that
customer code that used to access the spatial properties of a
Region
now must obtain a Region
from a
Bounded
object first. e.g., this code:
Block block = ...;
float xposition = block.xpos();
must be changed to:
Block block = ...;
float xposition = block.bounds().xpos();
Various methods on com.snowtide.pdf.EncryptionInfo
were originally made public in error, they are of no use outside of
PDFxStream’s internals. These methods are now private.
com.snowtide.pdf.Page.getStream()
has been renamed
to getDocument()
.
The public static main (String[])
method previously
provided by com.snowtide.pdf.PDFTextStream
has been
consolidated into the catch-all main method provided by
com.snowtide.pdf.Console
.
com.snowtide.pdf.forms.Form
, its sub-interfaces, and
its implementations are now generified based on the type of form fields
they contain.
com.snowtide.pdf.EncryptionInfo.getErrorType()
now
returns an instance of EncryptionInfo.ErrorType
, a new
enumeration.
com.snowtide.pdf.FaultyPDFException
’s constructors
are now private.
com.snowtide.pdf.PDFVersion
is now defined as an
enumeration, instances of which are returned by
com.snowtide.pdf.Document.getPDFVersion()
com.snowtide.pdf.Page.getPdfName()
has been removed;
use Page.getDocument().getName()
instead.
com.snowtide.pdf.annot.Annotation
now implements
com.snowtide.pdf.DocumentLocation
; its old methods
getRect()
and getPageNumber()
have been
renamed to match the analogous methods defined by
DocumentLocation
.
com.snowtide.pdf.PDFTextStream
is now deprecated,
though its API remains to prevent immediate breakage of code referencing
it. Please update your projects to open PDF files via
com.snowtide.PDF.open()
, and use the
com.snowtide.pdf.Document
interface as the type
representing those files.pdfts.logfactory' and
pdfts.loggingtype’
system variables to simplify the customization of logging via
com.snowtide.util.logging.LoggingRegistrypdfts.mmap.disable
system property option to
disable memory-mapping of PDF files - avoids JDK bug #4724038 (355)