public class Configuration
extends java.lang.Object
Various configuration options for PDFxStream may be set using this class. A custom configuration may be registered with PDFxStream in any of three ways:
default Configuration
or by creating a new Configuration
,
modifying it as desired, and setting it
as the new default instance.Configuration
instance to one of the {code PDF.open}
factory methods:
Document
via
Document.setConfig(Configuration)
. Note that certain configuration properties
are utilized only during Document
initialization, so default settings will end up being
used during that initialization phase.Constructor and Description |
---|
Configuration()
Creates a
Configuration whose properties are derived from the original defaults and the current
value of system properties. |
Configuration(Configuration other)
Creates a copy of the given
Configuration instance. |
Modifier and Type | Method and Description |
---|---|
static Configuration |
getDefault()
Returns the configuration that new
Document instances use by default. |
java.lang.String |
getLinebreakString()
Returns the string that
OutputTarget (and its subclasses) output for each linebreak identified in
extracted PDF content. |
int |
getMinTableCellCount()
Returns the minimum number of adjacent cells that must be present in order
for PDFxStream to recognize those cells collectively as a
Table . |
TextUnit.Predicate |
getTextUnitPredicate() |
static boolean |
isCJKSupportEnabled()
Returns true if this configuration will cause PDFxStream to extract and decode Chinese, Japanese,
and Korean content.
|
boolean |
isDeriveType3Fonts()
Returns true if this configuration will cause PDFxStream to derive the Unicode encodings of Type3
PDF fonts.
|
boolean |
isElideHorizontalTextualRules()
Returns true only if runs of characters forming a likely horizontal rule should be elided
from extraction results.
|
boolean |
isIgnoreNonCardinalRotatedChars()
When true, characters that are rotated by angles other than 0, 90, 180, and 270 will be ignored.
|
boolean |
isImplicitLineDetectionEnabled() |
boolean |
isStripXFAFormDataEnabled() |
boolean |
isTableDetectionEnabled()
Returns true only if
Table detection is enabled; defaults to true. |
static void |
setCJKSupportEnabled(boolean enableCJK)
Changes the setting that controls whether or not PDFxStream extracts and decodes Chinese, Japanese,
and Korean content.
|
static void |
setDefault(Configuration defaultConfig)
Sets the configuration that new
Document instances use by default. |
void |
setDeriveType3Fonts(boolean deriveType3Fonts)
Changes the setting that controls whether or not PDFxStream derives the Unicode encodings of Type3
PDF fonts.
|
void |
setElideHorizontalTextualRules(boolean elideHorizontalTextualRules)
Sets whether or not runs of characters forming a likely horizontal rule should be elided
from extraction results.
|
void |
setIgnoreNonCardinalRotatedChars(boolean ignoreRotated)
Sets whether characters that are rotated by angles other than 0, 90, 180, and 270 will be ignored.
|
void |
setImplicitLineDetectionEnabled(boolean detectImplicitLines) |
void |
setLinebreakString(java.lang.String linebreak)
Sets the string that
OutputTarget (and its subclasses) output for each linebreak identified in
extracted PDF content. |
void |
setMinTableCellCount(int minTableCellCount)
Changes the setting that controls the minimum number of adjacent cells that must be present in order
for PDFxStream to recognize those cells collectively as a
Table . |
void |
setStripXFAFormDataEnabled(boolean stripXFAFormData) |
void |
setTableDetectionEnabled(boolean detectTables)
Sets whether or not
Table detection is enabled. |
void |
setTextUnitPredicate(TextUnit.Predicate p) |
java.lang.String |
toString() |
public Configuration(Configuration other)
Configuration
instance.public Configuration()
Configuration
whose properties are derived from the original defaults and the current
value of system properties.
This does not take the current default configuration
into account.public static Configuration getDefault()
Document
instances use by default.Configuration.setDefault(Configuration)
public static void setDefault(Configuration defaultConfig)
Document
instances use by default.public java.lang.String toString()
toString
in class java.lang.Object
public void setTextUnitPredicate(TextUnit.Predicate p)
public TextUnit.Predicate getTextUnitPredicate()
public void setElideHorizontalTextualRules(boolean elideHorizontalTextualRules)
In many kinds of documents, a series of repeating characters (often dashes '-' or underscores '_') are used to "draw" what might otherwise usually be rendered using a vector line. For example, to separate a heading from some body text:
OVERVIEW OF THIS INFORMATION COLLECTION -------------------------------------------------- A combined total of 4,607 respondents will utilize the form and then package and ship/deliver business records to the agency heretofore cited, in order to...
Sometimes these runs of characters (called "rules" in printing and typesetting contexts) can be helpful to identify structure within extracted content. However, other times they are a hindrance, especially if the repeating characters are positioned immediately below some other content, so closely that PDFxStream infers that they are actually part of that other content. The result of this can be unfortunate text extraction results like:
_______O_V_E_R_V_I_E_W___O_F___T_H_I_S__I_N_F_O_R_M_A_T_I_O_N__C_O_L_L_E_C_T_I_O_N______
Enabling this configuration setting will cause PDFxStream to detect and remove the characters forming these kinds of horizontal rules from extracted texts.
public boolean isElideHorizontalTextualRules()
for more information about this setting.
public boolean isTableDetectionEnabled()
Table
detection is enabled; defaults to true.public void setTableDetectionEnabled(boolean detectTables)
Table
detection is enabled.public boolean isStripXFAFormDataEnabled()
public void setStripXFAFormDataEnabled(boolean stripXFAFormData)
public boolean isIgnoreNonCardinalRotatedChars()
public void setIgnoreNonCardinalRotatedChars(boolean ignoreRotated)
public int getMinTableCellCount()
Table
. This setting defaults
to 4.public void setMinTableCellCount(int minTableCellCount)
Table
. This setting defaults
to 4.public boolean isImplicitLineDetectionEnabled()
public void setImplicitLineDetectionEnabled(boolean detectImplicitLines)
public static boolean isCJKSupportEnabled()
public static void setCJKSupportEnabled(boolean enableCJK)
public boolean isDeriveType3Fonts()
public void setDeriveType3Fonts(boolean deriveType3Fonts)
public java.lang.String getLinebreakString()
OutputTarget
(and its subclasses) output for each linebreak identified in
extracted PDF content. This value defaults to the current platform's line break string, as identified
by the line.separator
system property.public void setLinebreakString(java.lang.String linebreak)
OutputTarget
(and its subclasses) output for each linebreak identified in
extracted PDF content. This value defaults to the current platform's line break string, as identified
by the line.separator
system property.