Class Configuration
- java.lang.Object
-
- com.snowtide.pdf.Configuration
-
public class Configuration extends Object
Various configuration options for PDFxStream may be set using this class. A custom configuration may be registered with PDFxStream in any of three ways:
- Retrieving and changing the current
default
or by creating a newConfiguration
Configuration
, modifying it as desired, andsetting it
as the new default instance. - Providing a customized
Configuration
instance to one of the {code PDF.open} factory methods: - Setting the configuration used by an existing
Document
viaDocument.setConfig(Configuration)
. Note that certain configuration properties are utilized only duringDocument
initialization, so default settings will end up being used during that initialization phase.
- Since:
- v3.0
- Version:
- ©2004-2024 Snowtide
- Retrieving and changing the current
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
Configuration.TelemetryMode
PDFxStream makes very limited use of remote telemetry, strictly to ensure licensing compliance and to aid Snowtide's technical support operations.
-
Constructor Summary
Constructors Constructor Description Configuration()
Creates aConfiguration
whose properties are derived from the original defaults and the current value of system properties.Configuration(Configuration other)
Creates a copy of the givenConfiguration
instance.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static Configuration
getDefault()
Returns the configuration that newDocument
instances use by default.String
getLinebreakString()
Returns the string thatOutputTarget
(and its subclasses) output for each linebreak identified in extracted PDF content.int
getMinTableCellCount()
Returns the minimum number of adjacent cells that must be present in order for PDFxStream to recognize those cells collectively as aTable
.Configuration.TelemetryMode
getTelemetryMode()
TextUnit.Predicate
getTextUnitPredicate()
static boolean
isCJKSupportEnabled()
Returns true if this configuration will cause PDFxStream to extract and decode Chinese, Japanese, and Korean content.boolean
isDeriveType3Fonts()
Returns true if this configuration will cause PDFxStream to derive the Unicode encodings of Type3 PDF fonts.boolean
isElideHorizontalTextualRules()
Returns true only if runs of characters forming a likely horizontal rule should be elided from extraction results.boolean
isIgnoreNonCardinalRotatedChars()
When true, characters that are rotated by angles other than 0, 90, 180, and 270 will be ignored.boolean
isImplicitLineDetectionEnabled()
boolean
isStripXFAFormDataEnabled()
boolean
isTableDetectionEnabled()
Returns true only ifTable
detection is enabled; defaults to true.static void
setCJKSupportEnabled(boolean enableCJK)
Changes the setting that controls whether or not PDFxStream extracts and decodes Chinese, Japanese, and Korean content.static void
setDefault(Configuration defaultConfig)
Sets the configuration that newDocument
instances use by default.void
setDeriveType3Fonts(boolean deriveType3Fonts)
Changes the setting that controls whether or not PDFxStream derives the Unicode encodings of Type3 PDF fonts.void
setElideHorizontalTextualRules(boolean elideHorizontalTextualRules)
Sets whether or not runs of characters forming a likely horizontal rule should be elided from extraction results.void
setIgnoreNonCardinalRotatedChars(boolean ignoreRotated)
Sets whether characters that are rotated by angles other than 0, 90, 180, and 270 will be ignored.void
setImplicitLineDetectionEnabled(boolean detectImplicitLines)
void
setLinebreakString(String linebreak)
Sets the string thatOutputTarget
(and its subclasses) output for each linebreak identified in extracted PDF content.void
setMinTableCellCount(int minTableCellCount)
Changes the setting that controls the minimum number of adjacent cells that must be present in order for PDFxStream to recognize those cells collectively as aTable
.void
setStripXFAFormDataEnabled(boolean stripXFAFormData)
void
setTableDetectionEnabled(boolean detectTables)
Sets whether or notTable
detection is enabled.void
setTelemetryMode(Configuration.TelemetryMode m)
void
setTextUnitPredicate(TextUnit.Predicate p)
String
toString()
-
-
-
Constructor Detail
-
Configuration
public Configuration(Configuration other)
Creates a copy of the givenConfiguration
instance.
-
Configuration
public Configuration()
Creates aConfiguration
whose properties are derived from the original defaults and the current value of system properties. This does not take the currentdefault configuration
into account.
-
-
Method Detail
-
getDefault
public static Configuration getDefault()
Returns the configuration that newDocument
instances use by default.- See Also:
setDefault(Configuration)
-
setDefault
public static void setDefault(Configuration defaultConfig)
Sets the configuration that newDocument
instances use by default.
-
getTelemetryMode
public Configuration.TelemetryMode getTelemetryMode()
-
setTelemetryMode
public void setTelemetryMode(Configuration.TelemetryMode m)
-
setTextUnitPredicate
public void setTextUnitPredicate(TextUnit.Predicate p)
-
getTextUnitPredicate
public TextUnit.Predicate getTextUnitPredicate()
-
setElideHorizontalTextualRules
public void setElideHorizontalTextualRules(boolean elideHorizontalTextualRules)
Sets whether or not runs of characters forming a likely horizontal rule should be elided from extraction results. Defaults to false.In many kinds of documents, a series of repeating characters (often dashes '-' or underscores '_') are used to "draw" what might otherwise usually be rendered using a vector line. For example, to separate a heading from some body text:
OVERVIEW OF THIS INFORMATION COLLECTION -------------------------------------------------- A combined total of 4,607 respondents will utilize the form and then package and ship/deliver business records to the agency heretofore cited, in order to...
Sometimes these runs of characters (called "rules" in printing and typesetting contexts) can be helpful to identify structure within extracted content. However, other times they are a hindrance, especially if the repeating characters are positioned immediately below some other content, so closely that PDFxStream infers that they are actually part of that other content. The result of this can be unfortunate text extraction results like:
_______O_V_E_R_V_I_E_W___O_F___T_H_I_S__I_N_F_O_R_M_A_T_I_O_N__C_O_L_L_E_C_T_I_O_N______
Enabling this configuration setting will cause PDFxStream to detect and remove the characters forming these kinds of horizontal rules from extracted texts.
-
isElideHorizontalTextualRules
public boolean isElideHorizontalTextualRules()
Returns true only if runs of characters forming a likely horizontal rule should be elided from extraction results. Defaults to false.- See Also:
for more information about this setting.
-
isTableDetectionEnabled
public boolean isTableDetectionEnabled()
Returns true only ifTable
detection is enabled; defaults to true.
-
setTableDetectionEnabled
public void setTableDetectionEnabled(boolean detectTables)
Sets whether or notTable
detection is enabled.
-
isStripXFAFormDataEnabled
public boolean isStripXFAFormDataEnabled()
-
setStripXFAFormDataEnabled
public void setStripXFAFormDataEnabled(boolean stripXFAFormData)
-
isIgnoreNonCardinalRotatedChars
public boolean isIgnoreNonCardinalRotatedChars()
When true, characters that are rotated by angles other than 0, 90, 180, and 270 will be ignored. Defaults to false.
-
setIgnoreNonCardinalRotatedChars
public void setIgnoreNonCardinalRotatedChars(boolean ignoreRotated)
Sets whether characters that are rotated by angles other than 0, 90, 180, and 270 will be ignored. Defaults to false.
-
getMinTableCellCount
public int getMinTableCellCount()
Returns the minimum number of adjacent cells that must be present in order for PDFxStream to recognize those cells collectively as aTable
. This setting defaults to 4.
-
setMinTableCellCount
public void setMinTableCellCount(int minTableCellCount)
Changes the setting that controls the minimum number of adjacent cells that must be present in order for PDFxStream to recognize those cells collectively as aTable
. This setting defaults to 4.
-
isImplicitLineDetectionEnabled
public boolean isImplicitLineDetectionEnabled()
-
setImplicitLineDetectionEnabled
public void setImplicitLineDetectionEnabled(boolean detectImplicitLines)
-
isCJKSupportEnabled
public static boolean isCJKSupportEnabled()
Returns true if this configuration will cause PDFxStream to extract and decode Chinese, Japanese, and Korean content. This setting defaults to true.
-
setCJKSupportEnabled
public static void setCJKSupportEnabled(boolean enableCJK)
Changes the setting that controls whether or not PDFxStream extracts and decodes Chinese, Japanese, and Korean content. This setting defaults to true. Changing it to false will minimize PDFxStream's memory utilization, but no CJK content will be extracted.
-
isDeriveType3Fonts
public boolean isDeriveType3Fonts()
Returns true if this configuration will cause PDFxStream to derive the Unicode encodings of Type3 PDF fonts. This setting defaults to true.
-
setDeriveType3Fonts
public void setDeriveType3Fonts(boolean deriveType3Fonts)
Changes the setting that controls whether or not PDFxStream derives the Unicode encodings of Type3 PDF fonts. This setting defaults to true. Changing it to false will result in a small performance improvement, but any PDF content rendered using Type3 fonts that lack a Unicode encoding will not be extracted by PDFxStream.
-
getLinebreakString
public String getLinebreakString()
Returns the string thatOutputTarget
(and its subclasses) output for each linebreak identified in extracted PDF content. This value defaults to the current platform's line break string, as identified by theline.separator
system property.
-
setLinebreakString
public void setLinebreakString(String linebreak)
Sets the string thatOutputTarget
(and its subclasses) output for each linebreak identified in extracted PDF content. This value defaults to the current platform's line break string, as identified by theline.separator
system property.
-
-