Class Configuration
- java.lang.Object
-
- com.snowtide.pdf.Configuration
-
public class Configuration extends Object
Various configuration options for PDFxStream may be set using this class. A custom configuration may be registered with PDFxStream in any of three ways:
- Retrieving and changing the current
defaultor by creating a newConfigurationConfiguration, modifying it as desired, andsetting itas the new default instance. - Providing a customized
Configurationinstance to one of the {code PDF.open} factory methods: - Setting the configuration used by an existing
DocumentviaDocument.setConfig(Configuration). Note that certain configuration properties are utilized only duringDocumentinitialization, so default settings will end up being used during that initialization phase.
- Since:
- v3.0
- Version:
- ©2004-2025 Snowtide
- Retrieving and changing the current
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classConfiguration.TelemetryModePDFxStream makes very limited use of remote telemetry, strictly to ensure licensing compliance and to aid Snowtide's technical support operations.
-
Constructor Summary
Constructors Constructor Description Configuration()Creates aConfigurationwhose properties are derived from the original defaults and the current value of system properties.Configuration(Configuration other)Creates a copy of the givenConfigurationinstance.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static ConfigurationgetDefault()Returns the configuration that newDocumentinstances use by default.StringgetLinebreakString()Returns the string thatOutputTarget(and its subclasses) output for each linebreak identified in extracted PDF content.intgetMinTableCellCount()Returns the minimum number of adjacent cells that must be present in order for PDFxStream to recognize those cells collectively as aTable.Configuration.TelemetryModegetTelemetryMode()TextUnit.PredicategetTextUnitPredicate()static booleanisCJKSupportEnabled()Returns true if this configuration will cause PDFxStream to extract and decode Chinese, Japanese, and Korean content.booleanisDeriveType3Fonts()Returns true if this configuration will cause PDFxStream to derive the Unicode encodings of Type3 PDF fonts.booleanisElideHorizontalTextualRules()Returns true only if runs of characters forming a likely horizontal rule should be elided from extraction results.booleanisIgnoreNonCardinalRotatedChars()When true, characters that are rotated by angles other than 0, 90, 180, and 270 will be ignored.booleanisImplicitLineDetectionEnabled()booleanisStripXFAFormDataEnabled()booleanisTableDetectionEnabled()Returns true only ifTabledetection is enabled; defaults to true.static voidsetCJKSupportEnabled(boolean enableCJK)Changes the setting that controls whether or not PDFxStream extracts and decodes Chinese, Japanese, and Korean content.static voidsetDefault(Configuration defaultConfig)Sets the configuration that newDocumentinstances use by default.voidsetDeriveType3Fonts(boolean deriveType3Fonts)Changes the setting that controls whether or not PDFxStream derives the Unicode encodings of Type3 PDF fonts.voidsetElideHorizontalTextualRules(boolean elideHorizontalTextualRules)Sets whether or not runs of characters forming a likely horizontal rule should be elided from extraction results.voidsetIgnoreNonCardinalRotatedChars(boolean ignoreRotated)Sets whether characters that are rotated by angles other than 0, 90, 180, and 270 will be ignored.voidsetImplicitLineDetectionEnabled(boolean detectImplicitLines)voidsetLinebreakString(String linebreak)Sets the string thatOutputTarget(and its subclasses) output for each linebreak identified in extracted PDF content.voidsetMinTableCellCount(int minTableCellCount)Changes the setting that controls the minimum number of adjacent cells that must be present in order for PDFxStream to recognize those cells collectively as aTable.voidsetStripXFAFormDataEnabled(boolean stripXFAFormData)voidsetTableDetectionEnabled(boolean detectTables)Sets whether or notTabledetection is enabled.voidsetTelemetryMode(Configuration.TelemetryMode m)voidsetTextUnitPredicate(TextUnit.Predicate p)StringtoString()
-
-
-
Constructor Detail
-
Configuration
public Configuration(Configuration other)
Creates a copy of the givenConfigurationinstance.
-
Configuration
public Configuration()
Creates aConfigurationwhose properties are derived from the original defaults and the current value of system properties. This does not take the currentdefault configurationinto account.
-
-
Method Detail
-
getDefault
public static Configuration getDefault()
Returns the configuration that newDocumentinstances use by default.- See Also:
setDefault(Configuration)
-
setDefault
public static void setDefault(Configuration defaultConfig)
Sets the configuration that newDocumentinstances use by default.
-
getTelemetryMode
public Configuration.TelemetryMode getTelemetryMode()
-
setTelemetryMode
public void setTelemetryMode(Configuration.TelemetryMode m)
-
setTextUnitPredicate
public void setTextUnitPredicate(TextUnit.Predicate p)
-
getTextUnitPredicate
public TextUnit.Predicate getTextUnitPredicate()
-
setElideHorizontalTextualRules
public void setElideHorizontalTextualRules(boolean elideHorizontalTextualRules)
Sets whether or not runs of characters forming a likely horizontal rule should be elided from extraction results. Defaults to false.In many kinds of documents, a series of repeating characters (often dashes '-' or underscores '_') are used to "draw" what might otherwise usually be rendered using a vector line. For example, to separate a heading from some body text:
OVERVIEW OF THIS INFORMATION COLLECTION -------------------------------------------------- A combined total of 4,607 respondents will utilize the form and then package and ship/deliver business records to the agency heretofore cited, in order to...Sometimes these runs of characters (called "rules" in printing and typesetting contexts) can be helpful to identify structure within extracted content. However, other times they are a hindrance, especially if the repeating characters are positioned immediately below some other content, so closely that PDFxStream infers that they are actually part of that other content. The result of this can be unfortunate text extraction results like:
_______O_V_E_R_V_I_E_W___O_F___T_H_I_S__I_N_F_O_R_M_A_T_I_O_N__C_O_L_L_E_C_T_I_O_N______
Enabling this configuration setting will cause PDFxStream to detect and remove the characters forming these kinds of horizontal rules from extracted texts.
-
isElideHorizontalTextualRules
public boolean isElideHorizontalTextualRules()
Returns true only if runs of characters forming a likely horizontal rule should be elided from extraction results. Defaults to false.- See Also:
for more information about this setting.
-
isTableDetectionEnabled
public boolean isTableDetectionEnabled()
Returns true only ifTabledetection is enabled; defaults to true.
-
setTableDetectionEnabled
public void setTableDetectionEnabled(boolean detectTables)
Sets whether or notTabledetection is enabled.
-
isStripXFAFormDataEnabled
public boolean isStripXFAFormDataEnabled()
-
setStripXFAFormDataEnabled
public void setStripXFAFormDataEnabled(boolean stripXFAFormData)
-
isIgnoreNonCardinalRotatedChars
public boolean isIgnoreNonCardinalRotatedChars()
When true, characters that are rotated by angles other than 0, 90, 180, and 270 will be ignored. Defaults to false.
-
setIgnoreNonCardinalRotatedChars
public void setIgnoreNonCardinalRotatedChars(boolean ignoreRotated)
Sets whether characters that are rotated by angles other than 0, 90, 180, and 270 will be ignored. Defaults to false.
-
getMinTableCellCount
public int getMinTableCellCount()
Returns the minimum number of adjacent cells that must be present in order for PDFxStream to recognize those cells collectively as aTable. This setting defaults to 4.
-
setMinTableCellCount
public void setMinTableCellCount(int minTableCellCount)
Changes the setting that controls the minimum number of adjacent cells that must be present in order for PDFxStream to recognize those cells collectively as aTable. This setting defaults to 4.
-
isImplicitLineDetectionEnabled
public boolean isImplicitLineDetectionEnabled()
-
setImplicitLineDetectionEnabled
public void setImplicitLineDetectionEnabled(boolean detectImplicitLines)
-
isCJKSupportEnabled
public static boolean isCJKSupportEnabled()
Returns true if this configuration will cause PDFxStream to extract and decode Chinese, Japanese, and Korean content. This setting defaults to true.
-
setCJKSupportEnabled
public static void setCJKSupportEnabled(boolean enableCJK)
Changes the setting that controls whether or not PDFxStream extracts and decodes Chinese, Japanese, and Korean content. This setting defaults to true. Changing it to false will minimize PDFxStream's memory utilization, but no CJK content will be extracted.
-
isDeriveType3Fonts
public boolean isDeriveType3Fonts()
Returns true if this configuration will cause PDFxStream to derive the Unicode encodings of Type3 PDF fonts. This setting defaults to true.
-
setDeriveType3Fonts
public void setDeriveType3Fonts(boolean deriveType3Fonts)
Changes the setting that controls whether or not PDFxStream derives the Unicode encodings of Type3 PDF fonts. This setting defaults to true. Changing it to false will result in a small performance improvement, but any PDF content rendered using Type3 fonts that lack a Unicode encoding will not be extracted by PDFxStream.
-
getLinebreakString
public String getLinebreakString()
Returns the string thatOutputTarget(and its subclasses) output for each linebreak identified in extracted PDF content. This value defaults to the current platform's line break string, as identified by theline.separatorsystem property.
-
setLinebreakString
public void setLinebreakString(String linebreak)
Sets the string thatOutputTarget(and its subclasses) output for each linebreak identified in extracted PDF content. This value defaults to the current platform's line break string, as identified by theline.separatorsystem property.
-
-