Class Configuration

    • Constructor Detail

      • Configuration

        public Configuration()
        Creates a Configuration whose properties are derived from the original defaults and the current value of system properties. This does not take the current default configuration into account.
    • Method Detail

      • setDefault

        public static void setDefault​(Configuration defaultConfig)
        Sets the configuration that new Document instances use by default.
      • setElideHorizontalTextualRules

        public void setElideHorizontalTextualRules​(boolean elideHorizontalTextualRules)
        Sets whether or not runs of characters forming a likely horizontal rule should be elided from extraction results. Defaults to false.

        In many kinds of documents, a series of repeating characters (often dashes '-' or underscores '_') are used to "draw" what might otherwise usually be rendered using a vector line. For example, to separate a heading from some body text:

                            OVERVIEW OF THIS INFORMATION COLLECTION
                       --------------------------------------------------
         A combined total of 4,607 respondents will utilize the form and then package and ship/deliver
         business records to the agency heretofore cited, in order to...
         

        Sometimes these runs of characters (called "rules" in printing and typesetting contexts) can be helpful to identify structure within extracted content. However, other times they are a hindrance, especially if the repeating characters are positioned immediately below some other content, so closely that PDFxStream infers that they are actually part of that other content. The result of this can be unfortunate text extraction results like:

           _______O_V_E_R_V_I_E_W___O_F___T_H_I_S__I_N_F_O_R_M_A_T_I_O_N__C_O_L_L_E_C_T_I_O_N______
         

        Enabling this configuration setting will cause PDFxStream to detect and remove the characters forming these kinds of horizontal rules from extracted texts.

      • isElideHorizontalTextualRules

        public boolean isElideHorizontalTextualRules()
        Returns true only if runs of characters forming a likely horizontal rule should be elided from extraction results. Defaults to false.
        See Also:
        for more information about this setting.
      • isTableDetectionEnabled

        public boolean isTableDetectionEnabled()
        Returns true only if Table detection is enabled; defaults to true.
      • setTableDetectionEnabled

        public void setTableDetectionEnabled​(boolean detectTables)
        Sets whether or not Table detection is enabled.
      • isStripXFAFormDataEnabled

        public boolean isStripXFAFormDataEnabled()
      • setStripXFAFormDataEnabled

        public void setStripXFAFormDataEnabled​(boolean stripXFAFormData)
      • isIgnoreNonCardinalRotatedChars

        public boolean isIgnoreNonCardinalRotatedChars()
        When true, characters that are rotated by angles other than 0, 90, 180, and 270 will be ignored. Defaults to false.
      • setIgnoreNonCardinalRotatedChars

        public void setIgnoreNonCardinalRotatedChars​(boolean ignoreRotated)
        Sets whether characters that are rotated by angles other than 0, 90, 180, and 270 will be ignored. Defaults to false.
      • getMinTableCellCount

        public int getMinTableCellCount()
        Returns the minimum number of adjacent cells that must be present in order for PDFxStream to recognize those cells collectively as a Table. This setting defaults to 4.
      • setMinTableCellCount

        public void setMinTableCellCount​(int minTableCellCount)
        Changes the setting that controls the minimum number of adjacent cells that must be present in order for PDFxStream to recognize those cells collectively as a Table. This setting defaults to 4.
      • isImplicitLineDetectionEnabled

        public boolean isImplicitLineDetectionEnabled()
      • setImplicitLineDetectionEnabled

        public void setImplicitLineDetectionEnabled​(boolean detectImplicitLines)
      • isCJKSupportEnabled

        public static boolean isCJKSupportEnabled()
        Returns true if this configuration will cause PDFxStream to extract and decode Chinese, Japanese, and Korean content. This setting defaults to true.
      • setCJKSupportEnabled

        public static void setCJKSupportEnabled​(boolean enableCJK)
        Changes the setting that controls whether or not PDFxStream extracts and decodes Chinese, Japanese, and Korean content. This setting defaults to true. Changing it to false will minimize PDFxStream's memory utilization, but no CJK content will be extracted.
      • isDeriveType3Fonts

        public boolean isDeriveType3Fonts()
        Returns true if this configuration will cause PDFxStream to derive the Unicode encodings of Type3 PDF fonts. This setting defaults to true.
      • setDeriveType3Fonts

        public void setDeriveType3Fonts​(boolean deriveType3Fonts)
        Changes the setting that controls whether or not PDFxStream derives the Unicode encodings of Type3 PDF fonts. This setting defaults to true. Changing it to false will result in a small performance improvement, but any PDF content rendered using Type3 fonts that lack a Unicode encoding will not be extracted by PDFxStream.
      • getLinebreakString

        public String getLinebreakString()
        Returns the string that OutputTarget (and its subclasses) output for each linebreak identified in extracted PDF content. This value defaults to the current platform's line break string, as identified by the line.separator system property.
      • setLinebreakString

        public void setLinebreakString​(String linebreak)
        Sets the string that OutputTarget (and its subclasses) output for each linebreak identified in extracted PDF content. This value defaults to the current platform's line break string, as identified by the line.separator system property.