Interface TextUnit

  • All Superinterfaces:
    Bounded

    public interface TextUnit
    extends Bounded
    A single character or discrete character grouping positioned within a Line.

    Note that space characters are typically not encoded in PDF documents; rather, they are implicit in the spacing between the bounding boxes of adjacent TextUnits.

    Since:
    v1.4
    Version:
    ©2004-2025 Snowtide
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Interface Description
      static interface  TextUnit.Predicate
      Type to be satisfied when implementing a TextUnit predicate for filtering characters in a Page.
    • Method Summary

      All Methods Instance Methods Abstract Methods 
      Modifier and Type Method Description
      char[] getCharacterSequence()
      Returns the characters that should be rendered for this TextUnit.
      int getCharCode()
      Returns the 'raw' character code used to encode this TextUnit in the source PDF document.
      Font getFont()
      Returns the Font that was in force when this TextUnit was outputted.
      float getFontSize()
      Returns the size of the font used to render this TextUnit.
      char[] getMappedCharSequence()
      Returns the characters that the source PDF mapped to the "raw" character code, via the font and encoding information in force when the character code was read from the PDF document.
      float getTheta()
      Returns the angle (in degrees) by which this TextUnit's baseline is rotated.
      boolean isStruckThrough()
      Returns true if this TextUnit is struck through (like this).
      boolean isUnderlined()
      Returns true if this TextUnit is underlined (like this).
      • Methods inherited from interface com.snowtide.pdf.layout.Bounded

        bounds
    • Method Detail

      • getCharCode

        int getCharCode()
        Returns the 'raw' character code used to encode this TextUnit in the source PDF document.

        In many cases, this character code is equivalent to the Unicode character id. Otherwise, the font and encoding information in force when the character code was read from the PDF document dictates that a particular character sequence be rendered instead of the Unicode character corresponding to the character code returned by this function.

        Nearly all use cases should use the getCharacterSequence() method in preference to this one.

        See Also:
        getCharacterSequence()
      • getMappedCharSequence

        char[] getMappedCharSequence()
        Returns the characters that the source PDF mapped to the "raw" character code, via the font and encoding information in force when the character code was read from the PDF document.

        Note that this character sequence will not reflect normalization that PDFxStream applies in order to produce getCharacterSequence(), including ligature re-folding, Arabic "un-shaping", the un-mirroring of brackets in right-to-left and bidirectional text, etc. Unless you have specific cause to avoid the result of these normalization steps, you should prefer getCharacterSequence() to this method.

      • getCharacterSequence

        char[] getCharacterSequence()
        Returns the characters that should be rendered for this TextUnit. This sequence is the result of applying:
        • the font and encoding information in force when the character code was read from the PDF document
        • ligature re-folding
        • Arabic "un-shaping"
        • reversal of right-to-left, multi-character sequences so that characters are in memory/logical order and not presentation order
        • the un-mirroring of brackets in right-to-left and bidirectional text
        • … and other normalization transformations that may be added from time to time

        This function will never return null, but may return an empty array if the TextUnit's "raw" character code is explicitly mapped to an empty character sequence.

      • getFont

        Font getFont()
        Returns the Font that was in force when this TextUnit was outputted.
      • getFontSize

        float getFontSize()
        Returns the size of the font used to render this TextUnit.
      • isUnderlined

        boolean isUnderlined()
        Returns true if this TextUnit is underlined (like this). While this will report an appropriate value for text that is rotated by a "regular" angle (90º, -90º, 180º), it will always return false for text that is rotated by any other angle (i.e. 30º, -45º, 16º, etc).
      • isStruckThrough

        boolean isStruckThrough()
        Returns true if this TextUnit is struck through (like this). This will report an appropriate value for for text that is not rotated, and will return always false otherwise.
      • getTheta

        float getTheta()
        Returns the angle (in degrees) by which this TextUnit's baseline is rotated.