Interface TextUnit
-
- All Superinterfaces:
Bounded
public interface TextUnit extends Bounded
A single character or discrete character grouping positioned within aLine
.Note that space characters are typically not encoded in PDF documents; rather, they are implicit in the spacing between the bounding boxes of adjacent TextUnits.
- Since:
- v1.4
- Version:
- ©2004-2025 Snowtide
-
-
Nested Class Summary
Nested Classes Modifier and Type Interface Description static interface
TextUnit.Predicate
Type to be satisfied when implementing aTextUnit
predicate for filtering characters in aPage
.
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description char[]
getCharacterSequence()
Returns the characters that should be rendered for this TextUnit.int
getCharCode()
Returns the 'raw' character code used to encode this TextUnit in the source PDF document.Font
getFont()
Returns theFont
that was in force when thisTextUnit
was outputted.float
getFontSize()
Returns the size of thefont
used to render thisTextUnit
.char[]
getMappedCharSequence()
Returns the characters that the source PDF mapped to the"raw" character code
, via the font and encoding information in force when the character code was read from the PDF document.float
getTheta()
Returns the angle (in degrees) by which thisTextUnit
's baseline is rotated.boolean
isStruckThrough()
Returns true if thisTextUnit
is struck through (like this).boolean
isUnderlined()
Returns true if thisTextUnit
is underlined (like this).
-
-
-
Method Detail
-
getCharCode
int getCharCode()
Returns the 'raw' character code used to encode this TextUnit in the source PDF document.In many cases, this character code is equivalent to the Unicode character id. Otherwise, the font and encoding information in force when the character code was read from the PDF document dictates that a particular
character sequence
be rendered instead of the Unicode character corresponding to the character code returned by this function.Nearly all use cases should use the
getCharacterSequence()
method in preference to this one.- See Also:
getCharacterSequence()
-
getMappedCharSequence
char[] getMappedCharSequence()
Returns the characters that the source PDF mapped to the"raw" character code
, via the font and encoding information in force when the character code was read from the PDF document.Note that this character sequence will not reflect normalization that PDFxStream applies in order to produce
getCharacterSequence()
, including ligature re-folding, Arabic "un-shaping", the un-mirroring of brackets in right-to-left and bidirectional text, etc. Unless you have specific cause to avoid the result of these normalization steps, you should prefergetCharacterSequence()
to this method.
-
getCharacterSequence
char[] getCharacterSequence()
Returns the characters that should be rendered for this TextUnit. This sequence is the result of applying:- the font and encoding information in force when
the character code
was read from the PDF document - ligature re-folding
- Arabic "un-shaping"
- reversal of right-to-left, multi-character sequences so that characters are in memory/logical order and not presentation order
- the un-mirroring of brackets in right-to-left and bidirectional text
- … and other normalization transformations that may be added from time to time
This function will never return null, but may return an empty array if the
TextUnit
's"raw" character code
is explicitly mapped to an empty character sequence. - the font and encoding information in force when
-
getFontSize
float getFontSize()
Returns the size of thefont
used to render thisTextUnit
.
-
isUnderlined
boolean isUnderlined()
Returns true if thisTextUnit
is underlined (like this). While this will report an appropriate value for text that is rotated by a "regular" angle (90º, -90º, 180º), it will always return false for text that is rotated by any other angle (i.e. 30º, -45º, 16º, etc).
-
isStruckThrough
boolean isStruckThrough()
Returns true if thisTextUnit
is struck through (like this). This will report an appropriate value for for text that is not rotated, and will return always false otherwise.
-
getTheta
float getTheta()
Returns the angle (in degrees) by which thisTextUnit
's baseline is rotated.
-
-