java.lang.Object

org.apache.lucene.util.UnicodeUtil

public final class UnicodeUtil extends Object

Class to encode java's UTF16 char[] into UTF8 byte[] without always allocating a new byte[] as String.getBytes(StandardCharsets.UTF_8) does.

NOTE: This API is for internal purposes only and might change in incompatible ways in the next release.

Field Summary

Fields

Modifier and Type

Field

Description

static final BytesRef

BIG_TERM

A binary term consisting of a number of 0xff bytes, likely to be bigger than other terms (e.g.

static final int

MAX_UTF8_BYTES_PER_CHAR

Maximum number of UTF8 bytes per UTF16 character.

static final int

UNI_REPLACEMENT_CHAR

static final int

UNI_SUR_HIGH_END

static final int

UNI_SUR_HIGH_START

static final int

UNI_SUR_LOW_END

static final int

UNI_SUR_LOW_START
Method Summary

Modifier and Type

Method

Description

static int

calcUTF16toUTF8Length(CharSequence s, int offset, int len)

Calculates the number of UTF8 bytes necessary to write a UTF16 string.

static int

codePointCount(BytesRef utf8)

Returns the number of code points in this UTF8 sequence.

static int

maxUTF8Length(int utf16Length)

Returns the maximum number of utf8 bytes required to encode a utf16 (e.g., java char[], String)

static String

newString(int[] codePoints, int offset, int count)

Cover JDK 1.5 API.

static String

toHexString(String s)

static int

UTF16toUTF8(char[] source, int offset, int length, byte[] out)

Encode characters from a char[] source, starting at offset for length chars.

static int

UTF16toUTF8(CharSequence s, int offset, int length, byte[] out)

Encode characters from this String, starting at offset for length characters.

static int

UTF16toUTF8(CharSequence s, int offset, int length, byte[] out, int outOffset)

Encode characters from this String, starting at offset for length characters.

static int

UTF8toUTF16(byte[] utf8, int offset, int length, char[] out)

Interprets the given byte array as UTF-8 and converts to UTF-16.

static int

UTF8toUTF16(BytesRef bytesRef, char[] chars)

Utility method for UTF8toUTF16(byte[], int, int, char[])

static int

UTF8toUTF32(BytesRef utf8, int[] ints)

This method assumes valid UTF8 input.

static boolean

validUTF16String(char[] s, int size)

static boolean

validUTF16String(CharSequence s)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- BIG_TERM
  
  public static final BytesRef BIG_TERM
  
  A binary term consisting of a number of 0xff bytes, likely to be bigger than other terms (e.g. collation keys) one would normally encounter, and definitely bigger than any UTF-8 terms.
  WARNING: This is not a valid UTF8 Term
- UNI_SUR_HIGH_START
  
  public static final int UNI_SUR_HIGH_START
  See Also:
  
  Constant Field Values
- UNI_SUR_HIGH_END
  
  public static final int UNI_SUR_HIGH_END
  See Also:
  
  Constant Field Values
- UNI_SUR_LOW_START
  
  public static final int UNI_SUR_LOW_START
  See Also:
  
  Constant Field Values
- UNI_SUR_LOW_END
  
  public static final int UNI_SUR_LOW_END
  See Also:
  
  Constant Field Values
- UNI_REPLACEMENT_CHAR
  
  public static final int UNI_REPLACEMENT_CHAR
  See Also:
  
  Constant Field Values
- MAX_UTF8_BYTES_PER_CHAR
  
  public static final int MAX_UTF8_BYTES_PER_CHAR
  
  Maximum number of UTF8 bytes per UTF16 character.
  See Also:
  
  Constant Field Values
Method Details
- UTF16toUTF8
  
  public static int UTF16toUTF8(char[] source, int offset, int length, byte[] out)
  
  Encode characters from a char[] source, starting at offset for length chars. It is the responsibility of the caller to make sure that the destination array is large enough.
- UTF16toUTF8
  
  public static int UTF16toUTF8(CharSequence s, int offset, int length, byte[] out)
  
  Encode characters from this String, starting at offset for length characters. It is the responsibility of the caller to make sure that the destination array is large enough.
- UTF16toUTF8
  
  public static int UTF16toUTF8(CharSequence s, int offset, int length, byte[] out, int outOffset)
  
  Encode characters from this String, starting at offset for length characters. Output to the destination array will begin at outOffset. It is the responsibility of the caller to make sure that the destination array is large enough.
  note this method returns the final output offset (outOffset + number of bytes written)
- calcUTF16toUTF8Length
  
  public static int calcUTF16toUTF8Length(CharSequence s, int offset, int len)
  
  Calculates the number of UTF8 bytes necessary to write a UTF16 string.
  
  Returns:
  
  the number of bytes written
- validUTF16String
  
  public static boolean validUTF16String(CharSequence s)
- validUTF16String
  
  public static boolean validUTF16String(char[] s, int size)
- codePointCount
  
  public static int codePointCount(BytesRef utf8)
  
  Returns the number of code points in this UTF8 sequence.
  This method assumes valid UTF8 input. This method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped).
  
  Throws:
  
  IllegalArgumentException - If invalid codepoint header byte occurs or the content is prematurely truncated.
- UTF8toUTF32
  
  public static int UTF8toUTF32(BytesRef utf8, int[] ints)
  
  This method assumes valid UTF8 input. This method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped). It is the responsibility of the caller to make sure that the destination array is large enough.
  
  Throws:
  
  IllegalArgumentException - If invalid codepoint header byte occurs or the content is prematurely truncated.
- newString
  
  public static String newString(int[] codePoints, int offset, int count)
  
  Cover JDK 1.5 API. Create a String from an array of codePoints.
  
  Parameters:
  
  codePoints - The code array
  
  offset - The start of the text in the code point array
  
  count - The number of code points
  
  Returns:
  
  a String representing the code points between offset and count
  
  Throws:
  
  IllegalArgumentException - If an invalid code point is encountered
  
  IndexOutOfBoundsException - If the offset or count are out of bounds.
- toHexString
  
  public static String toHexString(String s)
- UTF8toUTF16
  
  public static int UTF8toUTF16(byte[] utf8, int offset, int length, char[] out)
  
  Interprets the given byte array as UTF-8 and converts to UTF-16. It is the responsibility of the caller to make sure that the destination array is large enough.
  NOTE: Full characters are read, even if this reads past the length passed (and can result in an ArrayOutOfBoundsException if invalid UTF-8 is passed). Explicit checks for valid UTF-8 are not performed.
- maxUTF8Length
  
  public static int maxUTF8Length(int utf16Length)
  
  Returns the maximum number of utf8 bytes required to encode a utf16 (e.g., java char[], String)
- UTF8toUTF16
  
  public static int UTF8toUTF16(BytesRef bytesRef, char[] chars)
  
  Utility method for UTF8toUTF16(byte[], int, int, char[])
  See Also:
  
  UTF8toUTF16(byte[], int, int, char[])

Class UnicodeUtil

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

BIG_TERM

UNI_SUR_HIGH_START

UNI_SUR_HIGH_END

UNI_SUR_LOW_START

UNI_SUR_LOW_END

UNI_REPLACEMENT_CHAR

MAX_UTF8_BYTES_PER_CHAR

Method Details

UTF16toUTF8

UTF16toUTF8

UTF16toUTF8

calcUTF16toUTF8Length

validUTF16String

validUTF16String

codePointCount

UTF8toUTF32

newString

toHexString

UTF8toUTF16

maxUTF8Length

UTF8toUTF16