UTF-16

UTF-16 isch e Kodierung fir Unicode-Zeiche, optimiert uf di am meischte bruchte Zeiche us de „Basic multilingual plane“ (BMP).

UTF-16 wird vum Unicode-Konsortium un au vun ISO/IEC 10646 definiert. Unicode definiert dodebi e zuesätzligi Semantik. E gnauer Verglich findet mer im Anhong C vum Unicode-4.0-Standard. D ISO-Norm definiert usserdem e Kodierung UCS-2, wo aber numme 16-Bit-Darschtellunge vun de BMP erlaubt.

D BMP enthaltet d Unicode-Zeiche, wo de Code vun ene im Berich U+0000 bis U+FFFF lit. In dem Berich sin fir UTF-16 Ersatz-Zeiche (ängl. surrogate characters) reserviert.

D Zeiche uss de BMP werre dodebi direkt uff d 16 Bits vun ere UTF-16-Code-Unit abbildet. Unicode-Zeiche, wu de Code sich nit mit 16 Bit darschtelle losst, belege zwei 16-Bit-Werter (ängl: code units), wu sich uss so gnonnte Ersatzzeiche (ängl: surrogate character) wiä folgt zommesetze:

Bit
31            24|23           16|15            8|7             0|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0 0 0 0 0 0 0 0|0 0 0 z z z z z|x x x x x x y y|y y y y y y y y|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

High-Surrogate (U+D800 ... U+DBFF)

|15            8|7             0|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|1 1 0 1 1 0 Z Z|Z Z x x x x x x|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Low-Surrogate (U+DC00 ... U+DFFF)

|15            8|7             0|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|1 1 0 1 1 1 y y|y y y y y y y y|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Dodebi giltet: ZZZZ=zzzzz-1.

Dodrus ergit sich de zuelässig Wertberich fir UTF-16:

zzzzz=00000 -> ä 16-Bit-Wort -> U+00xxxx onderefalls: ZZZZ=0000..1111 -> zzzz=00001..10000 = U+01xxxx .. U+10xxxx

Di beide Ersatzzeiche werre in de Reihefolg High Surrogate Low Surrogate gsendet. S losse sich somit Unicode-Zeiche bis zue U+10FFFF kodiere.

Bi de Ibertragung vun UTF-16-Date z. B. iber e Netzwerk un bim Spychere uff Dateträger werre di beide Bytes, wu e 16-Bit-Wort drus bstoht, nochenonder ibertrait. Je noch de Byte-Reihefolg vun de Rechnerarchitektur werre s dodebi in eire vun zwei unterschidlige Reihefolge ogordnet. Dodedur ergebe sich zwei verschideni Kodierunge, wo als UTF-16BE (Big Endian) un UTF-16LE (Little Endian) bezeichnet werre:

UTF-16-Zeiche

Bit
|15            8|7             0|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|y y y y y y y y|x x x x x x x x|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

UTF-16BE-Kodierung

     1. Byte           2. Byte
|7             0| |7             0|
+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+
|y y y y y y y y| |x x x x x x x x|
+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+

UTF-16LE-Kodierung

     1. Byte           2. Byte
|7             0| |7             0|
+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+
|x x x x x x x x| |y y y y y y y y|
+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+

Zum die Kodierunge unterscheide z kinne, wird empfohle, s Unicode-Zeiche U+FEFF (BOM, byte order mark), wo fir e Leerzeiche mit Breiti Null un ohni Zeileumbruch steht, on de Ofong vum Datestrom z setze. Wird des als U+FFFE – wo e nit giltig Unicode-Zeiche isch – empfange, deno beditet des, dass d Bytereihefolg zwische Sender un Empfänger unterschidlig sin un dodemit bim Empfänger d Bytes vun jedem 16-Bit-Wort vertuscht werre mien.

Bi de Umwondlung vun UTF-16-Strings in UTF-8-Bytefolge isch z beachte, dass d Ersatzzeiche (surrogates) zerscht zue me Unicode-Zeichecode zommegfasst werre mien, voreb si deno in e UTF-8-Bytefolg umgwondelt werre kinne.

Ekschterni Syte

Unicode Standard 4.0 Chapter 2 (PDF) General Structure. UTF-16 isch unter 2.5 Encoding Forms (S.20) definiert.
Unicode TN12: UTF-16 for Processing

Luege au: Unicode