对Unicode的误解

来源：互联网发布：28大神软件下载编辑：程序博客网时间：2024/06/08 11:33

前几天看了一本Sybex出版的关于Java认证的书：Complete Java 2 Certification Study Guide fifth edition。

在其312页(Chapter 9 I/O and Streams)，关于对Unicode的描述，原文如下：

Clearly, 8 bits are not enough to represent all the characters of all the communities of our planet. The Unicode standard was developed as a way to map characters to 16-bit values. Using 16 bits means that there are 65,536 possible characters that can be represented, so almost all languages can be fully encoded. (Chinese, Japanese, and Korean, which have huge numbers of characters, are not completely represented.)

读了该文，我们会这样理解：由于用8个bit对所有字符进行编码不够，所以Unicode就用16个bit对字符进行编码，这样就基本够用了（除了中文、日语等）。

这样描述是错误的，Unicode标准中有多种字符编码方式，包括：UTF-8、UTF-16、UTF-32。

UTF-8以8个bit作为一个单元（即code unit）对字符进行编码。一个字符的编码将由1个8-bit或多个8-bit（不超过4个8-bit）组成。

UTF-16以16个bit作为一个单元（即code unit）对字符进行编码。一个字符的编码将由1个或2个16-bit组成。

UTF-32以32个bit作为一个单元（即code unit）对字符进行编码。一个字符的编码将由1个32-bit组成。