﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>C++博客-可冰-随笔分类-UTF-8</title><link>http://www.cppblog.com/kb/category/55.html</link><description>冰,是沉睡着的水......</description><language>zh-cn</language><lastBuildDate>Tue, 20 May 2008 05:11:46 GMT</lastBuildDate><pubDate>Tue, 20 May 2008 05:11:46 GMT</pubDate><ttl>60</ttl><item><title>评价一下UTF-8与UNICODE相互转换的代码</title><link>http://www.cppblog.com/kb/archive/2005/09/29/491.html</link><dc:creator>可冰</dc:creator><author>可冰</author><pubDate>Thu, 29 Sep 2005 12:34:00 GMT</pubDate><guid>http://www.cppblog.com/kb/archive/2005/09/29/491.html</guid><wfw:comment>http://www.cppblog.com/kb/comments/491.html</wfw:comment><comments>http://www.cppblog.com/kb/archive/2005/09/29/491.html#Feedback</comments><slash:comments>8</slash:comments><wfw:commentRss>http://www.cppblog.com/kb/comments/commentRss/491.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/kb/services/trackbacks/491.html</trackback:ping><description><![CDATA[<font color="#000000" face="Verdana" size="2">上周,我花了很多心思使用模板写了一个UTF-8与UNICODE相互转换的功能(见文件</font><a href="http://www.cnblogs.com/Files/kb/Code.rar"><font color="#000080" face="Verdana" size="2">code.rar</font></a><font color="#000000" face="Verdana" size="2">),刚开始感觉还可以,但这几天慢慢的觉得,为什么不直接提供两个函数呢,这样不是简单方便吗?我这样的设计又能带来额外的什么好处呢?刚开始我是想提供比较方便好用以及容易扩展与维护的代码,但现在感觉到与直接提供C式的函数并没有多少额外的好处.或许这样的简单功能根本就用不着这样复杂的代码吧.正如Eric Raymond对C++的评价一样,它"使程序员倾向于写复杂的代码".<br>我想大家看看我的代码,给我一点意见和建议.</font><img src ="http://www.cppblog.com/kb/aggbug/491.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/kb/" target="_blank">可冰</a> 2005-09-29 20:34 <a href="http://www.cppblog.com/kb/archive/2005/09/29/491.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>构思UTF-8解码模块</title><link>http://www.cppblog.com/kb/archive/2005/09/22/399.html</link><dc:creator>可冰</dc:creator><author>可冰</author><pubDate>Thu, 22 Sep 2005 15:24:00 GMT</pubDate><guid>http://www.cppblog.com/kb/archive/2005/09/22/399.html</guid><wfw:comment>http://www.cppblog.com/kb/comments/399.html</wfw:comment><comments>http://www.cppblog.com/kb/archive/2005/09/22/399.html#Feedback</comments><slash:comments>1</slash:comments><wfw:commentRss>http://www.cppblog.com/kb/comments/commentRss/399.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/kb/services/trackbacks/399.html</trackback:ping><description><![CDATA[<p style="font-family: verdana; font-size: 12px;">
想实现一个解码UTF-8格式文档为Unicode格式代码的"引擎",要用起来方便顺手.<br>但想了几天了,都没有一个合适的方案来实现.<br>唉......<br>今天先试着写了写,找找感觉,接着再想吧...<br>
</p><img src ="http://www.cppblog.com/kb/aggbug/399.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/kb/" target="_blank">可冰</a> 2005-09-22 23:24 <a href="http://www.cppblog.com/kb/archive/2005/09/22/399.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>std::wfstream是怎么支持宽字符的?</title><link>http://www.cppblog.com/kb/archive/2005/09/22/396.html</link><dc:creator>可冰</dc:creator><author>可冰</author><pubDate>Thu, 22 Sep 2005 14:47:00 GMT</pubDate><guid>http://www.cppblog.com/kb/archive/2005/09/22/396.html</guid><wfw:comment>http://www.cppblog.com/kb/comments/396.html</wfw:comment><comments>http://www.cppblog.com/kb/archive/2005/09/22/396.html#Feedback</comments><slash:comments>4</slash:comments><wfw:commentRss>http://www.cppblog.com/kb/comments/commentRss/396.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/kb/services/trackbacks/396.html</trackback:ping><description><![CDATA[<span style="font-family: Verdana; font-size: 12px;"><br>
std::wfstream的定义为:<br>
</span><span style="color: rgb(0, 0, 153);font-family: Verdana; font-size: 12px;">typedef</span><span style="font-family: Verdana; font-size: 12px;"> basic_fstream&lt;<span style="color: rgb(0, 0, 153);">wchar_t</span>, char_traits&lt;<span style="color: rgb(0, 0, 153);">wchar_t</span>&gt; &gt; wfstream;<br>
在读取字符时:<br>
wfstream wfile( "wcharfile.txt" );<br>
<span style="color: rgb(0, 0, 153);">wchar_t</span> wch = wfile.get();<br>
按语义讲应该是读入两个字节内容的.但经输出检测,它却只读入一个字节,这样和fstream还有什么分别?<br>
到底在处理Unicode编码的文件时,应该如何使用宽字符流?</span><img src ="http://www.cppblog.com/kb/aggbug/396.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/kb/" target="_blank">可冰</a> 2005-09-22 22:47 <a href="http://www.cppblog.com/kb/archive/2005/09/22/396.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>"这是一个UTF-8格式的文档!"的几种不同编码表示</title><link>http://www.cppblog.com/kb/archive/2005/09/20/343.html</link><dc:creator>可冰</dc:creator><author>可冰</author><pubDate>Tue, 20 Sep 2005 12:39:00 GMT</pubDate><guid>http://www.cppblog.com/kb/archive/2005/09/20/343.html</guid><wfw:comment>http://www.cppblog.com/kb/comments/343.html</wfw:comment><comments>http://www.cppblog.com/kb/archive/2005/09/20/343.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/kb/comments/commentRss/343.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/kb/services/trackbacks/343.html</trackback:ping><description><![CDATA[<p class="box"><img src="http://www.cppblog.com/images/cppblog_com/kb/58/r_charcode.gif">
</p><img src ="http://www.cppblog.com/kb/aggbug/343.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/kb/" target="_blank">可冰</a> 2005-09-20 20:39 <a href="http://www.cppblog.com/kb/archive/2005/09/20/343.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>UTF-8 编码格式总结</title><link>http://www.cppblog.com/kb/archive/2005/09/19/320.html</link><dc:creator>可冰</dc:creator><author>可冰</author><pubDate>Mon, 19 Sep 2005 12:03:00 GMT</pubDate><guid>http://www.cppblog.com/kb/archive/2005/09/19/320.html</guid><wfw:comment>http://www.cppblog.com/kb/comments/320.html</wfw:comment><comments>http://www.cppblog.com/kb/archive/2005/09/19/320.html#Feedback</comments><slash:comments>3</slash:comments><wfw:commentRss>http://www.cppblog.com/kb/comments/commentRss/320.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/kb/services/trackbacks/320.html</trackback:ping><description><![CDATA[

<font style="font-family: Verdana; font-size: 9pt;" color="#009966">[以下只是个人的总结,如若有误,恳请指正,谢谢!]<br></font>
<font style="font-family: Verdana; font-size: 12px;">
下列字节串用来表示一个字符. 用到哪个串取决于该字符在 Unicode 中的序号.<br>
<table style="font-size: 12px; font-family: Verdana;" align="center" border="1">
  <tbody><tr>
    <td width="192">U+00000000 - U+0000007F: </td>
    <td width="438">0 <em>xxxxxxx </em></td>
    <td width="278">0x - 7x</td>
    <td width="37">&nbsp;</td>
  </tr>
  <tr>
    <td>U+00000080 - U+000007FF: </td>
    <td>110 <em>xxxxx </em> 10 <em>xxxxxx </em></td>
    <td>Cx 8x - Dx Bx </td>
    <td>&nbsp;</td>
  </tr>
  <tr>
    <td>U+00000800 - U+0000FFFF: </td>
    <td>1110 <em>xxxx </em> 10 <em>xxxxxx </em> 10 <em>xxxxxx </em></td>
    <td>Ex 8x 8x - Ex Bx Bx</td>
    <td>&nbsp;</td>
  </tr>
  <tr>
    <td height="17">U+00010000 - U+001FFFFF: </td>
    <td>11110 <em>xxx </em> 10 <em>xxxxxx </em> 10 <em>xxxxxx </em> 10 <em>xxxxxx </em></td>
    <td>F0 8x 8x 8x - F7 Bx Bx Bx </td>
    <td rowspan="3">很少用</td>
  </tr>
  <tr>
    <td>U+00200000 - U+03FFFFFF: </td>
    <td>111110 <em>xx </em> 10 <em>xxxxxx </em> 10 <em>xxxxxx </em> 10 <em>xxxxxx </em> 10 <em>xxxxxx </em></td>
    <td>F8 8x 8x 8x 8x - FB Bx Bx Bx Bx </td>
  </tr>
  <tr>
    <td>U+04000000 - U+7FFFFFFF: </td>
    <td>1111110 <em>x </em> 10 <em>xxxxxx </em> 10 <em>xxxxxx </em> 10 <em>xxxxxx </em> 10 <em>xxxxxx </em> 10 <em>xxxxxx </em></td>
    <td>FC 8x 8x 8x 8x 8x - FD Bx Bx Bx Bx Bx </td>
  </tr>
</tbody></table>
</font>
<p><font style="font-family: Verdana; font-size: 9pt;"><br>
  * <span style="color: rgb(153, 0, 0); font-weight: bold;">FE FF</span>从未在编码中出现过.<br>
  * 除第一个字节外,其余字节都在
      <span style="color: rgb(153, 0, 0); font-weight: bold;">0x80 到 0xBF</span>范围内,每个字符的起始位置用0xC0-0xD0,0xE0,0xF0等可以确定(验证前四位或八位),不在这一范围的即为单字节字符.凡是以<span style="color: rgb(153, 0, 0); font-weight: bold;">0x80 到 0xBF</span>开头的都是后继字节,计数时都要跳过.
  
  <br>
* Unicode是一种编码表,只将字符指定给某一数字(Unicode做得还要更多一些,比如提供比较及显示等很多算法等等);<br>
而UTF-8是编码方式,是定义如何表示并存储指定编码的格式.
<br>
* UTF-8编码转换为Unicode编码: 将所有标志位去除,剩余位数若不足则在高位补零,凑足32位即可.<br>
* Unicode编码转换为UTF-8编码: 从低位开始,每取6位补两个位10,不足6位(不算高位的0)则按字节长度补相应的字符标志位0、110、1110等</font></p>
<img src ="http://www.cppblog.com/kb/aggbug/320.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/kb/" target="_blank">可冰</a> 2005-09-19 20:03 <a href="http://www.cppblog.com/kb/archive/2005/09/19/320.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>UTF types</title><link>http://www.cppblog.com/kb/archive/2005/09/19/312.html</link><dc:creator>可冰</dc:creator><author>可冰</author><pubDate>Mon, 19 Sep 2005 07:38:00 GMT</pubDate><guid>http://www.cppblog.com/kb/archive/2005/09/19/312.html</guid><wfw:comment>http://www.cppblog.com/kb/comments/312.html</wfw:comment><comments>http://www.cppblog.com/kb/archive/2005/09/19/312.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/kb/comments/commentRss/312.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/kb/services/trackbacks/312.html</trackback:ping><description><![CDATA[<table style="font-family: Verdana; font-size: 12px;" border="1" cellpadding="2">
  <tbody><tr> 
    <th align="left">UTF</th>
    <th align="left">Formats</th>
    <th colspan="2" align="left">Estimated average storage required per page (3000 
      characters)</th>
  </tr>
  <tr> 
    <th align="center">UTF-8</th>
    <td> 
      <p align="left"><img src="http://icu.sourceforge.net/docs/papers/forms_of_unicode/images/8S.gif" border="0" height="26" width="26"><br>
        <img src="http://icu.sourceforge.net/docs/papers/forms_of_unicode/images/8L.gif" border="0" height="26" width="26"><img src="http://icu.sourceforge.net/docs/papers/forms_of_unicode/images/8T.gif" border="0" height="26" width="26"><br>
        <img src="http://icu.sourceforge.net/docs/papers/forms_of_unicode/images/8L3.gif" border="0" height="26" width="26"><img src="http://icu.sourceforge.net/docs/papers/forms_of_unicode/images/8T.gif" border="0" height="26" width="26"><img src="http://icu.sourceforge.net/docs/papers/forms_of_unicode/images/8T.gif" border="0" height="26" width="26"><br>
        <img src="http://icu.sourceforge.net/docs/papers/forms_of_unicode/images/8L4.gif" border="0" height="26" width="26"><img src="http://icu.sourceforge.net/docs/papers/forms_of_unicode/images/8T.gif" border="0" height="26" width="26"><img src="http://icu.sourceforge.net/docs/papers/forms_of_unicode/images/8T.gif" border="0" height="26" width="26"><img src="http://icu.sourceforge.net/docs/papers/forms_of_unicode/images/8T.gif" border="0" height="26" width="26"></p>
    </td>
    <td align="center">3 KB<br>
      (1999) 
      <hr>
      5 KB<br>
      (2003)</td>
    <td>On average, English takes slightly over one unit per code point. Most 
      Latin-script languages take about 1.1 bytes. Greek, Russian, Arabic and 
      Hebrew take about 1.7 bytes, and most others (including Japanese, Chinese, 
      Korean and Hindi) take about 3 bytes. Characters in surrogate space take 
      4 bytes, but as a proportion of all world text they will always be very 
      rare.</td>
  </tr>
  <tr> 
    <th align="center">UTF-16</th>
    <td> 
      <p align="left"><img src="http://icu.sourceforge.net/docs/papers/forms_of_unicode/images/16S.gif" border="0" height="26" width="52"><br>
        <img src="http://icu.sourceforge.net/docs/papers/forms_of_unicode/images/16L.gif" border="0" height="26" width="52"><img src="http://icu.sourceforge.net/docs/papers/forms_of_unicode/images/16T.gif" border="0" height="26" width="59"></p>
    </td>
    <td align="center">6 KB</td>
    <td>All of the most common characters in use for all modern writing systems 
      are already represented with 2 bytes. Characters in surrogate space take 
      4 bytes, but as a proportion of all world text they will always be very 
      rare.</td>
  </tr>
  <tr> 
    <th align="center">UTF-32</th>
    <td> 
      <p align="left"><img src="http://icu.sourceforge.net/docs/papers/forms_of_unicode/images/32S.gif" border="0" height="26" width="111"></p>
    </td>
    <td align="center">12 KB</td>
    <td>All take 4 bytes</td>
  </tr>
</tbody></table>
<font style="font-family: Verdana; font-size: 12px;">
<p>[来源: <a href="http://icu.sourceforge.net/docs/papers/forms_of_unicode/">http://icu.sourceforge.net/docs/papers/forms_of_unicode/</a>]</p>
</font>

<br>

<p><font size="2">UTF-8(ISO 10646-1) 有以下特性: 

</font></p><ul><li><font size="2">UCS 字符 <span style="color: red;">U+0000 到 U+007F</span> (ASCII) 被编码为<span style="color: red;">字节 0x00 到 0x7F</span> (ASCII 兼容). 
    这意味着只包含 7 位 ASCII 字符的文件在 ASCII 和 UTF-8 
    两种编码方式下是一样的.</font></li><li><font size="2">所有<span style="color: red;"> &gt; U+007F</span> 的 UCS 字符被编码为一个或多个字节的串, 
    每个字节都有标记位集. 因此, ASCII 字节 (0x00-0x7F) 
    不可能作为任何其他字符的一部分.</font></li><li><font size="2">表示非 ASCII 字符的多字节串的<span style="color: red;">第一个字节</span>总是在 <span style="color: red;">0xC0 到 0xFD</span> 
    的范围里, 并指出这个字符包含多少个字节. 
    多字节串的<span style="color: red;">其余字节</span>都在 <span style="color: red;">0x80 到 0xBF</span> 范围里. 
    这使得重新同步非常容易, 并使编码无国界, 
    且很少受丢失字节的影响.</font></li><li><font size="2">可以编入所有可能的 2<sup>31</sup>个 UCS 代码</font></li><li><font size="2">UTF-8 编码字符理论上可以最多到 6 个字节长, 然而 16 位 BMP 
    字符最多只用到 3 字节长.</font></li><li><font size="2">Bigendian UCS-4 字节串的排列顺序是预定的.</font></li><li><font size="2">字节 <span style="color: red;">0xFE 和 0xFF</span> 在 UTF-8 编码中从未用到.</font></li></ul>

<p><font size="2">下列字节串用来表示一个字符. 用到哪个串取决于该字符在 Unicode 
中的序号.</font></p>
<div align="center"><center>

<table border="1">
  <tbody><tr>
    <td><font size="2">U-00000000 - U-0000007F: </font></td>
    <td><font size="2">0<i>xxxxxxx</i></font> </td>
  </tr>
  <tr>
    <td><font size="2">U-00000080 - U-000007FF: </font></td>
    <td><font size="2">110<i>xxxxx</i> 10<i>xxxxxx</i></font> </td>
  </tr>
  <tr>
    <td><font size="2">U-00000800 - U-0000FFFF: </font></td>
    <td><font size="2">1110<i>xxxx</i> 10<i>xxxxxx</i> 10<i>xxxxxx</i></font> </td>
  </tr>
  <tr>
    <td><font size="2">U-00010000 - U-001FFFFF: </font></td>
    <td><font size="2">11110<i>xxx</i> 10<i>xxxxxx</i> 10<i>xxxxxx</i> 10<i>xxxxxx</i></font> </td>
  </tr>
  <tr>
    <td><font size="2">U-00200000 - U-03FFFFFF: </font></td>
    <td><font size="2">111110<i>xx</i> 10<i>xxxxxx</i> 10<i>xxxxxx</i> 10<i>xxxxxx</i> 10<i>xxxxxx</i></font> </td>
  </tr>
  <tr>
    <td><font size="2">U-04000000 - U-7FFFFFFF: </font></td>
    <td><font size="2">1111110<i>x</i> 10<i>xxxxxx</i> 10<i>xxxxxx</i> 10<i>xxxxxx</i> 10<i>xxxxxx</i> 10<i>xxxxxx</i></font> 
    </td>
  </tr>
</tbody></table>
</center></div>

<p><font size="2">xxx 的位置由字符编码数的二进制表示的位填入. 越靠右的 x 
具有越少的特殊意义. 
只用最短的那个足够表达一个字符编码数的多字节串. 
注意在多字节串中, 第一个字节的开头"1"的数目就是整个串中字节的数目.</font></p>

<p><font size="2"><strong>例如</strong>: Unicode 字符 U+00A9 = 1010 1001 (版权符号) 在 UTF-8 
里的编码为:</font></p>

<blockquote>
  <p><font size="2">11000010 10101001 = 0xC2 0xA9</font></p>
</blockquote>

<p><font size="2">而字符 U+2260 = 0010 0010 0110 0000 (不等于) 编码为:</font></p>

<blockquote>
  <p><font size="2">11100010 10001001 10100000 = 0xE2 0x89 0xA0</font></p>
</blockquote>

<p><font size="2">这种编码的官方名字拼写为 UTF-8, 其中 UTF 代表 <strong>U</strong>CS <strong>T</strong>ransformation 
<strong>F</strong>ormat. 请勿在任何文档中用其他名字 (比如 utf8 或 UTF_8) 
来表示 UTF-8, 当然除非你指的是一个变量名而不是这种编码本身.</font></p>

<h2 style="font-weight: normal;"><font size="2">什么编程语言支持 Unicode?</font></h2>

<p><font size="2">在大约 1993 
年之后开发的大多数现代编程语言都有一个特别的数据类型, 叫做 
Unicode/ISO 10646-1 字符. 在 Ada95 中叫 Wide_Character, 在 Java 中叫 char.</font></p>

<p><font size="2">ISO C 也详细说明了处理多字节编码和宽字符 (wide characters) 的机制, 
1994 年 9 月 <a href="http://www.lysator.liu.se/c/na1.html">Amendment 1 to ISO C</a> 
发表时又加入了更多. 这些机制主要是为各类东亚编码而设计的, 
它们比处理 UCS 所需的要健壮得多. UTF-8 是 ISO C 
标准调用多字节字符串的编码的一个例子, <em>wchar_t</em> 
类型可以用来存放 Unicode 字符.<br>[来源: <a href="http://www.linuxforum.net/books/UTF-8-Unicode.html">http://www.linuxforum.net/books/UTF-8-Unicode.html</a>]</font></p><img src ="http://www.cppblog.com/kb/aggbug/312.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/kb/" target="_blank">可冰</a> 2005-09-19 15:38 <a href="http://www.cppblog.com/kb/archive/2005/09/19/312.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>UTF serializations</title><link>http://www.cppblog.com/kb/archive/2005/09/19/310.html</link><dc:creator>可冰</dc:creator><author>可冰</author><pubDate>Mon, 19 Sep 2005 07:23:00 GMT</pubDate><guid>http://www.cppblog.com/kb/archive/2005/09/19/310.html</guid><wfw:comment>http://www.cppblog.com/kb/comments/310.html</wfw:comment><comments>http://www.cppblog.com/kb/archive/2005/09/19/310.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/kb/comments/commentRss/310.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/kb/services/trackbacks/310.html</trackback:ping><description><![CDATA[<font style="font-family: Verdana;" size="2">
<a name="t2"><span class="atitle3"></span></a><br>
<table style="font-family: Verdana; font-size: 12px;" border="1" cellpadding="2" width="100%">
  <tbody><tr><th align="left" valign="top">UTF-8</th>
  <td valign="top"> 
    <ul><li>Inital <code>EF BB BF</code> is a signature, indicating that the rest 
        of the file is UTF-8.</li><li>Any <code>EF BF BE</code> is an error.</li><li>A real ZWNBSP at the start of a file requires a signature first.</li></ul>
  </td>
  </tr>
  <tr> 
    <th align="left" valign="top"><i>UTF-8N</i></th>
    <td valign="top"> 
      <ul><li>All of the text is normal UTF-8; there is no signature.</li><li>Inital <code>EF BB BF</code> is a ZWNBSP.</li><li>Any <code>EF BF BE</code> is an error.</li></ul>
    </td>
  </tr>
  <tr> 
    <th align="left" valign="top">UTF-16</th>
    <td valign="top"> 
      <ul><li>Initial <code>FE FF</code> is a signature indicating the rest of the 
          text is big endian UTF-16.</li><li>Initial <code>FF FE</code> is a signature indicating the rest of the 
          text is little endian UTF-16.</li><li>If neither of these are present, all of the text is big endian.</li><li>A real ZWNBSP at the start of a file requires a signature first.</li></ul>
    </td>
  </tr>
  <tr> 
    <th align="left" valign="top">UTF-16BE</th>
    <td valign="top"> 
      <ul><li>All of the text is big endian: there is no signature.</li><li>Initial <code>FE FF</code> is a ZWNBSP.</li><li>Any <code>FF FE</code> is an error.</li></ul>
    </td>
  </tr>
  <tr> 
    <th align="left" valign="top">UTF-16LE</th>
    <td valign="top"> 
      <ul><li>All of the text is little endian: there is no signature.</li><li>Initial <code>FF FE</code> is a ZWNBSP.</li><li>Any <code>FE FF</code> is an error.</li></ul>
    </td>
  </tr>
  <tr> 
    <th align="left" valign="top"><i>UTF-32</i></th>
    <td valign="top"> 
      <ul><li>Initial <code>00 00 FE FF</code> is a signature indicating the rest 
          of the text is big endian UTF-32.</li><li>Initial <code>FF FE 00 00</code> is a signature indicating the rest 
          of the text is little endian UTF-32.</li><li>If neither of these are present, all of the text is big endian.</li><li>A real ZWNBSP at the start of a file requires a signature first.</li></ul>
    </td>
  </tr>
  <tr> 
    <th align="left" valign="top"><i>UTF-32BE</i></th>
    <td valign="top"> 
      <ul><li>All of the text is big endian: there is no signature.</li><li>Initial <code>00 00 FE FF</code> is a ZWNBSP.</li><li>Any <code>FF FE 00 00</code> is an error.</li></ul>
    </td>
  </tr>
  <tr> 
    <th align="left" valign="top"><i>UTF-32LE</i></th>
    <td valign="top"> 
      <ul><li>All of the text is little endian: there is no signature.</li><li>Initial <code>FF FE 00 00</code> is a ZWNBSP.</li><li>Initial <code>00 00 FE FF</code> is an error.</li></ul>
    </td>
  </tr>
</tbody></table>

<blockquote> <i> <b>Note: </b>The italicized names are not yet registered, but 
  are useful for reference.</i> </blockquote>
[from: <a href="http://icu.sourceforge.net/docs/papers/forms_of_unicode/">http://icu.sourceforge.net/docs/papers/forms_of_unicode/</a>]</font><img src ="http://www.cppblog.com/kb/aggbug/310.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/kb/" target="_blank">可冰</a> 2005-09-19 15:23 <a href="http://www.cppblog.com/kb/archive/2005/09/19/310.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>