﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>C++博客-woaidongmao-随笔分类-字符编码</title><link>http://www.cppblog.com/woaidongmao/category/8755.html</link><description>文章均收录自他人博客，但不喜标题前加-[转贴]，因其丑陋，见谅！~</description><language>zh-cn</language><lastBuildDate>Thu, 10 Sep 2009 18:47:18 GMT</lastBuildDate><pubDate>Thu, 10 Sep 2009 18:47:18 GMT</pubDate><ttl>60</ttl><item><title>怎样学习使用libiconv库</title><link>http://www.cppblog.com/woaidongmao/archive/2009/09/10/95869.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Thu, 10 Sep 2009 15:52:00 GMT</pubDate><guid>http://www.cppblog.com/woaidongmao/archive/2009/09/10/95869.html</guid><wfw:comment>http://www.cppblog.com/woaidongmao/comments/95869.html</wfw:comment><comments>http://www.cppblog.com/woaidongmao/archive/2009/09/10/95869.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/woaidongmao/comments/commentRss/95869.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/woaidongmao/services/trackbacks/95869.html</trackback:ping><description><![CDATA[<p class="MsoNormal" style="margin-bottom: 12pt; line-height: 150%"><b><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">&nbsp;&nbsp;&nbsp; libiconv</span></b><b><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">库</span></b><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">是一个基于<span lang="EN-US">GNU</span>协议的开源库，主要是解决多语言编码处理转换等应用问题。<span lang="EN-US"><br>&nbsp;&nbsp;&nbsp; </span>怎样学习使用<span lang="EN-US">libiconv</span>库？对于刚接触到人来说，这篇文章不妨去看一看，若已经用到过该库的人，在应用的过程中可能遇到一些问题，我们可以一起来探讨，我的联系方式是 <span lang="EN-US"><a href="mailto:cnangel@gmail.com"><span style="color: black">cnangel@gmail.com</span></a> </span>。<span lang="EN-US"><?xml:namespace prefix = o /><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">&nbsp;&nbsp;&nbsp; </span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">几个函数原型：<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">iconv_t iconv_open(const char *tocode, const char *fromcode);<br>size_t iconv(iconv_t cd, char **inbuf, size_t *inbytesleft, char **outbuf, size_t *outbytesleft);<br>int iconv_close(iconv_t cd);<o:p></o:p></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">&nbsp;&nbsp;&nbsp; </span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">其中：<span lang="EN-US"><br>iconv_open</span>是打开一个编码流，类似于打开一个编码管道（通道），出错则返回<span lang="EN-US"> -1</span>；<span lang="EN-US"><br>iconv</span>用于具体输入的转换，如果出错，则返回<span lang="EN-US"> -1</span>，否则返回<span lang="EN-US"> 0</span>；<span lang="EN-US"><br>iconv_close</span>是关闭该管道（通道）。<span lang="EN-US"><br>&nbsp;&nbsp;&nbsp; </span>举个例子：<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">#include &lt;stdio.h&gt;<br>#include &lt;string.h&gt;<br>#include &lt;stdlib.h&gt;<br>#include &lt;iconv.h&gt;<br><br>#define OUTLEN 255<br>int covert(char *, char *, char *, size_t , char *, size_t );<br><br>int main(int argc, char *argv[])<br>{<br>&nbsp;&nbsp;&nbsp; char *input = "</span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">中国<span lang="EN-US">";<br>&nbsp;&nbsp;&nbsp; size_t len = strlen(input);<br>&nbsp;&nbsp;&nbsp; char *output = (char *)malloc(OUTLEN);<br>&nbsp;&nbsp;&nbsp; covert("UTF-8", "GBK", input, len, output, OUTLEN);<br>&nbsp;&nbsp;&nbsp; printf("%s\n", output);<br>&nbsp;&nbsp;&nbsp; return 0;<br>}<br><br>int covert(char *desc, char *src, char *input, size_t ilen, char *output, size_t olen)<br>{<br>&nbsp;&nbsp;&nbsp; char **pin = &amp;input;<br>&nbsp;&nbsp;&nbsp; char **pout = &amp;output;<br>&nbsp;&nbsp;&nbsp; iconv_t cd = iconv_open(desc, src);<br>&nbsp;&nbsp;&nbsp; if (cd == (iconv_t)-1)<br>&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return -1;<br>&nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; memset(output, 0, olen);<br>&nbsp;&nbsp;&nbsp; if (iconv(cd, pin, &amp;ilen, pout, &amp;olen)) return -1;<br>&nbsp;&nbsp;&nbsp; iconv_close(cd);<br>&nbsp;&nbsp;&nbsp; return 0;<br>}<o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">&nbsp;&nbsp;&nbsp; </span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">这里面<span lang="EN-US">covert</span>函数就是用于将编码进行转换，其中要注意的地方是<span lang="EN-US">iconv</span>函数的传递参数：<span lang="EN-US"><br>1</span>，<span lang="EN-US">iconv</span>传递有<span lang="EN-US">5</span>个参数；<span lang="EN-US"><br>2</span>，第<span lang="EN-US">3</span>个参数和第<span lang="EN-US">5</span>个参数一般是<span lang="EN-US">input</span>和<span lang="EN-US">output</span>实际分配的大小，一般是<span lang="EN-US"> sizeof(type)*strlen(string)</span>；<span lang="EN-US"><br>3</span>，第<span lang="EN-US">4</span>个参数是不能直接传递指针的地址，因为<span lang="EN-US">iconv</span>函数能够改变指针的值，所以需要复制一份指针变量；<span lang="EN-US"><br>&nbsp;&nbsp;&nbsp; </span>如果对于大量需要转换的编码，上述函数<span lang="EN-US">covert</span>不适合该方式，一是内存的限制不能一次调用，二是若分多次调用会频繁打开一个编码管道（通道），导致资源浪费，最好的办法还是拆开该函数根据情况使用。<span lang="EN-US"><br>&nbsp;&nbsp;&nbsp; </span>这里补充一下代码：<span lang="EN-US"><br>translateSP.h</span>：<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial"> #ifndef __TRANSLATESP_H_<br> #define __TRANSLATESP_H_<br> #include &lt;iconv.h&gt;<br>&nbsp;<br> class TranslateSP<br> {<br>&nbsp;&nbsp;&nbsp;&nbsp; public:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; TranslateSP():i_cd(0){}<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; TranslateSP(const char *from_charset,const char *to_charset)<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i_cd = iconv_open(to_charset, from_charset);<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if ((iconv_t)-1 == i_cd) printf("iconv open error!\n");<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ~TranslateSP()<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; {&nbsp;&nbsp; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (i_cd)<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; iconv_close(i_cd);<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br>&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp; public:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; size_t translate(char *src, size_t srcLen, char *desc, size_t descLen);<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; size_t convert(const char *from_charset, const char *to_charset, <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; char *src, size_t srcLen, char *desc, size_t descLen);<br>&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp; private:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; iconv_t i_cd;<br> };<br>&nbsp;<br> #endif<o:p></o:p></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">translateSP.cpp</span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">：<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial"> #include "translateSP.h"<br>&nbsp;<br> #define MAX_LEN 200<br>&nbsp;<br> size_t TranslateSP::translate(char *src, size_t srcLen, char *desc, size_t descLen)<br> {<br>&nbsp;&nbsp;&nbsp;&nbsp; char **inbuf = &amp;src;<br>&nbsp;&nbsp;&nbsp;&nbsp; char **outbuf = &amp;desc;<br>&nbsp;&nbsp;&nbsp;&nbsp; memset(desc, 0, descLen);<br>&nbsp;&nbsp;&nbsp;&nbsp; return iconv(i_cd, inbuf, &amp;srcLen, outbuf, &amp;descLen);<br> }<br>&nbsp;<br> size_t TranslateSP::convert(const char *from_charset, const char *to_charset, <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; char *src, size_t srcLen, char *desc, size_t descLen)<br> {<br>&nbsp;&nbsp;&nbsp;&nbsp; char **inbuf = &amp;src;<br>&nbsp;&nbsp;&nbsp;&nbsp; char **outbuf = &amp;desc;<br>&nbsp;&nbsp;&nbsp;&nbsp; iconv_t cd = iconv_open(to_charset, from_charset);<br>&nbsp;&nbsp;&nbsp;&nbsp; if ((iconv_t)-1 == cd) return (size_t)-1;<br>&nbsp;&nbsp;&nbsp;&nbsp; memset(desc, 0, descLen);<br>&nbsp;&nbsp;&nbsp;&nbsp; size_t n = iconv(cd, inbuf, &amp;srcLen, outbuf, &amp;descLen);<br>&nbsp;&nbsp;&nbsp;&nbsp; iconv_close(cd);<br>&nbsp;&nbsp;&nbsp;&nbsp; return n;<br> }<br>&nbsp;<br> int main(int argc, char *argv[])<br> {<br>&nbsp;&nbsp;&nbsp;&nbsp; char *str = "</span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">我爱<span lang="EN-US">zhong</span>国<span lang="EN-US">! </span>％＃＠＃<span lang="EN-US">";<br>&nbsp;&nbsp;&nbsp;&nbsp; char *str1 = "i</span>大量需要转换的编码<span lang="EN-US">";<br>&nbsp;&nbsp;&nbsp;&nbsp; char *str2 = "</span>函数就是用于将<span lang="EN-US">hello</span>进行转换<span lang="EN-US">";<br>&nbsp;&nbsp;&nbsp;&nbsp; char newstr[MAX_LEN];<br>&nbsp;&nbsp;&nbsp;&nbsp; TranslateSP tsp;<br>&nbsp;&nbsp;&nbsp;&nbsp; tsp.convert("utf-8", "gbk", str, strlen(str), newstr, MAX_LEN);<br>&nbsp;&nbsp;&nbsp;&nbsp; printf("%s\n", newstr);<br>&nbsp;&nbsp;&nbsp;&nbsp; TranslateSP newtsp("UTF-8", "GBK");<br>&nbsp;&nbsp;&nbsp;&nbsp; newtsp.translate(str1, strlen(str1), newstr, MAX_LEN);<br>&nbsp;&nbsp;&nbsp;&nbsp; printf("%s\n", newstr);<br>&nbsp;&nbsp;&nbsp;&nbsp; newtsp.translate(str2, strlen(str2), newstr, MAX_LEN);<br>&nbsp;&nbsp;&nbsp;&nbsp; printf("%s\n", newstr);<br>&nbsp;&nbsp;&nbsp;&nbsp; return 0;<br> }<o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">编译：<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">g++ translateSP.cpp -o test<br>./test<br></span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">我爱<span lang="EN-US">zhong</span>国<span lang="EN-US">! </span>％＃＠＃<span lang="EN-US"><br>i</span>大量需要转换的编码<span lang="EN-US"><br></span>函数就是用于将<span lang="EN-US">hello</span>进行转换<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">(</span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial">以上输出是<span lang="EN-US">GBK</span>编码<span lang="EN-US">)<o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial"><o:p>&nbsp;</o:p></span></p><img src ="http://www.cppblog.com/woaidongmao/aggbug/95869.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/woaidongmao/" target="_blank">肥仔</a> 2009-09-10 23:52 <a href="http://www.cppblog.com/woaidongmao/archive/2009/09/10/95869.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>unicode utf-8 gb18030 gb2312 gbk各种编码对比</title><link>http://www.cppblog.com/woaidongmao/archive/2009/09/10/95868.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Thu, 10 Sep 2009 15:42:00 GMT</pubDate><guid>http://www.cppblog.com/woaidongmao/archive/2009/09/10/95868.html</guid><wfw:comment>http://www.cppblog.com/woaidongmao/comments/95868.html</wfw:comment><comments>http://www.cppblog.com/woaidongmao/archive/2009/09/10/95868.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/woaidongmao/comments/commentRss/95868.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/woaidongmao/services/trackbacks/95868.html</trackback:ping><description><![CDATA[&nbsp;&nbsp;&nbsp;&nbsp; 摘要: 在修改一个cms的过程当中遇到一个php截取字符串的函数（当然得兼容中英字符了），因为对各种编码的字符范围和字符表示不清楚，感觉一头迷雾，虽然可以直接来调用这个函数但是我这个的特点是追究原理，我在乎的事情都想弄明白，于是各个qq群依次发信息，没人理会。唉，郁闷。只好自己google it and teach myself 。下面是详细介绍。还有对各方求助没有人理会，我有些个人想法。现在的人已经很少...&nbsp;&nbsp;<a href='http://www.cppblog.com/woaidongmao/archive/2009/09/10/95868.html'>阅读全文</a><img src ="http://www.cppblog.com/woaidongmao/aggbug/95868.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/woaidongmao/" target="_blank">肥仔</a> 2009-09-10 23:42 <a href="http://www.cppblog.com/woaidongmao/archive/2009/09/10/95868.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>GB18030编码研究以及GBK、GB18030与Unicode的映射</title><link>http://www.cppblog.com/woaidongmao/archive/2009/09/10/95867.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Thu, 10 Sep 2009 15:37:00 GMT</pubDate><guid>http://www.cppblog.com/woaidongmao/archive/2009/09/10/95867.html</guid><wfw:comment>http://www.cppblog.com/woaidongmao/comments/95867.html</wfw:comment><comments>http://www.cppblog.com/woaidongmao/archive/2009/09/10/95867.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/woaidongmao/comments/commentRss/95867.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/woaidongmao/services/trackbacks/95867.html</trackback:ping><description><![CDATA[&nbsp;&nbsp;&nbsp;&nbsp; 摘要: GB18030有两个版本：GB18030-2000和GB18030-2005。在本文中，没有指明版本的GB18030是指GB18030-2005。本文讨论了以下问题： 1.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; GB2312有682个图形符号，都放在1区。GBK的1区有717个图形符号，5区有 166个图形符号，一共...&nbsp;&nbsp;<a href='http://www.cppblog.com/woaidongmao/archive/2009/09/10/95867.html'>阅读全文</a><img src ="http://www.cppblog.com/woaidongmao/aggbug/95867.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/woaidongmao/" target="_blank">肥仔</a> 2009-09-10 23:37 <a href="http://www.cppblog.com/woaidongmao/archive/2009/09/10/95867.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>GBK, UCS和UTF8相互转换</title><link>http://www.cppblog.com/woaidongmao/archive/2009/09/10/95864.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Thu, 10 Sep 2009 15:13:00 GMT</pubDate><guid>http://www.cppblog.com/woaidongmao/archive/2009/09/10/95864.html</guid><wfw:comment>http://www.cppblog.com/woaidongmao/comments/95864.html</wfw:comment><comments>http://www.cppblog.com/woaidongmao/archive/2009/09/10/95864.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/woaidongmao/comments/commentRss/95864.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/woaidongmao/services/trackbacks/95864.html</trackback:ping><description><![CDATA[&nbsp;&nbsp;&nbsp;&nbsp; 摘要: 最近学习了下编码 以下地址可以很好的学习到相关的知识 http://dev.csdn.net/develop/article/69/69883.shtm http://dev.csdn.net/develop/article/72/72888.shtm 其中讲了UTF8的编码 当要表示的内容是　7位　的时候就用一个字节：0******* 　第一个0为标志位，剩下的空间正好可以表示ASCII　0－1...&nbsp;&nbsp;<a href='http://www.cppblog.com/woaidongmao/archive/2009/09/10/95864.html'>阅读全文</a><img src ="http://www.cppblog.com/woaidongmao/aggbug/95864.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/woaidongmao/" target="_blank">肥仔</a> 2009-09-10 23:13 <a href="http://www.cppblog.com/woaidongmao/archive/2009/09/10/95864.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>C程序实现汉字内码与GB码</title><link>http://www.cppblog.com/woaidongmao/archive/2008/11/08/66314.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Sat, 08 Nov 2008 04:17:00 GMT</pubDate><guid>http://www.cppblog.com/woaidongmao/archive/2008/11/08/66314.html</guid><wfw:comment>http://www.cppblog.com/woaidongmao/comments/66314.html</wfw:comment><comments>http://www.cppblog.com/woaidongmao/archive/2008/11/08/66314.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/woaidongmao/comments/commentRss/66314.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/woaidongmao/services/trackbacks/66314.html</trackback:ping><description><![CDATA[<p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　<span lang="EN-US">// HZEncode.cpp : Defines the entry point for the console application.<?xml:namespace prefix = o /><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　<span lang="EN-US">//<o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　<span lang="EN-US">/*<o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　参考文献：<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　汉字的编码和表示<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　<span lang="EN-US">1)</span>汉字交换码<span lang="EN-US">(</span>国标码<span lang="EN-US">) </span>汉字交换码<span lang="EN-US">(</span>国标码<span lang="EN-US">)</span>主要用于汉字信息交换。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　国标码：以国家标准局<span lang="EN-US">1980</span>年颁布的《信息交换用汉字编码字符集<span lang="EN-US">"</span>基本集》<span lang="EN-US">(</span>代号为<span lang="EN-US">GB2312 80)</span>规定的汉字交换码作为国家标准汉字编码。 <span lang="EN-US">GB2312 80</span>中共有<span lang="EN-US">7445</span>个字符符号： 汉字符号<span lang="EN-US">6763</span>个 一级汉字<span lang="EN-US">3755</span>个<span lang="EN-US">(</span>按汉语拼音字母顺序排列<span lang="EN-US">) </span>二级汉字<span lang="EN-US">3008</span>个<span lang="EN-US">(</span>按部首笔划顺序排列<span lang="EN-US">) </span>非汉字符号<span lang="EN-US">682</span>个<span lang="EN-US"> GB2312 80</span>规定，所有的国标码汉字及符号组成一个<span lang="EN-US">94 94</span>的方阵。在此方阵中，每一行称为一个<span lang="EN-US">"</span>区<span lang="EN-US">"</span>，每一列称为一个<span lang="EN-US">"</span>位<span lang="EN-US">"</span>。这个方阵实际上组成一个有<span lang="EN-US">94</span>个区<span lang="EN-US">(</span>编号由<span lang="EN-US">01</span>到<span lang="EN-US">94)</span>，每个区有<span lang="EN-US">94</span>个位<span lang="EN-US">(</span>编号由<span lang="EN-US">01</span>到<span lang="EN-US">94)</span>的汉字字符集。 一个汉字所在的区号和位号的组合就构成了该汉字的<span lang="EN-US">"</span>区位码<span lang="EN-US">"</span>。其中，高两位为区号，低两位为位号。这样区位码可以唯一地确定某一汉字或字符<span lang="EN-US">;</span>反之，任何一个汉字或符号都对应一个唯一的区位码，没有重码。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　区位码分布情况如下：<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　区 号 内 容<span lang="EN-US"> 1</span>区 键盘上没有的各种符号<span lang="EN-US"> 2</span>区 各种序号<span lang="EN-US"> 3</span>区 键盘上的各种符号<span lang="EN-US">(</span>按中文方式给出<span lang="EN-US">) 4 -5</span>区 日文字母<span lang="EN-US"> 6</span>区 希腊字母<span lang="EN-US"> 7</span>区 俄文字母<span lang="EN-US"> 8</span>区 标识拼音声调的母音及拼音字母名称<span lang="EN-US"> 9</span>区 制表符号<span lang="EN-US"> 10- 15</span>区 未用<span lang="EN-US"> 16-55</span>区 一级汉字<span lang="EN-US">(</span>按拼音字母顺序排列<span lang="EN-US">) 56- 87</span>区 二级汉字<span lang="EN-US">(</span>按部首笔划顺序排列<span lang="EN-US">) 88- 94</span>区 自定义汉字<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　由上可以看出，所有汉字与符号的<span lang="EN-US">94</span>个区，可以分为四个组：<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　①</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-font-kerning: 0pt">1 -15</span><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">区：为图形符号区。其中<span lang="EN-US">1 9</span>区为标准符号区<span lang="EN-US">;10 15</span>区为自定义符号区。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　②</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-font-kerning: 0pt">16 -55</span><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">区：为一级汉字区，包含<span lang="EN-US">3755</span>个汉字。这些区中的汉字按汉语拼音顺序排序，同音字按笔画顺序列出。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　③</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-font-kerning: 0pt">56 -87</span><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">区：为二级汉字区，包含<span lang="EN-US">3008</span>个汉字。这些区中的汉字是按部首笔划顺序排序的。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　④</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-font-kerning: 0pt">88 -94</span><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">区：为自定义汉字区。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　国标码规定，每个汉字<span lang="EN-US">(</span>包括非汉字的一些符号<span lang="EN-US">)</span>由<span lang="EN-US">2</span>字节代码表示。每个字节的最高位为<span lang="EN-US">0</span>，只使用低<span lang="EN-US">7</span>位，而低<span lang="EN-US">7</span>位的编码中又有<span lang="EN-US">34</span>个适用于控制用的，这样每个字节只有<span lang="EN-US">27 - 34 = 94</span>个编码用于汉字。<span lang="EN-US">2</span>个字节就有<span lang="EN-US">94 94=8836</span>个汉字编码。在表示一个汉字的<span lang="EN-US">2</span>个字节中，高字节对应编码表中的行号，称为区号<span lang="EN-US">;</span>低字节对应编码表中的列号，称为位号。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　汉字国标码的范围用二进制表示是：<span lang="EN-US"> 00100001 00100001 01111110 01111110 (1+32)10 (1+32)10 (94+32)10 (94+32)10 7 </span>位<span lang="EN-US">ASCII</span>码是<span lang="EN-US">128</span>个字符组成的字符集。其中编码值<span lang="EN-US">0 31(00000000 00011111)</span>不对应任何印刷字符，通常称为控制符，用于计算机通信中的通信控制或对计算机设备的功能控制。编码值<span lang="EN-US">32(00100000)</span>是空格字符<span lang="EN-US">SP</span>。编码值<span lang="EN-US">127(1111111)</span>是删除字符<span lang="EN-US">DEL</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　汉字国标码的起始二进制位置选择<span lang="EN-US">00100001</span>即<span lang="EN-US">(33)10</span>是为了跳过<span lang="EN-US">ASCII</span>码的<span lang="EN-US">32</span>个控制字符和空格字符。所以，汉字国标码的高位和低位分别比对应的区位码大<span lang="EN-US">(32)10</span>或<span lang="EN-US">(00100000)2</span>或<span lang="EN-US">(20)H</span>，即： 国标码高位<span lang="EN-US"> = </span>区码<span lang="EN-US"> + 20H (H</span>表示十六进制<span lang="EN-US">) </span>国标码低位<span lang="EN-US"> = </span>位码<span lang="EN-US"> + 20H<o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　<span lang="EN-US">2) </span>汉字机内码<span lang="EN-US">(</span>内码<span lang="EN-US">)(</span>汉字存储码<span lang="EN-US">)<o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　汉字机内码<span lang="EN-US">(</span>内码<span lang="EN-US">)(</span>汉字存储码<span lang="EN-US">)</span>的作用是统一了各种不同的汉字输入码在计算机内部的表示。为了将汉字的各种输入码在计算机内部统一起来，就有了专用于计算机内部存储汉字使用的汉字机内码，用以将输入时使用的多种汉字输入码统一转换成汉字机内码进行存储，以方便机内的汉字处理汉字机内码是在计算机内部存储、处理的代码。计算机既要处理汉字，又要处理英文。因此计算机必须能区别汉字字符和英文字符。英文字符的的机内码是最高为为<span lang="EN-US">0</span>的<span lang="EN-US">8</span>位<span lang="EN-US">ASCII</span>码。为了不与<span lang="EN-US">7</span>位<span lang="EN-US">ASCII</span>码发生冲突，把国标码每个字节的最高位由<span lang="EN-US">0</span>改为<span lang="EN-US">1</span>，其余位不变的编码作为汉字字符的机内码。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　汉字机内码的范围用二进制表示是：<span lang="EN-US"> 10100001 10100001 11111110 11111110 </span>机内码的高位和低位比对应的国标码的高位和低位大<span lang="EN-US">(128)10</span>或<span lang="EN-US">(10000000)2</span>或<span lang="EN-US">(80)H </span>即： 机内码高位<span lang="EN-US"> = </span>国标码高位<span lang="EN-US"> + 80H </span>机内码低位<span lang="EN-US"> = </span>国标码低位<span lang="EN-US"> + 80H </span>又因为： 国标码高位<span lang="EN-US"> = </span>区码<span lang="EN-US"> + 20H </span>国标码低位<span lang="EN-US"> = </span>位码<span lang="EN-US"> + 20H </span>所以： 机内码高位<span lang="EN-US"> = </span>区码<span lang="EN-US"> + A0H </span>机内码低位<span lang="EN-US"> = </span>位码<span lang="EN-US"> + A0H </span>也就是说，机内码高位和机内码低位分别比对应的区码和位码大<span lang="EN-US">(160)10</span>或<span lang="EN-US">(10100000)2</span>或<span lang="EN-US"> (A0)H </span>例如：汉字<span lang="EN-US">"</span>啊<span lang="EN-US">"</span>的区位码为<span lang="EN-US">"1601"</span>，其中区码为<span lang="EN-US">(16)10</span>或<span lang="EN-US">(10)H</span>，位码为<span lang="EN-US">(01)10</span>或<span lang="EN-US">(01)H</span>。 则： 机内码高位<span lang="EN-US"> = 10H + A0H = B0H </span>机内码低位<span lang="EN-US"> = 01H + A0H = A1H </span>所以： 机内码<span lang="EN-US">= B<?xml:namespace prefix = st1 /><st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="0" unitname="a">0A</st1:chmetcnv>1H<o:p></o:p></span></span></p> <div align="center"> <table class="MsoNormalTable" style="width: 95%; mso-cellspacing: 0cm; mso-padding-alt: 4.5pt 4.5pt 4.5pt 4.5pt" cellspacing="0" cellpadding="0" width="95%" border="0"> <tbody> <tr style="mso-yfti-irow: 0; mso-yfti-firstrow: yes; mso-yfti-lastrow: yes"> <td style="padding-right: 4.5pt; padding-left: 4.5pt; background: #f3f3f3; padding-bottom: 4.5pt; padding-top: 4.5pt"> <p class="MsoNormal" style="text-align: left; mso-pagination: widow-orphan" align="left"><b><span style="font-size: 12pt; color: #990000; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">以下是引用片段：</span></b><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><br>&lt;!--[if !supportEmptyParas]--&gt; &lt;!--[endif]--&gt;<o:p></o:p></span></p></td></tr></tbody></table></div> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　<span lang="EN-US">3) </span>汉字输入码<span lang="EN-US">(</span>外码<span lang="EN-US">)<o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　汉字输入码<span lang="EN-US">(</span>外码<span lang="EN-US">)</span>是为了通过键盘字符把汉字输入计算机而设计的一种编码。 英文输入时，相输入什么字符便按什么键，输入码和机内码一致。汉字输入时，可能要按几个键才能输入一个汉字。汉字输入方案有成百上千个，但是这千差万别的外码输入进计算机后都会转换成统一的内码。 汉字输入方案大致可分为以下<span lang="EN-US">4</span>种类型：<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　<span lang="EN-US">(1) </span>音码：如全拼、双拼、微软拼音等<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　<span lang="EN-US">(2) </span>形码：如五笔字型、郑码、表形码等<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　<span lang="EN-US">(3) </span>音形码：如智能<span lang="EN-US">ABC</span>、自然码等<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　<span lang="EN-US">(4) </span>数字码：如区位码、电报码等<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　<span lang="EN-US">4) </span>汉字字形码<span lang="EN-US">(</span>输出码<span lang="EN-US">)<o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　汉字字形码<span lang="EN-US">(</span>输出码<span lang="EN-US">)</span>用于汉字的显示和打印，是汉字字形的数字化信息。 汉字的内码是用数字代码来表示汉字，但是为了在输出时让人们看到汉字，就必须输出汉字的字形。在汉字系统中，一般采用点阵来表示字形。<span lang="EN-US"> 16 *16</span>汉字点阵示意<span lang="EN-US"> 16 * 16</span>点阵字形的字要使用<span lang="EN-US">32</span>个字节<span lang="EN-US">(16 * 16/8= 32)</span>存储，<span lang="EN-US">24 * 24</span>点阵字形的字要使用<span lang="EN-US">72</span>个字节<span lang="EN-US">(24 * 24/8=72)</span>存储。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　一般来说，表现汉字时使用的点阵越大，则汉字字形的质量也越好，当然每个汉字点阵所需的存储量也越大。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　<span lang="EN-US">5) </span>汉字地址码<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="text-align: left; mso-margin-top-alt: auto; mso-pagination: widow-orphan; mso-margin-bottom-alt: auto" align="left"><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">　　汉字地址码是指汉字库<span lang="EN-US">(</span>这里主要指整字形的点阵式字模库<span lang="EN-US">)</span>中存储汉字字形信息的逻辑地址。在汉字库中，字形信息都是按一定顺序<span lang="EN-US">(</span>大多数按标准汉字交换码中汉字的排列顺序<span lang="EN-US">)</span>连续存放在存储介质上的，所以汉字地址码也大多是连续有序的，而且与汉字内码间有着简单的对应关系，以简化汉字内码到汉字地址码的转换。<span lang="EN-US"><o:p></o:p></span></span></p> <div align="center"> <table class="MsoNormalTable" style="width: 95%; mso-cellspacing: 0cm; mso-padding-alt: 4.5pt 4.5pt 4.5pt 4.5pt" cellspacing="0" cellpadding="0" width="95%" border="0"> <tbody> <tr style="mso-yfti-irow: 0; mso-yfti-firstrow: yes; mso-yfti-lastrow: yes"> <td style="padding-right: 4.5pt; padding-left: 4.5pt; background: #f3f3f3; padding-bottom: 4.5pt; padding-top: 4.5pt"> <p class="MsoNormal" style="text-align: left; mso-pagination: widow-orphan" align="left"><b><span style="font-size: 12pt; color: #990000; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">以下是引用片段：</span></b><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><br>*/&nbsp; <br>&lt;!--[if !supportEmptyParas]--&gt; &lt;!--[endif]--&gt; <br>#include "stdafx.h" <br>#include "HZEncode.h" <br>&lt;!--[if !supportEmptyParas]--&gt; &lt;!--[endif]--&gt; <br>#ifdef _DEBUG <br>#define new DEBUG_NEW <br>#undef THIS_FILE <br>static char THIS_FILE[] = __FILE__; <br>#endif <br>#define UNICODE <br>#define _UNICODE <br>///////////////////////////////////////////////////////////////////////////// <br>// The one and only application object <br>&lt;!--[if !supportEmptyParas]--&gt; &lt;!--[endif]--&gt; <br>CWinApp theApp; <br>&lt;!--[if !supportEmptyParas]--&gt; &lt;!--[endif]--&gt; <br>using namespace std; <br>unsigned short* ptr; <br>char* pszHZ = "</span><span style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">啊<span lang="EN-US">"; <br>byte bt[] = {0xc4,0xe3,0xBA,0xC3};//“</span>你好<span lang="EN-US">”</span>的机内码<span lang="EN-US"> <br>int _tmain(int argc, TCHAR* argv[], TCHAR* envp[]) <br>{ <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; int nRetCode = 0; <br>&lt;!--[if !supportEmptyParas]--&gt; &lt;!--[endif]--&gt; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // initialize MFC and print and error on failure <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (!AfxWinInit(::GetModuleHandle(NULL), NULL, ::GetCommandLine(), 0)) <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // TODO: change error code to suit your needs <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; cerr &lt;&lt; _T("Fatal Error: MFC initialization failed") &lt;&lt; endl; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nRetCode = 1; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for (int i = 16;i &lt;= 55; i++) <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; byte Temp[3]; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Temp[2] = 0; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Temp[0] = i + 0xA0; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for (int j = 1;j &lt; 94;j++) <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Temp[1] = j + 0xA0; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; cout &lt;&lt; (LPCTSTR) Temp; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; cout &lt;&lt; endl; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br>&lt;!--[if !supportEmptyParas]--&gt; &lt;!--[endif]--&gt; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br>&lt;!--[if !supportEmptyParas]--&gt; &lt;!--[endif]--&gt; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; system("pause"); <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return nRetCode; <br>} <br>&lt;!--[if !supportEmptyParas]--&gt; &lt;!--[endif]--&gt; <br>&nbsp; <br>&lt;!--[if !supportEmptyParas]--&gt; &lt;!--[endif]--&gt;<o:p></o:p></span></span></p></td></tr></tbody></table></div> <p class="MsoNormal" style="text-align: left" align="left"><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: arial"><o:p>&nbsp;</o:p></span></p><img src ="http://www.cppblog.com/woaidongmao/aggbug/66314.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/woaidongmao/" target="_blank">肥仔</a> 2008-11-08 12:17 <a href="http://www.cppblog.com/woaidongmao/archive/2008/11/08/66314.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>C++的三种字符编码方式</title><link>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66259.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Fri, 07 Nov 2008 15:27:00 GMT</pubDate><guid>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66259.html</guid><wfw:comment>http://www.cppblog.com/woaidongmao/comments/66259.html</wfw:comment><comments>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66259.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/woaidongmao/comments/commentRss/66259.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/woaidongmao/services/trackbacks/66259.html</trackback:ping><description><![CDATA[<p></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">c++</span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">通常使用的是三种编码方式，分别是<span lang="EN-US">SBCS(single byte character set),MBCS(multi-byte characterset)</span>和<span lang="EN-US">Unicode</span>字符集。<span lang="EN-US">SBCS</span>是一个字节一个字符，<span lang="EN-US">MBCS</span>是几个字节一个字符，可能是一个，两个，三个不等，但是实际上，绝大多数时候使用两个字节的，所以有时候看到<span lang="EN-US">DBCS(double-byte character set)</span>代替<span lang="EN-US">MBCS</span>也不奇怪；<span lang="EN-US">Unicode</span>一律是两个字节编码。在<span lang="EN-US">windows nt</span>内核中，<span lang="EN-US">API</span>一律使用的是<span lang="EN-US">unicode</span>编码，所以如果你在编写软件过程中使用非<span lang="EN-US">unicode</span>编码方式，系统也会自动转换成<span lang="EN-US">unicode</span>执行，然后返回的结构再转换为你使用的类型。单字节表示用<span lang="EN-US">char</span>，<span lang="EN-US">unicode</span>使用<span lang="EN-US">wchar_t.</span>我们是在单字节的光芒下成长起来的，一时间完全抛弃单字节未免难以接受，但是有些时候我们又不可避免的需要使用<span lang="EN-US">unicode</span>字符集合，那么<span lang="EN-US">ms</span>提供的解决办法是泳宏：<span lang="EN-US">TChar<?xml:namespace prefix = o /><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">我们看看他的定义：<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">#ifdef UNICODE<br>typedef wchar_t TCHAR;<br>#else<br>typedef char TCHAR;<br>#endif<o:p></o:p></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">ok</span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">，一切问题都解决了，我们只需要定义<span lang="EN-US">UNICODE</span>就一样使用<span lang="EN-US">wchar_t,</span>是很方便。另外，在<span lang="EN-US">windows</span>的<span lang="EN-US">COM</span>中使用的一律是<span lang="EN-US">unicode</span>，但是<span lang="EN-US">MFC</span>默认的确实<span lang="EN-US">MBCS</span>，所以你用<span lang="EN-US">MFC</span>写的类库如果放到了<span lang="EN-US">COM</span>下，有些字符的格式化方式或者返回值错误的，原因就是<span lang="EN-US">com</span>一律使用<span lang="EN-US">unicode</span>，而<span lang="EN-US">unicode</span>使用<span lang="EN-US">wchar_t('00')</span>结尾，<span lang="EN-US">char</span>却是使用<span lang="EN-US">'0'</span>结尾的。一般情况下，普通字符需要加载<span lang="EN-US">_T</span>宏才能正常运行，比如<span lang="EN-US">MFC</span>中你写道<span lang="EN-US">S = "FSDFSDF",</span>那么该类转到<span lang="EN-US">COM</span>下，需要写<span lang="EN-US">S = _T("FSDFSDF")</span>；才可以。我们可以想象宏<span lang="EN-US">_T</span>跟<span lang="EN-US">TCHAr</span>的功能一样，如果使用<span lang="EN-US">UNICODE</span>就自动在<span lang="EN-US">constant string</span>前面加上<span lang="EN-US">L</span>，否则就直接使用。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">我们说一些小问题：<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">VC6</span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">生成的<span lang="EN-US">console application</span>是<span lang="EN-US"><br>int main(int argc, char* argv[])<o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">VS C++ 2005</span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">生成的是<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">int _tmain(int argc, _TCHAR* argv[])<o:p></o:p></span></p> <p class="MsoNormal" style="line-height: 150%"><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">显然，用<span lang="EN-US">_tmain</span>更好，<span lang="EN-US">why?<o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">You can also use <b>_tmain</b>, which is defined in TCHAR.h. <b>_tmain</b> will resolve to <b>main</b> unless _UNICODE is defined, in which case <b>_tmain</b> will resolve to <b>wmain</b>.(<a href="http://msdn2.microsoft.com/en-us/library/6wd819wh.aspx"><span style="color: black">http://msdn2.microsoft.com/en-us/library/6wd819wh.aspx</span></a>#).<o:p></o:p></span></p> <p class="MsoNormal" style="line-height: 150%"><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">我们也会常常看到如下一些字符类型，<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">WCHAR wchar_t wchar_t <br>LPSTR zero-terminated string of char (char*) zero-terminated string of char (char*) <br>LPCSTR constant zero-terminated string of char (const char*) constant zero-terminated string of char (const char*) <br>LPWSTR zero-terminated Unicode string (wchar_t*) zero-terminated Unicode string (wchar_t*) <br>LPCWSTR constant zero-terminated Unicode string (const wchar_t*) constant zero-terminated Unicode string (const wchar_t*) <br>TCHAR char wchar_t <br>LPTSTR zero-terminated string of TCHAR (TCHAR*) zero-terminated string of TCHAR (TCHAR*) <br>LPCTSTR constant zero-terminated string of TCHAR (const TCHAR*) constant zero-terminated string of TCHAR (const TCHAR*) <br>C </span><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">一般代表<span lang="EN-US">constant</span>，<span lang="EN-US">P</span>代表指针，<span lang="EN-US">LP</span>代表长指针<span lang="EN-US">,W</span>代表宽字符，也就是<span lang="EN-US">UNICODE</span>，这下是不是都能明白这些是干什么的了？<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">我们也会常常看到<span lang="EN-US">_mbsstr()</span>这样的函数，这就是<span lang="EN-US">MBCS</span>字符编码的函数，当然可以处理<span lang="EN-US">SBCS</span>编码，但是反之却不行。所以为了保险起见，我们可以使用<span lang="EN-US">_mbsstr</span>代替<span lang="EN-US">strstr,</span>但是如果程序只是处理<span lang="EN-US">SBCS</span>，那么显然又影响效率，所以到底用什么方式同时满足效率和可移植性，自己掂量着办吧。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><span style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体">以后使用<span lang="EN-US">C++</span>编写程序，如果出现乱码，首先检查<span lang="EN-US">C++</span>的编码类型，而且一般情况下都是结束符号没有弄对，<span lang="EN-US">SBCS</span>和<span lang="EN-US">MBCS</span>都是以单字节<span lang="EN-US">0</span>结尾，<span lang="EN-US">UNICODE</span>是以双字节<span lang="EN-US">00</span>结尾的。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="line-height: 150%"><b style="mso-bidi-font-weight: normal"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: arial"><o:p>&nbsp;</o:p></span></b></p></span><img src ="http://www.cppblog.com/woaidongmao/aggbug/66259.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/woaidongmao/" target="_blank">肥仔</a> 2008-11-07 23:27 <a href="http://www.cppblog.com/woaidongmao/archive/2008/11/07/66259.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>字符编码方式基本知识</title><link>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66252.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Fri, 07 Nov 2008 14:43:00 GMT</pubDate><guid>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66252.html</guid><wfw:comment>http://www.cppblog.com/woaidongmao/comments/66252.html</wfw:comment><comments>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66252.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/woaidongmao/comments/commentRss/66252.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/woaidongmao/services/trackbacks/66252.html</trackback:ping><description><![CDATA[<p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">ASCII</span><span style="mso-bidi-font-family: arial">：基本字符集是<span lang="EN-US">128</span>个常用字符，扩展字符集是<span lang="EN-US">128</span>个，共<span lang="EN-US">256</span>个，用<span lang="EN-US">1</span>个字节表示。<span lang="EN-US"><br>GB2312</span>：<span lang="EN-US">6</span>千多个常用汉字<span lang="EN-US"><br>GBK</span>：<span lang="EN-US">1</span>万多个汉字<span lang="EN-US"><br>GB18030</span>：更多，不过依然是两个字节来表示汉字。<span lang="EN-US"><br></span>上面三种<span lang="EN-US">GB*</span>可以统一称为<span lang="EN-US">ANSI</span>编码，且<span lang="EN-US">16</span>个<span lang="EN-US">bit</span>的第一个必定是<span lang="EN-US">1</span>。<span lang="EN-US"><br>BIG5</span>：繁体字符集，用于台湾地区<span lang="EN-US"><?xml:namespace prefix = o /><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">Unicode</span><span style="mso-bidi-font-family: arial">：两字节表示的世界通用码，存储为文本时会有连个字节的头信息。<span lang="EN-US"><br>UTF-8</span>：一种以<span lang="EN-US">8</span>个<span lang="EN-US">bit</span>为一组的<span lang="EN-US">Unicode</span>的表示格式，存储为本文时有三个字节的头信息。<span lang="EN-US"><br>UTF-16</span>：<span lang="EN-US">16</span>个<span lang="EN-US">bit</span>为一组<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">单词全称：<span lang="EN-US"><br>ASCII: American Standard Code Information Interchange<br>ANSI: American National Standard Institue<br>GB: Guo Biao<br>UTF: Unicode Translation Format<o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">========================================================<br></span><span style="mso-bidi-font-family: arial">字符是各种文字和符号的总称，包括各国家文字、标点符号、图形符号、数字等。字符集是多个字符的集合，字符集种类较多，每个字符集包含的字符个数不同，常见字符集名称：<span lang="EN-US">ASCII</span>字符集、<span lang="EN-US">GB2312</span>字符集、<span lang="EN-US">BIG5</span>字符集、<span lang="EN-US"> GB 18030</span>字符集、<span lang="EN-US">Unicode</span>字符集等。计算机要准确的处理各种字符集文字，需要进行字符编码，以便计算机能够识别和存储各种文字。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">中文文字数目大，而且还分为简体中文和繁体中文两种不同书写规则的文字，而计算机最初是按英语单字节字符设计的，因此，对中文字符进行编码，是中文信息交流的技术基础。本文将按照字符集的时间顺序讨论几种典型的字符集，选取几种代表性的中文字符集，研究历史由来、特点、技术特征。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">ASCII </span><span style="mso-bidi-font-family: arial">字符集<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">1</span><span style="mso-bidi-font-family: arial">．名称的由来<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">ASCII</span><span style="mso-bidi-font-family: arial">（<span lang="EN-US">American Standard Code for Information Interchange</span>，美国信息互换标准代码）是基于罗马字母表的一套电脑编码系统。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">2</span><span style="mso-bidi-font-family: arial">．特点<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">它主要用于显示现代英语和其他西欧语言。它是现今最通用的单字节编码系统，并等同于国际标准<span lang="EN-US">ISO 646</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">3</span><span style="mso-bidi-font-family: arial">．包含内容<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">控制字符：回车键、退格、换行键等。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">可显示字符：英文大小写字符、阿拉伯数字和西文符号<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">4</span><span style="mso-bidi-font-family: arial">．技术特征<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">7</span><span style="mso-bidi-font-family: arial">位（<span lang="EN-US">bits</span>）表示一个字符，共<span lang="EN-US">128</span>字符<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">5</span><span style="mso-bidi-font-family: arial">．<span lang="EN-US">ASCII</span>扩展字符集<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">7</span><span style="mso-bidi-font-family: arial">位编码的字符集只能支持<span lang="EN-US">128</span>个字符，为了表示更多的欧洲常用字符对<span lang="EN-US">ASCII</span>进行了扩展，<span lang="EN-US">ASCII</span>扩展字符集使用<span lang="EN-US">8</span>位（<span lang="EN-US">bits</span>）表示一个字符，共<span lang="EN-US">256</span>字符。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">ASCII</span><span style="mso-bidi-font-family: arial">扩展字符集比<span lang="EN-US">ASCII</span>字符集扩充出来的符号包括表格符号、计算符号、希腊字母和特殊的拉丁符号。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">GB2312 </span><span style="mso-bidi-font-family: arial">字符集<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial"> 1</span><span style="mso-bidi-font-family: arial">．名称的由来<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">GB2312</span><span style="mso-bidi-font-family: arial">又称为<span lang="EN-US">GB2312-80</span>字符集，全称为《信息交换用汉字编码字符集<span lang="EN-US">·</span>基本集》，由原中国国家标准总局发布，<?xml:namespace prefix = st1 /><st1:chsdate w:st="on" isrocdate="False" islunardate="False" day="1" month="5" year="1981"><span lang="EN-US">1981</span>年<span lang="EN-US">5</span>月<span lang="EN-US">1</span>日</st1:chsdate>实施。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">2</span><span style="mso-bidi-font-family: arial">．特点<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">GB2312</span><span style="mso-bidi-font-family: arial">是中国国家标准的简体中文字符集。它所收录的汉字已经覆盖<span lang="EN-US">99.75%</span>的使用频率，基本满足了汉字的计算机处理需要。在中国大陆和新加坡获广泛使用。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">3</span><span style="mso-bidi-font-family: arial">．包含内容<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">GB2312</span><span style="mso-bidi-font-family: arial">收录简化汉字及一般符号、序号、数字、拉丁字母、日文假名、希腊字母、俄文字母、汉语拼音符号、汉语注音字母，共<span lang="EN-US"> 7445 </span>个图形字符。其中包括<span lang="EN-US">6763</span>个汉字，其中一级汉字<span lang="EN-US">3755</span>个，二级汉字<span lang="EN-US">3008</span>个；包括拉丁字母、希腊字母、日文平假名及片假名字母、俄语西里尔字母在内的<span lang="EN-US">682</span>个全角字符。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">4</span><span style="mso-bidi-font-family: arial">．技术特征<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">（<span lang="EN-US">1</span>）分区表示：<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">GB2312</span><span style="mso-bidi-font-family: arial">中对所收汉字进行了<span lang="EN-US">“</span>分区<span lang="EN-US">”</span>处理，每区含有<span lang="EN-US">94</span>个汉字<span lang="EN-US">/</span>符号。这种表示方式也称为区位码。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">各区包含的字符如下：<span lang="EN-US">01-09</span>区为特殊符号；<span lang="EN-US">16-55</span>区为一级汉字，按拼音排序；<span lang="EN-US">56-87</span>区为二级汉字，按部首<span lang="EN-US">/</span>笔画排序；<span lang="EN-US">10-15</span>区及<span lang="EN-US">88-94</span>区则未有编码。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">（<span lang="EN-US">2</span>）双字节表示<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">两个字节中前面的字节为第一字节，后面的字节为第二字节。习惯上称第一字节为<span lang="EN-US">“</span>高字节<span lang="EN-US">” </span>，而称第二字节为<span lang="EN-US">“</span>低字节<span lang="EN-US">”</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">“</span><span style="mso-bidi-font-family: arial">高位字节<span lang="EN-US">”</span>使用了<span lang="EN-US">0xA1-0xF7(</span>把<span lang="EN-US">01-87</span>区的区号加上<span lang="EN-US">0xA0)</span>，<span lang="EN-US">“</span>低位字节<span lang="EN-US">”</span>使用了<span lang="EN-US">0xA1-0xFE(</span>把<span lang="EN-US">01-94</span>加上<span lang="EN-US">0xA0)</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">5</span><span style="mso-bidi-font-family: arial">．编码举例<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">以<span lang="EN-US">GB2312</span>字符集的第一个汉字<span lang="EN-US">“</span>啊<span lang="EN-US">”</span>字为例，它的区号<span lang="EN-US">16</span>，位号<span lang="EN-US">01</span>，则区位码是<span lang="EN-US">1601</span>，在大多数计算机程序中，高字节和低字节分别加<span lang="EN-US">0xA0</span>得到程序的汉字处理编码<span lang="EN-US">0xB<st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="0" unitname="a">0A</st1:chmetcnv>1</span>。计算公式是：<span lang="EN-US">0xB0=0xA0+16, 0xA1=0xA0+1</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">BIG5 </span><span style="mso-bidi-font-family: arial">字符集<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">1</span><span style="mso-bidi-font-family: arial">．名称的由来<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">又称大<st1:chmetcnv w:st="on" tcsc="1" numbertype="3" negative="False" hasspace="False" sourcevalue="5" unitname="码">五码</st1:chmetcnv>或五大码，<span lang="EN-US">1984</span>年由台湾财团法人信息工业策进会和五间软件公司宏碁<span lang="EN-US"> (Acer)</span>、神通<span lang="EN-US"> (MiTAC)</span>、佳佳、零壹<span lang="EN-US"> (Zero One)</span>、大众<span lang="EN-US"> (FIC)</span>创立，故称大<st1:chmetcnv w:st="on" tcsc="1" numbertype="3" negative="False" hasspace="False" sourcevalue="5" unitname="码">五码</st1:chmetcnv>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">Big<st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="5" unitname="码">5<span lang="EN-US"><span lang="EN-US">码</span></span></st1:chmetcnv><span lang="EN-US">的产生，是因为当时台湾不同厂商各自推出不同的编码，如倚天码、IBM PS55</span></span><span style="mso-bidi-font-family: arial">、王安码等，彼此不能兼容；另一方面，台湾政府当时尚未推出官方的汉字编码，而中国大陆的<span lang="EN-US">GB2312</span>编码亦未有收录繁体中文字。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">2</span><span style="mso-bidi-font-family: arial">．特点<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">Big5</span><span style="mso-bidi-font-family: arial">字符集共收录<span lang="EN-US">13,053</span>个中文字，该字符集在中国台湾使用。耐人寻味的是该字符集重复地收录了两个相同的字：<span lang="EN-US">“</span>兀<span lang="EN-US">”(0xA461</span>及<span lang="EN-US">0xC<st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="94" unitname="a">94A</st1:chmetcnv>)</span>、<span lang="EN-US">“</span>嗀<span lang="EN-US">”(0xDCD1</span>及<span lang="EN-US">0xDDFC)</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">3</span><span style="mso-bidi-font-family: arial">．字符编码方法<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">Big<st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="5" unitname="码">5<span lang="EN-US"><span lang="EN-US">码</span></span></st1:chmetcnv><span lang="EN-US">使用了双字节储存方法，以两个字节来编码一个字。第一个字节称为“</span></span><span style="mso-bidi-font-family: arial">高位字节<span lang="EN-US">”</span>，第二个字节称为<span lang="EN-US">“</span>低位字节<span lang="EN-US">”</span>。高位字节的编码范围<span lang="EN-US">0xA1-0xF9</span>，低位字节的编码范围<span lang="EN-US">0x40-0x7E</span>及<span lang="EN-US">0xA1-0xFE</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">各编码范围对应的字符类型如下：<span lang="EN-US">0xA140-0xA3BF</span>为标点符号、希腊字母及特殊符号，另外于<span lang="EN-US">0xA259-0xA261</span>，存放了双音节度量衡单位用字：兙兛兞兝兡兣嗧瓩糎；<span lang="EN-US">0xA440-0xC67E</span>为常用汉字，先按笔划再按部首排序；<span lang="EN-US">0xC940-0xF9D5</span>为次常用汉字，亦是先按笔划再按部首排序。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">4</span><span style="mso-bidi-font-family: arial">．<span lang="EN-US">Big5 </span>的局限性<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">尽管<span lang="EN-US">Big<st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="5" unitname="码">5<span lang="EN-US"><span lang="EN-US">码</span></span></st1:chmetcnv><span lang="EN-US">内包含一万多个字符，但是没有考虑社会上流通的人名、地名用字、方言用字、化学及生物科等用字，没有包含日文平假名及片假名字母。<o:p></o:p></span></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">例如台湾视<span lang="EN-US">“</span>着<span lang="EN-US">”</span>为<span lang="EN-US">“</span>著<span lang="EN-US">”</span>的异体字，故没有收录<span lang="EN-US">“</span>着<span lang="EN-US">”</span>字。康熙字典中的一些部首用字<span lang="EN-US">(</span>如<span lang="EN-US">“</span>亠<span lang="EN-US">”</span>、<span lang="EN-US">“</span>疒<span lang="EN-US">”</span>、<span lang="EN-US">“</span>辵<span lang="EN-US">”</span>、<span lang="EN-US">“</span>癶<span lang="EN-US">”</span>等<span lang="EN-US">)</span>、常见的人名用字<span lang="EN-US">(</span>如<span lang="EN-US">“</span>堃<span lang="EN-US">”</span>、<span lang="EN-US">“</span>煊<span lang="EN-US">”</span>、<span lang="EN-US">“</span>栢<span lang="EN-US">”</span>、<span lang="EN-US">“</span>喆<span lang="EN-US">”</span>等<span lang="EN-US">) </span>也没有收录到<span lang="EN-US">Big5</span>之中。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">GB18030 </span><span style="mso-bidi-font-family: arial">字符集<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">1</span><span style="mso-bidi-font-family: arial">．名称的由来<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">GB 18030</span><span style="mso-bidi-font-family: arial">的全称是<span lang="EN-US">GB18030-2000</span>《信息交换用汉字编码字符集基本集的扩充》，是我国政府于<st1:chsdate w:st="on" isrocdate="False" islunardate="False" day="17" month="3" year="2000"><span lang="EN-US">2000</span>年<span lang="EN-US">3</span>月<span lang="EN-US">17</span>日</st1:chsdate>发布的新的汉字编码国家标准，<st1:chsdate w:st="on" isrocdate="False" islunardate="False" day="31" month="8" year="2001"><span lang="EN-US">2001</span>年<span lang="EN-US">8</span>月<span lang="EN-US">31</span>日</st1:chsdate>后在中国市场上发布的软件必须符合本标准<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">2</span><span style="mso-bidi-font-family: arial">．特点<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">GB 18030</span><span style="mso-bidi-font-family: arial">字符集标准的出台经过广泛参与和论证，来自国内外知名信息技术行业的公司，信息产业部和原国家质量技术监督局联合实施。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">GB 18030</span><span style="mso-bidi-font-family: arial">字符集标准解决汉字、日文假名、朝鲜语和中国少数民族文字组成的大字符集计算机编码问题。该标准的字符总编码空间超过<span lang="EN-US">150</span>万个编码位，收录了<span lang="EN-US">27484</span>个汉字，覆盖中文、日文、朝鲜语和中国少数民族文字。满足中国大陆、香港、台湾、日本和韩国等东亚地区信息交换多文种、大字量、多用途、统一编码格式的要求。并且与<span lang="EN-US">Unicode 3.0</span>版本兼容，填补<span lang="EN-US">Unicode</span>扩展字符字汇<span lang="EN-US">“</span>统一汉字扩展<span lang="EN-US">A”</span>的内容。并且与以前的国家字符编码标准（<span lang="EN-US">GB2312</span>，<span lang="EN-US">GB13000.1</span>）兼容。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">3</span><span style="mso-bidi-font-family: arial">．编码方法<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">GB 18030</span><span style="mso-bidi-font-family: arial">标准采用单字节、双字节和四字节三种方式对字符编码。单字节部分使用<span lang="EN-US">0×00</span>至<span lang="EN-US">0×<st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="7" unitname="F">7F</st1:chmetcnv></span>码<span lang="EN-US">(</span>对应于<span lang="EN-US">ASCII</span>码的相应码<span lang="EN-US">)</span>。双字节部分，首字节码从<span lang="EN-US">0×81</span>至<span lang="EN-US">0×FE</span>，尾字节码位分别是<span lang="EN-US">0×40</span>至<span lang="EN-US">0×7E</span>和<span lang="EN-US">0×80</span>至<span lang="EN-US">0×FE</span>。四字节部分采用<span lang="EN-US">GB/T 11383</span>未采用的<span lang="EN-US">0×30</span>到<span lang="EN-US">0×39</span>作为对双字节编码扩充的后缀，这样扩充的四字节编码，其范围为<span lang="EN-US">0×81308130</span>到<span lang="EN-US">0×FE39FE39</span>。其中第一、三个字节编码码位均为<span lang="EN-US">0×81</span>至<span lang="EN-US">0×FE</span>，第二、四个字节编码码位均为<span lang="EN-US">0×30</span>至<span lang="EN-US">0×39</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">4</span><span style="mso-bidi-font-family: arial">．包含的内容<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">双字节部分收录内容主要包括<span lang="EN-US">GB13000.1</span>全部<span lang="EN-US">CJK</span>汉字<span lang="EN-US">20902</span>个、有关标点符号、表意文字描述符<span lang="EN-US">13</span>个、增补的汉字和部首<span lang="EN-US">/</span>构件<span lang="EN-US">80</span>个、双字节编码的欧元符号等。　　四字节部分收录了上述双字节字符之外的，包括<span lang="EN-US">CJK</span>统一汉字扩充<span lang="EN-US">A</span>在内的<span lang="EN-US">GB 13000.1</span>中的全部字符。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">Unicode</span><span style="mso-bidi-font-family: arial">字符集<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">1</span><span style="mso-bidi-font-family: arial">．名称的由来<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">Unicode</span><span style="mso-bidi-font-family: arial">字符集编码是<span lang="EN-US">Universal Multiple-Octet Coded Character Set </span>通用多八位编码字符集的简称，是由一个名为<span lang="EN-US"> Unicode </span>学术学会<span lang="EN-US">(Unicode Consortium)</span>的机构制订的字符编码系统，支持现今世界各种不同语言的书面文本的交换、处理及显示。该编码于<span lang="EN-US">1990</span>年开始研发，<span lang="EN-US">1994</span>年正式公布，最新版本是<st1:chsdate w:st="on" isrocdate="False" islunardate="False" day="31" month="3" year="2005"><span lang="EN-US">2005</span>年<span lang="EN-US">3</span>月<span lang="EN-US">31</span>日</st1:chsdate>的<span lang="EN-US">Unicode <st1:chsdate w:st="on" isrocdate="False" islunardate="False" day="30" month="12" year="1899">4.1.0</st1:chsdate></span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">2</span><span style="mso-bidi-font-family: arial">．特征<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">Unicode</span><span style="mso-bidi-font-family: arial">是一种在计算机上使用的字符编码。它为每种语言中的每个字符设定了统一并且唯一的二进制编码，以满足跨语言、跨平台进行文本转换、处理的要求。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">3</span><span style="mso-bidi-font-family: arial">．编码方法<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">Unicode </span><span style="mso-bidi-font-family: arial">标准始终使用十六进制数字，而且在书写时在前面加上前缀<span lang="EN-US">“U+”</span>，例如字母<span lang="EN-US">“A”</span>的编码为<span lang="EN-US"> 004116 </span>和字符<span lang="EN-US">“?”</span>的编码为<span lang="EN-US"> <st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="20" unitname="ac">20AC</st1:chmetcnv>16</span>。所以<span lang="EN-US">“A”</span>的编码书写为<span lang="EN-US">“U+<st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="41" unitname="&rdquo;">0041”</st1:chmetcnv></span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">4</span><span style="mso-bidi-font-family: arial">．<span lang="EN-US">UTF-8 </span>编码<span lang="EN-US"><br>UTF-8</span>是<span lang="EN-US">Unicode</span>的其中一个使用方式。<span lang="EN-US"> UTF</span>是<span lang="EN-US"> Unicode Translation Format</span>，即把<span lang="EN-US">Unicode</span>转做某种格式的意思。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">UTF-8</span><span style="mso-bidi-font-family: arial">便于不同的计算机之间使用网络传输不同语言和编码的文字，使得双字节的<span lang="EN-US">Unicode</span>能够在现存的处理单字节的系统上正确传输。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">UTF-8</span><span style="mso-bidi-font-family: arial">使用可变长度字节来储存<span lang="EN-US"> Unicode</span>字符，例如<span lang="EN-US">ASCII</span>字母继续使用<span lang="EN-US">1</span>字节储存，重音文字、希腊字母或西里尔字母等使用<span lang="EN-US">2</span>字节来储存，而常用的汉字就要使用<span lang="EN-US">3</span>字节。辅助平面字符则使用<span lang="EN-US">4</span>字节。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">5</span><span style="mso-bidi-font-family: arial">．<span lang="EN-US">UTF-16 </span>和<span lang="EN-US"> UTF-32 </span>编码<span lang="EN-US"><br>UTF-32</span>、<span lang="EN-US">UTF-16 </span>和<span lang="EN-US"> UTF-8 </span>是<span lang="EN-US"> Unicode </span>标准的编码字符集的字符编码方案，<span lang="EN-US">UTF-16 </span>使用一个或两个未分配的<span lang="EN-US"> 16 </span>位代码单元的序列对<span lang="EN-US"> Unicode </span>代码点进行编码；<span lang="EN-US">UTF-32 </span>即将每一个<span lang="EN-US"> Unicode </span>代码点表示为相同值的<span lang="EN-US"> 32 </span>位整数。<span lang="EN-US"><br>========================================================<br></span>什么是<span lang="EN-US">unicode, GB2312, GBK, ANSI, UTF<o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">发展过程<span lang="EN-US"> ASCII à GB2312(BIG5) à GBKàGB18030&nbsp; <o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">字符必须编码后才能被计算机处理。计算机使用的缺省编码方式就是计算机的内码。早期的计算机使用<span lang="EN-US">7</span>位的<span lang="EN-US">ASCII</span>编码，为了处理汉字，程序员设计了用于简体中文的<span lang="EN-US">GB2312</span>和用于繁体中文的<span lang="EN-US">big5</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">GB2312(1980</span><span style="mso-bidi-font-family: arial">年<span lang="EN-US">)</span>一共收录了<span lang="EN-US">7445</span>个字符，包括<span lang="EN-US">6763</span>个汉字和<span lang="EN-US">682</span>个其它符号。汉字区的内码范围高字节从<span lang="EN-US">B0-F7</span>，低字节从<span lang="EN-US">A1-FE</span>，占用的码位是<span lang="EN-US">72*94=6768</span>。其中有<span lang="EN-US">5</span>个空位是<span lang="EN-US">D7FA-D7FE</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">GB2312</span><span style="mso-bidi-font-family: arial">支持的汉字太少。<span lang="EN-US">1995</span>年的汉字扩展规范<span lang="EN-US">GBK1.0</span>收录了<span lang="EN-US">21886</span>个符号，它分为汉字区和图形符号区。汉字区包括<span lang="EN-US">21003</span>个字符。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">从<span lang="EN-US">ASCII</span>、<span lang="EN-US">GB2312</span>到<span lang="EN-US">GBK</span>，这些编码方法是向下兼容的，即同一个字符在这些方案中总是有相同的编码，后面的标准支持更多的字符。在这些编码中，英文和中文可以统一地处理。区分中文编码的方法是高字节的最高位不为<span lang="EN-US">0</span>。按照程序员的称呼，<span lang="EN-US">GB2312</span>、<span lang="EN-US">GBK</span>都属于双字节字符集<span lang="EN-US"> (DBCS)</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">2000</span><span style="mso-bidi-font-family: arial">年的<span lang="EN-US">GB18030</span>是取代<span lang="EN-US">GBK1.0</span>的正式国家标准。该标准收录了<span lang="EN-US">27484</span>个汉字，同时还收录了藏文、蒙文、维吾尔文等主要的少数民族文字。从汉字字汇上说，<span lang="EN-US">GB18030</span>在<span lang="EN-US">GB13000.1</span>的<span lang="EN-US">20902</span>个汉字的基础上增加了<span lang="EN-US">CJK</span>扩展<span lang="EN-US">A</span>的<span lang="EN-US">6582</span>个汉字（<span lang="EN-US">Unicode</span>码<span lang="EN-US"> 0x3400-0x4db5</span>），一共收录了<span lang="EN-US">27484</span>个汉字。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">CJK</span><span style="mso-bidi-font-family: arial">就是中日韩的意思。<span lang="EN-US">Unicode</span>为了节省码位，将中日韩三国语言中的文字统一编码。<span lang="EN-US">GB13000.1</span>就是<span lang="EN-US">ISO/IEC 10646-1</span>的中文版，相当于<span lang="EN-US">Unicode 1.1</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">GB18030</span><span style="mso-bidi-font-family: arial">的编码采用单字节、双字节和<span lang="EN-US">4</span>字节方案。其中单字节、双字节和<span lang="EN-US">GBK</span>是完全兼容的。<span lang="EN-US">4</span>字节编码的码位就是收录了<span lang="EN-US">CJK</span>扩展<span lang="EN-US">A</span>的<span lang="EN-US">6582</span>个汉字。例如：<span lang="EN-US">UCS</span>的<span lang="EN-US">0x3400</span>在<span lang="EN-US">GB18030</span>中的编码应该是<span lang="EN-US">8139EF30</span>，<span lang="EN-US">UCS</span>的<span lang="EN-US">0x3401</span>在<span lang="EN-US">GB18030</span>中的编码应该是<span lang="EN-US">8139EF31</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">微软提供了<span lang="EN-US">GB18030</span>的升级包，但这个升级包只是提供了一套支持<span lang="EN-US">CJK</span>扩展<span lang="EN-US">A</span>的<span lang="EN-US">6582</span>个汉字的新字体：新宋体<span lang="EN-US">-18030</span>，并不改变内码。<span lang="EN-US">Windows </span>的内码仍然是<span lang="EN-US">GBK</span>。<span lang="EN-US">&nbsp; <o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">从<span lang="EN-US">ASCII</span>、<span lang="EN-US">GB2312</span>、<span lang="EN-US">GBK</span>到<span lang="EN-US">GB18030</span>的编码方法是向下兼容的。而<span lang="EN-US">Unicode</span>只与<span lang="EN-US">ASCII</span>兼容<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">Unicode</span><span style="mso-bidi-font-family: arial">也是一种字符编码方法，不过它是由国际组织设计，可以容纳全世界所有语言文字的编码方案。<span lang="EN-US">unicode </span>为<span lang="EN-US">java </span>中的编码转换桥梁<span lang="EN-US">,</span>使用了以组流过滤器来桥接<span lang="EN-US">unicode</span>编码文本和本地操作系统编码文本的隔阂<span lang="EN-US">(</span>内码<span lang="EN-US">,</span>如<span lang="EN-US">windows</span>的<span lang="EN-US">GBK).</span>所有的<span lang="EN-US">class </span>派生自<span lang="EN-US">abstract class Reader and Writer .</span>后面继续研究<span lang="EN-US">.&nbsp; <o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">由于现有的大量程序和文档都采用了某种特定语言的编码，例如<span lang="EN-US">GBK</span>，<span lang="EN-US">Windows</span>不可能不支持现有的编码，而全部改用<span lang="EN-US">Unicode</span>。我们称<span lang="EN-US">GBK</span>为<span lang="EN-US">windows</span>的内码<span lang="EN-US">.Windows</span>使用代码页<span lang="EN-US">(code page)</span>来适应各个国家和地区。<span lang="EN-US">code page</span>可以被理解为内码。<span lang="EN-US">GBK</span>对应的<span lang="EN-US">code page</span>是<span lang="EN-US">CP936</span>。 <span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial"> what is UCS?<o:p></o:p></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">Unicode</span><span style="mso-bidi-font-family: arial">的学名是<span lang="EN-US">"Universal Multiple-Octet Coded Character Set"</span>，简称为<span lang="EN-US">UCS</span>。<span lang="EN-US">UCS</span>可以看作是<span lang="EN-US">"Unicode Character Set"</span>的缩写。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">UCS</span><span style="mso-bidi-font-family: arial">有两种格式：<span lang="EN-US">UCS-2</span>和<span lang="EN-US">UCS-4</span>。顾名思义，<span lang="EN-US">UCS-2</span>就是用两个字节编码，<span lang="EN-US">UCS-4</span>就是用<span lang="EN-US">4</span>个字节（实际上只用了<span lang="EN-US">31</span>位，最高位必须为<span lang="EN-US">0</span>）编码。<span lang="EN-US">&nbsp; <o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">什么是<span lang="EN-US">UTF<o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">UTF</span><span style="mso-bidi-font-family: arial">，是<span lang="EN-US">Unicode Text Format</span>的缩写，意为<span lang="EN-US">Unicode</span>文本格式。对于<span lang="EN-US">UTF</span>，是这样定义的<span lang="EN-US">&nbsp; <o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">（<span lang="EN-US">1</span>）如果<span lang="EN-US">Unicode</span>的<span lang="EN-US">16</span>位字符的头<span lang="EN-US">9</span>位是<span lang="EN-US">0</span>，则用一个字节表示，这个字节的首位是<span lang="EN-US"> “0”</span>，剩下的<span lang="EN-US">7</span>位与原字符中的后<span lang="EN-US">7</span>位相同，如<span lang="EN-US">“\u0034”</span>（<span lang="EN-US">0000 0000 0011 0100</span>），用<span lang="EN-US">“34” (0011 0100)</span>表示；（与源<span lang="EN-US">Unicode</span>字符是相同的）；<span lang="EN-US">&nbsp; <o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">（<span lang="EN-US">2</span>）如果<span lang="EN-US">Unicode</span>的<span lang="EN-US">16</span>位字符的头<span lang="EN-US">5</span>位是<span lang="EN-US">0</span>，则用<span lang="EN-US">2</span>个字节表示，首字节是<span lang="EN-US">“110”</span>开头，后面的<span lang="EN-US">5</span>位与源字符中除去头<span lang="EN-US">5</span>个零后的最高<span lang="EN-US">5</span>位相同；第二个字节以<span lang="EN-US">“10”</span>开头，后面的<span lang="EN-US">6</span>位与源字符中的低<span lang="EN-US">6</span>位相同。如<span lang="EN-US">“\ u025d”</span>（<span lang="EN-US">0000 0010 0101 1101</span>），转化后为<span lang="EN-US">“c99d”</span>（<span lang="EN-US">1100 1001 1001 1101</span>）；<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">（<span lang="EN-US">3</span>）如果不符合上述两个规则，则用三个字节表示。第一个字节以<span lang="EN-US">“1110”</span>开头，后四位为源字符的高四位；第二个字节以<span lang="EN-US">“10”</span>开头，后六位为源字符中间的六位；第三个字节以<span lang="EN-US">“10”</span>开头，后六位为源字符的低六位；如<span lang="EN-US">“\u9da<st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="7" unitname="&rdquo;">7”</st1:chmetcnv></span>（<span lang="EN-US">1001 1101 1010 0111</span>），转化为<span lang="EN-US">“e9b<st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="6" unitname="a">6a</st1:chmetcnv><st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="7" unitname="&rdquo;">7”</st1:chmetcnv></span>（<span lang="EN-US">1110 1001 1011 0110 1010 0111</span>）；<span lang="EN-US">&nbsp; <o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">UCS </span><span style="mso-bidi-font-family: arial">和<span lang="EN-US"> UTF </span>的联系<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">UTF-8</span><span style="mso-bidi-font-family: arial">就是以<span lang="EN-US">8</span>位为单元对<span lang="EN-US">UCS</span>进行编码<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">UTF-16</span><span style="mso-bidi-font-family: arial">以<span lang="EN-US">16</span>位为单元对<span lang="EN-US">UCS</span>进行编码 <span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">big endian</span><span style="mso-bidi-font-family: arial">和<span lang="EN-US">little endian<o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">big endian</span><span style="mso-bidi-font-family: arial">和<span lang="EN-US">little endian</span>是<span lang="EN-US">CPU</span>处理多字节数的不同方式。例如<span lang="EN-US">“</span>汉<span lang="EN-US">”</span>字的<span lang="EN-US">Unicode</span>编码是<st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="6" unitname="C"><span lang="EN-US">6C</span></st1:chmetcnv><span lang="EN-US">49</span>。那么写到文件里时，究竟是将<st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="6" unitname="C"><span lang="EN-US">6C</span></st1:chmetcnv>写在前面，还是将<span lang="EN-US">49</span>写在前面？如果将<st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="6" unitname="C"><span lang="EN-US">6C</span></st1:chmetcnv>写在前面，就是<span lang="EN-US">big endian</span>。如果将<span lang="EN-US">49</span>写在前面，就是<span lang="EN-US">little endian</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">“endian”</span><span style="mso-bidi-font-family: arial">这个词出自《格列佛游记》。小人国的内战就源于吃鸡蛋时是究竟从大头<span lang="EN-US">(Big-Endian)</span>敲开还是从小头<span lang="EN-US">(Little-Endian)</span>敲开，由此曾发生过六次叛乱，一个皇帝送了命，另一个丢了王位。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">我们一般将<span lang="EN-US">endian</span>翻译成<span lang="EN-US">“</span>字节序<span lang="EN-US">”</span>，将<span lang="EN-US">big endian</span>和<span lang="EN-US">little endian</span>称作<span lang="EN-US">“</span>大尾<span lang="EN-US">”</span>和<span lang="EN-US">“</span>小尾<span lang="EN-US">”</span>。<span lang="EN-US"><br>=================================================<br>GB2312</span>是<span lang="EN-US">GBK</span>的子集，<span lang="EN-US">GBK</span>是<span lang="EN-US">GB18030</span>的子集<span lang="EN-US"> <br>GBK</span>是包括中日韩字符的大字符集合<span lang="EN-US"> <br></span>如果是中文的网站 推荐<span lang="EN-US">GB2312 GBK</span>有时还是有点问题<span lang="EN-US"> <br></span>为了避免所有乱码问题，应该采用<span lang="EN-US">UTF-8</span>，将来要支持国际化也非常方便<span lang="EN-US"> <br>UTF-8</span>可以看作是大字符集，它包含了大部分文字的编码。<span lang="EN-US"> <br></span>使用<span lang="EN-US">UTF-8</span>的一个好处是其他地区的用户（如香港台湾）无需安装简体中文支持就能正常观看你的文字而不会出现乱码。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">词条：<span lang="EN-US">UTF8 <br>UTF8</span>并不算是一种电脑编码，而是一种储存和传送的格式，如前所述，每个<span lang="EN-US">Unicode/UCS</span>字符都以<span lang="EN-US"> 2</span>或<span lang="EN-US">4</span>个<span lang="EN-US">bytes</span>来储存，看看以下的比较： <span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">　　以<span lang="EN-US">"I am Chinese"</span>为例<span lang="EN-US"><br></span>　　　用<span lang="EN-US">ANSI</span>储存：<span lang="EN-US">12 Bytes<br></span>　　　用<span lang="EN-US">Unicode/UCS2</span>储存：<span lang="EN-US">24 Bytes + 2 Bytes(header)<br></span>　　　用<span lang="EN-US">UCS4</span>储存：<span lang="EN-US">48 Bytes + 4 Bytes(header)<o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">　　以<span lang="EN-US">"</span>我是中国人<span lang="EN-US">"</span>为例<span lang="EN-US"><br></span>　　　用<span lang="EN-US">ANSI</span>储存：<span lang="EN-US">10 Bytes<br></span>　　　用<span lang="EN-US">Unicode/UCS2</span>储存：<span lang="EN-US">10 Bytes + 2 Bytes(header)<br></span>　　　用<span lang="EN-US">UCS4</span>储存：<span lang="EN-US">20 Bytes + 4 Bytes(header)<o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">　　由此可见直接以<span lang="EN-US">Unicode/UCS</span>的原始形式来储存是一种极大的浪费，而且也不利于互联网的传输<span lang="EN-US">(</span>中文稍为合算一点<span lang="EN-US">^_^)</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">　　有见及此，<span lang="EN-US">Unicode/UCS</span>的压缩形式－－<span lang="EN-US">UTF8</span>出现了，套用官方网站的首句话『<span lang="EN-US">UTF-8 stands for Unicode Transformation Format-8. It is an octet (8-bit) lossless encoding of Unicode characters.</span>』，由于<span lang="EN-US">UTF</span>也适用于编码<span lang="EN-US">UCS</span>，故亦可称为『<span lang="EN-US">UCS transformation formats (UTF)</span>』<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">　　<span lang="EN-US">UTF8</span>是以<span lang="EN-US">8bits</span>即<span lang="EN-US">1Bytes</span>为编码的最基本单位，当然也可以有基于<span lang="EN-US">16bits</span>和<span lang="EN-US">32bits</span>的形式，分别称为<span lang="EN-US">UTF16</span>和<span lang="EN-US">UTF32</span>，但目前用得不多，而<span lang="EN-US">UTF8</span>则被广泛应用在文件储存和网络传输中。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial"><br></span><span style="mso-bidi-font-family: arial">编码原理<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">先看这个模板：<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">UCS-4 range (hex.) UTF-8 octet sequence (binary)<br>0000 0000-0000 <st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="7" unitname="F">007F</st1:chmetcnv> 0xxxxxxx<br>0000 0080-0000 07FF 110xxxxx 10xxxxxx<br>0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx<o:p></o:p></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">0001 0000<st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="True" hasspace="False" sourcevalue="1" unitname="F">-001F</st1:chmetcnv> FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx<br>0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx<br>0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx<o:p></o:p></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">编码步骤：<span lang="EN-US"><br>1) </span>首先确定需要多少个<span lang="EN-US">8bits(octets)<br>2) </span>按照上述模板填充每个<span lang="EN-US">octets</span>的高位<span lang="EN-US">bits<br>3) </span>把字符的<span lang="EN-US">bits</span>填充至<span lang="EN-US">x</span>中，字符顺序：低位<span lang="EN-US">→</span>高位，<span lang="EN-US">UTF8</span>顺序：最后一个<span lang="EN-US">octet</span>的最末位<span lang="EN-US">x→</span>第一个<span lang="EN-US">octet</span>最高位<span lang="EN-US">x<br>4) </span>解码的原理一样。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">实例：<span lang="EN-US">(</span>留意每个<span lang="EN-US">bit</span>的颜色，粗体字为模板内容<span lang="EN-US">)<o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">UCS-4 UTF-8 <br>HEX BIN Bytes BIN HEX Bytes <br>0000 <st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="0" unitname="a">000A</st1:chmetcnv> 00001010 4 00001010 <st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="0" unitname="a">0A</st1:chmetcnv> 1 <br>0000 0099 10011001 4 11000010 <st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="True" sourcevalue="10011001" unitname="C">10011001 C</st1:chmetcnv>2 99 2 <br>0000 8D99 10001101 10011001 4 11101000 10110110 10011001 E8 B6 99 3 <o:p></o:p></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">　　不知大家看懂了没有，其实不懂也无所谓，反正又不用自己算，程式可以完全代劳。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">　　以<span lang="EN-US">UTF8</span>格式储存的文件档首标识为<span lang="EN-US">EF BB BF</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial"><br></span><span style="mso-bidi-font-family: arial">效率<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">　　从上述编码原理中得出的结论是：<span lang="EN-US"><br></span>　　　<span lang="EN-US">1.</span>每个英文字母、数字所占的空间为<span lang="EN-US">1 Byte</span>；<span lang="EN-US"><br></span>　　　<span lang="EN-US">2.</span>泛欧语系、斯拉夫语字母占<span lang="EN-US">2 Bytes</span>；<span lang="EN-US"><br></span>　　　<span lang="EN-US">3.</span>汉字占<span lang="EN-US">3 Bytes</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">　　由此可见<span lang="EN-US">UTF8</span>对英文来说是个非常诱人的方案，但对中文来说则不太合算，无论用<span lang="EN-US">ANSI</span>还是<span lang="EN-US"> Unicode/UCS2</span>来编码都只用<span lang="EN-US">2 Bytes</span>，但用<span lang="EN-US">UTF8</span>则需要<span lang="EN-US">3 Bytes</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">　　以下是一些统计资料，显示用<span lang="EN-US">UTF8</span>来储存文件每个字符所需的平均字节：<span lang="EN-US"><br></span>　　　<span lang="EN-US">1.</span>拉丁语系平均用<span lang="EN-US">1.1 Bytes</span>；<span lang="EN-US"><br></span>　　　<span lang="EN-US">2.</span>希腊文、俄文、阿拉伯文和希伯莱文平均用<span lang="EN-US">1.7 Bytes</span>；<span lang="EN-US"><br></span>　　　<span lang="EN-US">3.</span>其他大部份文字如中文、日文、韩文、<span lang="EN-US">Hindi(</span>北印度语<span lang="EN-US">)</span>用约<span lang="EN-US">3 Bytes</span>；<span lang="EN-US"><br></span>　　　<span lang="EN-US">4.</span>用超过<span lang="EN-US">4 Bytes</span>的都是些非常少用的文字符号。<span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">词条：<span lang="EN-US">GB2312<br></span>字符必须编码后才能被计算机处理。计算机使用的缺省编码方式就是计算机的内码。早期的计算机使用<span lang="EN-US">7</span>位的<span lang="EN-US">ASCII</span>编码，为了处理汉字，程序员设计了用于简体中文的<span lang="EN-US">GB2312</span>和用于繁体中文的<span lang="EN-US">big5</span>。 <span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">GB2312(1980</span><span style="mso-bidi-font-family: arial">年<span lang="EN-US">)</span>一共收录了<span lang="EN-US">7445</span>个字符，包括<span lang="EN-US">6763</span>个汉字和<span lang="EN-US">682</span>个其它符号。汉字区的内码范围高字节从<span lang="EN-US">B0-F7</span>，低字节从<span lang="EN-US">A1-FE</span>，占用的码位是<span lang="EN-US">72*94=6768</span>。其中有<span lang="EN-US">5</span>个空位是<span lang="EN-US">D7FA-D7FE</span>。 <span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">GB2312</span><span style="mso-bidi-font-family: arial">支持的汉字太少。<span lang="EN-US">1995</span>年的汉字扩展规范<span lang="EN-US">GBK1.0</span>收录了<span lang="EN-US">21886</span>个符号，它分为汉字区和图形符号区。汉字区包括<span lang="EN-US">21003</span>个字符。<span lang="EN-US">2000</span>年的<span lang="EN-US">GB18030</span>是取代<span lang="EN-US">GBK1.0</span>的正式国家标准。该标准收录了<span lang="EN-US">27484</span>个汉字，同时还收录了藏文、蒙文、维吾尔文等主要的少数民族文字。现在的<span lang="EN-US">PC</span>平台必须支持<span lang="EN-US">GB18030</span>，对嵌入式产品暂不作要求。所以手机、<span lang="EN-US">MP3</span>一般只支持<span lang="EN-US">GB2312</span>。 <span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">从<span lang="EN-US">ASCII</span>、<span lang="EN-US">GB2312</span>、<span lang="EN-US">GBK</span>到<span lang="EN-US">GB18030</span>，这些编码方法是向下兼容的，即同一个字符在这些方案中总是有相同的编码，后面的标准支持更多的字符。在这些编码中，英文和中文可以统一地处理。区分中文编码的方法是高字节的最高位不为<span lang="EN-US">0</span>。按照程序员的称呼，<span lang="EN-US">GB2312</span>、<span lang="EN-US">GBK</span>到<span lang="EN-US">GB18030</span>都属于双字节字符集<span lang="EN-US"> (DBCS)</span>。 <span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">有的中文<span lang="EN-US">Windows</span>的缺省内码还是<span lang="EN-US">GBK</span>，可以通过<span lang="EN-US">GB18030</span>升级包升级到<span lang="EN-US">GB18030</span>。不过<span lang="EN-US">GB18030</span>相对<span lang="EN-US">GBK</span>增加的字符，普通人是很难用到的，通常我们还是用<span lang="EN-US">GBK</span>指代中文<span lang="EN-US">Windows</span>内码。 <span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">这里还有一些细节： <span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">GB2312</span><span style="mso-bidi-font-family: arial">的原文还是区位码，从区位码到内码，需要在高字节和低字节上分别加上<span lang="EN-US">A0</span>。 <span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span style="mso-bidi-font-family: arial">在<span lang="EN-US">DBCS</span>中，<span lang="EN-US">GB</span>内码的存储格式始终是<span lang="EN-US">big endian</span>，即高位在前。 <span lang="EN-US"><o:p></o:p></span></span></p> <p style="line-height: 150%"><span lang="EN-US" style="mso-bidi-font-family: arial">GB2312</span><span style="mso-bidi-font-family: arial">的两个字节的最高位都是<span lang="EN-US">1</span>。但符合这个条件的码位只有<span lang="EN-US">128*128=16384</span>个。所以<span lang="EN-US">GBK</span>和<span lang="EN-US">GB18030</span>的低字节最高位都可能不是<span lang="EN-US">1</span>。不过这不影响<span lang="EN-US">DBCS</span>字符流的解析：在读取<span lang="EN-US">DBCS</span>字符流时，只要遇到高位为<span lang="EN-US">1</span>的字节，就可以将下两个字节作为一个双字节编码，而不用管低字节的高位是什么。 <span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: arial"><o:p>&nbsp;</o:p></span></p><img src ="http://www.cppblog.com/woaidongmao/aggbug/66252.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/woaidongmao/" target="_blank">肥仔</a> 2008-11-07 22:43 <a href="http://www.cppblog.com/woaidongmao/archive/2008/11/07/66252.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>VC/C++的中文字符处理方式</title><link>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66250.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Fri, 07 Nov 2008 14:39:00 GMT</pubDate><guid>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66250.html</guid><wfw:comment>http://www.cppblog.com/woaidongmao/comments/66250.html</wfw:comment><comments>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66250.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/woaidongmao/comments/commentRss/66250.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/woaidongmao/services/trackbacks/66250.html</trackback:ping><description><![CDATA[<p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">怎样把<a name="baidusnap2"></a><b>汉字</b>转换成<a name="baidusnap0"></a><b>整数</b>，又怎样把该<b>整数</b>还原成<b>汉字</b><span lang="EN-US"><?xml:namespace prefix = o /><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">char * str="</span><b><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">汉字</span></b><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">";BYTE *pstr=(BYTE*)str;BYTE B=pstr[i];B </span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">就是<b>整数</b><span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: red; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">一 引入问题</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 21pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">代码<span lang="EN-US"> wchar_t a[3]=L”</span>中国<span lang="EN-US">”</span>，编译时出错，出错信息为：数组越界。但<span lang="EN-US">wchar_t </span>是一个宽字节类型，数组<span lang="EN-US">a</span>的大小应为<span lang="EN-US">6</span>个字节，而两个汉字的的<span lang="EN-US">unicode</span>码占<span lang="EN-US">4</span>个字节，再加上一个结束符，最多<span lang="EN-US">6</span>个字节，所以应该不会越界。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">难道是编译器出问题了？<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: red; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">二 解决引入问题所需的知识</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: red; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp; </span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">主要需两方面的知识，第一个为字符尤其是汉字的编码，以及语言和工具的支持情况，第二个是<span lang="EN-US">vc/c++</span>中<span lang="EN-US">MutiByte Charater Set </span>和<span lang="EN-US"> Wide Character Set</span>有关内存分配的情况。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: red; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">三 汉字的编码方式及在<span lang="EN-US">vc/c++</span>中的处理</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: blue; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">1.</span><span style="font-size: 12pt; color: blue; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">汉字编码方式的介绍</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">对英文字符的处理，<span lang="EN-US">7</span>位<span lang="EN-US">ASCII</span>码字符集中的字符即可满足使用需求，且英文字符在计算机上的输入及输出也非常简单，因此，英文字符的输入、存储、内部处理和输出都可以只用同一个编码（如<span lang="EN-US">ASCII</span>码）。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">而汉字是一种象形文字，字数极多（现代汉字中仅常用字就有六、七千个，总字数高达<span lang="EN-US">5</span>万个以上），且字形复杂，每一个汉字都有<span lang="EN-US">"</span>音、形、义<span lang="EN-US">"</span>三要素，同音字、异体字也很多，这些都给汉字的的计算机处理带来了很大的困难。要在计算机中处理汉字，必须解决以下几个问题：首先是汉字的输入，即如何把结构复杂的方块汉字输入到计算机中去，这是汉字处理的关键；其次，汉字在计算机内如何表示和存储？如何与西文兼容？最后，如何将汉字的处理结果从计算机内输出？ <span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 21pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">为此，必须将汉字代码化，即对汉字进行编码。对应于上述汉字处理过程中的输入、内部处理及输出这三个主要环节，每一个汉字的编码都包括输入码、交换码、内部码和字形码。在计算机的汉字信息处理系统中，处理汉字时要进行如下的代码转换：输入码→交换码→内部码→字形码。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt 24pt; layout-grid-mode: char; word-break: break-all; text-indent: -18pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">(1)</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">输入码： 作用是，利用它和现有的标准西文键盘结合来输入汉字。输入码也称为外码。主要归为四类：<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt 48pt; layout-grid-mode: char; word-break: break-all; text-indent: -21pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">a)</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">数字编码：数字编码是用等长的数字串为汉字逐一编号，以这个编号作为汉字的输入码。例如，区位码、电报码等都属于数字编码。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt 48pt; layout-grid-mode: char; word-break: break-all; text-indent: -21pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">b)</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">拼音码：拼音码是以汉字的读音为基础的输入办法。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt 48pt; layout-grid-mode: char; word-break: break-all; text-indent: -21pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">c)</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">字形码：字形码是以汉字的字形结构为基础的输入编码。例如，五笔字型码（王码）。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt 48pt; layout-grid-mode: char; word-break: break-all; text-indent: -21pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">d)</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">音形码：音形码是兼顾汉字的读音和字形的输入编码。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt 24pt; layout-grid-mode: char; word-break: break-all; text-indent: -18pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">(2)</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">交换码：用于汉字外码和内部码的交换。交换码的国家标准代号为<span lang="EN-US">GB2312-80</span>。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt 24pt; layout-grid-mode: char; word-break: break-all; text-indent: -18pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">(3)</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">内部码：内部码是汉字在计算机内的基本表示形式，是计算机对汉字进行识别、存储、处理和传输所用的编码。内部码也是双字节编码，将国标码两个字节的最高位都置为<span lang="EN-US">"1"</span>，即转换成汉字的内部码。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt 24pt; layout-grid-mode: char; word-break: break-all; text-indent: -18pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">(4)</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">字形码：字形码是表示汉字字形信息（汉字的结构、形状、笔划等）的编码，用来实现计算机对汉字的输出（显示、打印）。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: blue; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">2.VC</span><span style="font-size: 12pt; color: blue; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">中汉字的编码方式</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp; vc/c++</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">正是采用了<span lang="EN-US">GB2312</span>内部码作为汉字的编码方式<span lang="EN-US">,</span>因此<span lang="EN-US">vc/c++</span>中的各种输入输出方法，如<span lang="EN-US">cin/wcin,cout/wcout,scanf/wsanf,printf/wprintf...</span>都是基于<span lang="EN-US">GB2312</span>的，如果汉字的内码不是这种编码方式，那么利用上述各种方法就不会正确的解析汉字。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">仔细观察<span lang="EN-US">ASCII</span>字符表，从第<span lang="EN-US">161</span>个字符开始，后面的字符并不经常为用户所使用，负值也未使用。<span lang="EN-US">GB2312</span>编码方式充分利用这一特性，将<span lang="EN-US">161-255</span>（<span lang="EN-US">-95~-1</span>）之间的数值空间作为汉字的标识码。既然<span lang="EN-US">255-161 = 94</span>不能满足汉字容量的要求，就将每两个字符并在一块<span lang="EN-US">(</span>即一个汉字占两个字节<span lang="EN-US">)</span>，显然，<span lang="EN-US">94* 94 =8836</span>基本上已经满足了常用汉字个数的要求。计算机处理字符时，当连续处理到两个大与<span lang="EN-US">160(</span>或<span lang="EN-US">-95~-1)</span>的字节时，就认为这两个字节存放了一个汉字字符。可以用下面的<span lang="EN-US">Demo</span>程序来模拟<span lang="EN-US">vc/c++</span>中输出汉字字符的过程。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp; </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">unsigned</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"> </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">char</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"> input[50];<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; text-indent: 24pt; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">cin&gt;&gt;input;<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp; </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">int</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"> flag=0;<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp; </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">for</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">(</span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">int</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"> i =0 ;i &lt; 50 ;i++)<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp; {<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">if</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">(input[i] &gt; 0xa0 &amp;&amp; input[i] != 0)<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; {<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">if</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">(flag == 1)<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; {<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; cout&lt;&lt;"chinese character"&lt;&lt;endl;<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; flag = 0;<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">else</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; {<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; flag++;<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">else</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"> </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">if</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">(input[i] == 0)<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; {<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">break</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">;<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span lang="EN-US" style="font-size: 12pt; color: blue; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">else</span><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"> <o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; {<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; cout&lt;&lt;"english character"&lt;&lt;endl;<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; word-break: break-all; line-height: 150%; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; line-height: 150%; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">}<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">输入：<span lang="EN-US">Hello</span>中国 （<span lang="EN-US">“</span>中国<span lang="EN-US">”</span>对应的<span lang="EN-US">GB2312</span>内码为：<span lang="EN-US">214 208</span>，<span lang="EN-US">185 250</span>）<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">输出：<span lang="EN-US">english character<o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 60pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">english character<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 60pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">english character<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 60pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">english character<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 60pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">english character<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 60pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">chinese character<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 60pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">chinese character<o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">vc/c++</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">中的英文字符仍然采用<span lang="EN-US">ASCII</span>编码方式。可以设想，其他国家程序员利用<span lang="EN-US">vc/c++</span>编写程序输入本国字符时，<span lang="EN-US">vc/c++</span>则会采用该国的字符编码方式来处理这些字符。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">&nbsp;&nbsp;&nbsp; </span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">问题又产生了，韩国的<span lang="EN-US">vc/c++</span>程序在中国的<span lang="EN-US">vc/c++</span>上运行时，如果没有相应的内码库，则对韩语字符的显示有可能出现乱码。我个人猜测，<span lang="EN-US">vc</span>安装程序中应该带有不同国家的内码库，这样一来肯定会占用很大的空间。如果所有的国家使用统一的编码方式，且所有的程序设计语言和开发工具都支持这种编码方式该多好！而现实中，确实已经有这种编码方式了，且许多新的语言也都支持这种编码方式，如<span lang="EN-US">Java</span>、<span lang="EN-US">C#</span>等，它就是下面的<span lang="EN-US">Unicode</span>编码<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: blue; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">3.</span><span style="font-size: 12pt; color: blue; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">新的内码标准<span lang="EN-US">---Unicode</span></span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"><o:p></o:p></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">Unicode</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">（统<?xml:namespace prefix = st1 /><st1:chmetcnv w:st="on" tcsc="1" numbertype="3" negative="False" hasspace="False" sourcevalue="1" unitname="码">一码</st1:chmetcnv>、万国码、单<st1:chmetcnv w:st="on" tcsc="1" numbertype="3" negative="False" hasspace="False" sourcevalue="1" unitname="码">一码</st1:chmetcnv>）是一种在计算机上使用的字符编码。它为每种语言中的每个字符设定了统一并且唯一的二进制编码，以满足跨语言、跨平台进行文本转换、处理的要求。<span lang="EN-US">1990</span>年开始研发，<span lang="EN-US">1994</span>年正式公布。随着计算机工作能力的增强，<span lang="EN-US">Unicode</span>也在面世以来的十多年里得到普及。最新版本的<span lang="EN-US"> Unicode </span>是<span lang="EN-US"> <st1:chsdate w:st="on" isrocdate="False" islunardate="False" day="31" month="3" year="2005">2005<span lang="EN-US"><span lang="EN-US">年3</span></span><span lang="EN-US"><span lang="EN-US">月31</span></span><span lang="EN-US"><span lang="EN-US">日</span></span></st1:chsdate><span lang="EN-US">推出的Unicode <st1:chsdate w:st="on" isrocdate="False" islunardate="False" day="30" month="12" year="1899">4.1.0</st1:chsdate> </span></span>。另外，<span lang="EN-US">5.0 Beta</span>已于<st1:chsdate w:st="on" isrocdate="False" islunardate="False" day="12" month="12" year="2005"><span lang="EN-US">2005</span>年<span lang="EN-US">12</span>月<span lang="EN-US">12</span>日</st1:chsdate>推出，以供各会员评价。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">Unicode </span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">编码系统可分为编码方式和实现方式两个层次。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">编码方式：<span lang="EN-US">Unicode </span>的编码方式与<span lang="EN-US"> ISO 10646 </span>的通用字符集（<span lang="EN-US">Universal Character Set</span>，<span lang="EN-US">UCS</span>）概念相对应，目前的用于实用的<span lang="EN-US"> Unicode </span>版本对应于<span lang="EN-US"> UCS-2</span>，使用<span lang="EN-US">16</span>位的编码空间。也就是每个字符占用<span lang="EN-US">2</span>个字节。这样理论上一共最多可以表示<span lang="EN-US"> 216 </span>个字符。基本满足各种语言的使用。实际上目前版本的<span lang="EN-US"> Unicode </span>尚未填充满这<span lang="EN-US">16</span>位编码，保留了大量空间作为特殊使用或将来扩展。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 24pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">实现方式：<span lang="EN-US">Unicode </span>的实现方式不同于编码方式。一个字符的<span lang="EN-US"> Unicode </span>编码是确定的。但是在实际传输过程中，由于不同系统平台的设计不一定一致，以及出于节省空间的目的，对<span lang="EN-US"> Unicode </span>编码的实现方式有所不同。<span lang="EN-US">Unicode </span>的实现方式称为<span lang="EN-US">Unicode</span>转换格式（<span lang="EN-US">Unicode Translation Format</span>，简称为<span lang="EN-US"> UTF</span>）。如，<span lang="EN-US">UTF-8 </span>编码，这是一种变长编码，它将基本<span lang="EN-US">7</span>位<span lang="EN-US">ASCII</span>字符仍用<span lang="EN-US">7</span>位编码表示，占用一个字节（首位补<span lang="EN-US">0</span>）。而遇到与其他<span lang="EN-US"> Unicode </span>字符混合的情况，将按一定算法转换，每个字符使用<span lang="EN-US">1-3</span>个字节编码，并利用首位为<span lang="EN-US">0</span>或<span lang="EN-US">1</span>进行识别。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 21pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">Java</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">与<span lang="EN-US">C#</span>语言都是采用<span lang="EN-US">Unicode</span>编码方式，在这两种语言中定义一个字符，在内存中存放的就是这个字符的两字节<span lang="EN-US">Unicode</span>码。如下所示：<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal" style="background: white; margin: 0cm 6pt 0pt; layout-grid-mode: char; word-break: break-all; text-indent: 21pt; line-height: 20pt; text-align: left; mso-pagination: widow-orphan" align="left"><span lang="EN-US" style="font-size: 12pt; color: blue; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">char</span><span lang="EN-US" style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt"> a='</span><span style="font-size: 12pt; color: black; font-family: 宋体; mso-bidi-font-family: 宋体; mso-font-kerning: 0pt">我<span lang="EN-US">';&nbsp;&nbsp;&nbsp; =&gt; </span>内存中存放的<span lang="EN-US">Unicode</span>码为：<span lang="EN-US">25105</span></span></p><img src ="http://www.cppblog.com/woaidongmao/aggbug/66250.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/woaidongmao/" target="_blank">肥仔</a> 2008-11-07 22:39 <a href="http://www.cppblog.com/woaidongmao/archive/2008/11/07/66250.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Win32 字符编码</title><link>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66246.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Fri, 07 Nov 2008 14:33:00 GMT</pubDate><guid>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66246.html</guid><wfw:comment>http://www.cppblog.com/woaidongmao/comments/66246.html</wfw:comment><comments>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66246.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/woaidongmao/comments/commentRss/66246.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/woaidongmao/services/trackbacks/66246.html</trackback:ping><description><![CDATA[&nbsp;&nbsp;&nbsp;&nbsp; 摘要: 毫无疑问，我们都看到过像 TCHAR, std::string, BSTR 等各种各样的字符串类型，还有那些以 _tcs 开头的奇怪的宏。你也许正在盯着显示器发愁。本指引将总结引进各种字符类型的目的，展示一些简单的用法，并告诉您在必要时，如何实现各种字符串类型之间的转换。　　在第一部分，我们将介绍3种字符编码类型。了解各种编码模式的工作方式是很重要的事情。即使你已经知道一个字符串是一个字符数组，你...&nbsp;&nbsp;<a href='http://www.cppblog.com/woaidongmao/archive/2008/11/07/66246.html'>阅读全文</a><img src ="http://www.cppblog.com/woaidongmao/aggbug/66246.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/woaidongmao/" target="_blank">肥仔</a> 2008-11-07 22:33 <a href="http://www.cppblog.com/woaidongmao/archive/2008/11/07/66246.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>C++的三种字符编码方式</title><link>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66247.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Fri, 07 Nov 2008 14:33:00 GMT</pubDate><guid>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66247.html</guid><wfw:comment>http://www.cppblog.com/woaidongmao/comments/66247.html</wfw:comment><comments>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66247.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/woaidongmao/comments/commentRss/66247.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/woaidongmao/services/trackbacks/66247.html</trackback:ping><description><![CDATA[<p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; color: #333333; font-family: 宋体">c++</span><span style="font-size: 12pt; color: #333333; font-family: 宋体">通常使用的是三种编码方式，分别是<span lang="EN-US">SBCS(single byte character set),MBCS(multi-byte characterset)</span>和<span lang="EN-US">Unicode</span>字符集。<span lang="EN-US">SBCS</span>是一个字节一个字符，<span lang="EN-US">MBCS</span>是几个字节一个字符，可能是一个，两个，三个不等，但是实际上，绝大多数时候使用两个字节的，所以有时候看到<span lang="EN-US">DBCS(double-byte character set)</span>代替<span lang="EN-US">MBCS</span>也不奇怪；<span lang="EN-US">Unicode</span>一律是两个字节编码。在<span lang="EN-US">windows nt</span>内核中，<span lang="EN-US">API</span>一律使用的是<span lang="EN-US">unicode</span>编码，所以如果你在编写软件过程中使用非<span lang="EN-US">unicode</span>编码方式，系统也会自动转换成<span lang="EN-US">unicode</span>执行，然后返回的结构再转换为你使用的类型。单字节表示用<span lang="EN-US">char</span>，<span lang="EN-US">unicode</span>使用<span lang="EN-US">wchar_t.</span>我们是在单字节的光芒下成长起来的，一时间完全抛弃单字节未免难以接受，但是有些时候我们又不可避免的需要使用<span lang="EN-US">unicode</span>字符集合，那么<span lang="EN-US">ms</span>提供的解决办法是泳宏：<span lang="EN-US">TChar<?xml:namespace prefix = o /><o:p></o:p></span></span></p> <p class="MsoNormal"><span style="font-size: 12pt; color: #333333; font-family: 宋体">我们看看他的定义：<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; color: #333333; font-family: 宋体">#ifdef UNICODE<br>typedef wchar_t TCHAR;<br>#else<br>typedef char TCHAR;<br>#endif<o:p></o:p></span></p> <p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; color: #333333; font-family: 宋体">ok</span><span style="font-size: 12pt; color: #333333; font-family: 宋体">，一切问题都解决了，我们只需要定义<span lang="EN-US">UNICODE</span>就一样使用<span lang="EN-US">wchar_t,</span>是很方便。另外，在<span lang="EN-US">windows</span>的<span lang="EN-US">COM</span>中使用的一律是<span lang="EN-US">unicode</span>，但是<span lang="EN-US">MFC</span>默认的确实<span lang="EN-US">MBCS</span>，所以你用<span lang="EN-US">MFC</span>写的类库如果放到了<span lang="EN-US">COM</span>下，有些字符的格式化方式或者返回值错误的，原因就是<span lang="EN-US">com</span>一律使用<span lang="EN-US">unicode</span>，而<span lang="EN-US">unicode</span>使用<span lang="EN-US">wchar_t('00')</span>结尾，<span lang="EN-US">char</span>却是使用<span lang="EN-US">'0'</span>结尾的。一般情况下，普通字符需要加载<span lang="EN-US">_T</span>宏才能正常运行，比如<span lang="EN-US">MFC</span>中你写道<span lang="EN-US">S = "FSDFSDF",</span>那么该类转到<span lang="EN-US">COM</span>下，需要写<span lang="EN-US">S = _T("FSDFSDF")</span>；才可以。我们可以想象宏<span lang="EN-US">_T</span>跟<span lang="EN-US">TCHAr</span>的功能一样，如果使用<span lang="EN-US">UNICODE</span>就自动在<span lang="EN-US">constant string</span>前面加上<span lang="EN-US">L</span>，否则就直接使用。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal"><span style="font-size: 12pt; color: #333333; font-family: 宋体">我们说一些小问题：<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; color: #333333; font-family: 宋体">VC6</span><span style="font-size: 12pt; color: #333333; font-family: 宋体">生成的<span lang="EN-US">console application</span>是<span lang="EN-US"><br>int main(int argc, char* argv[])<o:p></o:p></span></span></p> <p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; color: #333333; font-family: 宋体">VS C++ 2005</span><span style="font-size: 12pt; color: #333333; font-family: 宋体">生成的是<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; color: #333333; font-family: 宋体">int _tmain(int argc, _TCHAR* argv[])<o:p></o:p></span></p> <p class="MsoNormal"><span style="font-size: 12pt; color: #333333; font-family: 宋体">显然，用<span lang="EN-US">_tmain</span>更好，<span lang="EN-US">why?<o:p></o:p></span></span></p> <p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; color: #333333; font-family: 宋体">You can also use <b>_tmain</b>, which is defined in TCHAR.h. <b>_tmain</b> will resolve to <b>main</b> unless _UNICODE is defined, in which case <b>_tmain</b> will resolve to <b>wmain</b>.(<a href="http://msdn2.microsoft.com/en-us/library/6wd819wh.aspx">http://msdn2.microsoft.com/en-us/library/6wd819wh.aspx</a>#).<o:p></o:p></span></p> <p class="MsoNormal"><span style="font-size: 12pt; color: #333333; font-family: 宋体">我们也会常常看到如下一些字符类型，<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; color: #333333; font-family: 宋体">WCHAR wchar_t wchar_t <br>LPSTR zero-terminated string of char (char*) zero-terminated string of char (char*) <br>LPCSTR constant zero-terminated string of char (const char*) constant zero-terminated string of char (const char*) <br>LPWSTR zero-terminated Unicode string (wchar_t*) zero-terminated Unicode string (wchar_t*) <br>LPCWSTR constant zero-terminated Unicode string (const wchar_t*) constant zero-terminated Unicode string (const wchar_t*) <br>TCHAR char wchar_t <br>LPTSTR zero-terminated string of TCHAR (TCHAR*) zero-terminated string of TCHAR (TCHAR*) <br>LPCTSTR constant zero-terminated string of TCHAR (const TCHAR*) constant zero-terminated string of TCHAR (const TCHAR*) <br>C </span><span style="font-size: 12pt; color: #333333; font-family: 宋体">一般代表<span lang="EN-US">constant</span>，<span lang="EN-US">P</span>代表指针，<span lang="EN-US">LP</span>代表长指针<span lang="EN-US">,W</span>代表宽字符，也就是<span lang="EN-US">UNICODE</span>，这下是不是都能明白这些是干什么的了？<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal"><span style="font-size: 12pt; color: #333333; font-family: 宋体">我们也会常常看到<span lang="EN-US">_mbsstr()</span>这样的函数，这就是<span lang="EN-US">MBCS</span>字符编码的函数，当然可以处理<span lang="EN-US">SBCS</span>编码，但是反之却不行。所以为了保险起见，我们可以使用<span lang="EN-US">_mbsstr</span>代替<span lang="EN-US">strstr,</span>但是如果程序只是处理<span lang="EN-US">SBCS</span>，那么显然又影响效率，所以到底用什么方式同时满足效率和可移植性，自己掂量着办吧。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal"><span style="font-size: 12pt; color: #333333; font-family: 宋体">以后使用<span lang="EN-US">C++</span>编写程序，如果出现乱码，首先检查<span lang="EN-US">C++</span>的编码类型，而且一般情况下都是结束符号没有弄对，<span lang="EN-US">SBCS</span>和<span lang="EN-US">MBCS</span>都是以单字节<span lang="EN-US">0</span>结尾，<span lang="EN-US">UNICODE</span>是以双字节<span lang="EN-US">00</span>结尾的。<span lang="EN-US"><o:p></o:p></span></span></p> <p class="MsoNormal"><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: arial"><o:p>&nbsp;</o:p></span></p></span><img src ="http://www.cppblog.com/woaidongmao/aggbug/66247.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/woaidongmao/" target="_blank">肥仔</a> 2008-11-07 22:33 <a href="http://www.cppblog.com/woaidongmao/archive/2008/11/07/66247.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>维基百科----UTF-16</title><link>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66245.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Fri, 07 Nov 2008 14:31:00 GMT</pubDate><guid>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66245.html</guid><wfw:comment>http://www.cppblog.com/woaidongmao/comments/66245.html</wfw:comment><comments>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66245.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/woaidongmao/comments/commentRss/66245.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/woaidongmao/services/trackbacks/66245.html</trackback:ping><description><![CDATA[<h3 style="background: #f8fcff"><span style="font-size: 12pt; color: black; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">维基百科，自由的百科全书</span><span lang="EN-US" style="font-family: 'Trebuchet MS'"><?xml:namespace prefix = u2 /><u2:p></u2:p><?xml:namespace prefix = o /><o:p></o:p></span></h3> <p class="MsoNormal" style="background: #f8fcff; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto"><span style="font-family: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-ascii-font-family: 'Times New Roman'">跳转到</span><span lang="EN-US">: </span><span lang="EN-US" style="mso-fareast-language: zh-tw"><a href="http://zh.wikipedia.org/wiki/UTF-16#column-one#column-one"><span lang="EN-US" style="font-family: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-fareast-language: zh-cn; mso-ascii-font-family: 'Times New Roman'"><span lang="EN-US">导航</span></span></a></span><span lang="EN-US">, </span><span lang="EN-US" style="mso-fareast-language: zh-tw"><a href="http://zh.wikipedia.org/wiki/UTF-16#searchInput#searchInput"><span lang="EN-US" style="font-family: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-fareast-language: zh-cn; mso-ascii-font-family: 'Times New Roman'"><span lang="EN-US">搜寻</span></span></a><u2:p></u2:p></span><span lang="EN-US" style="font-family: 宋体"><o:p></o:p></span></p> <p style="background: #f8fcff"><b><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">UTF-16</span></b><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">是</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'; mso-fareast-language: zh-tw"><a title="Unicode" href="http://zh.wikipedia.org/w/index.php?title=Unicode&amp;variant=zh-tw"><span style="mso-fareast-language: zh-cn">Unicode</span></a></span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">的其中一个使用方式。</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'"> UTF</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">是</span><span style="font-size: 10.5pt; font-family: 'Trebuchet MS'"> <i><span lang="EN-US">Unicode/UCS Transformation Format</span></i></span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">，即把</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">Unicode</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">转做某种格式的意思。</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'"><u2:p></u2:p><o:p></o:p></span></p> <p style="background: #f8fcff"><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">它定义于</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'; mso-fareast-language: zh-tw"><a title="ISO 10646" href="http://zh.wikipedia.org/w/index.php?title=ISO_10646&amp;variant=zh-tw"><span style="mso-fareast-language: zh-cn">ISO/IEC 10646-1</span></a></span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">的附录</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">Q</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">，而</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'; mso-fareast-language: zh-tw"><a title="RFC" href="http://zh.wikipedia.org/w/index.php?title=RFC&amp;variant=zh-tw"><span style="mso-fareast-language: zh-cn">RFC</span></a></span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">2781</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">也定义了相似的做法。</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'"><u2:p></u2:p><o:p></o:p></span></p> <p style="background: #f8fcff"><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">在</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">Unicode</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'; mso-fareast-language: zh-tw"><a title="基本多文種平面" href="http://zh.wikipedia.org/w/index.php?title=%E5%9F%BA%E6%9C%AC%E5%A4%9A%E6%96%87%E7%A8%AE%E5%B9%B3%E9%9D%A2&amp;variant=zh-tw"><span lang="EN-US" style="font-family: 宋体; mso-hansi-font-family: 'Trebuchet MS'; mso-fareast-language: zh-cn; mso-ascii-font-family: 'Trebuchet MS'"><span lang="EN-US">基本多文种平面</span></span></a></span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">定义的字符（无论是拉丁字母、汉字或其它文字或符号），一律使用</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">2</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'; mso-fareast-language: zh-tw"><a title="字节" href="http://zh.wikipedia.org/w/index.php?title=%E5%AD%97%E8%8A%82&amp;variant=zh-tw"><span lang="EN-US" style="font-family: 宋体; mso-hansi-font-family: 'Trebuchet MS'; mso-fareast-language: zh-cn; mso-ascii-font-family: 'Trebuchet MS'"><span lang="EN-US">字节</span></span></a></span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">储存。而在</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'; mso-fareast-language: zh-tw"><a title="辅助平面" href="http://zh.wikipedia.org/w/index.php?title=%E8%BE%85%E5%8A%A9%E5%B9%B3%E9%9D%A2&amp;variant=zh-tw"><span lang="EN-US" style="font-family: 宋体; mso-hansi-font-family: 'Trebuchet MS'; mso-fareast-language: zh-cn; mso-ascii-font-family: 'Trebuchet MS'"><span lang="EN-US">辅助平面</span></span></a></span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">定义的字符，会以<i>代理对</i>（</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">surrogate pair</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">）的形式，以两个</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">2</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">字节的值来储存。</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'"><u2:p></u2:p><o:p></o:p></span></p> <p style="background: #f8fcff"><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">UTF-16</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">比起</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'; mso-fareast-language: zh-tw"><a title="UTF-8" href="http://zh.wikipedia.org/w/index.php?title=UTF-8&amp;variant=zh-tw"><span style="mso-fareast-language: zh-cn">UTF-8</span></a></span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">，好处在于大部分字符都以固定长度的字节</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'"> (2</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">字节</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">) </span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">储存，但</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">UTF-16</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">却无法兼容于</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'; mso-fareast-language: zh-tw"><a title="ASCII" href="http://zh.wikipedia.org/w/index.php?title=ASCII&amp;variant=zh-tw"><span style="mso-fareast-language: zh-cn">ASCII</span></a></span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">编码。</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'"><u2:p></u2:p><o:p></o:p></span></p> <h2 style="background: #f8fcff"><a id="UTF-16.E7.9A.84.E7.B7.A8.E7.A2.BC.E6.A8.A1.E5.BC.8F" name="UTF-16.E7.9A.84.E7.B7.A8.E7.A2.BC.E6.A8."></a><span class="editsection3"><span lang="EN-US" style="font-size: 12pt">[</span></span><span class="editsection3"><span lang="EN-US" style="font-size: 12pt; mso-fareast-language: zh-tw"><a title="編輯段落" href="http://zh.wikipedia.org/w/index.php?title=UTF-16&amp;action=edit&amp;section=1"><span lang="EN-US" style="mso-fareast-language: zh-cn"><span lang="EN-US">编辑</span></span></a></span></span><span class="editsection3"><span lang="EN-US" style="font-size: 12pt">]</span></span><span lang="EN-US" style="font-size: 12pt"> <span class="mw-headline">UTF-16</span></span><span class="mw-headline"><span style="font-size: 12pt">的编码模式</span></span><span lang="EN-US" style="font-family: 'Trebuchet MS'"><u2:p></u2:p><o:p></o:p></span></h2> <p style="background: #f8fcff"><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">UTF-16</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">的大尾序和小尾序储存形式都在用。一般来说，以</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'; mso-fareast-language: zh-tw"><a title="Macintosh" href="http://zh.wikipedia.org/w/index.php?title=Macintosh&amp;variant=zh-tw"><span style="mso-fareast-language: zh-cn">Macintosh</span></a></span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">制作或储存的文字使用大尾序格式，以</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'; mso-fareast-language: zh-tw"><a title="Microsoft" href="http://zh.wikipedia.org/w/index.php?title=Microsoft&amp;variant=zh-tw"><span style="mso-fareast-language: zh-cn">Microsoft</span></a></span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">或</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'; mso-fareast-language: zh-tw"><a title="Linux" href="http://zh.wikipedia.org/w/index.php?title=Linux&amp;variant=zh-tw"><span style="mso-fareast-language: zh-cn">Linux</span></a></span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">制作或储存的文字使用小尾序格式。</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'"><u2:p></u2:p><o:p></o:p></span></p> <p style="background: #f8fcff"><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">为了弄清楚</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">UTF-16</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">文件的大小尾序，在</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">UTF-16</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">文件的开首，都会放置一个</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">U+FEFF</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">字符作为</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">Byte Order Mark (UTF-16LE </span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">以</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'"> FF FE </span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">代表，</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">UTF-16BE </span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">以</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'"> FE FF </span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">代表</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">)</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">，以显示这个文本文件是以</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">UTF-16</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">编码，其中</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">U+FEFF</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">字符在</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">UNICODE</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">中代表的意义是</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">ZERO WIDTH NO-BREAK SPACE</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">，顾名思义，它是个没有宽度也没有断字的空白。</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'"><u2:p></u2:p><o:p></o:p></span></p> <p style="background: #f8fcff"><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">以下的例子有三个字符：「朱」</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">(U+6731)</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">、半角逗号</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'"> (U+<?xml:namespace prefix = u3 /><u3:chmetcnv tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="2" unitname="C" u4:st="on"><?xml:namespace prefix = st1 /><st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="2" unitname="C">002C</u3:chmetcnv></st1:chmetcnv>)</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">、「聿」</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">(U+<u3:chmetcnv tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="807" unitname="F" u4:st="on"><st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="807" unitname="F">807F</u3:chmetcnv></st1:chmetcnv>)</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">。</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'"><u2:p></u2:p><o:p></o:p></span></p> <table class="MsoNormalTable" style="border-right: medium none; border-top: medium none; background: #f9f9f9; border-left: medium none; border-bottom: medium none; border-collapse: collapse; mso-border-alt: solid #aaaaaa .75pt; mso-padding-alt: 0cm 0cm 0cm 0cm" cellspacing="0" cellpadding="0" border="1"> <tbody> <tr style="mso-yfti-irow: 0; mso-yfti-firstrow: yes"> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: #aaaaaa 1pt solid; padding-left: 2.4pt; background: #f2f2f2; padding-bottom: 2.4pt; border-left: #aaaaaa 1pt solid; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt" colspan="6"> <p class="MsoNormal" style="margin: 12pt 0cm; text-align: center" align="center"><b><span style="font-family: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-ascii-font-family: 'Times New Roman'">使用</span><span lang="EN-US"> UTF-16 </span></b><b><span style="font-family: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-ascii-font-family: 'Times New Roman'">编码的例子</span></b><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td></tr> <tr style="mso-yfti-irow: 1"> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; background: #f2f2f2; padding-bottom: 2.4pt; border-left: #aaaaaa 1pt solid; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt" rowspan="2"> <p class="MsoNormal" style="margin: 12pt 0cm; text-align: center" align="center"><b><span style="font-family: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-ascii-font-family: 'Times New Roman'">编码名称</span></b><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; background: #f2f2f2; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt" rowspan="2"> <p class="MsoNormal" style="margin: 12pt 0cm; text-align: center" align="center"><b><span style="font-family: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-ascii-font-family: 'Times New Roman'">编码次序</span></b><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; background: #f2f2f2; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt" colspan="5"> <p class="MsoNormal" style="margin: 12pt 0cm; text-align: center" align="center"><b><span style="font-family: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-ascii-font-family: 'Times New Roman'">编码</span></b><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td></tr> <tr style="mso-yfti-irow: 2"> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; background: #f2f2f2; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm; text-align: center" align="center"><b><span lang="EN-US">BOM</span></b><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; background: #f2f2f2; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm; text-align: center" align="center"><b><span lang="EN-US">"</span></b><b><span style="font-family: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-ascii-font-family: 'Times New Roman'">朱</span><span lang="EN-US">"</span></b><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; background: #f2f2f2; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm; text-align: center" align="center"><b><span lang="EN-US">","</span></b><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; background: #f2f2f2; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm; text-align: center" align="center"><b><span lang="EN-US">"</span></b><b><span style="font-family: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-ascii-font-family: 'Times New Roman'">聿</span><span lang="EN-US">"</span></b><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 0.75pt; border-top: medium none; padding-left: 0.75pt; padding-bottom: 0.75pt; border-left: medium none; padding-top: 0.75pt; border-bottom: medium none; mso-border-right-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto"><span lang="EN-US"><u2:p>&nbsp;</u2:p></span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p></td></tr> <tr style="mso-yfti-irow: 3"> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: #aaaaaa 1pt solid; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span lang="EN-US">UTF-16LE</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span style="font-family: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-ascii-font-family: 'Times New Roman'">小尾序</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span lang="EN-US"><u2:p>&nbsp;</u2:p></span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span lang="EN-US">31 67</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><u3:chmetcnv tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="2" unitname="C" u4:st="on"><st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="2" unitname="C"><span lang="EN-US">2C</span></st1:chmetcnv></u3:chmetcnv><span lang="EN-US"> 00</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><u3:chmetcnv tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="7" unitname="F" u4:st="on"><st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="7" unitname="F"><span lang="EN-US">7F</span></st1:chmetcnv></u3:chmetcnv><span lang="EN-US"> 80</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 0.75pt; border-top: medium none; padding-left: 0.75pt; padding-bottom: 0.75pt; border-left: medium none; padding-top: 0.75pt; border-bottom: medium none; mso-border-right-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto"><span lang="EN-US"><u2:p>&nbsp;</u2:p></span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p></td></tr> <tr style="mso-yfti-irow: 4"> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: #aaaaaa 1pt solid; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span lang="EN-US">UTF-16BE</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span style="font-family: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-ascii-font-family: 'Times New Roman'">大尾序</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span lang="EN-US"><u2:p>&nbsp;</u2:p></span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span lang="EN-US">67 31</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span lang="EN-US">00 <u3:chmetcnv tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="2" unitname="C" u4:st="on"><st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="2" unitname="C">2C</u3:chmetcnv></st1:chmetcnv></span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span lang="EN-US">80 <u3:chmetcnv tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="7" unitname="F" u4:st="on"><st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="7" unitname="F">7F</u3:chmetcnv></st1:chmetcnv></span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 0.75pt; border-top: medium none; padding-left: 0.75pt; padding-bottom: 0.75pt; border-left: medium none; padding-top: 0.75pt; border-bottom: medium none; mso-border-right-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto"><span lang="EN-US"><u2:p>&nbsp;</u2:p></span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p></td></tr> <tr style="mso-yfti-irow: 5"> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: #aaaaaa 1pt solid; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span lang="EN-US">UTF-16</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span style="font-family: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-ascii-font-family: 'Times New Roman'">小尾序，包含</span><span lang="EN-US">BOM</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span lang="EN-US">FF FE</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span lang="EN-US">31 67</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><u3:chmetcnv tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="2" unitname="C" u4:st="on"><st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="2" unitname="C"><span lang="EN-US">2C</span></st1:chmetcnv></u3:chmetcnv><span lang="EN-US"> 00</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><u3:chmetcnv tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="7" unitname="F" u4:st="on"><st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="7" unitname="F"><span lang="EN-US">7F</span></st1:chmetcnv></u3:chmetcnv><span lang="EN-US"> 80</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 0.75pt; border-top: medium none; padding-left: 0.75pt; padding-bottom: 0.75pt; border-left: medium none; padding-top: 0.75pt; border-bottom: medium none; mso-border-right-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto"><span lang="EN-US"><u2:p>&nbsp;</u2:p></span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p></td></tr> <tr style="mso-yfti-irow: 6; mso-yfti-lastrow: yes"> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: #aaaaaa 1pt solid; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span lang="EN-US">UTF-16</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span style="font-family: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-ascii-font-family: 'Times New Roman'">大尾序，包含</span><span lang="EN-US">BOM</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span lang="EN-US">FE FF</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span lang="EN-US">67 31</span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span lang="EN-US">00 <u3:chmetcnv tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="2" unitname="C" u4:st="on"><st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="2" unitname="C">2C</u3:chmetcnv></st1:chmetcnv></span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 2.4pt; border-top: medium none; padding-left: 2.4pt; padding-bottom: 2.4pt; border-left: medium none; padding-top: 2.4pt; border-bottom: #aaaaaa 1pt solid; mso-border-alt: solid #aaaaaa .75pt; mso-border-top-alt: solid #aaaaaa .75pt; mso-border-left-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="margin: 12pt 0cm"><span lang="EN-US">80 <u3:chmetcnv tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="7" unitname="F" u4:st="on"><st1:chmetcnv w:st="on" tcsc="0" numbertype="1" negative="False" hasspace="False" sourcevalue="7" unitname="F">7F</u3:chmetcnv></st1:chmetcnv></span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><u2:p></u2:p><o:p></o:p></span></p></td> <td style="border-right: #aaaaaa 1pt solid; padding-right: 0.75pt; border-top: medium none; padding-left: 0.75pt; padding-bottom: 0.75pt; border-left: medium none; padding-top: 0.75pt; border-bottom: #aaaaaa 1pt solid; mso-border-bottom-alt: solid #aaaaaa .75pt; mso-border-right-alt: solid #aaaaaa .75pt"> <p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto"><span lang="EN-US"><u2:p>&nbsp;</u2:p></span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p></td></tr></tbody></table> <h2 style="background: #f8fcff"><a id="UTF-16_.E8.88.87_UCS-2_.E7.9A.84.E9.97.9C.E4.BF.82" name="UTF-16_.E8.88.87_UCS-2_.E7.9A.84.E9.97.9"></a><span class="editsection3"><span lang="EN-US" style="font-size: 12pt">[</span></span><span class="editsection3"><span lang="EN-US" style="font-size: 12pt; mso-fareast-language: zh-tw"><a title="編輯段落" href="http://zh.wikipedia.org/w/index.php?title=UTF-16&amp;action=edit&amp;section=2"><span lang="EN-US" style="mso-fareast-language: zh-cn"><span lang="EN-US">编辑</span></span></a></span></span><span class="editsection3"><span lang="EN-US" style="font-size: 12pt">]</span></span><span lang="EN-US" style="font-size: 12pt"> <span class="mw-headline">UTF-16 </span></span><span class="mw-headline"><span style="font-size: 12pt">与<span lang="EN-US"> UCS-2 </span>的关系</span></span><span lang="EN-US" style="font-family: 'Trebuchet MS'"><u2:p></u2:p><o:p></o:p></span></h2> <p style="background: #f8fcff"><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">UTF-16</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">可看成是</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">UCS-2</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">的</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'; mso-fareast-language: zh-tw"><a title="父集" href="http://zh.wikipedia.org/w/index.php?title=%E7%88%B6%E9%9B%86&amp;variant=zh-tw"><span lang="EN-US" style="font-family: 宋体; mso-hansi-font-family: 'Trebuchet MS'; mso-fareast-language: zh-cn; mso-ascii-font-family: 'Trebuchet MS'"><span lang="EN-US">父集</span></span></a></span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">。在没有辅助平面字符前，</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">UTF-16</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">与</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">UCS-2</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">所指的是同一的意思。但当引入辅助平面字符后，就只称为</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">UTF-16</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">了。现在若有软件声称自己支持</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'">UCS-2</span><span style="font-size: 10.5pt; mso-hansi-font-family: 'Trebuchet MS'; mso-ascii-font-family: 'Trebuchet MS'">编码，那其实是暗指它不能支持辅助平面字符的委婉语。</span><span lang="EN-US" style="font-size: 10.5pt; font-family: 'Trebuchet MS'"><u2:p></u2:p><o:p></o:p></span></p> <p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto"><span lang="EN-US" style="mso-bidi-font-family: arial; mso-fareast-language: zh-tw"><u2:p>&nbsp;</u2:p></span><span lang="EN-US" style="font-size: 12pt; font-family: 宋体"><o:p></o:p></span></p> <p class="MsoNormal" style="mso-margin-top-alt: auto; mso-margin-bottom-alt: auto"><span lang="EN-US" style="mso-bidi-font-family: arial; mso-fareast-language: zh-tw"><u2:p>&nbsp;</u2:p></span><span lang="EN-US"><o:p></o:p></span></p><u2:p></u2:p><img src ="http://www.cppblog.com/woaidongmao/aggbug/66245.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/woaidongmao/" target="_blank">肥仔</a> 2008-11-07 22:31 <a href="http://www.cppblog.com/woaidongmao/archive/2008/11/07/66245.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>谈谈Unicode编码，简要解释UCS、UTF、BMP、BOM等名词</title><link>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66242.html</link><dc:creator>肥仔</dc:creator><author>肥仔</author><pubDate>Fri, 07 Nov 2008 14:14:00 GMT</pubDate><guid>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66242.html</guid><wfw:comment>http://www.cppblog.com/woaidongmao/comments/66242.html</wfw:comment><comments>http://www.cppblog.com/woaidongmao/archive/2008/11/07/66242.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/woaidongmao/comments/commentRss/66242.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/woaidongmao/services/trackbacks/66242.html</trackback:ping><description><![CDATA[<p>&#160;</p>
<p class=MsoNormal>这是一篇程序员写给程序员的趣味读物。所谓趣味是指可以比较轻松地了解一些原来不清楚的概念，增进知识，类似于打<span lang=EN-US>RPG</span>游戏的升级。整理这篇文章的动机是两个问题：<span lang=EN-US><o:p></o:p></span></p>
<p class=MsoNormal><span style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">问题一： <span lang=EN-US><o:p></o:p></span></span></p>
<p style="MARGIN-LEFT: 36pt">使用<span lang=EN-US>Windows</span>记事本的<span lang=EN-US>&#8220;</span>另存为<span lang=EN-US>&#8221;</span>，可以在<span lang=EN-US>GBK</span>、<span lang=EN-US>Unicode</span>、<span lang=EN-US>Unicode big endian</span>和<span lang=EN-US>UTF-8</span>这几种编码方式间相互转换。同样是<span lang=EN-US>txt</span>文件，<span lang=EN-US>Windows</span>是怎样识别编码方式的呢？<span lang=EN-US><o:p></o:p></span></p>
<p style="MARGIN-LEFT: 36pt">我很早前就发现<span lang=EN-US>Unicode</span>、<span lang=EN-US>Unicode big endian</span>和<span lang=EN-US>UTF-8</span>编码的<span lang=EN-US>txt</span>文件的开头会多出几个字节，分别是<span lang=EN-US>FF</span>、<span lang=EN-US>FE</span>（<span lang=EN-US>Unicode</span>）<span lang=EN-US>,FE</span>、<span lang=EN-US>FF</span>（<span lang=EN-US>Unicode big endian</span>）<span lang=EN-US>,EF</span>、<span lang=EN-US>BB</span>、<span lang=EN-US>BF</span>（<span lang=EN-US>UTF-8</span>）。但这些标记是基于什么标准呢？<span lang=EN-US><o:p></o:p></span></p>
<p class=MsoNormal><span style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">问题二： <span lang=EN-US><o:p></o:p></span></span></p>
<p class=MsoNormal style="MARGIN-LEFT: 36pt"><span style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">最近在网上看到一个<span lang=EN-US>ConvertUTF.c</span>，实现了<span lang=EN-US>UTF-32</span>、<span lang=EN-US>UTF-16</span>和<span lang=EN-US>UTF-8</span>这三种编码方式的相互转换。对于<span lang=EN-US>Unicode(UCS2)</span>、<span lang=EN-US>GBK</span>、<span lang=EN-US>UTF-8</span>这些编码方式，我原来就了解。但这个程序让我有些糊涂，想不起来<span lang=EN-US>UTF-16</span>和<span lang=EN-US>UCS2</span>有什么关系。 <span lang=EN-US><o:p></o:p></span></span></p>
<p>查了查相关资料，总算将这些问题弄清楚了，顺带也了解了一些<span lang=EN-US>Unicode</span>的细节。写成一篇文章，送给有过类似疑问的朋友。本文在写作时尽量做到通俗易懂，但要求读者知道什么是字节，什么是十六进制。<span lang=EN-US><o:p></o:p></span></p>
<div style="BORDER-RIGHT: medium none; PADDING-RIGHT: 0cm; BORDER-TOP: medium none; PADDING-LEFT: 0cm; PADDING-BOTTOM: 0cm; BORDER-LEFT: medium none; PADDING-TOP: 0cm; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-element: para-border-div; mso-border-bottom-alt: solid #aaaaaa .75pt">
<h3><span lang=EN-US style="FONT-SIZE: 12pt">0</span><span style="FONT-SIZE: 12pt">、<span lang=EN-US>big endian</span>和<span lang=EN-US>little endian<o:p></o:p></span></span></h3>
</div>
<p><span lang=EN-US>big endian</span>和<span lang=EN-US>little endian</span>是<span lang=EN-US>CPU</span>处理多字节数的不同方式。例如<span lang=EN-US>&#8220;</span>汉<span lang=EN-US>&#8221;</span>字的<span lang=EN-US>Unicode</span>编码是<st1:chmetcnv unitname="C" sourcevalue="6" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on"><span lang=EN-US>6C</span></st1:chmetcnv><span lang=EN-US>49</span>。那么写到文件里时，究竟是将<st1:chmetcnv unitname="C" sourcevalue="6" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on"><span lang=EN-US>6C</span></st1:chmetcnv>写在前面，还是将<span lang=EN-US>49</span>写在前面？如果将<st1:chmetcnv unitname="C" sourcevalue="6" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on"><span lang=EN-US>6C</span></st1:chmetcnv>写在前面，就是<span lang=EN-US>big endian</span>。还是将<span lang=EN-US>49</span>写在前面，就是<span lang=EN-US>little endian</span>。<span lang=EN-US><o:p></o:p></span></p>
<p><span lang=EN-US>&#8220;endian&#8221;</span>这个词出自《格列佛游记》。小人国的内战就源于吃鸡蛋时是究竟从大头<span lang=EN-US>(Big-Endian)</span>敲开还是从小头<span lang=EN-US>(Little-Endian)</span>敲开，由此曾发生过六次叛乱，其中一个皇帝送了命，另一个丢了王位。<span lang=EN-US><o:p></o:p></span></p>
<p>我们一般将<span lang=EN-US>endian</span>翻译成<span lang=EN-US>&#8220;</span>字节序<span lang=EN-US>&#8221;</span>，将<span lang=EN-US>big endian</span>和<span lang=EN-US>little endian</span>称作<span lang=EN-US>&#8220;</span>大尾<span lang=EN-US>&#8221;</span>和<span lang=EN-US>&#8220;</span>小尾<span lang=EN-US>&#8221;</span>。<span lang=EN-US><o:p></o:p></span></p>
<div style="BORDER-RIGHT: medium none; PADDING-RIGHT: 0cm; BORDER-TOP: medium none; PADDING-LEFT: 0cm; PADDING-BOTTOM: 0cm; BORDER-LEFT: medium none; PADDING-TOP: 0cm; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-element: para-border-div; mso-border-bottom-alt: solid #aaaaaa .75pt">
<h3><span lang=EN-US style="FONT-SIZE: 12pt">1</span><span style="FONT-SIZE: 12pt">、字符编码、内码，顺带介绍汉字编码<span lang=EN-US><o:p></o:p></span></span></h3>
</div>
<p>字符必须编码后才能被计算机处理。计算机使用的缺省编码方式就是计算机的内码。早期的计算机使用<span lang=EN-US>7</span>位的<span lang=EN-US>ASCII</span>编码，为了处理汉字，程序员设计了用于简体中文的<span lang=EN-US>GB2312</span>和用于繁体中文的<span lang=EN-US>big5</span>。<span lang=EN-US><o:p></o:p></span></p>
<p><span lang=EN-US>GB2312(1980</span>年<span lang=EN-US>)</span>一共收录了<span lang=EN-US>7445</span>个字符，包括<span lang=EN-US>6763</span>个汉字和<span lang=EN-US>682</span>个其它符号。汉字区的内码范围高字节从<span lang=EN-US>B0-F7</span>，低字节从<span lang=EN-US>A1-FE</span>，占用的码位是<span lang=EN-US>72*94=6768</span>。其中有<span lang=EN-US>5</span>个空位是<span lang=EN-US>D7FA-D7FE</span>。<span lang=EN-US><o:p></o:p></span></p>
<p><span lang=EN-US>GB2312</span>支持的汉字太少。<span lang=EN-US>1995</span>年的汉字扩展规范<span lang=EN-US>GBK1.0</span>收录了<span lang=EN-US>21886</span>个符号，它分为汉字区和图形符号区。汉字区包括<span lang=EN-US>21003</span>个字符。<span lang=EN-US>2000</span>年的<span lang=EN-US>GB18030</span>是取代<span lang=EN-US>GBK1.0</span>的正式国家标准。该标准收录了<span lang=EN-US>27484</span>个汉字，同时还收录了藏文、蒙文、维吾尔文等主要的少数民族文字。现在的<span lang=EN-US>PC</span>平台必须支持<span lang=EN-US>GB18030</span>，对嵌入式产品暂不作要求。所以手机、<span lang=EN-US>MP3</span>一般只支持<span lang=EN-US>GB2312</span>。<span lang=EN-US><o:p></o:p></span></p>
<p>从<span lang=EN-US>ASCII</span>、<span lang=EN-US>GB2312</span>、<span lang=EN-US>GBK</span>到<span lang=EN-US>GB18030</span>，这些编码方法是向下兼容的，即同一个字符在这些方案中总是有相同的编码，后面的标准支持更多的字符。在这些编码中，英文和中文可以统一地处理。区分中文编码的方法是高字节的最高位不为<span lang=EN-US>0</span>。按照程序员的称呼，<span lang=EN-US>GB2312</span>、<span lang=EN-US>GBK</span>到<span lang=EN-US>GB18030</span>都属于双字节字符集<span lang=EN-US> (DBCS)</span>。<span lang=EN-US><o:p></o:p></span></p>
<p>有的中文<span lang=EN-US>Windows</span>的缺省内码还是<span lang=EN-US>GBK</span>，可以通过<span lang=EN-US>GB18030</span>升级包升级到<span lang=EN-US>GB18030</span>。不过<span lang=EN-US>GB18030</span>相对<span lang=EN-US>GBK</span>增加的字符，普通人是很难用到的，通常我们还是用<span lang=EN-US>GBK</span>指代中文<span lang=EN-US>Windows</span>内码。<span lang=EN-US><o:p></o:p></span></p>
<p>这里还有一些细节：<span lang=EN-US><o:p></o:p></span></p>
<p style="MARGIN-LEFT: 36pt; TEXT-INDENT: -18pt; mso-list: l0 level1 lfo1; tab-stops: list 36.0pt"><span lang=EN-US style="FONT-SIZE: 10pt; FONT-FAMILY: symbol; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: symbol; mso-bidi-font-family: symbol"><span style="mso-list: ignore">&#183;<span style="FONT: 7pt 'Times New Roman'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span></span><span lang=EN-US>GB2312</span>的原文还是区位码，从区位码到内码，需要在高字节和低字节上分别加上<span lang=EN-US>A0</span>。<span lang=EN-US><o:p></o:p></span></p>
<p style="MARGIN-LEFT: 36pt; TEXT-INDENT: -18pt; mso-list: l0 level1 lfo1; tab-stops: list 36.0pt"><span lang=EN-US style="FONT-SIZE: 10pt; FONT-FAMILY: symbol; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: symbol; mso-bidi-font-family: symbol"><span style="mso-list: ignore">&#183;<span style="FONT: 7pt 'Times New Roman'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span></span>在<span lang=EN-US>DBCS</span>中，<span lang=EN-US>GB</span>内码的存储格式始终是<span lang=EN-US>big endian</span>，即高位在前。<span lang=EN-US><o:p></o:p></span></p>
<p style="MARGIN-LEFT: 36pt; TEXT-INDENT: -18pt; mso-list: l0 level1 lfo1; tab-stops: list 36.0pt"><span lang=EN-US style="FONT-SIZE: 10pt; FONT-FAMILY: symbol; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: symbol; mso-bidi-font-family: symbol"><span style="mso-list: ignore">&#183;<span style="FONT: 7pt 'Times New Roman'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span></span><span lang=EN-US>GB2312</span>的两个字节的最高位都是<span lang=EN-US>1</span>。但符合这个条件的码位只有<span lang=EN-US>128*128=16384</span>个。所以<span lang=EN-US>GBK</span>和<span lang=EN-US>GB18030</span>的低字节最高位都可能不是<span lang=EN-US>1</span>。不过这不影响<span lang=EN-US>DBCS</span>字符流的解析：在读取<span lang=EN-US>DBCS</span>字符流时，只要遇到高位为<span lang=EN-US>1</span>的字节，就可以将下两个字节作为一个双字节编码，而不用管低字节的高位是什么。<span lang=EN-US><o:p></o:p></span></p>
<div style="BORDER-RIGHT: medium none; PADDING-RIGHT: 0cm; BORDER-TOP: medium none; PADDING-LEFT: 0cm; PADDING-BOTTOM: 0cm; BORDER-LEFT: medium none; PADDING-TOP: 0cm; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-element: para-border-div; mso-border-bottom-alt: solid #aaaaaa .75pt">
<h3><span lang=EN-US style="FONT-SIZE: 12pt">2</span><span style="FONT-SIZE: 12pt">、<span lang=EN-US>Unicode</span>、<span lang=EN-US>UCS</span>和<span lang=EN-US>UTF<o:p></o:p></span></span></h3>
</div>
<p>前面提到从<span lang=EN-US>ASCII</span>、<span lang=EN-US>GB2312</span>、<span lang=EN-US>GBK</span>到<span lang=EN-US>GB18030</span>的编码方法是向下兼容的。而<span lang=EN-US>Unicode</span>只与<span lang=EN-US>ASCII</span>兼容（更准确地说，是与<span lang=EN-US>ISO-8859-1</span>兼容），与<span lang=EN-US>GB</span>码不兼容。例如<span lang=EN-US>&#8220;</span>汉<span lang=EN-US>&#8221;</span>字的<span lang=EN-US>Unicode</span>编码是<st1:chmetcnv unitname="C" sourcevalue="6" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on"><span lang=EN-US>6C</span></st1:chmetcnv><span lang=EN-US>49</span>，而<span lang=EN-US>GB</span>码是<span lang=EN-US>BABA</span>。<span lang=EN-US><o:p></o:p></span></p>
<p><span lang=EN-US>Unicode</span>也是一种字符编码方法，不过它是由国际组织设计，可以容纳全世界所有语言文字的编码方案。<span lang=EN-US>Unicode</span>的学名是<span lang=EN-US>"Universal Multiple-Octet Coded Character Set"</span>，简称为<span lang=EN-US>UCS</span>。<span lang=EN-US>UCS</span>可以看作是<span lang=EN-US>"Unicode Character Set"</span>的缩写。<span lang=EN-US><o:p></o:p></span></p>
<p>根据维基百科全书<span lang=EN-US>(http://zh.wikipedia.org/wiki/)</span>的记载：历史上存在两个试图独立设计<span lang=EN-US>Unicode</span>的组织，即国际标准化组织（<span lang=EN-US>ISO</span>）和一个软件制造商的协会（<span lang=EN-US>unicode.org</span>）。<span lang=EN-US>ISO</span>开发了<span lang=EN-US>ISO 10646</span>项目，<span lang=EN-US>Unicode</span>协会开发了<span lang=EN-US>Unicode</span>项目。<span lang=EN-US><o:p></o:p></span></p>
<p>在<span lang=EN-US>1991</span>年前后，双方都认识到世界不需要两个不兼容的字符集。于是它们开始合并双方的工作成果，并为创立一个单一编码表而协同工作。从<span lang=EN-US>Unicode2.0</span>开始，<span lang=EN-US>Unicode</span>项目采用了与<span lang=EN-US>ISO 10646-1</span>相同的字库和字码。<span lang=EN-US><o:p></o:p></span></p>
<p>目前两个项目仍都存在，并独立地公布各自的标准。<span lang=EN-US>Unicode</span>协会现在的最新版本是<span lang=EN-US>2005</span>年的<span lang=EN-US>Unicode <st1:chsdate w:st="on" year="1899" month="12" day="30" islunardate="False" isrocdate="False">4.1.0</st1:chsdate></span>。<span lang=EN-US>ISO</span>的最新标准是<span lang=EN-US>10646-3:2003</span>。<span lang=EN-US><o:p></o:p></span></p>
<p><span lang=EN-US>UCS</span>规定了怎么用多个字节表示各种文字。怎样传输这些编码，是由<span lang=EN-US>UTF(UCS Transformation Format)</span>规范规定的，常见的<span lang=EN-US>UTF</span>规范包括<span lang=EN-US>UTF-8</span>、<span lang=EN-US>UTF-7</span>、<span lang=EN-US>UTF-16</span>。<span lang=EN-US><o:p></o:p></span></p>
<p><span lang=EN-US>IETF</span>的<span lang=EN-US>RFC2781</span>和<span lang=EN-US>RFC3629</span>以<span lang=EN-US>RFC</span>的一贯风格，清晰、明快又不失严谨地描述了<span lang=EN-US>UTF-16</span>和<span lang=EN-US>UTF-8</span>的编码方法。我总是记不得<span lang=EN-US>IETF</span>是<span lang=EN-US>Internet Engineering Task Force</span>的缩写。但<span lang=EN-US>IETF</span>负责维护的<span lang=EN-US>RFC</span>是<span lang=EN-US>Internet</span>上一切规范的基础。<span lang=EN-US><o:p></o:p></span></p>
<div style="BORDER-RIGHT: medium none; PADDING-RIGHT: 0cm; BORDER-TOP: medium none; PADDING-LEFT: 0cm; PADDING-BOTTOM: 0cm; BORDER-LEFT: medium none; PADDING-TOP: 0cm; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-element: para-border-div; mso-border-bottom-alt: solid #aaaaaa .75pt">
<h3><span lang=EN-US style="FONT-SIZE: 12pt">3</span><span style="FONT-SIZE: 12pt">、<span lang=EN-US>UCS-2</span>、<span lang=EN-US>UCS-4</span>、<span lang=EN-US>BMP<o:p></o:p></span></span></h3>
</div>
<p><span lang=EN-US>UCS</span>有两种格式：<span lang=EN-US>UCS-2</span>和<span lang=EN-US>UCS-4</span>。顾名思义，<span lang=EN-US>UCS-2</span>就是用两个字节编码，<span lang=EN-US>UCS-4</span>就是用<span lang=EN-US>4</span>个字节（实际上只用了<span lang=EN-US>31</span>位，最高位必须为<span lang=EN-US>0</span>）编码。下面让我们做一些简单的数学游戏：<span lang=EN-US><o:p></o:p></span></p>
<p><span lang=EN-US>UCS-2</span>有<span lang=EN-US>2^16=65536</span>个码位，<span lang=EN-US>UCS-4</span>有<span lang=EN-US>2^31=2147483648</span>个码位。<span lang=EN-US><o:p></o:p></span></p>
<p><span lang=EN-US>UCS-4</span>根据最高位为<span lang=EN-US>0</span>的最高字节分成<span lang=EN-US>2^7=128</span>个<span lang=EN-US>group</span>。每个<span lang=EN-US>group</span>再根据次高字节分为<span lang=EN-US>256</span>个<span lang=EN-US>plane</span>。每个<span lang=EN-US>plane</span>根据第<span lang=EN-US>3</span>个字节分为<span lang=EN-US>256</span>行<span lang=EN-US> (rows)</span>，每行包含<span lang=EN-US>256</span>个<span lang=EN-US>cells</span>。当然同一行的<span lang=EN-US>cells</span>只是最后一个字节不同，其余都相同。<span lang=EN-US><o:p></o:p></span></p>
<p><span lang=EN-US>group 0</span>的<span lang=EN-US>plane 0</span>被称作<span lang=EN-US>Basic Multilingual Plane, </span>即<span lang=EN-US>BMP</span>。或者说<span lang=EN-US>UCS-4</span>中，高两个字节为<span lang=EN-US>0</span>的码位被称作<span lang=EN-US>BMP</span>。<span lang=EN-US><o:p></o:p></span></p>
<p>将<span lang=EN-US>UCS-4</span>的<span lang=EN-US>BMP</span>去掉前面的两个零字节就得到了<span lang=EN-US>UCS-2</span>。在<span lang=EN-US>UCS-2</span>的两个字节前加上两个零字节，就得到了<span lang=EN-US>UCS-4</span>的<span lang=EN-US>BMP</span>。而目前的<span lang=EN-US>UCS-4</span>规范中还没有任何字符被分配在<span lang=EN-US>BMP</span>之外。<span lang=EN-US><o:p></o:p></span></p>
<div style="BORDER-RIGHT: medium none; PADDING-RIGHT: 0cm; BORDER-TOP: medium none; PADDING-LEFT: 0cm; PADDING-BOTTOM: 0cm; BORDER-LEFT: medium none; PADDING-TOP: 0cm; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-element: para-border-div; mso-border-bottom-alt: solid #aaaaaa .75pt">
<h3><span lang=EN-US style="FONT-SIZE: 12pt">4</span><span style="FONT-SIZE: 12pt">、<span lang=EN-US>UTF</span>编码<span lang=EN-US><o:p></o:p></span></span></h3>
</div>
<p><span lang=EN-US>UTF-8</span>就是以<span lang=EN-US>8</span>位为单元对<span lang=EN-US>UCS</span>进行编码。从<span lang=EN-US>UCS-2</span>到<span lang=EN-US>UTF-8</span>的编码方式如下：<span lang=EN-US><o:p></o:p></span></p>
<table class=MsoNormalTable style="WIDTH: 75%; mso-cellspacing: 1.5pt" cellPadding=0 width="75%" border=1>
    <tbody>
        <tr style="mso-yfti-irow: 0; mso-yfti-firstrow: yes">
            <td style="PADDING-RIGHT: 0.75pt; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; PADDING-TOP: 0.75pt">
            <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">UCS-2</span><span style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">编码<span lang=EN-US>(16</span>进制<span lang=EN-US>)</span></span><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p>
            </td>
            <td style="PADDING-RIGHT: 0.75pt; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; PADDING-TOP: 0.75pt">
            <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">UTF-8 </span><span style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">字节流<span lang=EN-US>(</span>二进制<span lang=EN-US>)</span></span><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p>
            </td>
        </tr>
        <tr style="mso-yfti-irow: 1">
            <td style="PADDING-RIGHT: 0.75pt; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; PADDING-TOP: 0.75pt">
            <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">0000 - <st1:chmetcnv unitname="F" sourcevalue="7" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on">007F</st1:chmetcnv></span><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p>
            </td>
            <td style="PADDING-RIGHT: 0.75pt; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; PADDING-TOP: 0.75pt">
            <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">0xxxxxxx</span><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p>
            </td>
        </tr>
        <tr style="mso-yfti-irow: 2">
            <td style="PADDING-RIGHT: 0.75pt; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; PADDING-TOP: 0.75pt">
            <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">0080 - 07FF</span><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p>
            </td>
            <td style="PADDING-RIGHT: 0.75pt; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; PADDING-TOP: 0.75pt">
            <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">110xxxxx 10xxxxxx</span><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p>
            </td>
        </tr>
        <tr style="mso-yfti-irow: 3; mso-yfti-lastrow: yes">
            <td style="PADDING-RIGHT: 0.75pt; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; PADDING-TOP: 0.75pt">
            <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">0800 - FFFF</span><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p>
            </td>
            <td style="PADDING-RIGHT: 0.75pt; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; PADDING-TOP: 0.75pt">
            <p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">1110xxxx 10xxxxxx 10xxxxxx</span><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: 宋体"><o:p></o:p></span></p>
            </td>
        </tr>
    </tbody>
</table>
<p>例如<span lang=EN-US>&#8220;</span>汉<span lang=EN-US>&#8221;</span>字的<span lang=EN-US>Unicode</span>编码是<st1:chmetcnv unitname="C" sourcevalue="6" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on"><span lang=EN-US>6C</span></st1:chmetcnv><span lang=EN-US>49</span>。<st1:chmetcnv unitname="C" sourcevalue="6" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on"><span lang=EN-US>6C</span></st1:chmetcnv><span lang=EN-US>49</span>在<span lang=EN-US>0800-FFFF</span>之间，所以肯定要用<span lang=EN-US>3</span>字节模板了：<span lang=EN-US style="COLOR: blue">1110</span><span lang=EN-US>xxxx <span style="COLOR: blue">10</span>xxxxxx <span style="COLOR: blue">10</span>xxxxxx</span>。将<st1:chmetcnv unitname="C" sourcevalue="6" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on"><span lang=EN-US>6C</span></st1:chmetcnv><span lang=EN-US>49</span>写成二进制是：<span lang=EN-US>0110 110001 001001</span>， 用这个比特流依次代替模板中的<span lang=EN-US>x</span>，得到：<span lang=EN-US style="COLOR: blue">1110</span><span lang=EN-US>0110 <span style="COLOR: blue">10</span>110001 <span style="COLOR: blue">10</span>001001</span>，即<span lang=EN-US>E6 B1 89</span>。<span lang=EN-US><o:p></o:p></span></p>
<p>读者可以用记事本测试一下我们的编码是否正确。<span lang=EN-US><o:p></o:p></span></p>
<p><span lang=EN-US>UTF-16</span>以<span lang=EN-US>16</span>位为单元对<span lang=EN-US>UCS</span>进行编码。对于小于<span lang=EN-US>0x10000</span>的<span lang=EN-US>UCS</span>码，<span lang=EN-US>UTF-16</span>编码就等于<span lang=EN-US>UCS</span>码对应的<span lang=EN-US>16</span>位无符号整数。对于不小于<span lang=EN-US>0x10000</span>的<span lang=EN-US>UCS</span>码，定义了一个算法。不过由于实际使用的<span lang=EN-US>UCS2</span>，或者<span lang=EN-US>UCS4</span>的<span lang=EN-US>BMP</span>必然小于<span lang=EN-US>0x10000</span>，所以就目前而言，可以认为<span lang=EN-US>UTF-16</span>和<span lang=EN-US>UCS-2</span>基本相同。但<span lang=EN-US>UCS-2</span>只是一个编码方案，<span lang=EN-US>UTF-16</span>却要用于实际的传输，所以就不得不考虑字节序的问题。<span lang=EN-US><o:p></o:p></span></p>
<div style="BORDER-RIGHT: medium none; PADDING-RIGHT: 0cm; BORDER-TOP: medium none; PADDING-LEFT: 0cm; PADDING-BOTTOM: 0cm; BORDER-LEFT: medium none; PADDING-TOP: 0cm; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-element: para-border-div; mso-border-bottom-alt: solid #aaaaaa .75pt">
<h3><span lang=EN-US style="FONT-SIZE: 12pt">5</span><span style="FONT-SIZE: 12pt">、<span lang=EN-US>UTF</span>的字节序和<span lang=EN-US>BOM<o:p></o:p></span></span></h3>
</div>
<p><span lang=EN-US>UTF-8</span>以字节为编码单元，没有字节序的问题。<span lang=EN-US>UTF-16</span>以两个字节为编码单元，在解释一个<span lang=EN-US>UTF-16</span>文本前，首先要弄清楚每个编码单元的字节序。例如收到一个<span lang=EN-US>&#8220;</span>奎<span lang=EN-US>&#8221;</span>的<span lang=EN-US>Unicode</span>编码是<span lang=EN-US>594E</span>，<span lang=EN-US>&#8220;</span>乙<span lang=EN-US>&#8221;</span>的<span lang=EN-US>Unicode</span>编码是<span lang=EN-US>4E59</span>。如果我们收到<span lang=EN-US>UTF-16</span>字节流<span lang=EN-US>&#8220;594E&#8221;</span>，那么这是<span lang=EN-US>&#8220;</span>奎<span lang=EN-US>&#8221;</span>还是<span lang=EN-US>&#8220;</span>乙<span lang=EN-US>&#8221;</span>？<span lang=EN-US><o:p></o:p></span></p>
<p><span lang=EN-US>Unicode</span>规范中推荐的标记字节顺序的方法是<span lang=EN-US>BOM</span>。<span lang=EN-US>BOM</span>不是<span lang=EN-US>&#8220;Bill Of Material&#8221;</span>的<span lang=EN-US>BOM</span>表，而是<span lang=EN-US>Byte Order Mark</span>。<span lang=EN-US>BOM</span>是一个有点小聪明的想法：<span lang=EN-US><o:p></o:p></span></p>
<p>在<span lang=EN-US>UCS</span>编码中有一个叫做<span lang=EN-US>"ZERO WIDTH NO-BREAK SPACE"</span>的字符，它的编码是<span lang=EN-US>FEFF</span>。而<span lang=EN-US>FFFE</span>在<span lang=EN-US>UCS</span>中是不存在的字符，所以不应该出现在实际传输中。<span lang=EN-US>UCS</span>规范建议我们在传输字节流前，先传输字符<span lang=EN-US>"ZERO WIDTH NO-BREAK SPACE"</span>。<span lang=EN-US><o:p></o:p></span></p>
<p>这样如果接收者收到<span lang=EN-US>FEFF</span>，就表明这个字节流是<span lang=EN-US>Big-Endian</span>的；如果收到<span lang=EN-US>FFFE</span>，就表明这个字节流是<span lang=EN-US>Little-Endian</span>的。因此字符<span lang=EN-US>"ZERO WIDTH NO-BREAK SPACE"</span>又被称作<span lang=EN-US>BOM</span>。<span lang=EN-US><o:p></o:p></span></p>
<p><span lang=EN-US>UTF-8</span>不需要<span lang=EN-US>BOM</span>来表明字节顺序，但可以用<span lang=EN-US>BOM</span>来表明编码方式。字符<span lang=EN-US>"ZERO WIDTH NO-BREAK SPACE"</span>的<span lang=EN-US>UTF-8</span>编码是<span lang=EN-US>EF BB BF</span>（读者可以用我们前面介绍的编码方法验证一下）。所以如果接收者收到以<span lang=EN-US>EF BB BF</span>开头的字节流，就知道这是<span lang=EN-US>UTF-8</span>编码了。<span lang=EN-US><o:p></o:p></span></p>
<p><span lang=EN-US>Windows</span>就是使用<span lang=EN-US>BOM</span>来标记文本文件的编码方式的。<span lang=EN-US><o:p></o:p></span></p>
<div style="BORDER-RIGHT: medium none; PADDING-RIGHT: 0cm; BORDER-TOP: medium none; PADDING-LEFT: 0cm; PADDING-BOTTOM: 0cm; BORDER-LEFT: medium none; PADDING-TOP: 0cm; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-element: para-border-div; mso-border-bottom-alt: solid #aaaaaa .75pt">
<h3><span lang=EN-US style="FONT-SIZE: 12pt">6</span><span style="FONT-SIZE: 12pt">、进一步的参考资料<span lang=EN-US><o:p></o:p></span></span></h3>
</div>
<p>本文主要参考的资料是<span lang=EN-US> "Short overview of ISO-IEC 10646 and Unicode" (http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html)</span>。<span lang=EN-US><o:p></o:p></span></p>
<p>我还找了两篇看上去不错的资料，不过因为我开始的疑问都找到了答案，所以就没有看：<span lang=EN-US><o:p></o:p></span></p>
<ol type=1>
    <li class=MsoNormal style="TEXT-ALIGN: left; mso-list: l1 level1 lfo2; tab-stops: list 36.0pt; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; mso-pagination: widow-orphan"><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">"Understanding Unicode A general introduction to the Unicode Standard" (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&amp;item_id=IWS-Chapter<st1:chmetcnv unitname="a" sourcevalue="4" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on">04a</st1:chmetcnv>) <o:p></o:p></span>
    <li class=MsoNormal style="TEXT-ALIGN: left; mso-list: l1 level1 lfo2; tab-stops: list 36.0pt; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; mso-pagination: widow-orphan"><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">"Character set encoding basics Understanding character set encodings and legacy encodings" (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&amp;item_id=IWS-Chapter03) <o:p></o:p></span></li>
</ol>
<p>我写过<span lang=EN-US>UTF-8</span>、<span lang=EN-US>UCS-2</span>、<span lang=EN-US>GBK</span>相互转换的软件包，包括使用<span lang=EN-US>Windows API</span>和不使用<span lang=EN-US>Windows API</span>的版本。以后有时间的话，我会整理一下放到我的个人主页上<span lang=EN-US>(http://fmddlmyy.home4u.china.com)</span>。<span lang=EN-US><o:p></o:p></span></p>
<p>我是想清楚所有问题后才开始写这篇文章的，原以为一会儿就能写好。没想到考虑措辞和查证细节花费了很长时间，竟然从下午<span lang=EN-US>1:30</span>写到<span lang=EN-US>9:00</span>。希望有读者能从中受益。<span lang=EN-US><o:p></o:p></span></p>
<p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: arial"><o:p>&nbsp;</o:p></span></p>
<p class=MsoNormal><span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: arial"><o:p></o:p></span>&nbsp;</p>
<p class=MsoNormal>附录1&nbsp;再说说区位码、GB2312、内码和代码页<br>有的朋友对文章中这句话还有疑问：<br>&#8220;GB2312的原文还是区位码，从区位码到内码，需要在高字节和低字节上分别加上A0。&#8221;<br><br><br><br>我再详细解释一下：<br><br>&#8220;GB2312的原文&#8221;是指国家1980年的一个标准《中华人民共和国国家标准&nbsp;信息交换用汉字编码字符集&nbsp;基本集&nbsp;GB&nbsp;2312-80》。这个标准用两个数来编码汉字和中文符号。第一个数称为&#8220;区&#8221;，第二个数称为&#8220;位&#8221;。所以也称为区位码。1-9区是中文符号，16-55区是一级汉字，56-87区是二级汉字。现在Windows也还有区位输入法，例如输入1601得到&#8220;啊&#8221;。（这个区位输入法可以自动识别16进制的GB2312和10进制的区位码，也就是说输入B0A1同样会得到&#8220;啊&#8221;。）<br><br>内码是指操作系统内部的字符编码。早期操作系统的内码是与语言相关的。现在的Windows在系统内部支持Unicode，然后用代码页适应各种语言，&#8220;内码&#8221;的概念就比较模糊了。微软一般将缺省代码页指定的编码说成是内码。<br><br>内码这个词汇，并没有什么官方的定义，代码页也只是微软这个公司的叫法。作为程序员，我们只要知道它们是什么东西，没有必要过多地考证这些名词。<br><br>所谓代码页(code&nbsp;page)就是针对一种语言文字的字符编码。例如GBK的code&nbsp;page是CP936，BIG5的code&nbsp;page是CP950，GB2312的code&nbsp;page是CP20936。<br><br>Windows中有缺省代码页的概念，即缺省用什么编码来解释字符。例如Windows的记事本打开了一个文本文件，里面的内容是字节流：BA、BA、D7、D6。Windows应该去怎么解释它呢？<br><br>是按照Unicode编码解释、还是按照GBK解释、还是按照BIG5解释，还是按照ISO8859-1去解释？如果按GBK去解释，就会得到&#8220;汉字&#8221;两个字。按照其它编码解释，可能找不到对应的字符，也可能找到错误的字符。所谓&#8220;错误&#8221;是指与文本作者的本意不符，这时就产生了乱码。<br><br>答案是Windows按照当前的缺省代码页去解释文本文件里的字节流。缺省代码页可以通过控制面板的区域选项设置。记事本的另存为中有一项ANSI，其实就是按照缺省代码页的编码方法保存。<br><br>Windows的内码是Unicode，它在技术上可以同时支持多个代码页。只要文件能说明自己使用什么编码，用户又安装了对应的代码页，Windows就能正确显示，例如在HTML文件中就可以指定charset。<br><br>有的HTML文件作者，特别是英文作者，认为世界上所有人都使用英文，在文件中不指定charset。如果他使用了0x80-0xff之间的字符，中文Windows又按照缺省的GBK去解释，就会出现乱码。这时只要在这个html文件中加上指定charset的语句，例如：<br>&lt;meta&nbsp;http-equiv="Content-Type"&nbsp;content="text/html;&nbsp;charset=ISO8859-1"&gt;<br>如果原作者使用的代码页和ISO8859-1兼容，就不会出现乱码了。<br><br>再说区位码，啊的区位码是1601，写成16进制是0x10,0x01。这和计算机广泛使用的ASCII编码冲突。为了兼容00-7f的ASCII编码，我们在区位码的高、低字节上分别加上A0。这样&#8220;啊&#8221;的编码就成为B0A1。我们将加过两个A0的编码也称为GB2312编码，虽然GB2312的原文根本没提到这一点。<span lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-bidi-font-family: arial"><o:p></o:p></span></p>
<img src ="http://www.cppblog.com/woaidongmao/aggbug/66242.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/woaidongmao/" target="_blank">肥仔</a> 2008-11-07 22:14 <a href="http://www.cppblog.com/woaidongmao/archive/2008/11/07/66242.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>