UTF-8 BOM - xiaoguozi's Blog

UTF-8 BOM

在PHP中查找中文字符，有两种方案。

1、中文字符是gbk（gb2312）

有两种解决方法

第一种：

将PHP保存为ASCII编码，然后使用strpos查找，如：

strpos($curl_res, ‘哈哈’)

第二种：

将PHP保存为UTF-8无BOM编码，然后转换字符串编码为UTF-8，再查找，如：

$curl_res = mb_convert_encoding($curl_res, ‘utf-8′, ‘gbk’);

mb_strpos($curl_res, ‘哈哈’);

2、中文字符是UTF-8

有两种解决方法

第一种：

将PHP保存为UTF-8无BOM编码，然后使用strpos查找，如：

strpos($curl_res, ‘哈哈’)

第二种：

将PHP保存为ASCII编码，然后转换字符串编码为gbk，再查找，如：

$curl_res = mb_convert_encoding($curl_res, ‘gbk’, ‘utf-8′);

mb_strpos($curl_res, ‘哈哈’);

应该可以看出一些规律，就是：函数中的中文字符串参数的编码和PHP文件保存格式的编码一致，在使用函数时要考虑到！

   我生成的那个html文件被EmEditor认为UTF-8 with Signature。而好用的那个html文件被EmEditor认为UTF-8 without Signature.
    对于这两种UTF－8格式的转换，我查看了网上信息，点击记事本，EmEditor等文本编辑器的另存为，当选择了UTF-8的编码格式时，Add a Unicode Signature(BOM)这个选项被激活，只要选择上，我的文件就可以存为UTF-8 with Signature的格式。可是，问题就在于，我用java怎么让我的文件直接生成为 UTF-8 with Signature的格式。
    开始上google搜索UTF-8 with Signature,BOM,Add a Unicode Signature等关键字。
http://www.unicode.org/unicode/faq/utf_bom.html#BOM
我大致了解了他们两个的区别。
Q: What is a BOM?

A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol.
http://mindprod.com/jgloss/bom.html
BOM
Byte Order Marks are special characters at the beginning of a Unicode file to indicate whether it is big or little endian, in other words does the high or low order byte come first. These codes also tell whether the encoding is 8, 16 or 32 bit. You can recognise Unicode files by their starting byte order marks, and by the way Unicode-16 files are half zeroes and Unicode-32 files are three-quarters zeros. Unicode Endian Markers
Byte-order mark Description
EF BB BF UTF-8
FF FE UTF-16 aka UCS-2, little endian
FE FF UTF-16 aka UCS-2, big endian
00 00 FF FE UTF-32 aka UCS-4, little endian.
00 00 FE FF UTF-32 aka UCS-4, big-endian.
There are also variants of these encodings that have an implied endian marker.
Unfortunately, often applications, even Javac.exe, choke on these byte order marks. Java Readers don't automatically filter them out. There is not much you can do but manually remove them.

http://cache.baidu.com/c?word=java%2Cbom&url=http%3A//tgdem530%2Eblogchina%2Ecom/&b=0&a=1&user=baidu
c、UTF的字节序和BOM
UTF-8以字节为编码单元，没有字节序的问题。UTF-16以两个字节为编码单元，在解释一个UTF-16文本前，首先要弄清楚每个编码单元的字节序。例如收到一个“奎”的Unicode编码是594E，“乙”的Unicode编码是4E59。如果我们收到UTF-16字节流“594E”，那么这是 “奎”还是“乙”？

Unicode规范中推荐的标记字节顺序的方法是BOM。BOM不是“Bill Of Material”的BOM表，而是Byte Order Mark。BOM是一个有点小聪明的想法：

在UCS编码中有一个叫做"ZERO WIDTH NO-BREAK SPACE"的字符，它的编码是FEFF。而FFFE在UCS中是不存在的字符，所以不应该出现在实际传输中。UCS规范建议我们在传输字节流前，先传输字符"ZERO WIDTH NO-BREAK SPACE"。

这样如果接收者收到FEFF，就表明这个字节流是Big-Endian的；如果收到FFFE，就表明这个字节流是Little-Endian的。因此字符"ZERO WIDTH NO-BREAK SPACE"又被称作BOM。

UTF-8不需要BOM来表明字节顺序，但可以用BOM来表明编码方式。字符"ZERO WIDTH NO-BREAK SPACE"的UTF-8编码是EF BB BF（读者可以用我们前面介绍的编码方法验证一下）。所以如果接收者收到以EF BB BF开头的字节流，就知道这是UTF-8编码了。

Windows就是使用BOM来标记文本文件的编码方式的。

原来BOM是在文件的开始加了几个字节作为标记。有了这个标记，一些协议和系统才能识别。好，看看怎么加上这写字节。
终于在这里找到了
http://mindprod.com/jgloss/encoding.html
UTF-8
8-bit encoded Unicode. neé UTF8. Optional marker on front of file: EF BB BF for reading. Unfortunately, OutputStreamWriter does not automatically insert the marker on writing. Notepad can't read the file without this marker. Now the question is, how do you get that marker in there? You can't just emit the bytes EF BB BF since they will be encoded and changed. However, the solution is quite simple. prw.write( '\ufeff' ); at the head of the file. This will be encoded as EF BB BF.
DataOutputStreams have a binary length count in front of each string. Endianness does not apply to 8-bit encodings. Java DataOutputStream and ObjectOutputStream uses a slight variant of kosher UTF-8. To aid with compatibility with C in JNI, the null byte '\u0000' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls. Only the 1-byte, 2-byte, and 3-byte formats are used. Supplementary characters, (above 0xffff), are represented in the form of surrogate pairs (a pair of encoded 16 bit characters in a special range), rather than directly encoding the character.

prw.write( '\ufeff' );就是这个。
于是我的代码变为：
public void htmlWrite(String charsetName) {
        try {
            out = new BufferedWriter(new OutputStreamWriter(
                        new FileOutputStream(outFileName), "UTF-8"));
            out.write('\ufeff');
            out.write(res);
            out.flush();

            if (out != null) {
                out.close();
            }
        } catch (Exception e) {
            try {
                if (out != null) {
                    out.close();
                }
            } catch (IOException e1) {
                System.out.print("write errors!" + e);
            }

            System.out.print("write errors!" + e);
        }
    }
问题解决。

posted on 2013-02-04 15:38 小果子阅读(2939) 评论(0) 编辑收藏引用所属分类: 学习笔记

只有注册用户登录后才能发表评论。
【推荐】100%开源！大型工业跨平台软件C++源码提供，建模，组态！

相关文章: Nagios插件编写及调试方法 Linux下Nagios的安装与配置 A collection of color schemes for some famous websites in China 工具集 Meteor：让实时Web App成为主流 SEH异常处理学习总结九种引人瞩目的开源大数据技术 linux 维护 phonegap js 和本地代码调用原理(转) 浏览器探究——执行网页跳转 (转)

网站导航: 博客园 IT新闻 BlogJava 博问 Chat2DB 管理

常用链接

随笔分类

Blog

Company

Friends&Acmers

QT

搜索

最新评论

阅读排行榜