CodePage简介(转)

1. Codepage的定义和历史

字符内码(charcter code)指的是用来代表字符的内码.读者在输入和存储文档时都要使用内码,内码分为

单字节内码 -- Single-Byte character sets (SBCS),可以支持256个字符编码.
双字节内码 -- Double-Byte character sets)(DBCS),可以支持65000个字符编码.主要用来对大字符集的东方文字进行编码.

codepage 指的是一个经过挑选的以特定顺序排列的字符内码列表,对于早期的单字节内码的语种,codepage中的内码顺序使得系统
可以按照此列表来根据键盘的输入值给出一个对应的内码.对于双字节内码,则给出的是MultiByte到Unicode的对应表,这样就可以把
以Unicode形式存放的字符转化为相应的字符内码,或者反之,在Linux核心中对应的函数就是utf8_mbtowc和utf8_wctomb. ......1. Codepage的定义和历史

字符内码(charcter code)指的是用来代表字符的内码.读者在输入和存储文档时都要使用内码,内码分为

单字节内码 -- Single-Byte character sets (SBCS),可以支持256个字符编码.
双字节内码 -- Double-Byte character sets)(DBCS),可以支持65000个字符编码.主要用来对大字符集的东方文字进行编码.

codepage 指的是一个经过挑选的以特定顺序排列的字符内码列表,对于早期的单字节内码的语种,codepage中的内码顺序使得系统
可以按照此列表来根据键盘的输入值给出一个对应的内码.对于双字节内码,则给出的是MultiByte到Unicode的对应表,这样就可以把
以Unicode形式存放的字符转化为相应的字符内码,或者反之,在Linux核心中对应的函数就是utf8_mbtowc和utf8_wctomb.

在1980年前,仍然没有任何国际标准如ISO-8859或Unicode来定义如何扩展US-ASCII编码以便非英语国家的用户使用.很多IT 厂商发明了他们自己的编码,并且使用了难以记忆的数目来标识:

例如936代表简体中文. 950代表繁体中文.
1.1 CJK Codepage
同 Extended Unix Coding ( EUC )编码大不一样的是,下面所有的远东 codepage 都利用了C1控制码 { =80..=9F } 做为首字节, 使用ASCII值 { =40..=7E { 做为第二字节,这样才能包含多达数万个双字节字符,这表明在这种编码之中小于3F的ASCII值不一定代表ASCII字符.

CP932

Shift-JIS包含日本语 charset JIS X 0201 (每个字符一个字节) 和 JIS X 0208 (每个字符两个字节),所以 JIS X 0201平假名包含一个字节半宽的字符,其剩馀的60个字节被用做7076个汉字以及648个其他全宽字符的首字节.同EUC-JP编码区别的是, Shift-JIS没有包含JIS X 202中定义的5802个汉字.

CP936

GBK 扩展了 EUC-CN 编码( GB 2312-80编码,包含 6763 个汉字)到Unicode (GB13000.1-93)中定义的20902个汉字,中国大陆使用的是简体中文zh_CN.

CP949

UnifiedHangul (UHC) 是韩文 EUC-KR 编码(KS C 5601-1992 编码,包括2350 韩文音节和 4888 个汉字a)的超集,包含 8822个附加的韩文音节( 在C1中 )

CP950

是代替EUC-TW (CNS 11643-1992)的 Big5 编码(13072 繁体 zh_TW 中文字) 繁体中文,这些定义都在Ken Lunde的 CJK.INF中或者 Unicode 编码表中找到.

注意: Microsoft采用以上四种Codepage,因此要访问Microsoft的文件系统时必需采用上面的Codepage .

1.2 IBM的远东语言Codepage

IBM的Codepage分为SBCS和DBCS两种:

IBM SBCS Codepage

37 (英文) *

290 (日文) *

833 (韩文) *

836 (简体中文) *

891 (韩文)

897 (日文)

903 (简体中文)

904 (繁体中文)

IBM DBCS Codepage

300 (日文) *

301 (日文)

834 (韩文) *

835 (繁体中文) *

837 (简体中文) *

926 (韩文)

927 (繁体中文)

928 (简体中文)

将SBCS的Codepage和DBCS的Codepage混合起来就成为: IBM MBCS Codepage

930 (日文) (Codepage 300 加 290) *

932 (日文) (Codepage 301 加 897)

933 (韩文) (Codepage 834 加 833) *

934 (韩文) (Codepage 926 加 891)

938 (繁体中文) (Codepage 927 加 904)

936 (简体中文) (Codepage 928 加 903)

5031 (简体中文) (Codepage 837 加 836) *

5033 (繁体中文) (Codepage 835 加 37) *

*代表采用EBCDIC编码格式

由此可见,Mircosoft的CJK Codepage来源于IBM的Codepage.

2. Linux下Codepage的作用

在Linux下引入对Codepage的支持主要是为了访问FAT/VFAT/FAT32/NTFS/NCPFS等文件系统下的多语种文件名的问题,目前在NTFS和FAT32/VFAT下的文件系统上都使用了Unicode,这就需要系统在读取这些文件名时动态将其转换为相应的语言编码.因此引入了NLS支持.其相应的程序文件在/usr/src/linux/fs/nls下:

Config.in
Makefile
nls_base.c
nls_cp437.c
nls_cp737.c
nls_cp775.c
nls_cp850.c
nls_cp852.c
nls_cp855.c
nls_cp857.c
nls_cp860.c
nls_cp861.c
nls_cp862.c
nls_cp863.c
nls_cp864.c
nls_cp865.c
nls_cp866.c
nls_cp869.c
nls_cp874.c
nls_cp936.c
nls_cp950.c
nls_iso8859-1.c
nls_iso8859-15.c
nls_iso8859-2.c
nls_iso8859-3.c
nls_iso8859-4.c
nls_iso8859-5.c
nls_iso8859-6.c
nls_iso8859-7.c
nls_iso8859-8.c
nls_iso8859-9.c
nls_koi8-r.c

实现了下列函数:

extern int utf8_mbtowc(__u16 *, const __u8 *, int);
extern int utf8_mbstowcs(__u16 *, const __u8 *, int);
extern int utf8_wctomb(__u8 *, __u16, int);
extern int utf8_wcstombs(__u8 *, const __u16 *, int);

这样在加载相应的文件系统时就可以用下面的参数来设置Codepage:

对于Codepage 437 来说

mount -t vfat /dev/hda1 /mnt/1 -o codepage=437,iocharset=cp437

这样在Linux下就可以正常访问不同语种的长文件名了.

3. Linux下支持的Codepage

nls codepage 437 -- 美国/加拿大英语

nls codepage 737 -- 希腊语

nls codepage 775 -- 波罗的海语

nls codepage 850 -- 包括西欧语种(德语,西班牙语,意大利语)中的一些字符

nls codepage 852 -- Latin 2 包括中东欧语种(阿尔巴尼亚语,克罗地亚语,捷克语,英语,芬兰语,匈牙利语,爱尔兰语,德语,波兰语,罗马利亚语,塞尔维亚语,斯洛伐克语,斯洛文尼亚语,Sorbian语)

nls codepage 855 -- 斯拉夫语

nls codepage 857 -- 土耳其语

nls codepage 860 -- 葡萄牙语

nls codepage 861 -- 冰岛语

nls codepage 862 -- 希伯来语

nls codepage 863 -- 加拿大语

nls codepage 864 -- 阿拉伯语

nls codepage 865 -- 日尔曼语系

nls codepage 866 -- 斯拉夫语/俄语

nls codepage 869 -- 希腊语(2)

nls codepage 874 -- 泰语

nls codepage 936 -- 简体中文GBK

nls codepage 950 -- 繁体中文Big5

nls iso8859-1 -- 西欧语系(阿尔巴尼亚语,西班牙加泰罗尼亚语,丹麦语,荷兰语,英语,Faeroese语,芬兰语,法语,德语,加里西亚语,爱尔兰语,冰岛语,意大利语,挪威语,葡萄牙语,瑞士语.)这同时适用于美国英语.

nls iso8859-2 -- Latin 2 字符集,斯拉夫/中欧语系(捷克语,德语,匈牙利语,波兰语,罗马尼亚语,克罗地亚语,斯洛伐克语,斯洛文尼亚语)

nls iso8859-3 -- Latin 3 字符集, (世界语,加里西亚语,马耳他语,土耳其语)

nls iso8859-4 -- Latin 4 字符集, (爱莎尼亚语,拉脱维亚语,立陶宛语),是Latin 6 字符集的前序标准

nls iso8859-5 -- 斯拉夫语系(保加利亚语,Byelorussian语,马其顿语,俄语,塞尔维亚语,乌克兰语) 一般推荐使用 KOI8-R codepage

nls iso8859-6 -- 阿拉伯语.

nls iso8859-7 -- 现代希腊语

nls iso8859-8 -- 希伯来语

nls iso8859-9 -- Latin 5 字符集, (去掉了 Latin 1中不经常使用的一些冰岛语字符而代以土耳其语字符)

nls iso8859-10 -- Latin 6 字符集, (因纽特(格陵兰)语,萨摩斯岛语等Latin 4 中没有包括的北欧语种)

nls iso8859-15 -- Latin 9 字符集, 是Latin 1字符集的更新版本,去掉一些不常用的字符,增加了对爱莎尼亚语的支持,修正了法语和芬兰语部份,增加了欧元字符)

nls koi8-r -- 俄语的缺省支持

4. 简体中文GBK/繁体中文Big5的Codepage

如何制作简体中文GBK/繁体中文Big5的Codepage?

从 Unicode 组织取得GBK/Big5的Unicode的定义.
由于GBK是基于ISO 10646-1:1993标准的,而相应的日文是JIS X 0221-1995,韩文是KS C 5700-1995,他们被提交到Unicode标准的时间表为:
Unicode Version 1.0
Unicode Version 1.1 <-> ISO 10646-1:1993, JIS X 0221-1995, GB 13000.1-93
Unicode Version 2.0 <-> KS C 5700-1995

从Windows 95开始均采用GBK编码. 您需要的是 CP936.TXT和 BIG5.TXT
然后使用下面的程序就可以将其转化为Linux核心需要的Unicode<->GBK码表
./genmap BIG5.txt | perl uni2big5.pl
./genmap CP936.txt | perl uni2gbk.pl
再修改fat/vfat/ntfs的相关函数就可以完成对核心的修改工作. 具体使用时可以使用下面的命令:

简体中文: mount -t vfat /dev/hda1 /mnt/1 -o codepage=936,iocharset=cp936

繁体中文: mount -t vfat /dev/hda1 /mnt/1 -o codepage=950,iocharset=cp936

有趣的是,由于GBK包含了全部的GB2312/Big5/JIS的内码,所以使用936的Codepage也可以显示Big5的文件名.

5. 附录

5.1 作者和相关文档

制作codepage950支持的是台湾的 cosmos先生, 主页为 http://www.cis.nctu.edu.tw:8080/~is84086/Project/kernel_cp950/

制作GBK的cp936支持的是TurboLinux的中文研发小组的方汉和陈向阳

5.2 genmap

#!/bin/sh
cat $1  | awk '{if(index($1,"#")==0)print $0}' | awk 'BEGIN{FS="0x"}{print $2 $3}' |
 awk '{if(length($1)==length($2))print $1,$2}'

5.3 uni2big5.pl

  1 #!/usr/bin/perl
  2
  3 @code = (
  4         "00", "01", "02", "03", "04", "05", "06", "07",
  5         "08", "09", "0A", "0B", "0C", "0D", "0E", "0F",
  6         "10", "11", "12", "13", "14", "15", "16", "17",
  7         "18", "19", "1A", "1B", "1C", "1D", "1E", "1F",
  8         "20", "21", "22", "23", "24", "25", "26", "27",
  9         "28", "29", "2A", "2B", "2C", "2D", "2E", "2F",
10         "30", "31", "32", "33", "34", "35", "36", "37",
11         "38", "39", "3A", "3B", "3C", "3D", "3E", "3F",
12         "40", "41", "42", "43", "44", "45", "46", "47",
13         "48", "49", "4A", "4B", "4C", "4D", "4E", "4F",
14         "50", "51", "52", "53", "54", "55", "56", "57",
15         "58", "59", "5A", "5B", "5C", "5D", "5E", "5F",
16         "60", "61", "62", "63", "64", "65", "66", "67",
17         "68", "69", "6A", "6B", "6C", "6D", "6E", "6F",
18         "70", "71", "72", "73", "74", "75", "76", "77",
19         "78", "79", "7A", "7B", "7C", "7D", "7E", "7F",
20         "80", "81", "82", "83", "84", "85", "86", "87",
21         "88", "89", "8A", "8B", "8C", "8D", "8E", "8F",
22         "90", "91", "92", "93", "94", "95", "96", "97",
23         "98", "99", "9A", "9B", "9C", "9D", "9E", "9F",
24         "A0", "A1", "A2", "A3", "A4", "A5", "A6", "A7",
25         "A8", "A9", "AA", "AB", "AC", "AD", "AE", "AF",
26         "B0", "B1", "B2", "B3", "B4", "B5", "B6", "B7",
27         "B8", "B9", "BA", "BB", "BC", "BD", "BE", "BF",
28         "C0", "C1", "C2", "C3", "C4", "C5", "C6", "C7",
29         "C8", "C9", "CA", "CB", "CC", "CD", "CE", "CF",
30         "D0", "D1", "D2", "D3", "D4", "D5", "D6", "D7",
31         "D8", "D9", "DA", "DB", "DC", "DD", "DE", "DF",
32         "E0", "E1", "E2", "E3", "E4", "E5", "E6", "E7",
33         "E8", "E9", "EA", "EB", "EC", "ED", "EE", "EF",
34         "F0", "F1", "F2", "F3", "F4", "F5", "F6", "F7",
35         "F8", "F9", "FA", "FB", "FC", "FD", "FE", "FF");
36
37 while (<STDIN>){
38         ($unicode, $big5) = split;
39         ($high, $low) = $unicode =~ /(..)(..)/;
40         $table2{$high}{$low} = $big5;
41         ($high, $low) = $big5 =~ /(..)(..)/;
42         $table{$high}{$low} = $unicode;
43 }
44
45 print <<EOF;
46 /*
47  * linux/fs/nls_cp874.c
48  *
49  * Charset cp874 translation tables.
50  * Generated automatically from the Unicode and charset
51  * tables from the Unicode Organization (www.unicode.org).
52  * The Unicode to charset table has only exact mappings.
53  */
54
55 #include <linux/module.h>
56 #include <linux/kernel.h>
57 #include <linux/string.h>
58 #include <linux/nls.h>
59
60 /* A1 - F9*/
61 static struct nls_unicode charset2uni[(0xF9-0xA1+1)*(0x100-0x60)] = {
62 EOF
63
64 for ($high=0xA1; $high <= 0xF9; $high++){
65         for ($low=0x40; $low <= 0x7F; $low++){
66                 $unicode = $table2{$code[$high]}{$code[$low]};
67                 $unicode = "0000" if (!(defined $unicode));
68                 print "\n\t" if ($low%4 == 0);
69                 print "/* $code[$high]$code[$low]*/\n\t" if ($low%0x10 == 0);
70                 ($uhigh, $ulow) = $unicode =~ /(..)(..)/;
71                 printf("{0x%2s, 0x%2s}, ", $ulow, $uhigh);
72         }
73         for ($low=0xA0; $low <= 0xFF; $low++){
74                 $unicode = $table2{$code[$high]}{$code[$low]};
75                 $unicode = "0000" if (!(defined $unicode));
76                 print "\n\t" if ($low%4 == 0);
77                 print "/* $code[$high]$code[$low]*/\n\t" if ($low%0x10 == 0);
78                 ($uhigh, $ulow) = $unicode =~ /(..)(..)/;
79                 printf("{0x%2s, 0x%2s}, ", $ulow, $uhigh);
80         }
81 }
82
83 print "\n};\n\n";
84 for ($high=1; $high <= 255;$high++){
85         if (defined $table{$code[$high]}){
86                 print "static unsigned char page$code[$high]\[512\] = {\n\t";
87                 for ($low=0; $low<=255;$low++){
88                         $big5 = $table{$code[$high]}{$code[$low]};
89                         $big5 = "3F3F" if (!(defined $big5));
90                         if ($low > 0 && $low%4 == 0){
91                                 printf("/* 0x%02X-0x%02X */\n\t", $low-4, $low-1);
92                         }
93                         print "\n\t" if ($low == 0x80);
94                         ($bhigh, $blow) = $big5 =~ /(..)(..)/;
95                         printf("0x%2s, 0x%2s, ", $bhigh, $blow);
96                 }
97                 print "/* 0xFC-0xFF */\n};\n\n";
98         }
99 }
100
101 print "static unsigned char *page_uni2charset[256] = {";
102 for ($high=0; $high<=255;$high++){
103         print "\n\t" if ($high%8 == 0);
104         if ($high>0 && defined $table{$code[$high]}){
105                 print "page$code[$high], ";
106         }
107         else{
108                 print "NULL,   ";
109         }
110 }
111 print <<EOF;
112
113 };
114
115 static unsigned char charset2upper[256] = {
116         0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, /* 0x00-0x07 */
117         0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, /* 0x08-0x0f */
118         0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, /* 0x10-0x17 */
119         0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f, /* 0x18-0x1f */
120         0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, /* 0x20-0x27 */
121         0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f, /* 0x28-0x2f */
122         0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, /* 0x30-0x37 */
123         0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f, /* 0x38-0x3f */
124         0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, /* 0x40-0x47 */
125         0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, /* 0x48-0x4f */
126         0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, /* 0x50-0x57 */
127         0x58, 0x59, 0x5a, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f, /* 0x58-0x5f */
128         0x60, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x60-0x67 */
129         0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x68-0x6f */
130         0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x70-0x77 */
131         0x00, 0x00, 0x00, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f, /* 0x78-0x7f */
132         0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, /* 0x80-0x87 */
133         0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f, /* 0x88-0x8f */
134         0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, /* 0x90-0x97 */
135         0x98, 0x99, 0x9a, 0x00, 0x9c, 0x00, 0x00, 0x00, /* 0x98-0x9f */
136         0x00, 0x00, 0x00, 0x00, 0xa4, 0xa5, 0xa6, 0xa7, /* 0xa0-0xa7 */
137         0xa8, 0xa9, 0xaa, 0xab, 0xac, 0xad, 0xae, 0xaf, /* 0xa8-0xaf */
138         0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0xb5, 0xb6, 0xb7, /* 0xb0-0xb7 */
139         0xb8, 0xb9, 0xba, 0xbb, 0xbc, 0xbd, 0xbe, 0xbf, /* 0xb8-0xbf */
140         0xc0, 0xc1, 0xc2, 0xc3, 0xc4, 0xc5, 0xc6, 0xc7, /* 0xc0-0xc7 */
141         0xc8, 0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf, /* 0xc8-0xcf */
142         0xd0, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0x00, 0x00, /* 0xd0-0xd7 */
143         0x00, 0xd9, 0xda, 0xdb, 0xdc, 0x00, 0x00, 0xdf, /* 0xd8-0xdf */
144         0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe0-0xe7 */
145         0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xef, /* 0xe8-0xef */
146         0xf0, 0xf1, 0x00, 0x00, 0x00, 0xf5, 0x00, 0xf7, /* 0xf0-0xf7 */
147         0xf8, 0xf9, 0x00, 0x00, 0x00, 0x00, 0xfe, 0xff, /* 0xf8-0xff */
148 };
149
150
151 static void inc_use_count(void)
152 {
153         MOD_INC_USE_COUNT;
154 }
155
156 static void dec_use_count(void)
157 {
158         MOD_DEC_USE_COUNT;
159 }
160
161 static struct nls_table table = {
162         "cp950",
163         page_uni2charset,
164         charset2uni,
165         inc_use_count,
166         dec_use_count,
167         NULL
168 };
169
170 int init_nls_cp950(void)
171 {
172         return register_nls();
173 }
174
175 #ifdef MODULE
176 int init_module(void)
177 {
178         return init_nls_cp950();
179 }
180
181
182 void cleanup_module(void)
183 {
184         unregister_nls();
185         return;
186 }
187 #endif
188
189 /*
190  * Overrides for Emacs so that we follow Linus's tabbing style.
191  * Emacs will notice this stuff at the end of the file and automatically
192  * adjust the settings for this buffer only.  This must remain at the end
193  * of the file.
194  *
195 ---------------------------------------------------------------------------
196  * Local variables:
197  * c-indent-level: 8
198  * c-brace-imaginary-offset: 0
199  * c-brace-offset: -8
200  * c-argdecl-indent: 8
201  * c-label-offset: -8
202  * c-continued-statement-offset: 8
203  * c-continued-brace-offset: 0
204  * End:
205  */
206 EOF
207
208 5.4 uni2gbk.pl
209
210 #!/usr/bin/perl
211
212 @code = (
213         "00", "01", "02", "03", "04", "05", "06", "07",
214         "08", "09", "0A", "0B", "0C", "0D", "0E", "0F",
215         "10", "11", "12", "13", "14", "15", "16", "17",
216         "18", "19", "1A", "1B", "1C", "1D", "1E", "1F",
217         "20", "21", "22", "23", "24", "25", "26", "27",
218         "28", "29", "2A", "2B", "2C", "2D", "2E", "2F",
219         "30", "31", "32", "33", "34", "35", "36", "37",
220         "38", "39", "3A", "3B", "3C", "3D", "3E", "3F",
221         "40", "41", "42", "43", "44", "45", "46", "47",
222         "48", "49", "4A", "4B", "4C", "4D", "4E", "4F",
223         "50", "51", "52", "53", "54", "55", "56", "57",
224         "58", "59", "5A", "5B", "5C", "5D", "5E", "5F",
225         "60", "61", "62", "63", "64", "65", "66", "67",
226         "68", "69", "6A", "6B", "6C", "6D", "6E", "6F",
227         "70", "71", "72", "73", "74", "75", "76", "77",
228         "78", "79", "7A", "7B", "7C", "7D", "7E", "7F",
229         "80", "81", "82", "83", "84", "85", "86", "87",
230         "88", "89", "8A", "8B", "8C", "8D", "8E", "8F",
231         "90", "91", "92", "93", "94", "95", "96", "97",
232         "98", "99", "9A", "9B", "9C", "9D", "9E", "9F",
233         "A0", "A1", "A2", "A3", "A4", "A5", "A6", "A7",
234         "A8", "A9", "AA", "AB", "AC", "AD", "AE", "AF",
235         "B0", "B1", "B2", "B3", "B4", "B5", "B6", "B7",
236         "B8", "B9", "BA", "BB", "BC", "BD", "BE", "BF",
237         "C0", "C1", "C2", "C3", "C4", "C5", "C6", "C7",
238         "C8", "C9", "CA", "CB", "CC", "CD", "CE", "CF",
239         "D0", "D1", "D2", "D3", "D4", "D5", "D6", "D7",
240         "D8", "D9", "DA", "DB", "DC", "DD", "DE", "DF",
241         "E0", "E1", "E2", "E3", "E4", "E5", "E6", "E7",
242         "E8", "E9", "EA", "EB", "EC", "ED", "EE", "EF",
243         "F0", "F1", "F2", "F3", "F4", "F5", "F6", "F7",
244         "F8", "F9", "FA", "FB", "FC", "FD", "FE", "FF");
245
246 while (<STDIN>){
247         ($unicode, $big5) = split;
248         ($high, $low) = $unicode =~ /(..)(..)/;
249         $table2{$high}{$low} = $big5;
250         ($high, $low) = $big5 =~ /(..)(..)/;
251         $table{$high}{$low} = $unicode;
252 }
253
254 print <<EOF;
255 /*
256  * linux/fs/nls_cp936.c
257  *
258  * Charset cp936 translation tables.
259  * Generated automatically from the Unicode and charset
260  * tables from the Unicode Organization (www.unicode.org).
261  * The Unicode to charset table has only exact mappings.
262  */
263
264 #include <linux/module.h>
265 #include <linux/kernel.h>
266 #include <linux/string.h>
267 #include <linux/nls.h>
268
269 /* 81 - FE*/
270 static struct nls_unicode charset2uni[(0xFE-0x81+1)*(0x100-0x40)] = {
271 EOF
272
273 for ($high=0x81; $high <= 0xFE; $high++){
274         for ($low=0x40; $low <= 0x7F; $low++){
275                 $unicode = $table2{$code[$high]}{$code[$low]};
276                 $unicode = "0000" if (!(defined $unicode));
277                 print "\n\t" if ($low%4 == 0);
278                 print "/* $code[$high]$code[$low]*/\n\t" if ($low%0x10 == 0);
279                 ($uhigh, $ulow) = $unicode =~ /(..)(..)/;
280                 printf("{0x%2s, 0x%2s}, ", $ulow, $uhigh);
281         }
282         for ($low=0x80; $low <= 0xFF; $low++){
283                 $unicode = $table2{$code[$high]}{$code[$low]};
284                 $unicode = "0000" if (!(defined $unicode));
285                 print "\n\t" if ($low%4 == 0);
286                 print "/* $code[$high]$code[$low]*/\n\t" if ($low%0x10 == 0);
287                 ($uhigh, $ulow) = $unicode =~ /(..)(..)/;
288                 printf("{0x%2s, 0x%2s}, ", $ulow, $uhigh);
289         }
290 }
291
292 print "\n};\n\n";
293 for ($high=1; $high <= 255;$high++){
294         if (defined $table{$code[$high]}){
295                 print "static unsigned char page$code[$high]\[512\] = {\n\t";
296                 for ($low=0; $low<=255;$low++){
297                         $big5 = $table{$code[$high]}{$code[$low]};
298                         $big5 = "3F3F" if (!(defined $big5));
299                         if ($low > 0 && $low%4 == 0){
300                                 printf("/* 0x%02X-0x%02X */\n\t", $low-4, $low-1);
301                         }
302                         print "\n\t" if ($low == 0x80);
303                         ($bhigh, $blow) = $big5 =~ /(..)(..)/;
304                         printf("0x%2s, 0x%2s, ", $bhigh, $blow);
305                 }
306                 print "/* 0xFC-0xFF */\n};\n\n";
307         }
308 }
309
310 print "static unsigned char *page_uni2charset[256] = {";
311 for ($high=0; $high<=255;$high++){
312         print "\n\t" if ($high%8 == 0);
313         if ($high>0 && defined $table{$code[$high]}){
314                 print "page$code[$high], ";
315         }
316         else{
317                 print "NULL,   ";
318         }
319 }
320 print <<EOF;
321
322 };
323
324 static unsigned char charset2upper[256] = {
325         0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, /* 0x00-0x07 */
326         0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, /* 0x08-0x0f */
327         0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, /* 0x10-0x17 */
328         0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f, /* 0x18-0x1f */
329         0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, /* 0x20-0x27 */
330         0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f, /* 0x28-0x2f */
331         0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, /* 0x30-0x37 */
332         0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f, /* 0x38-0x3f */
333         0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, /* 0x40-0x47 */
334         0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, /* 0x48-0x4f */
335         0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, /* 0x50-0x57 */
336         0x58, 0x59, 0x5a, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f, /* 0x58-0x5f */
337         0x60, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x60-0x67 */
338         0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x68-0x6f */
339         0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0x70-0x77 */
340         0x00, 0x00, 0x00, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f, /* 0x78-0x7f */
341         0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, /* 0x80-0x87 */
342         0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f, /* 0x88-0x8f */
343         0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, /* 0x90-0x97 */
344         0x98, 0x99, 0x9a, 0x00, 0x9c, 0x00, 0x00, 0x00, /* 0x98-0x9f */
345         0x00, 0x00, 0x00, 0x00, 0xa4, 0xa5, 0xa6, 0xa7, /* 0xa0-0xa7 */
346         0xa8, 0xa9, 0xaa, 0xab, 0xac, 0xad, 0xae, 0xaf, /* 0xa8-0xaf */
347         0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0xb5, 0xb6, 0xb7, /* 0xb0-0xb7 */
348         0xb8, 0xb9, 0xba, 0xbb, 0xbc, 0xbd, 0xbe, 0xbf, /* 0xb8-0xbf */
349         0xc0, 0xc1, 0xc2, 0xc3, 0xc4, 0xc5, 0xc6, 0xc7, /* 0xc0-0xc7 */
350         0xc8, 0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf, /* 0xc8-0xcf */
351         0xd0, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0x00, 0x00, /* 0xd0-0xd7 */
352         0x00, 0xd9, 0xda, 0xdb, 0xdc, 0x00, 0x00, 0xdf, /* 0xd8-0xdf */
353         0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* 0xe0-0xe7 */
354         0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xef, /* 0xe8-0xef */
355         0xf0, 0xf1, 0x00, 0x00, 0x00, 0xf5, 0x00, 0xf7, /* 0xf0-0xf7 */
356         0xf8, 0xf9, 0x00, 0x00, 0x00, 0x00, 0xfe, 0xff, /* 0xf8-0xff */
357 };
358
359
360 static void inc_use_count(void)
361 {
362         MOD_INC_USE_COUNT;
363 }
364
365 static void dec_use_count(void)
366 {
367         MOD_DEC_USE_COUNT;
368 }
369
370 static struct nls_table table = {
371         "cp936",
372         page_uni2charset,
373         charset2uni,
374         inc_use_count,
375         dec_use_count,
376         NULL
377 };
378
379 int init_nls_cp936(void)
380 {
381         return register_nls();
382 }
383
384 #ifdef MODULE
385 int init_module(void)
386 {
387         return init_nls_cp936();
388 }
389
390
391 void cleanup_module(void)
392 {
393         unregister_nls();
394         return;
395 }
396 #endif
397
398 /*
399  * Overrides for Emacs so that we follow Linus's tabbing style.
400  * Emacs will notice this stuff at the end of the file and automatically
401  * adjust the settings for this buffer only.  This must remain at the end
402  * of the file.
403  *
404 ---------------------------------------------------------------------------
405  * Local variables:
406  * c-indent-level: 8
407  * c-brace-imaginary-offset: 0
408  * c-brace-offset: -8
409  * c-argdecl-indent: 8
410  * c-label-offset: -8
411  * c-continued-statement-offset: 8
412  * c-continued-brace-offset: 0
413  * End:
414  */
415 EOF
416
417 5.5 转换CODEPAGE的工具
418
419 /*
420  * CPI.C: A program to examine MSDOS codepage files (*.cpi)
421  * and extract specific codepages.
422  * Compiles under Linux & DOS (using BC++ 3.1).
423  *
424  * Compile: gcc -o cpi cpi.c
425  * Call: codepage file.cpi [-a|-l|nnn]
426  *
427  * Author: Ahmed M. Naas (ahmed@oea.xs4all.nl)
428  * Many changes: aeb@cwi.nl  [changed until it would handle all
429  *      *.cpi files people have sent me; I have no documentation,
430  *      so all this is experimental]
431  * Remains to do: DRDOS fonts.
432  *
433  * Copyright: Public domain.
434  */
435
436 #include <stdio.h>
437 #include <stdlib.h>
438 #include <string.h>
439 #include <unistd.h>
440
441 int handle_codepage(int);
442 void handle_fontfile(void);
443
444 #define PACKED __attribute__ ((packed))
445 /* Use this (instead of the above) to compile under MSDOS */
446 /*#define PACKED  */
447
448 struct {
449         unsigned char id[8] PACKED;
450         unsigned char res[8] PACKED;
451         unsigned short num_pointers PACKED;
452         unsigned char p_type PACKED;
453         unsigned long offset PACKED;
454 } FontFileHeader;
455
456 struct {
457         unsigned short num_codepages PACKED;
458 } FontInfoHeader;
459
460 struct {
461         unsigned short size PACKED;
462         unsigned long off_nexthdr PACKED;
463         unsigned short device_type PACKED; /* screen=1; printer=2 */
464         unsigned char device_name[8] PACKED;
465         unsigned short codepage PACKED;
466         unsigned char res[6] PACKED;
467         unsigned long off_font PACKED;
468 } CPEntryHeader;
469
470 struct {
471         unsigned short reserved PACKED;
472         unsigned short num_fonts PACKED;
473         unsigned short size PACKED;
474 } CPInfoHeader;
475
476 struct {
477         unsigned char height PACKED;
478         unsigned char width PACKED;
479         unsigned short reserved PACKED;
480         unsigned short num_chard PACKED;
481 } ScreenFontHeader;
482
483 struct {
484         unsigned short p1 PACKED;
485         unsigned short p2 PACKED;
486 } PrinterFontHeader;
487
488 FILE *in, *out;
489 void usage(void);
490
491 int opta, optc, optl, optL, optx;
492 extern int optind;
493 extern char *optarg;
494
495 unsigned short codepage;
496
497 int main (int argc, char *argv[])
498 {
499         if (argc < 2)
500                 usage();
501
502         if ((in = fopen(argv[1], "r")) == NULL) {
503                 printf("\nUnable to open file %s.\n", argv[1]);
504                 exit(0);
505         }
506
507         opta = optc = optl = optL = optx = 0;
508         optind = 2;
509         if (argc == 2)
510                 optl = 1;
511         else
512         while(1) {
513             switch(getopt(argc, argv, "alLc")) {
514               case 'a':
515                 opta = 1;
516                 continue;
517               case 'c':
518                 optc = 1;
519                 continue;
520               case 'L':
521                 optL = 1;
522                 continue;
523               case 'l':
524                 optl = 1;
525                 continue;
526               case '?':
527               default:
528                 usage();
529               case -1:
530                 break;
531             }
532             break;
533         }
534         if (optind != argc) {
535             if (optind != argc-1 || opta)
536               usage();
537             codepage = atoi(argv[optind]);
538             optx = 1;
539         }
540
541         if (optc)
542           handle_codepage(0);
543         else
544           handle_fontfile();
545
546         if (optx) {
547             printf("no page %d found\n", codepage);
548             exit(1);
549         }
550
551         fclose(in);
552         return (0);
553 }
554
555 void
556 handle_fontfile(){
557         int i, j;
558
559         j = fread(, 1, sizeof(FontFileHeader), in);
560         if (j != sizeof(FontFileHeader)) {
561             printf("error reading FontFileHeader - got %d chars\n", j);
562             exit (1);
563         }
564         if (!strcmp(FontFileHeader.id + 1, "DRFONT ")) {
565             printf("this program cannot handle DRDOS font files\n");
566             exit(1);
567         }
568         if (optL)
569           printf("FontFileHeader: id=%8.8s res=%8.8s num=%d typ=%c offset=%ld\n\n",
570                  FontFileHeader.id, FontFileHeader.res,
571                  FontFileHeader.num_pointers,
572                  FontFileHeader.p_type,
573                  FontFileHeader.offset);
574
575         j = fread(, 1, sizeof(FontInfoHeader), in);
576         if (j != sizeof(FontInfoHeader)) {
577             printf("error reading FontInfoHeader - got %d chars\n", j);
578             exit (1);
579         }
580         if (optL)
581           printf("FontInfoHeader: num_codepages=%d\n\n",
582                  FontInfoHeader.num_codepages);
583
584         for (i = FontInfoHeader.num_codepages; i; i--)
585           if (handle_codepage(i-1))
586             break;
587 }
588
589 int
590 handle_codepage(int more_to_come) {
591         int j;
592         char outfile[20];
593         unsigned char *fonts;
594         long inpos, nexthdr;
595
596         j = fread(, 1, sizeof(CPEntryHeader), in);
597         if (j != sizeof(CPEntryHeader)) {
598             printf("error reading CPEntryHeader - got %d chars\n", j);
599             exit(1);
600         }
601         if (optL) {
602             int t = CPEntryHeader.device_type;
603             printf("CPEntryHeader: size=%d dev=%d [%s] name=%8.8s \
604 codepage=%d\n\t\tres=%6.6s nxt=%ld off_font=%ld\n\n",
605                    CPEntryHeader.size,
606                    t, (t==1) ? "screen" : (t==2) ? "printer" : "?",
607                    CPEntryHeader.device_name,
608                    CPEntryHeader.codepage,
609                    CPEntryHeader.res,
610                    CPEntryHeader.off_nexthdr, CPEntryHeader.off_font);
611         } else if (optl) {
612             printf("\nCodepage = %d\n", CPEntryHeader.codepage);
613             printf("Device = %.8s\n", CPEntryHeader.device_name);
614         }
615 #if 0
616         if (CPEntryHeader.size != sizeof(CPEntryHeader)) {
617             /* seen 26 and 28, so that the difference below is -2 or 0 */
618             if (optl)
619               printf("Skipping %d bytes of garbage\n",
620                      CPEntryHeader.size - sizeof(CPEntryHeader));
621             fseek(in, CPEntryHeader.size - sizeof(CPEntryHeader),
622                   SEEK_CUR);
623         }
624 #endif
625         if (!opta && (!optx || CPEntryHeader.codepage != codepage) && !optc)
626           goto next;
627
628         inpos = ftell(in);
629         if (inpos != CPEntryHeader.off_font && !optc) {
630             if (optL)
631               printf("pos=%ld font at %ld\n", inpos, CPEntryHeader.off_font);
632             fseek(in, CPEntryHeader.off_font, SEEK_SET);
633         }
634
635         j = fread(, 1, sizeof(CPInfoHeader), in);
636         if (j != sizeof(CPInfoHeader)) {
637             printf("error reading CPInfoHeader - got %d chars\n", j);
638             exit(1);
639         }
640         if (optl) {
641             printf("Number of Fonts = %d\n", CPInfoHeader.num_fonts);
642             printf("Size of Bitmap = %d\n", CPInfoHeader.size);
643         }
644         if (CPInfoHeader.num_fonts == 0)
645           goto next;
646         if (optc)
647           return 0;
648
649         sprintf(outfile, "%d.cp", CPEntryHeader.codepage);
650         if ((out = fopen(outfile, "w")) == NULL) {
651             printf("\nUnable to open file %s.\n", outfile);
652             exit(1);
653         } else printf("\nWriting %s\n", outfile);
654
655         fonts = (unsigned char *) malloc(CPInfoHeader.size);
656
657         fread(fonts, CPInfoHeader.size, 1, in);
658         fwrite(, sizeof(CPEntryHeader), 1, out);
659         fwrite(, sizeof(CPInfoHeader), 1, out);
660         j = fwrite(fonts, 1, CPInfoHeader.size, out);
661         if (j != CPInfoHeader.size) {
662             printf("error writing %s - wrote %d chars\n", outfile, j);
663             exit(1);
664         }
665         fclose(out);
666         free(fonts);
667         if (optx) exit(0);
668       next:
669         /*
670          * It seems that if entry headers and fonts are interspersed,
671          * then nexthdr will point past the font, regardless of
672          * whether more entries follow.
673          * Otherwise, first all entry headers are given, and then
674          * all fonts; in this case nexthdr will be 0 in the last entry.
675          */
676         nexthdr = CPEntryHeader.off_nexthdr;
677         if (nexthdr == 0 || nexthdr == -1) {
678             if (more_to_come) {
679                 printf("mode codepages expected, but nexthdr=%ld\n",
680                        nexthdr);
681                 exit(1);
682             } else
683                 return 1;
684         }
685
686         inpos = ftell(in);
687         if (inpos != CPEntryHeader.off_nexthdr) {
688             if (optL)
689               printf("pos=%ld nexthdr at %ld\n", inpos, nexthdr);
690             if (opta && !more_to_come) {
691                 printf("no more code pages, but nexthdr != 0\n");
692                 return 1;
693             }
694
695             fseek(in, CPEntryHeader.off_nexthdr, SEEK_SET);
696         }
697
698         return 0;
699 }
700
701 void usage(void)
702 {
703         printf("\nUsage: cpi code_page_file [-c] [-L] [-l] [-a|nnn]\n");
704         printf(" -c: input file is a single codepage\n");
705         printf(" -L: print header info (you don't want to see this)\n");
706         printf(" -l or no option: list all codepages contained in the file\n");
707         printf(" -a: extract all codepages from the file\n");
708         printf(" nnn (3 digits): extract codepage nnn from the file\n");
709         printf("Example: cpi ega.cpi 850 \n");
710         printf(" will create a file 850.cp containing the requested codepage.\n\n");
711         exit(1);
712 }

posted on 2007-04-05 17:24 CPP&&设计模式小屋阅读(11041) 评论(0) 编辑收藏引用

常用链接

留言簿(10)

随笔分类

随笔档案

相册

朋友

搜索

最新评论

阅读排行榜

评论排行榜