Delphi in a Unicode World Part I



Delphi in a Unicode World Part I: What is Unicode, Why do you need it, and How do you work with it in Delphi?

By: Nick Hodges


Abstract: This article discusses Unicode, how Delphi developers can benefit from using Unicode, and how Unicode will be implemented in Delphi 2009.


The Internet has broken down geographical barriers that enable world-wide software distribution. As a result, applications can no longer live in a purely ANSI-based environment. The world has embraced Unicode as the standard means of transferring text and data. Since it provides support for virtually any writing system in the world, Unicode text is now the norm throughout the global technological ecosystem.

    What is Unicode?

Unicode is a character encoding scheme that allows virtually all alphabets to be encoded into a single character set. Unicode allows computers to manage and represent text most of the world’s writing systems. Unicode is managed by The Unicode Consortium and codified in a standard. More simply put, Unicode is a system for enabling everyone to use each other’s alphabets. Heck, there is even a Unicode version of Klingon.

This series of articles isn’t meant to give you a full rundown of exactly what Unicode is and how it works; instead it is meant to get you going on using Unicode within Delphi 2009. If you want a good overview of Unicode, Joel Spolsky has a great article entitled “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” which is highly recommended reading. As Joel clearly points out “IT’S NOT THAT HARD”. This article, Part I of III, will discuss why Unicode is important, and how Delphi will implement the new UnicodeString type.

    Why Unicode?

Among the many new features found in Delphi 2009 is the imbuing of Unicode throughout the product. The default string in Delphi is now a Unicode-based string. Since Delphi is largely built with Delphi, the IDE, the compiler, the RTL, and the VCL all are fully Unicode-enabled.

The move to Unicode in Delphi is a natural one. Windows itself is fully Unicode-aware, so it is only natural that applications built for it, use a Unicode string as the default string. And for Delphi developers, the benefits don’t stop merely at being able to use the same string type as Windows.

The addition of Unicode support provides Delphi developers with a great opportunity. Delphi developers now can read, write, accept, produce, display, and deal with Unicode data – and it’s all built right into the product. With only few, or in some cases to zero code changes, your applications can be ready for any kind of data you, your customers or your users can throw at it. Applications that previously restricted to ANSI encoded data can be easily modified to handle almost any character set in the world.

Delphi developers will now be able to serve a global market with their applications -- even if they don’t do anything special to localize or internationalize their applications. Windows itself supports many different localized versions, and Delphi applications need to be able to adapt and work on machines running any of the large number of locales that Windows supports, including the Japanese, Chinese, Greek, or Russian versions of Windows. Users of your software may be entering non-ANSI text into your application or using non-ANSI based path names. ANSI-based applications won’t always work as desired in those scenarios. Windows applications built with a fully Unicode-enabled Delphi will be able to handle and work in those situations. Even if you don’t translate your application into any other spoken languages, your application still needs to be able to work properly -- no matter what the end user’s locale is.

For existing ANSI-based Delphi applications, then opportunity to localize applications and expand the reach of those applications into Unicode-based markets is potentially huge. And if you do want to localize your applications, Delphi makes that very easy, especially now at design-time. The Integrated Translation Environment (ITE) enables you to translate, compile, and deploy an application right in the IDE. If you require external translation services, the IDE can export your project in a form that translators can use in conjunction with the deployable External Translation Manager. These tools work together with the Delphi IDE for both Delphi and C++Builder to make localizing your applications a smooth and easy to manage process.

The world is Unicode-based, and now Delphi developers can be a part of that in a native, organic way. So if you want to be able to handle Unicode data, or if you want to sell your applications to emerging and global markets, you can do it with Delphi 2009.

    A Word about Terminology

Unicode encourages the use of some new terms. For instance the idea of “character” is a bit less precise in the world of Unicode than you might be used to. In Unicode, the more precise term is “code point”. In Delphi 2009, the SizeOf(Char) is 2, but even that doesn’t tell the whole story. Depending on the encoding, it is possible for a given character to take up more than two bytes. These sequences are called “Surrogate Pairs”. So a code point is a unique code assigned an element defined by the Unicode.org. Most commonly that is a “character”, but not always.

Another term you will see in relation to Unicode is “BOM”, or Byte Order Mark, and that is a very short prefix used at the beginning of a text file to indicate the type of encoding used for that text file. MSDN has a nice article on what a BOM is. The new TEncoding Class (to be discussed in Part II) has a class method called GetPreamble which returns the BOM for a given encoding.

Now that all that has been explained, we’ll look at how Delphi 2009 implements a Unicode-based string.

    The New UnicodeString Type

The default string in Delphi 2009 is the new UnicodeString type. By default, the UnicodeString type will have an affinity for UTF-16, the same encoding used by Windows. This is a change from previous versions which had AnsiString as the default type. The Delphi RTL has in the past included the WideString type to handle Unicode data, but this type is not reference-counted as the AnsiString type is, and thus isn’t as full-featured as Delphi developers expect the default string to be.

For Delphi 2009, a new UnicodeString type has been designed, that incorporates the capabilities of both the AnsiString and WideString types. A UnicodeString can contain either a Unicode-sized character, or an ANSI byte-sized character. (Note that both the AnsiString and WideString types will remain in place.) The Char and PChar types will map to WideChar and PWideChar, respectively. Note, as well, that no string types have disappeared. All the types that developers are used to still exist and work as before.

However, for Delphi 2009, the default string type will be equivalent to UnicodeString. In addition, the default Char type is WideChar, and the default PChar type is PWideChar.

That is, the following code is declared by the compiler:

  string = UnicodeString;
  Char = WideChar;
  PChar = PWideChar;

UnicodeString is assignment compatible with all other string types; however, assignments between AnsiStrings and UnicodeStrings will do type conversions as appropriate. Thus, an assignment of a UnicodeString type to an AnsiString type could result in data-loss. That is, if a UnicodeString contains high-order byte data, a conversion of that string to AnsiString will result in a loss of that high-order byte data.

The important thing to note here is that this new UnicodeString behaves pretty much like strings always have (with the notable exception of their ability to hold Unicode data, of course). You can still add any string data to them, you can index them, you can concatenate them with the ‘+’ sign, etc.

For example, instances of a UnicodeString will still be able to index characters. Consider the following code:

   MyChar: Char;
   MyString: string;
   MyString := ‘This is a string’;
   MyChar := MyString[1];

The variable MyChar will still hold the character found at the first index position, i.e. ‘T’. This functionality of this code hasn’t changed at all. Similarly, if we are handling Unicode data:

   MyChar: Char;
   MyString: string;
   MyString := ‘世界您好‘;
   MyChar := MyString[1];

The variable MyChar will still hold the character found at the first index position, i.e. ‘世’.

The RTL provides helper functions that enable users to do explicit conversions between codepages and element size conversions. If the user is using the Move function on the character array, they cannot make assumptions about the element size.

As you can imagine, this new string type has ramifications for existing code. With Unicode, it is no longer true that one Char represents one Byte. In fact, it isn’t even always true that one Char is equal to two bytes! As a result, you may have to make some adjustments to your code. However, we’ve worked very hard to make the transition a smooth one, and we are confident that you’ll be able to be up and running quite quickly. Parts II and III of this series will discuss further the new UnicodeString type, talk about some of the new features of the RTL that support Unicode enablement, and then discuss specific coding idioms that you’ll want to look for in your code. This series should help make your transition to Unicode a smooth and painless endeavor.


With the addition of Unicode as the default string, Delphi can accept, process, and display virtually any alphabet or code page in the world. Applications you build with Delphi 2009 will be able to accept, display, and handle Unicode text with ease, and they will work much better in almost any Windows locale. Delphi developers can now easily localize and translate their applications to enter markets that they have previously been more difficult to enter. It’s a Unicode world out there, and now your Delphi apps can live in it.

In Part II, we’ll discuss the changes and updates to the Delphi Runtime Library that will enable you to work easily with Unicode strings.




  Delphi开发者可以作为全球市场中的应用——即使他们不做任何特别的局部或国际化的应用。支持多种不同的局部窗口本身的版本,Delphi应用程序需要能够适应工作的任何机器运行大量的场景,包括了,窗户支持日本、中国、希腊、或俄罗斯版本的视窗。用户可以进入你的软件应用到你non-ANSI non-ANSI或使用基于路径名。ANSI-based应用不会一直工作所需的那些场景。视窗系统应用具有完全Unicode-enabledDelphi将能够处理和工作的情况。即使你不把你的应用程序在任何其他种语言,你的应用还需要能够正常工作——无论如何在最终用户的场所。
  对现有ANSI-basedDelphi申请书,并应用和扩大机遇来定位的应用是潜在的巨大市场进入Unicode-based。如果你确实想要让你的应用程序中,德尔斐定位,非常容易,尤其是现在在设计。尽管综合翻译环境(翻译)允许你编写,和部署,申请的权利。如果你需要外部的翻译服务,IDE可汇出您的项目可以使用一种译者在翻译会同部署的外部经理。这些工具与DelphiIDE对于德尔菲法和C + + Builder使本地化软件平滑而易于处理的过程。
  默认的字符串在2009年新UnicodeString德尔菲的类型。默认情况下,UnicodeString类型会有亲和力,同样的编码为UTF-16所用的窗口。这是一个从以前的版本,具有AnsiString设为默认的类型。德尔菲RTL已经在过去的数据类型来处理WideString制定的,但是这种不是reference-counted AnsiString型的,因此并不像预期的一样Delphi开发商将默认的字符串。
  对于一个新UnicodeString德尔菲2009年,设计,类型都包含了能力,WideString AnsiString类型。一个UnicodeString可以包含一个字,或一个Unicode-sized ANSI byte-sized字符。(注意:双方AnsiString WideString类型,将继续存在。)贾泽民、PChar的类型将地图,分别WideChar PWideChar。注意,没有字符串类型已经消失了。所有的类型,开发者习惯于依然存在的情况下工作。
  然而,Delphi2009年,默认字符串类型将相当于UnicodeString。此外,默认是WideChar炙、类型的默认PChar PWideChar类型。
   string = UnicodeString;
    Char = WideChar;
    PChar = PWideChar;

  兼容所有作业UnicodeString是其他字符串类型;然而,作业和UnicodeStrings AnsiStrings之间做适当的类型转换。因此,赋值类型的一个AnsiString UnicodeString data-loss类型可能导致。这就是说,如果一个UnicodeString含有高阶字节数据转换的那根绳子,将导致损失AnsiString高字节的数据。
   MyChar: Char;
   MyString: string;
     MyString := ‘This is a string’;
     MyChar := MyString[1];

   MyChar: Char;
   MyString: string;
     MyString := ‘世界您好‘;
     MyChar := MyString[1];




posted on 2010-01-11 22:43 cpploverr 阅读(72) 评论(0)  编辑 收藏 引用

【推荐】超50万行VC++源码: 大型组态工控、电力仿真CAD与GIS源码库
网站导航: 博客园   IT新闻   BlogJava   知识库   博问   管理