diceidea

parser

常用链接

Others

我的外文地址 (rss)
C++ blog这里主要是国内交流，外语的嘛在msn空间里，目前只有英语，以后可能会学西班牙语，或者法语，我想学的好多啊

有用的东西

DFA和lexical analysis

对于hand written的lexical analyzer来说，NFA和DFA的运用是不可避免的，除非你的grammer十分简单。
一旦给出了source program(也就是你想处理的character stream)的一个pattern的正则表达式，就可以构造对应的NFA，然后转换为DFA，这个DFA就可以用来处理你的source program, 将里面能够match这个pattern的lexeme全都找出来。按照这样的流程，对于一种编程语言，不管是常用的语言，还是脚本语言，只要对所有的pattern构造DFA，就能够写出自己的lexical analyzer了。
有两篇关于正则表达式到DFA的文章写的很好：
1.Writing own regular expression parser By Amer Gerzic英文的
http://www.codeproject.com/KB/recipes/OwnRegExpressionsParser.aspx
有源码
2. 《构造正则表达式引擎》新鲜出炉啦！中文的，by vczh,华南理工大学
http://www.cppblog.com/vczh/archive/2008/05/22/50763.html
阅读完上面两篇文章，写个能够运行的lexer就不成问题了。
另外附上龙书（Compilers, principles techniques and tools）里一段token,pattern和lexeme术语的区别：
1. A t o k e n is a pair consisting of a token name and an optional attribute
value. The token name is an abstract symbol representing a kind of
lexical unit(lexeme), e.g., a particular keyword, or a sequence of input characters
denoting an identifier. The token names are the input symbols that the
parser processes. In what follows, we shall generally write the name of a
token in boldface. We will often refer to a token by its token name.
2. A pattern is a description of the form that the lexemes of a token may take.
In the case of a keyword as a token, the pattern is just the sequence of
characters that form the keyword. For identifiers and some other tokens,
the pattern is a more complex structure that is matched by many strings.
3. A lexeme is a sequence of characters in the source program that matches
the pattern for a token and is identified by the lexical analyzer as an
instance of that token.
notes:
1. more than one lexeme can match a pattern
2. 看看example 3.1

posted on 2008-05-24 13:59 diceidea 阅读(585) 评论(0) 编辑收藏引用所属分类: Dev log

只有注册用户登录后才能发表评论。


相关文章: DFA和lexical analysis dev log(1)

网站导航: 博客园博客园最新博文博问管理

diceidea

导航

留言簿(1)

随笔分类

文章分类

收藏夹

随笔档案

阅读排行榜

评论排行榜

常用链接

Others

有用的东西

最新评论

DFA和lexical analysis