DFA和lexical analysis

diceidea — Sat, 24 May 2008 05:59:00 GMT

对于hand written的lexical analyzer来说，NFA和DFA的运用是不可避免的，除非你的grammer十分简单。
一旦给出了source program(也就是你想处理的character stream)的一个pattern的正则表达式，就可以构造对应的NFA，然后转换为DFA，这个DFA就可以用来处理你的source program, 将里面能够match这个pattern的lexeme全都找出来。按照这样的流程，对于一种编程语言，不管是常用的语言，还是脚本语言，只要对所有的pattern构造DFA，就能够写出自己的lexical analyzer了。
有两篇关于正则表达式到DFA的文章写的很好：
1.Writing own regular expression parser By Amer Gerzic英文的
http://www.codeproject.com/KB/recipes/OwnRegExpressionsParser.aspx
有源码
2. 《构造正则表达式引擎》新鲜出炉啦！中文的，by vczh,华南理工大学
http://www.cppblog.com/vczh/archive/2008/05/22/50763.html
阅读完上面两篇文章，写个能够运行的lexer就不成问题了。
另外附上龙书（Compilers, principles techniques and tools）里一段token,pattern和lexeme术语的区别：
1. A t o k e n is a pair consisting of a token name and an optional attribute
value. The token name is an abstract symbol representing a kind of
lexical unit(lexeme), e.g., a particular keyword, or a sequence of input characters
denoting an identifier. The token names are the input symbols that the
parser processes. In what follows, we shall generally write the name of a
token in boldface. We will often refer to a token by its token name.
2. A pattern is a description of the form that the lexemes of a token may take.
In the case of a keyword as a token, the pattern is just the sequence of
characters that form the keyword. For identifiers and some other tokens,
the pattern is a more complex structure that is matched by many strings.
3. A lexeme is a sequence of characters in the source program that matches
the pattern for a token and is identified by the lexical analyzer as an
instance of that token.
notes:
1. more than one lexeme can match a pattern
2. 看看example 3.1

diceidea 2008-05-24 13:59 发表评论

dev log(1)

diceidea — Mon, 12 May 2008 03:50:00 GMT

Summary for last week:

1. Understand the basic of lexical parse and syntax parse, know what they are used for separately. The former with regular expression to recognize tokens, the latter using operator precedence/recursive descent to build a syntax tree for continuing step, e.g, computing the math expression. The former is considered on character, while the latter is on token.

2. Have a superficial knowledge in Automate, know how to build a FA for specific regular expression and reduce e-FA to FA, if the result is NFA, then convert to DFA

3. I find rolling dice can reduce to a SM.

Todo in coming week:

1. Study the internal lexical and syntax parts of muParser.

2. Write code for my own math expression parser.

3. Add 'If' and 'while' to the parser using my own way, then read more in Parsing related Docs for better solution.

Time stamp:

1:00 May.12.2008

diceidea 2008-05-12 11:50 发表评论

C++博客-diceidea-随笔分类-Dev log

DFA和lexical analysis

dev log(1)