C++博客-sunrise-随笔分类-自然语言处理

NLP数据收集

SunRise_at — Thu, 20 Sep 2012 09:29:00 GMT

目前网上可供下载的数据众多，但是内容庞杂，把其中比较有用的数据找了出来。

wiki系：
wikipedia大家都不陌生，它的下载地址是：http://dumps.wikimedia.org/ , 这里有详细介绍：http://en.wikipedia.org/wiki/Wikipedia:Database_download
但是wikipedia只是Wikimedia基金会的一个子项目，wikimedia下面还有多个其他的重要项目，包括：
wiktionary 一个语义化的关联词典，形式上类似于wordnet
wikiquote 收录各种名人名言
Wikibooks 免费的教科书和手册
Wikinews 大量的新闻故事
Wikiversity 免费的教育材料
Wikisource 免费的文本内容
上述的这些内容，都可以通过http://dumps.wikimedia.org/ 下载到。
还有一些小型的wiki项目，比如：
http://simple.wikipedia.org   使用Basic English写的wiki，给儿童和初学者看
http://simple.wiktionary.org   使用Basic English写的wiktionary

wikipedia的数据处理有很多方式，我比较推崇这两个：
jwpl: http://code.google.com/p/jwpl/
wikipedia-miner:   http://wikipedia-miner.cms.waikato.ac.nz/wiki/

下面我介绍下另一个商业化的wiki网站:http://www.wikia.com  这个网站上用户可以创建单独的维基网站，下面是排名前250位wikia网站：
http://wikis.wikia.com/wiki/List_of_Wikia_wikis
wikia上的资源也可供下载：http://community.wikia.com/wiki/Help:Database_download

Freebase:
freebase是啥就不解释了，下面给出数据的下载地址：
http://wiki.freebase.com/wiki/Data_dumps   freebase自身的数据
http://wiki.freebase.com/wiki/WEX   freebase从wikipedia中提取的数据

YAGO2:
http://www.mpi-inf.mpg.de/yago-naga/yago/

dbpedia:
http://www.dbpedia.org

如果要找LinkedData，可以来这里：http://www.thedatahub.org   这里收集了很多Linked Data
http://linkeddata.org/   这里有一张图，给出了各种linkeddata的关系和影响力。

如果要找各种网上的api，可以来这里：http://www.programmableweb.com
现在外国政府纷纷对外公开数据，下面是几个政府的开放数据集：
http://data.gov.au   澳大利亚
http://data.dc.gov   美国哥伦比亚州的
http://www.data.gov   美国
http://data.gov.uk   英国
http://databases.lapl.org/   洛杉矶地区的开放数据集，知道硅谷为啥这么牛了吧
http://www.gov.hk/en/theme/psi/welcome   香港政府也公开了很多数据
对比一下，外国政府做了这么多实事，人民大会堂里的那些酒囊饭袋们都在干什么？

http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lexAccess/current/web/download.html   美国国家卫生署发布的词表
http://www.census.gov/genealogy/www/data/2000surnames/index.html   美国统计局的姓名数据
https://www.cia.gov/library/publications/download/   美国中央情报局发布的factbook，介绍了世界各国情况
连卫生署，统计局和中情局这种单位都为美国的信息建设做出了这么多的贡献，我们应该知道自己跟美帝的差距有多大了吧。

叙词表：
http://www.nlm.nih.gov/mesh/filelist.html   mesh,关于医学的受控词表
http://id.loc.gov/download/   美国国会图书馆发布的叙词表

一些三元组数据：
http://www.cs.utexas.edu/users/pclark/dart/   采集自BNC（英国国家语料库）和Reuters，2300万条
http://reverb.cs.washington.edu/   华盛顿大学的项目，1500万条
http://www.cs.washington.edu/research/sherlock-hornclauses/   大约有200-300万条数据
http://www.cs.rochester.edu/research/knext   有535万条数据，来自BNC和布朗语料库
http://rtw.ml.cmu.edu/rtw/resources   readtheweb项目，数据量较小

机读词典：
http://wordnet.princeton.edu/   英语的wordnet
http://nlpwww.nict.go.jp/wn-ja/index.en.html   日语的wordnet
http://alpage.inria.fr/~sagot/wolf-en.html   法语的wordnet
http://wordnet.ru/   俄罗斯的wordnet
http://cl.haifa.ac.il/projects/mwn/index.shtml   希伯来语的wordnet
http://wordnet.dk/dannet/menu?item=2   丹麦语的wordnet
http://grial.uab.es/sensem/download?idioma=en   西班牙语的wordnet
http://www.ling.helsinki.fi/en/lt/research/finnwordnet/download.shtml   芬兰语的wordnet
这些不同版本的wordnet都是免费下载的。可恨中国泱泱五千年的文明古国，文献典故浩如烟海，竟连一份免费且公开的机读词典都没有。这是汉语的耻辱，中国的耻辱，也是中华民族的耻辱。特别是中科院计算所和自动化所的人们，你们觉得呢？（顺祝hownet生意兴隆，越卖越好）

http://dico.fj.free.fr/dico.php   日法词典
http://www.csse.monash.edu.au/~jwb/edict.html   日英词典
http://cc-cedict.org/wiki/start   中文到英文的词典，终于出来中文的了，可惜是外国人搞出来的。
https://framenet.icsi.berkeley.edu   基于框架语义学的东东，恐怕不能算词典，不过没地儿放了。

语料库：
http://opus.lingfil.uu.se/   开放的平行语料库
http://opus.lingfil.uu.se/OpenSubtitles_v2.php   大量电影字幕的下载地址
http://www.statmt.org/europarl   欧洲议会的平行语料库
http://www.anc.org/OANC/   开放的美国国家语料库

http://snap.stanford.edu/data/   斯坦福大学的SNAP项目，抓了很多数据，不过时间较早，只有研究价值

SunRise_at 2012-09-20 17:29 发表评论

搭建wiki镜像小结

SunRise_at — Tue, 21 Aug 2012 01:22:00 GMT

一.Apache，Php5，Mysql不可少，然后下载mediawiki软件。

之前没有接触过这些软件，so每一个都需要装....

(1)apache配置

　在Debian下，安装完成后，软件包为我们提供的配置文件位于/etc/apache2目录下：

　　tony@tonybox:/etc/apache2$ ls -l

　　total 72

　　-rw-r--r-- 1 root root 12482 2006-01-16 18:15 apache2.conf

　　drwxr-xr-x 2 root root 4096 2006-06-30 13:56 conf.d

　　-rw-r--r-- 1 root root 748 2006-01-16 18:05 envvars

　　-rw-r--r-- 1 root root 268 2006-06-30 13:56 httpd.conf

　　-rw-r--r-- 1 root root 12441 2006-01-16 18:15 magic

　　drwxr-xr-x 2 root root 4096 2006-06-30 13:56 mods-available

　　drwxr-xr-x 2 root root 4096 2006-06-30 13:56 mods-enabled

　　-rw-r--r-- 1 root root 10 2006-06-30 13:56 ports.conf

　　-rw-r--r-- 1 root root 2266 2006-01-16 18:15 README

　　drwxr-xr-x 2 root root 4096 2006-06-30 13:56 sites-available

　　drwxr-xr-x 2 root root 4096 2006-06-30 13:56 sites-enabled

　　drwxr-xr-x 2 root root 4096 2006-01-16 18:15

　　其中

　　apache2.conf

　　为apache2服务器的主配置文件，查看此配置文件，你会发现以下内容

　　# Include module configuration:

　　Include /etc/apache2/mods-enabled/*.load

　　Include /etc/apache2/mods-enabled/*.conf

　　# Include all the user configurations:

　　Include /etc/apache2/httpd.conf

　　# Include ports listing

　　Include /etc/apache2/ports.conf

　　# Include generic snippets of statements

　　Include /etc/apache2/conf.d/[^.#]*

　　有此可见， apache2 根据配置功能的不同，对配置文件进行了分割，这样更利于管理

　　conf.d

　　下为配置文件的附加片断，默认情况下，仅提供了 charset 片断，

　　tony@tonybox:/etc/apache2/conf.d$ cat charset

　　AddDefaultCharset UTF-8

　　如有需要我们可以将默认编码修改为 GB2312, 即文件的内容为： AddDefaultCharset GB2312

　　httpd.conf

　　是个空文件

　　magic

　　文件中包含的是有关mod_mime_magic模块的数据，一般不需要修改它。

　　ports.conf

　　则为服务器监听IP和端口设置的配置文件，

　　tony@tonybox:/etc/apache2$ cat ports.conf

　　Listen 80

　　mods-available

　　目录下是一些。conf和。load 文件，为系统中可以使用的加载各种模块的配置文件，而mods-enabled目录下则是指向这些配置文件的符号连接，从配置文件apache2.conf 中可以看出，系统通过mods-enabled目录来加载模块，也就是说，系统仅通过在此目录下创建了符号连接的mods-available 目录下的配置文件来加载模块。同时系统还提供了两个命令 a2enmod 和 a2dismod用于维护这些符号连接。这两个命令由 apache2-common 包提供。命令各式也非常简单： a2enmod [module] 或 a2dismod [module]

　　sites-available

　　目录下为配置好的站点的配置文件， sites-enabled 目录下则是指向这些配置文件的符号连接，系统通过这些符号连接来起用站点 sites-enabled目录下的符号连接附有一个数字前缀，如000-default, 这个数字用于决定启动顺序，数字越小，启动优先级越高。系统提供了两个命令 a2ensite 和 a2dissite 用于维护这些符号连接。这两个命令由 apache2-common 包提供。

　　/var/www

　　默认情况下将要发布的网页文件应该置于/var/www目录下，这一默认值可以同过主配置文件中的DocumnetRoot 选项修改。

　　二.mediawiki直接解压到apache里面(就是解压在var/www路径下),解压后重名为wiki；

三. 然后进主页localhost/wiki，对MediaWiki进行安装。去创建数据库wikidb。里面有41个表。在导入数据之间，要先清除page,revision,text三个表。

delete from page;

delete from revision;

delete from text;

四.http://dumps.wikimedia.org/backup-index.html 在这里可以下载任何语言wiki 的数据库xml 文件。下载的文件类似于enwiki-20061130-pages-articles.xml.bz2（英文版的），wiki差不多每两个月更新一次数据。

五.安装mediawiki。去下载mediawiki的源代码，如果其官方网站被封的话可以去www.allwiki.com这个中文网站上去下载。下载后解压到你的apache能找到的一个目录下，将其config目录权限设置为777，然后在浏览器里访问其 config/index.php，进行一些配置后，会在config目录下生成一个LocalSettings.php的文件，将这个文件拷贝到它的上一级目录。最后别忘了将config的目录再改回原来的权限。

六.把文件导入数据库：
命令：
java -Xmx600M -server -jar mwdumper.jar --format=sql:1.5
enwiki-20061130-pages-articles.xml.bz2 | mysql -u wikiuser -p wikidb

参见：http://fuhao-987.iteye.com/blog/1044933

http://jgs80.blog.163.com/blog/static/3566265320076177435762/

SunRise_at 2012-08-21 09:22 发表评论

Penn Treebank Tags

SunRise_at — Tue, 31 Jul 2012 05:31:00 GMT

Note: This information comes from "Bracketing Guidelines for Treebank II Style Penn Treebank Project" - part of the documentation that comes with the Penn Treebank.

Contents:

Bracket Labels

Clause Level

Phrase Level

Word Level

Function Tags

Form/function discrepancies

Grammatical role

Adverbials

Miscellaneous

Index of All Tags

Bracket Labels

Clause Level

S - simple declarative clause, i.e. one that is not introduced by a (possible empty) subordinating conjunction or a wh-word and that does not exhibit subject-verb inversion.

SBAR - Clause introduced by a (possibly empty) subordinating conjunction.

SBARQ - Direct question introduced by a wh-word or a wh-phrase. Indirect questions and relative clauses should be bracketed as SBAR, not SBARQ.

SINV - Inverted declarative sentence, i.e. one in which the subject follows the tensed verb or modal.

SQ - Inverted yes/no question, or main clause of a wh-question, following the wh-phrase in SBARQ.

Phrase Level

ADJP - Adjective Phrase.

ADVP - Adverb Phrase.

CONJP - Conjunction Phrase.

FRAG - Fragment.

INTJ - Interjection. Corresponds approximately to the part-of-speech tag UH.

LST - List marker. Includes surrounding punctuation.

NAC - Not a Constituent; used to show the scope of certain prenominal modifiers within an NP.

NP - Noun Phrase.

NX - Used within certain complex NPs to mark the head of the NP. Corresponds very roughly to N-bar level but used quite differently.

PP - Prepositional Phrase.

PRN - Parenthetical.

PRT - Particle. Category for words that should be tagged RP.

QP - Quantifier Phrase (i.e. complex measure/amount phrase); used within NP.

RRC - Reduced Relative Clause.

UCP - Unlike Coordinated Phrase.

VP - Vereb Phrase.

WHADJP - Wh-adjective Phrase. Adjectival phrase containing a wh-adverb, as in how hot.

WHAVP - Wh-adverb Phrase. Introduces a clause with an NP gap. May be null (containing the 0 complementizer) or lexical, containing a wh-adverb such as how or why.

WHNP - Wh-noun Phrase. Introduces a clause with an NP gap. May be null (containing the 0 complementizer) or lexical, containing some wh-word, e.g. who, which book, whose daughter, none of which, or how many leopards.

WHPP - Wh-prepositional Phrase. Prepositional phrase containing a wh-noun phrase (such as of which or by whose authority) that either introduces a PP gap or is contained by a WHNP.

X - Unknown, uncertain, or unbracketable. X is often used for bracketing typos and in bracketing the...the-constructions.

Word level

CC - Coordinating conjunction

CD - Cardinal number

DT - Determiner

EX - Existential there

FW - Foreign word

IN - Preposition or subordinating conjunction

JJ - Adjective

JJR - Adjective, comparative

JJS - Adjective, superlative

LS - List item marker

MD - Modal

NN - Noun, singular or mass

NNS - Noun, plural

NNP - Proper noun, singular

NNPS - Proper noun, plural

PDT - Predeterminer

POS - Possessive ending

PRP - Personal pronoun

PRP$ - Possessive pronoun (prolog version PRP-S)

RB - Adverb

RBR - Adverb, comparative

RBS - Adverb, superlative

RP - Particle

SYM - Symbol

TO - to

UH - Interjection

VB - Verb, base form

VBD - Verb, past tense

VBG - Verb, gerund or present participle

VBN - Verb, past participle

VBP - Verb, non-3rd person singular present

VBZ - Verb, 3rd person singular present

WDT - Wh-determiner

WP - Wh-pronoun

WP$ - Possessive wh-pronoun (prolog version WP-S)

WRB - Wh-adverb

Function tags

Form/function discrepancies

-ADV (adverbial) - marks a constituent other than ADVP or PP when it is used adverbially (e.g. NPs or free ("headless" relatives). However, constituents that themselves are modifying an ADVP generally do not get -ADV. If a more specific tag is available (for example, -TMP) then it is used alone and -ADV is implied. See the Adverbials section.

-NOM (nominal) - marks free ("headless") relatives and gerunds when they act nominally.

Grammatical role

-DTV (dative) - marks the dative object in the unshifted form of the double object construction. If the preposition introducing the "dative" object is for, it is considered benefactive (-BNF). -DTV (and -BNF) is only used after verbs that can undergo dative shift.

-LGS (logical subject) - is used to mark the logical subject in passives. It attaches to the NP object of by and not to the PP node itself.

-PRD (predicate) - marks any predicate that is not VP. In the do so construction, the so is annotated as a predicate.

-PUT - marks the locative complement of put.

-SBJ (surface subject) - marks the structural surface subject of both matrix and embedded clauses, including those with null subjects.

-TPC ("topicalized") - marks elements that appear before the subject in a declarative sentence, but in two cases only:

if the front element is associated with a *T* in the position of the gap.

if the fronted element is left-dislocated (i.e. it is associated with a resumptive pronoun in the position of the gap).

-VOC (vocative) - marks nouns of address, regardless of their position in the sentence. It is not coindexed to the subject and not get -TPC when it is sentence-initial.

Adverbials

Adverbials are generally VP adjuncts.

-BNF (benefactive) - marks the beneficiary of an action (attaches to NP or PP).

This tag is used only when (1) the verb can undergo dative shift and (2) the prepositional variant (with the same meaning) uses for. The prepositional objects of dative-shifting verbs with other prepositions than for (such as to or of) are annotated -DTV.

-DIR (direction) - marks adverbials that answer the questions "from where?" and "to where?" It implies motion, which can be metaphorical as in "...rose 5 pts. to 57-1/2" or "increased 70% to 5.8 billion yen" -DIR is most often used with verbs of motion/transit and financial verbs.

-EXT (extent) - marks adverbial phrases that describe the spatial extent of an activity. -EXT was incorporated primarily for cases of movement in financial space, but is also used in analogous situations elsewhere. Obligatory complements do not receive -EXT. Words such as fully and completely are absolutes and do not receive -EXT.

-LOC (locative) - marks adverbials that indicate place/setting of the event. -LOC may also indicate metaphorical location. There is likely to be some varation in the use of -LOC due to differing annotator interpretations. In cases where the annotator is faced with a choice between -LOC or -TMP, the default is -LOC. In cases involving SBAR, SBAR should not receive -LOC. -LOC has some uses that are not adverbial, such as with place names that are adjoined to other NPs and NAC-LOC premodifiers of NPs. The special tag -PUT is used for the locative argument of put.

-MNR (manner) - marks adverbials that indicate manner, including instrument phrases.

-PRP (purpose or reason) - marks purpose or reason clauses and PPs.

-TMP (temporal) - marks temporal or aspectual adverbials that answer the questions when, how often, or how long. It has some uses that are not strictly adverbial, auch as with dates that modify other NPs at S- or VP-level. In cases of apposition involving SBAR, the SBAR should not be labeled -TMP. Only in "financialspeak," and only when the dominating PP is a PP-DIR, may temporal modifiers be put at PP object level. Note that -TMP is not used in possessive phrases.

Miscellaneous

-CLR (closely related) - marks constituents that occupy some middle ground between arguments and adjunct of the verb phrase. These roughly correspond to "predication adjuncts", prepositional ditransitives, and some "phrasel verbs". Although constituents marked with -CLR are not strictly speaking complements, they are treated as complements whenever it makes a bracketing difference. The precise meaning of -CLR depends somewhat on the category of the phrase.

on S or SBAR - These categories are usually arguments, so the -CLR tag indicates that the clause is more adverbial than normal clausal arguments. The most common case is the infinitival semi-complement of use, but there are a variety of other cases.

on PP, ADVP, SBAR-PRP, etc - On categories that are ordinarily interpreted as (adjunct) adverbials, -CLR indicates a somewhat closer relationship to the verb. For example:

Prepositional Ditransitives

In order to ensure consistency, the Treebank recognizes only a limited class of verbs that take more than one complement (-DTV and -PUT and Small Clauses) Verbs that fall outside these classes (including most of the prepositional ditransitive verbs in class [D2]) are often associated with -CLR.

Phrasal verbs

Phrasal verbs are also annotated with -CLR or a combination of -PRT and PP-CLR. Words that are considered borderline between particle and adverb are often bracketed with ADVP-CLR.

Predication Adjuncts

Many of Quirk's predication adjuncts are annotated with -CLR.

on NP - To the extent that -CLR is used on NPs, it indicates that the NP is part of some kind of "fixed phrase" or expression, such as take care of. Variation is more likely for NPs than for other uses of -CLR.

-CLF (cleft) - marks it-clefts ("true clefts") and may be added to the labels S, SINV, or SQ.

-HLN (headline) - marks headlines and datelines. Note that headlines and datelines always constitute a unit of text that is structurally independent from the following sentence.

-TTL (title) - is attached to the top node of a title when this title appears inside running text. -TTL implies -NOM. The internal structure of the title is bracketed as usual.

Index of All Tags

ADJP

-ADV

ADVP

-BNF

-CLF

-CLR

CONJP

-DIR

-DTV

-EXT

FRAG

-HLN

INTJ

JJR

JJS

-LGS

-LOC

LST

-MNR

NAC

NNS

NNP

NNPS

-NOM

PDT

POS

-PRD

PRN

PRP

-PRP

PRP$ or PRP-S

PRT

-PUT

RBR

RBS

RRC

SBAR

SBARQ

-SBJ

SINV

SYM

-TMP

-TPC

-TTL

UCP

VBD

VBG

VBN

VBP

VBZ

-VOC

WDT

WHADJP

WHADVP

WHNP

WHPP

WP$ or WP-S

WRB

SunRise_at 2012-07-31 13:31 发表评论

召回率和准确率

SunRise_at — Mon, 23 Jul 2012 01:41:00 GMT

转自：http://uwei.blogbus.com/logs/11424864.html
外行人做互联网，很多概念不懂。就拿最基础的“召回率”和“准确率”这种概念，看看网上资料知道大概，自己用的时候，脑子里绕着弯儿能想明白，可碰到别人活用的时候，脑子里还是没法一下子反应过来，还是要绕弯想一下。特地找了些资料，将这两个概念整理一下，希望能更熟练。

召回率和准确率是搜索引擎（或其它检索系统）的设计中很重要的两个概念和指标。
召回率：Recall，又称“查全率”；
准确率：Precision，又称“精度”、“正确率”。
在一个大规模数据集合中检索文档时，可把集合中的所有文档分成四类：

	相关	不相关
检索到	A	B
未检索到	C	D

A：检索到的，相关的                   （搜到的也想要的）
B：检索到的，但是不相关的           （搜到的但没用的）
C：未检索到的，但却是相关的        （没搜到，然而实际上想要的）
D：未检索到的，也不相关的          （没搜到也没用的）

通常我们希望：数据库中相关的文档，被检索到的越多越好，这是追求“查全率”，即A/(A+C)，越大越好。
同时我们还希望：检索到的文档中，相关的越多越好，不相关的越少越好，这是追求“准确率”，即A/(A+B)，越大越好。

归纳如下：
召回率：检索到的相关文档比库中所有的相关文档
准确率：检索到的相关文档比所有被检索到的文档

“召回率”与“准确率”虽然没有必然的关系（从上面公式中可以看到），然而在大规模数据集合中，这两个指标却是相互制约的。
由于“检索策略”并不完美，希望更多相关的文档被检索到时，放宽“检索策略”时，往往也会伴随出现一些不相关的结果，从而使准确率受到影响。
而希望去除检索结果中的不相关文档时，务必要将“检索策略”定的更加严格，这样也会使有一些相关的文档不再能被检索到，从而使召回率受到影响。

凡是设计到大规模数据集合的检索和选取，都涉及到“召回率”和“准确率”这两个指标。而由于两个指标相互制约，我们通常也会根据需要为“检索策略”选择一个合适的度，不能太严格也不能太松，寻求在召回率和准确率中间的一个平衡点。这个平衡点由具体需求决定。

其实，准确率（precision，精度）比较好理解。往往难以迅速反应的是“召回率”。我想这与字面意思也有关系，从“召回”的字面意思不能直接看到其意义。
我觉得“召回率”这个词翻译的不够好。“召回”在中文的意思是：把xx调回来。比如sony电池有问题，厂家召回。
既然说翻译的不好，我们回头看“召回率”对应的英文“recall”，recall除了有上面说到的“order sth to return”的意思之外，还有“remember”的意思。

Recall：the ability to remember sth. that you have learned or sth. that has happened in the past.

这里，recall应该是这个意思，这样就更容易理解“召回率”的意思了。
当我们问检索系统某一件事的所有细节时（输入检索query），Recall就是指：检索系统能“回忆”起那些事的多少细节，通俗来讲就是“回忆的能力”。能回忆起来的细节数除以系统知道这件事的所有细节，就是“记忆率”，也就是recall——召回率。

这样想，要容易的多了。

SunRise_at 2012-07-23 09:41 发表评论

《数学之美》－－马尔可夫链

SunRise_at — Thu, 12 Jul 2012 00:54:00 GMT

最近在读的一本书《数学之美》，由于自己对马尔可夫链缺乏相关的知识背景，故学习了一下。对于N久没有看过概率论的人来说，重拾起来也花费了一点时间。比如：P(A|B)是指在B的条件下A的概率，诸如此类，都需要重新复习一下，正所谓温故而知新。知道这个了，也就不难理解马尔可夫链的性质，即每一步可以移动到任何一个相邻的点，在这里移动到每一个点的概率都是相同的。

关于马尔可夫链的定义： http://zh.wikipedia.org/wiki/%E9%A6%AC%E5%8F%AF%E5%A4%AB%E9%8F%88

隐含马尔可夫模型是上述马尔可夫链的一个扩展：任何一个时刻t的状态St是不可见的。隐含马尔可夫模型在每一个时刻t会输出一个符号，而且这个符合和st相关，而且仅和st相关，这个被称为独立输出假设。关于隐含马尔可夫模型的成功应用可以参见吴军的《数学之美》第5章的内容。
额，快到上班时间了，小总结一下。继续码农中......

SunRise_at 2012-07-12 08:54 发表评论

统计自然语言处理--互信息

SunRise_at — Fri, 01 Jun 2012 05:06:00 GMT

今天六一，C小加不在身边，混球啊。任务需要在看曼宁的《统计自然语言处理基础》。然后用到互信息，每次我觉得好高深的名字，做下去的时候就发现没有那么难。

搭配

搭配由有限的复合构词法所描述。

识别搭配对的方法有三种：1.使用频率信息的搭配识别。2.基于含义和主词搭配词之间的距离识别。3.基于假设测试和互信息的识别。

1.频率

将语料过滤后得到的动词，名词，之间进行两两配对，统计每个词语在一个句子，或在一个段落中出现的次数，即为频率。

2.均值和方差

由于两个词之间的距离是可以变化的，计算两个词之间的偏移量的均值和方差。

均值就是简单的平均偏移量。

方差衡量的是单独的偏移量偏离均值的距离：

是同现i的偏移量，表示的是样本偏移量的均值。

我们可以通过使用这个信息来发现搭配。具体的方法是通过寻找带有低偏差的词对。一个低的偏差值意味着这两个词通常大致相同距离出现。零偏差意味着这两个词总是以相同的距离出现。

方差是关于一个相对于其他词分布峰值情况的度量。

关于互信息

互信息的计算公式是这样的：

MI(a,b) = log( p(ab) / (p(a)*p(b)) )

其中log的底数是2，p(x)表示x出现的概率。

好吧，好水，好简单。。着手写代码了。

SunRise_at 2012-06-01 13:06 发表评论

基于HowNet语义相似度的ＦＡＱ的研究

SunRise_at — Thu, 12 Apr 2012 08:19:00 GMT

    继建立同义词库后的新任务，读文献，然后找出问题的解决方案。几篇文献都是研究句子与句子的相似度计算，我们的关键是词语与句子的相似度计算。据说FAQ是自然语言处理领域研究的热点。看了几篇论文，感觉都是大同小异。
   因为是第一次接触这些东西，所以有很多陌生的词汇，就自己动手查了查。
   关于HowNet,见http://www.keenage.com/zhiwang/c_zhiwang.html
   FAQ自动问答系统的核心问题是如何快速地将客户所提问题与FAQ数据库的问题比较，进而确定与其最相似的问题，如果有，则将对应的答案作为结果回复给客户。


                                                                                    FAQ系统结构图
     相似度流程的计算就是先计算义原相似度，然后是概念相似度，接着词语相似度，最后就是句子相似度。
     /Files/sunrise/相似度.doc这里公式不能显示就相似度计算就插在附件中了。
     FAQ差不多就进行到这里了。程序小白的小白文章，小白将继续小白下去。

SunRise_at 2012-04-12 16:19 发表评论

自然语言处理相关书籍及其他资源

SunRise_at — Wed, 28 Mar 2012 01:35:00 GMT

特别推荐：
1、HMM学习最佳范例全文文档
2、无约束最优化全文文档

一、书籍：
1、《自然语言处理综论》英文版第二版
2、《统计自然语言处理基础》英文版
3、《用Python进行自然语言处理》，NLTK配套书
4、《Learning Python第三版》，Python入门经典书籍，详细而不厌其烦
5、《自然语言处理中的模式识别》
6、《EM算法及其扩展》
7、《统计学习基础》
8、《自然语言理解》英文版（似乎只有前9章）
9、《Fundamentals of Speech Recognition》，质量不太好，不过第6章关于HMM的部分比较详细，作者之一便是Lawrence Rabiner；
10、概率统计经典入门书：《概率论及其应用》（英文版，威廉*费勒著）
　　第一卷　　第二卷　　 DjVuLibre阅读器（阅读前两卷书需要）
11、一本利用Perl和Prolog进行自然语言处理的介绍书籍：《An Introduction to Language Processing with Perl and Prolog》
12、国外机器学习书籍之：
　1) “Programming Collective Intelligence“，中文译名《集体智慧编程》，机器学习&数据挖掘领域”近年出的入门好书，培养兴趣是最重要的一环，一上来看大部头很容易被吓走的”
　2) “Machine Learning“,机器学习领域无可争议的经典书籍，下载完毕将后缀改为pdf即可。豆瓣评论 by王宁）：老书，牛人。现在看来内容并不算深，很多章节有点到为止的感觉，但是很适合新手（当然，不能”新”到连算法和概率都不知道）入门。比如决策树部分就很精彩，并且这几年没有特别大的进展，所以并不过时。另外，这本书算是对97年前数十年机器学习工作的大综述，参考文献列表极有价值。国内有翻译和影印版，不知道绝版否。
　3) “Introduction to Machine Learning”
13、国外数据挖掘书籍之：
　1) “Data.Mining.Concepts.and.Techniques.2nd“，数据挖掘经典书籍作者 : Jiawei Han/Micheline Kamber 出版社 : Morgan Kaufmann 评语 : 华裔科学家写的书，相当深入浅出。
　2) Data Mining:Practical Machine Learning Tools and Techniques
　3) Beautiful Data: The Stories Behind Elegant Data Solutions（ Toby Segaran, Jeff Hammerbacher）
14、国外模式识别书籍之：
　1）“Pattern Recognition”
　2）“Pattern Recongnition Technologies and Applications”
　3）“An Introduction to Pattern Recognition”
　4）“Introduction to Statistical Pattern Recognition”
　5）“Statistical Pattern Recognition 2nd Edition”
　6）“Supervised and Unsupervised Pattern Recognition”
　7）“Support Vector Machines for Pattern Classification”
15、国外人工智能书籍之：
　1）Artificial Intelligence: A Modern Approach (2nd Edition) 人工智能领域无争议的经典。
　2）“Paradigms of Artificial Intelligence Programming: Case Studies in Common LISP”
16、其他相关书籍：
　1）Programming the Semantic Web，Toby Segaran , Colin Evans, Jamie Taylor
　2）Learning.Python第四版，英文

二、课件：
1、哈工大刘挺老师的“统计自然语言处理”课件；
2、哈工大刘秉权老师的“自然语言处理”课件；
3、中科院计算所刘群老师的“计算语言学讲义“课件；
4、中科院自动化所宗成庆老师的“自然语言理解”课件；
5、北大常宝宝老师的“计算语言学”课件；
6、北大詹卫东老师的“中文信息处理基础”的课件及相关代码；
7、MIT Regina Barzilay教授的“自然语言处理”课件，52nlp上翻译了前5章；
8、MIT大牛Michael Collins的“Machine Learning Approaches for Natural Language Processing(面向自然语言处理的机器学习方法)”课件；
9、Michael Collins的“Machine Learning （机器学习）”课件；
10、SMT牛人Philipp Koehn “Advanced Natural Language Processing（高级自然语言处理）”课件；
11、Philipp Koehn “Empirical Methods in Natural Language Processing”课件；
12、Philipp Koehn“Machine Translation（机器翻译）”课件；

三、语言资源和开源工具：
1、Brown语料库：
　a) XML格式的brown语料库，带词性标注；
　b) 普通文本格式的brown语料库，带词性标注；
　c) 合并并去除空行、行首空格，用于词性标注训练：browntest.zip
2、NLTK官方提供的语料库资源列表
3、OpenNLP上的开源自然语言处理工具列表
4、斯坦福大学自然语言处理组维护的“统计自然语言处理及基于语料库的计算语言学资源列表”
5、LDC上免费的中文信息处理资源
6、中文分词相关工具：
　1）Java版本的MMSEG：mmseg-v0.3.zip，作者为solol，详情可参见：《中文分词入门之篇外》
　2）张华平老师的ICTCLAS2010，该版本非商用免费一年，下载地址：
http://cid-51de2738d3ea0fdd.skydrive.live.com/self.aspx/.Public/ICTCLAS2010-packet-release.rar
7、热心读者“finallyliuyu”提供的一批新闻语料库，包括腾讯，新浪，网易，凤凰等，目前放在CSDN上：http://finallyliuyu.download.csdn.net/
　　另外finalllyliuyu在2010年9月又提供了一批文本文类语料，详情见：献给热衷于自然语言处理的业余爱好者的中文新闻分类语料库之二

四、文献：
1、ACL-IJCNLP 2009论文全集：
　a) 大会论文Full Paper第一卷
　b) 大会论文Full Paper第二卷
　c) 大会论文Short Paper合集
　d) ACL09之EMNLP-2009合集
　e) ACL09 所有workshop论文合集

SunRise_at 2012-03-28 09:35 发表评论