Python generate corpus using Dirichlet distribution

At first, let's define the sample function:

def sample(dist, num_samples=1):
    """
    Uses the inverse CDF method to return samples drawn from an
    (unnormalized) discrete distribution.

    Arguments:

    dist -- (unnormalized) distribution

    Keyword arguments:

    num_samples -- number of samples to draw
    """

    cdf = cumsum(dist)
    r = uniform(size=num_samples) * cdf[-1]

    return cdf.searchsorted(r)

As we can see, the sample function input two parameters, one is dist, which can be an un-normalized distribution, another is the sample we want to draw.

Let's see how to generate corpus for Dirichlet--multinomial unigram language model

def generate_corpus(beta, mean, N):
    """
    Returns a corpus of tokens drawn from a Dirichlet--multinomial
    unigram language model. Each token is an instance of one of V
    unique word types, represented by indices 0,

, V - 1.

    Arguments:

    beta -- concentration parameter for the Dirichlet prior
    mean -- V-dimensional mean of the Dirichlet prior
    N -- number of tokens to generate
    """

    pass # YOUR CODE GOES HERE
    #print mean
    #print beta
    #print dot(mean,beta)
    #print dirichlet(mean*beta,size=1)
    temp=sample(dirichlet(beta*array(mean),size=1),N)
    #print temp
    return temp

please keep in mind the dirichlet function is “from numpy.random.mtrand import dirichlet"
and the parameters it receives are corresponding to beta*array(mean). beta is the concentration factor, and mean is the vector which sum to 1.

another way is to generate corpus is using the property:
P(D'|D,H)= Nv+beta_nv/N+beta

def generate_corpus_collapsed(beta, mean, N):
    """
    Returns a corpus of tokens drawn from a Dirichlet--multinomial
    unigram language model using the 'collapsed' generative process
    (i.e., phi is not explicitly represented). Each token is an
    instance of one of V unique word types.

    Arguments:

    beta -- concentration parameter for the Dirichlet prior
    mean -- V-dimensional mean of the Dirichlet prior
    N -- number of tokens to generate
    """

    V = len(mean) # vocabulary size

    corpus = zeros(N, dtype=int) # corpus

    Nv = zeros(V, dtype=int) # counts for each word type

    pass # YOUR CODE GOES HERE
    for n in xrange(N):
        corpus[n]=sample((Nv+beta*array(mean))/(n+beta),1)
        Nv[corpus[n]]+=1;
    return corpus

Let's see how to generate corpus for Mixture of Dirichlet-multinomial unigram language model

def generate_corpus(alpha, m, beta, n, D, Nd):
    """
    Returns a grouped corpus drawn from a mixture of
    Dirichlet--multinomial unigram language models.

    Arguments:

    alpha -- concentration parameter for the Dirichlet prior over theta
    m -- T-dimensional mean of the Dirichlet prior over theta
    beta -- concentration parameter for the Dirichlet prior over phis
    n -- V-dimensional mean of the Dirichlet prior over phis
    D -- number of documents to generate
    Nd -- number of tokens to generate per document
    """
    corpus = GroupedCorpus()

    pass # YOUR CODE GOES HERE
    #determine the topic the distribution for topic dirichlet(dot(m,alpha),size=1)
    #given the topic, the distribtuion for word dirichlet(dot(n,beta),size=1)
    theta=dirichlet(alpha*array(m),1)
    phis=dirichlet(beta*array(n),len(m))
    for d in range(0,D):
        [t]=sample(theta,1)
        #print groupVcab
        corpus.add(str(d),str(t),[str(x) for x in sample(phis[t,:],Nd)])
    return corpus

注意是T个topic (group)， phis=dirichlet(beta*array(n),len(m)) 产生了T个 dirichlet distribution,相同的topic t应该取同一个 dirichlet distribution phis[t,:]

posted on 2012-10-28 10:13 luis 阅读(631) 评论(0) 编辑收藏引用所属分类: Python

只有注册用户登录后才能发表评论。
【推荐】100%开源！大型工业跨平台软件C++源码提供，建模，组态！

相关文章: Python extract all comments:提取所有comments,提取c/c++中注释Python脚本 Python 笔记 pi tan 等公式 Python 笔记2 // label switching Python generate corpus using Dirichlet distribution Python 空数组 Python笔记

网站导航: 博客园 IT新闻 BlogJava 博问 Chat2DB 管理

2012年10月

日

一

二

三

四

五

六

常用链接

留言簿(3)

随笔分类

随笔档案

文章分类

感悟！奋斗！(2)

文章档案

2009年7月 (2)

友情链接

个人主页
Yi Lu's Homepage UMass Amherst