hLDA(层次潜在狄利克雷分配)在层次主题建模中的良好效果已经得到广泛验证.为了实现半监督或无监督,通常采用交叉验证或抽样超参来确定参数.但由于语料特征、建模需求等不确定因素,参数调节方法、建模效果和数率都是实际应用中的难点.该文首先结合贝叶斯线索和范围线索构成的统一分析框架,研究hLDA主题建模中的关键影响因素,然后给出一个切实有效的建模策略及流程,最终结合ACL MultiLing 2013多文档摘要语料进行实际建模效果评估.
The results of hLDA (hierarchical Latent Dirichlet Allocation) in the hierarchical topic modeling have been widely validated. In order to achieve semi-supervised or unsupervised learning, cross-validation or sampling super parameters are usually used to determine the true parameters. However, corpus features, modeling demand and some other factors are uncertain. Hence, parameter adjustment, modeling effectiveness and efficiency are difficulty to achieve in practical applications. This paper builds a unified analytical framework by combining Bayesian theory and boundary information, analyzes the key factors in its topic modeling, then gives a series of practical and effective modeling strategies and processes, and finally evaluates the modeling results with multi-document summary corpus from ACL MultiLing 2013.