位置:成果数据库 > 期刊 > 期刊详情页
基于增量型聚类的自动话题检测研究
  • 期刊名称:软件学报(录用)
  • 时间:0
  • 分类:TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
  • 作者机构:[1]北京航空航天大学计算机科学与工程系,北京100191
  • 相关基金:国家自然科学基金(61170189,61003111);国家教育部博士点基金(20101102120016);国家重点实验室基金(SKLSDE+2011ZX.03)
  • 相关项目:基于图的统计机器翻译方法研究
中文摘要:

随着网络信息飞速的发展,收集并组织相关信息变得越来越困难.话题检测与跟踪(topic detection and tracking,简称TDT)就是为解决该问题而提出来的研究方向.话题检测是TDT中重要的研究任务之一,其主要研究内容是把讨论相同话题的故事聚类到一起.虽然话题检测已经有了多年的研究,但面对日益变化的网络信息,它具有了更大的挑战性.提出了一种基于增量型聚类的和自动话题检测方法,该方法旨在提高话题检测的效率,并且能够自动检测出文本库中话题的数量.采用改进的权重算法计算特征的权重,通过自适应地提炼具有较强的主题辨别能力的文本特征来提高文档聚类的准确率,并且在聚类过程中利用BIC来判断话题类别的数目,同时利用话题的延续性特征来预聚类文档,并以此提高话题检测的速度.基于TDT-4语料库的实验结果表明,该方法能够大幅度提高话题检测的效率和准确率.

英文摘要:

With the exponential growth of information on the Intemet, it has become increasingly difficult to find and organize relevant material. Topic detection and tracking (TDT) is a research area addressing this problem. As one of the basic tasks of TDT, topic detection is the problem of grouping all stories, based on the topics they discuss. This paper proposes a new topic detection method (TPIC) based on an incremental clustering algorithm. The proposed topic detection strives to achieve a high accuracy and the capability of estimating the true number of topics in the document corpus. Term reweighing algorithm is used to accurately and efficiently cluster the given document corpus, and a self-refinement process of discriminative feature identification is proposed to improve the performance of clustering. Furthermore, topics' "aging" nature is used to precluster stories, and Bayesian information criterion (B1C) is used to estimate the true number of topics. Experimental results on linguistic data consortium (LDC) datasets TDT-4 show that the proposed model can improve both efficiency and accuracy, compared to other models.

同期刊论文项目
期刊论文 3 会议论文 4
同项目期刊论文