为满足网络舆情监控系统中话题发现的需要,并克服经典single—pass算法处理网络文本聚类中受输入顺序影响和精度较低的主要不足,提出了ICIT算法,继承了single—pass算法的简单原理,保证了网络文本聚类的实时性;通过正文分词时标注词性选择名词动词进行正文向量化、建立文本标题向量来与文本正文向量共同表征文本、采用average—link策略、引入“代”的概念分批进行文本的聚类,以及在每批次聚类后添加报道重新选择调整所属的步骤来提高聚类的质量。实验证明了ICIT算法在提高话题发现准确度上的有效性和实用性。
To meet the needs of topic detection for monitoring the public opinion on internet, this paper proposed an incremen- tal clustering algorithm called ICIT to improve the two main disadvantages of single-pass algorithm, that was, being easily effected by the order of inputs and low precision. ICIT inherited the simple principle from single-pass to ensure clustering internet texts in real time and overcame its shortage by selecting only nouns and verbs from content as the content' s vector expression, using vector expression of title with content' s vector expression to express the text better, adopting average-link comparison strategy, introducing generation to accomplish batch process and add a stage for texts to reconsideration and adjust their ascription after first clustering. Experiments approve ICIT' s validity and practicability in heightening the precise of topic detection.