根据科技文献的结构特点,论文提出了一种四层挖掘模式,并结合K-means算法和Apriori算法,构建一个新的特征词提取方法——MultiLM-FE方法.该方法首先依据科技文献的结构将其分为4个层次,然后通过K-means聚类对前3层逐层实现特征词提取,最后再使用Aprori算法找出第4层的最大频繁项集,并作为第4层的特征词集合.该方法能够解决K-means算法不能自动确定最佳聚类初始点的问题,减少了聚类过程中信息损耗,这使得该方法能够在文献语料库中更加准确地找到特征词,较之以前的方法有很大提升,尤其是在科技文献方面更为适用.实验结果表明,该方法是可行有效的.
This article proposed a four-mining model based on the structural characteris- tics of scientific literature, and combined K-means algorithm and Apriori algorithm to construct an new feature extraction method-Multil.M-FE Method. Firstly, scientific lit- erature was divided into four layers according to its structure. And then, it selected fea- tures progressively for the former three layers by K means clustering. Finally, it found out the maximum frequent itemsets of fourth layer by Aprori algorithm to act as a col- lection of features fourth layer. This method can solve the problem that the K-means clustering algorithm can't determine the most appropriate clustering starting point auto- matically, and reduces the loss of information in the clustering process, so it is possible to find features more accurately in the literature corpus. Experimental results showed that this method was feasible and effective and had greatly improved especially in terms of the scientific literature when compared with the previous method.