为了既避免数据稀疏又充分考虑相邻词性的关系和每种短语的内部组成规律,提出了改进K-均值聚类方法.此方法把每个短语看成是以中心词为核心的聚簇,充分考虑每种短语的内部组成规律;依据语料库中的数据来确定每个类的初始中心,使有指导的统计方法和无指导的聚类方法有机结合,既提高了聚类的准确率,又避免了因汉语语块库规模较小而导致的数据稀疏现象.应用改进K-均值聚类方法对7种汉语语块进行识别,F值达到了92.94%,因此,该方法对汉语语块识别是有效的.
An improved k-means clustering method is proposed avoiding data sparseness and taking think of the relationship of to identify Chinese phrases with the purpose of neighbor part of speech and the cohesion of all part of speeches within one phrase. The proposed method regards each phrase as a cluster whose kernel is headword, which richly used the constituent disciplinarian of one phrase. It also integrates supervised statistical method and unsupervised clustering method by setting the original center of each class according the data from small Chinese corpus, which not only improves the accuracy of clustering but also avoids data sparseness. Through testing on Chinese Penn Treebank, the F score of seven types of Chinese phrase achieves to 92. 94%. So, it is effective for Chinese text chunking.