基于最长次长匹配的方法建立汉语切分路径有向图,将汉语自动分词转换为在有向图中选择正确的切分路径,其中有向图中的节点代价对应单词频度,而边代价对应所连接的两个单词的接续频度;运用改进后Dijkstra最小代价路径算法,求出有向图中路径代价最小的切分路径作为切分结果.在切分歧义的处理上采用分步过滤逐步解消的方法,并引入了基于未知词特征词驱动的机制,对未知词进行了前处理,减少了因未知词的出现而导致的切分错误.实验结果表明,该方法有效地提高了汉语分词的精确率和召回率.
The Chinese word segmentation is transformed into a best segmentation path selecting problem in a directed graph based on the maximum and second-maximum matching method. Dijkstra's algorithm is modified to choose the minimum cost path from the directed graph, of which the node cost corresponds to the single-word frequency and the edge cost to the doublewords frequency. Word segmentation ambiguities are filtered and solved step by step. The unknown-word-characteristic-driven mechanism is adopted to handle the unknown word problem. The results show marked improvement in the efficiency of segmentation,and high accuracy rate and recall rate are guaranteed.