在分析中文分词算法和交通信息自然语言表达特点基础上,提出了一种自然语言表达交通信息的跨阶匹配分词算法,以适应动态出行信息服务对数字形式结构化实时交通信息的迫切需求。该算法充分考虑了交通信息自然语言描述词库记录长度特点,通过设置对应的中文分词阶数,将传统中文分词的字符串指针1阶跨越方法改进为依词库性质变化的多阶跨越方法,对可能成词的中文字符串进行整体处理,极大地提高了自然语言表达交通信息的实时分词与理解效率。通过与改进MM(maximum matching)算法的实验比较,本方法在理解成功率和容错性相同的情况下,效率比MM分词算法提高了10倍以上。
A novel cross-step word segmentation algorithm is proposed to process real-time traffic information represented in natural Chinese in this paper, to meet the urgent need of real-time traveling information service, for dynamic traffic information. Considering the record length distribution of the word libraries depicting real-time traffic information, this algorithm sets corresponding steps of word segmentation for address, direction and event libraries, and improves the one step running of the string pointer in classical Chinese word segmentation to flexible multiple steps running, so as to aggregate possible Chinese words efficiently. A case study shows that the proposed algorithm runs 10 times faster than an improved MM algorithm, whilst keeping similar accuracy and robustness. The authors argued that the presented algorithm is greatly helpful to the automatic and intelligent processing of the real-time traffic information, and facilitate the development of travel information services.