鉴于中文字段匹配在信息检索领域的重要性以及日益复杂的检索需求,本文首次提出并实现了基于最长公共子序列LCS的中文缩写字段匹配模型,避免了繁琐的分词操作,将字段匹配过程简单化。在CWT100G数据集部分网页上的实验表明,该方法性能比较稳定,检索效果比较好,尤其在较长缩写字段的匹配方面效果更优于传统的基于字符串匹配的分词模型。
We initially present and realize a Longest Common Subsequence (LCS) based Chinese abbreviation field match model in view of its significance in information retrievil and increasingly complicated search demands, which avoids the fussy operation to word segment and simplifies the process of field match. Experiment in partial webpage of CWTIOOG dataset shows that the approach is stable in performance and preferable to retrieval results, and that it is superior to the traditional string match based word seyment model especially in the longer Chinese abbreviation field match.