针对词性标注中的复杂特征提取问题,应用粗糙集理论(rough sets),有效地挖掘了包括长距离特征在内的复杂特征,并有效地处理了语料库噪声问题.最后,将这些特征融合于最大熵模型中,训练时按模型整体性能为其分配权重.开放实验表明:增加粗规则后获得96.29%的标注精度,相比原有模型提高了0.83%.
In order to extract the complicated contextual features in the part-of-speech tagging task, a novel approach based on rough sets is presented in this paper to collect the complex and long-distance features from the corpus effectively, and to overcome the noise and inconsistent sample problem existing in the corpus. In addition, these rough rules are added into the maximum entropy model. The experiment achieved the precision of 96.29 %, and increased the tagging precision by O. 83 % compared with the former model.