针对中文人物社会关系标注语料库的匮乏和人物关系分类过于粗糙的问题,采用一种简单的方式标注了八类主要人物社会关系。为了有效地降低特征向量的维数避免维数灾难,并尽可能去除噪声特征以提高关系抽取的准确率,提出一种基于动词和名词抽取与Х^2统计量法(CHI)相结合的特征选择方法,并使用TF—IDF计算特征权重。通过SVM分类器进行实验,F值和正确率都得到了提高;为了充分利用数据集对该特征选择方法的效果进行测试,使用后一折交叉验证检验该方法的有效性,实验表明通过该方法产生的分类模型具有较强的区分能力和泛化能力。
Due to the scarce of labeled Chinese corpus of social relation and the rough classification of personal social relations, this paper used a simple method to labeled eight main types of personal social relation. It was necessary to reduce the dimension of feature vector effectively to avoid the curse of dimensionality and remove the noise characteristics to improve the accuracy of relation extraction, therefore, this paper proposed a feature selection method based on Chi square statistic combination with selection of verb and noun, and used TF-IDF to calculate weight of the feature items. After feature selection, the proposed method was tested by SVM classifier, and the results of F-score and accuracy were improved. In order to make full use of the data set to test the effect of this feature selection method, the validity of the proposed method was tested by using k-fold cross validation. Experimental results show that the classification model generated by this method has high discernibility and generalization ability.