在实体消歧问题中,特征文本是指输入实体消歧系统的用于表征实体指称和候选实体的文本,其质量对于实体消歧的性能有重要的影响.论文对特征文本的选取问题进行研究,针对网络文本的特点,综合考虑文本中的特殊字符、特征文本的位置、特征文本是否包含实体指称和特征文本的单句长度等因素,对文本进行筛选和处理,产生特征文本,以提高实体消歧的效果.论文在深度结构语义网(Deep Structured Semantic Model,DSSM)和向量相似度模型(Vector Similarity Mod-el,VSM)两个实体排序模型上验证了特征文本选取方法的效果.结果显示特征文本筛选提高了DSSM上排序准确性,在P@3、P@5和P@10上分别有12.2%、12.3%和12.2%的提高.其中特殊字符处理对VSM有5.5%的提高.实验结果表明,对特征文本进行合理的筛选及清洗,有助于提高实体消岐中候选实体排序步骤的效果.
In an entity disambiguation task,feature text is the input of entity disambiguation system to represent the men-tioned entity and the candidate entity. Quality of feature text affects entity disambiguation performance. Feature text selection regard-ing web text is studied in three aspects,including special tokens,text location and whether it contains the mention and length of sen-tences. Experiments are conducted on DSSM(Deep Structured Semantic Model)and VSM(Vector Similarity Model). Results in DSSM show but increases of 12.2%,12.3%and 12.2%on P@3,P@5 and P@10 respectively. Special token preprocess increased VSM precision by 5.5%. Feature text selection helps in semantic understanding in entity disambiguation.