互联网中出现的短文本内容短小,相互共享的词汇较少,因此在分类过程中容易出现大量的集外词,导致分类性能降低。鉴于此,提出了一种基于词矢量相似度的分类方法,首先利用无监督的方法对无标注数据进行训练得到词矢量,然后通过词矢量之间的相似度对测试集中出现的集外词进行扩展。通过与基线系统的对比表明,该方法的分类正确率均优于基线系统1%-2%,尤其是在训练数据较少的情况下,所提出的方法的正确率相对提高10%以上。
As the short length of the Web short text and less shared words, a lot of out of vocabulary (OOV) words would appear, and these words make the task of text classification more difficult. To solve this problem, a new general framework based on word embedding similarity was proposed. First, get the word embedding file with unsupervised learning method based on unlabeled data. Second, extend the OOVs with the similar words in training data through computing the similarities of different word embeddings. The comparison with the baseline system shows that the pro- posed method gets better 1%-2% rate and outperforms more 10% rate on small training data set.