从词向量的训练模式入手,研究了基于语料语句分割(BWP)算法,分隔符分割(BSP)算法以及属性主题分割(BTP)算法三种分割情况下的词向量训练结果的优劣。研究发现,由于评论短文本的自身特征,传统的无分割(NP)训练方法,在词向量训练结果的准确率和相似度等方面与BWP算法、BSP算法以及BTP算法具有明显的差异。通过对0.7亿条评论短文本进行词向量构建实验对比后发现,该文所提出的BTP算法在同义词(属性词)测试任务上获得的结果是最佳的,因此BTP算法对于优化评论短文本词向量的训练,评论短文本属性词的抽取以及情感倾向分析等在内的,以词向量为基础的应用研究工作具有较为重要的实践意义。同时,该文在超大规模评论语料集上构建的词向量(开源)对于其他商品评论文本分析的应用任务具有较好可用性。
We propose a method for Word2vec training on the short review textshy a partition according to the topic. We examine three kinds of partition methods, i.e. Based on Whole-review (BWP), Based on sentence-Separator (BSP) and Based on Topic(BTP), to improve the result of Word2vec training. Our findings suggest that there is a big difference on accuracy and similarity rates between the None Partition Model (NP) and BWP, BSP, BTP, due to the characteristic of the review short text. Experiment on various models and vector dimensions demonstrate that the result of word vector trained by Word2vec model has been greatly enhanced by BTP.