分词是中文自然语言处理中的一个关键基础技术。通过基于字的统计机器学习方法学习判断词边界是当前中文分词的主流做法。然而,传统机器学习方法严重依赖人工设计的特征,而验证特征的有效性需要不断的尝试和修改,是一项费时费力的工作。随着基于神经网络的表示学习方法的兴起,使得自动学习特征成为可能。该文探索了一种基于表示学习的中文分词方法。首先从大规模语料中无监督地学习中文字的语义向量,然后将字的语义向量应用于基于神经网络的有监督中文分词。实验表明,表示学习算法是一种有效的中文分词方法,但是我们仍然发现,由于语料规模等的限制,表示学习方法尚不能完全取代传统基于人工设计特征的有监督机器学习方法。
Word segmentation is a fundamental technology of Chinese natural language processing.Using characterbased statistical machine learning methods to perform Chinese word segmentation is the main trendcurrently.However,conventional machine learning methods heavily rely on manually designed features,which require intensive labor to modify the features and verify their effectiveness.With the rapid develop of neural-network-based representation learning,it becomes realistic to learn featuresautomatically.This paper investigates a Chinese word segment method based on representation learning.We first learn embedding vectors for Chinese characters from a large corpus unsupervisedly,and then apply them to neural-network-based Chinese word segmentation supervisedly.Experimental results show that representation learning is an effective method for Chinese word segmentation.However,due to the limitation of corpus size,it still cannot replace conventional machine learning methods whichbased on manually designed features.