汉语的基本块识别是汉语句法语义自动分析中的重要任务之一.传统的方法大多数直接将汉语基本块识别任务转化成词层面的一个序列标注问题,采用CRF模型来处理.虽然,在许多评测中得到最好的结果,但基于词为标注单位,在实用中受限于自动分词系统以及汉语词特征的稀疏性.为此,该文给出了一种以字为标注单位,以字为原始输入层,来构建汉语的基本块识别的深层神经网络模型,并通过无监督方法,学习到字的C&W和word2vec两种分布表征,将其作为深层神经网络模型的字的表示层的初始输入参数来强化模型参数的训练.实验结果表明,使用五层神经网络模型,以[-3,3]窗口的字的word2vec分布表征,其准确率、召回率和F值分别达到80.74%,73.80%和77.12%,这比基于字的CRF高出约5%.这表明深层神经网络模型在汉语的基本块识别中是有作用的.
Chinese base-chunk identification is an important task for automatically syntactic and semantic analysis. A widely-used strategy is to transform it into a word-level sequence labeling problem, and use models like CRFs to deal with it. Despite its best results in many open evaluations, practical application of such method is limited by accuracy of Chinese word segmentation systems and sparsity of Chinese word features. Therefore, this paper presents a base- chunk identification model based on deep neural network models, which take Chinese character as tagging unit and original input layer. Moreover, Chinese character's C&W distributed representation and word2vec distributed repre- sentation are derived through unsupervised learning models, and they are taken as initial input parameters of deep neural network to improve the training procedure. Experimental results show that the precision, recall and F-meas- ure of our final identification model can achieved 80.74%, 73.80% and 77.12%, respectively, conditioned on a five- layer neural network with feature window of size[--3, 3] and word2vec distributed representation.