东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于表示学习的中文分词算法探索

ISSN号：1003-0077
期刊名称：中文信息学报
时间：2013
页码：-
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]中国科学院自动化研究所模式识别国家重点实验室,北京100190
相关基金：国家自然科学基金资助项目（61070106,61272332,61202329）;国家高技术研究发展计划资助项目（863计划）（2012AA011102）;国家重点基础研究发展计划资助项目（973计划）（2012CB316300）;网络文化与数字传播北京市重点实验室开放课题资助项目（ICDD201201）
相关项目：互联网环境下中文实体知识挖掘关键技术研究

关键词：表示学习, 中文分词, representation learning, Chinese word segmentation

中文摘要：

分词是中文自然语言处理中的一个关键基础技术。通过基于字的统计机器学习方法学习判断词边界是当前中文分词的主流做法。然而,传统机器学习方法严重依赖人工设计的特征,而验证特征的有效性需要不断的尝试和修改,是一项费时费力的工作。随着基于神经网络的表示学习方法的兴起,使得自动学习特征成为可能。该文探索了一种基于表示学习的中文分词方法。首先从大规模语料中无监督地学习中文字的语义向量,然后将字的语义向量应用于基于神经网络的有监督中文分词。实验表明,表示学习算法是一种有效的中文分词方法,但是我们仍然发现,由于语料规模等的限制,表示学习方法尚不能完全取代传统基于人工设计特征的有监督机器学习方法。

英文摘要：

Word segmentation is a fundamental technology of Chinese natural language processing.Using characterbased statistical machine learning methods to perform Chinese word segmentation is the main trendcurrently.However,conventional machine learning methods heavily rely on manually designed features,which require intensive labor to modify the features and verify their effectiveness.With the rapid develop of neural-network-based representation learning,it becomes realistic to learn featuresautomatically.This paper investigates a Chinese word segment method based on representation learning.We first learn embedding vectors for Chinese characters from a large corpus unsupervisedly,and then apply them to neural-network-based Chinese word segmentation supervisedly.Experimental results show that representation learning is an effective method for Chinese word segmentation.However,due to the limitation of corpus size,it still cannot replace conventional machine learning methods whichbased on manually designed features.

同期刊论文项目