东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于图的同义词集自动获取方法

期刊名称：计算机研究与发展
时间：0
页码：610-616
语言：中文
分类：TP18[自动化与计算机技术—控制科学与工程;自动化与计算机技术—控制理论与控制工程]
作者机构：[1]计算语言学教育部重点实验室北京大学,北京100871, [2]北京大学计算语言学研究所,北京100871, [3]乐山师范学院智能信息处理及应用实验室,四川乐山614000
相关基金：国家自然科学基金项目（60703063 61003206）; 九十八年度蒋经国国际学术交流基金会奖助项目（RG013-D-09）
相关项目：结合分布相似和汉语构词特征的词义相似度计算

关键词：相似词, 同义词集, 图模型, 并列结构, Newman算法, 边权值, similar words, semantic class, graph model, coordinate structure, Newman algorithm, edge weight

中文摘要：

同义词集是重要的语言基础知识,基于大规模语料库的同义词集自动获取是自然语言处理领域的一项基础性研究课题.从大规模语料中自动获取有并列结构关联的词语对,据此形成图,采用Newman算法对图进行划分而自动聚类相似词语.着重研究在Newman算法的基础上,充分挖掘和利用并列结构的特性和汉语的构词特点,采用6种方法对图中边的权值加以改进从而提升效果：分割语料、去除低频边、加重双向边、加重团、加重相同后字、惩罚音节不等.同义词集自动获取的准确率从初始的23.28%提升至53.12%,准确率提高了约30个百分点.

英文摘要：

A semantic class is a collection of terms which share similar meaning.Knowing the semantic classes of words can be extremely valuable for many natural language processing tasks.This paper investigates the usage of linguistic knowledge on the graph-based acquisition of Chinese semantic classes,and demonstrates that linguistic knowledge can really improve the graph-based method.The used corpus is Xinhua News of LDC Chinese Gigaword.A graph is built by extracting word pairs with coordination structure from corpus,with the co-occurring words as nodes and the co-occurring frequency as edges＇ weight between the two words.And then Newman algorithm is adopted to experiment word clustering in the graph.This paper focuses on transforming the edges＇ weight,motivated by the properties of coordinate structure and Chinese language.We present six kinds of methods： divide the whole corpus to small parts,cut the low-frequency edges,enlarge the weight of bidirectional edges,enlarge the weight of edges within cliques,enlarge the weight of edges in which two nodes share the same last-character,and reduce the weight of edges in which two nodes have different number of characters.The experimental result with the six methods yields a promising precision of 53.12%,which outperform the baseline Newman algorithm by 29.84%.

同期刊论文项目

结合分布相似和汉语构词特征的词义相似度计算

期刊论文 3 会议论文 4

基于词语独异性特征的大规模词义标注语料库自动构建研究

期刊论文 10 会议论文 7

同项目期刊论文

词义消歧相关术语简介

词义消歧研究:资源、方法与评测

基于SVM融合多特征的介词结构自动识别

多分类器集成的汉语词义消歧研究

词义标注语料库建设综述

现代汉语“很”充当修饰语的偏正结构研究

“纸张粉碎机”的层次结构

SemEval-2010 Task 18: Disambiguating Sentiment Ambiguous Adjectives

Using Clustering Engine and Selectional Preference to Generate Targets in Conceptual Metonymies