东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

藏文语料库深加工方法研究

ISSN号：1002-8331
期刊名称：计算机工程与应用
时间：2012.5.20
页码：127-130,147
分类：TP393[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]青海师范大学计算机学院,西宁810008
相关基金：国家973计划项目（No.2010CB334708）;国家自然科学基金（No.61163018）;青海师范大学仓蚓谨金项目（No.2011-Z-754/2011-Z-755）.
相关项目：藏文字符信息熵研究

作者：才藏太|

关键词：藏文语料库, 分词标注, 分词词典, 还原算法, Tibetan corpus, segmentation and tag, segmentation dictionary, decompression algorithm

中文摘要：

随着自然语言信息处理的不断发展和完善,大规模语料文本处理已经成为计算语言学界的一个热门话题。一个重要的原因是从大规模的语料库中能够提取出所需要的知识。结合973前期项目《藏文语料库分词标注规范研究》的开发经验,论述了班智达大型藏文语料库的建设,分词标注词典库和分词标注软件的设计与实现,重点讨论了词典库的索引结构及查找算法、分词标注软件的格词分块匹配算法和还原算法。

英文摘要：

As the constant development and improvement of natural language information processing, enormous lin- guistic material text processing has become a hot topic in the area of computational linguistics. One important rea- son is that it can collect the demanding knowledge from the huge corpus. This article puts together the development experience of the 973 project ＂Studies on syncopate-dimensional norms of the Tibetan corpus＂, elaborates on the large-scale construction of the Banzhiada Tibetan corpus, the design and the realization of the syncopate-dimensional dictionary storehouse and the syncopate-dimensional software. It mainly discusses the index structure and the look- up algorithm of the dictionary storehouse, the matching algorithm case auxiliary words block and the decompression algorithm of syncopate-dimensional software.

同期刊论文项目