东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于大规模网络语料的藏文音节拼写错误统计与分析

ISSN号：1003-0077
期刊名称：中文信息学报
时间：0
页码：-
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：中国科学院软件研究所,北京100190
相关基金：国家自然科学基金（61202219,61303165）; 中国科学院信息化专项（XXH12504-1-10）; 新闻出版重大科技工程（0610-1041BJNF 2328/23）
相关项目：基于部件的联机手写藏文音节识别方法研究

关键词：藏文拼写检查, 拼写检查, 语料, 统计, 藏文信息处理, 中文信息处理, Tibetan spell check, spell check, corpus, Tibetan information processing, Chinese information processing

中文摘要：

针对从互联网获取的一份包含19万藏文网页,总计427万句、9 328万音节字的藏文文本语料,该文按照预定的规则对其中的藏文音节拼写错误情况进行了统计与分析。数据显示,在语料中出现的共计20 743个藏文音节中,含有拼写错误的音节共有9 700个,占藏文音节总数的46.762 8%,错误音节在语料中共出现27 427次,仅占0.030 8%,说明这份语料的文本质量是相当高的。文中还详细统计了各种不同表现形式的错误音节所占比重,并分析了导致拼写错误的四个主要原因：一是输入了多余的元音符号;二是音节点或句尾空格缺失;三是同一字丁/字符存在多种表达形式;四是错误地使用了相似字符。

英文摘要：

A large scale Tibetan text corpus is built, which includes 4.27 million sentences in 190 thousand documents, totaling 93 million syUables. Some predefined rules are applied to check whether there are spelling errors, detecting altogether 9 700 misspelt syllable types out of the 20 743 types of Tibetan syllables occurred in the corpus （covering 46. 762 8%）. But at the token level, the corpus has a very high quality, with only 27 427 misspelt syllables, roughly 0. 030 8% of the total 93 million syllable tokens. Further analysis shows that there are mainly four causes leading to those spell errors., extra vowel sign（s） ; absence of syllable delimiter or sentence delimiter; characters which can be written in different forms; similar characters.

同期刊论文项目