随着自然语言信息处理的不断发展和完善,大规模语料文本处理已经成为计算语言学界的一个热门话题。一个重要的原因是从大规模的语料库中能够提取出所需要的知识。结合973前期项目《藏文语料库分词标注规范研究》的开发经验,论述了班智达大型藏文语料库的建设,分词标注词典库和分词标注软件的设计与实现,重点讨论了词典库的索引结构及查找算法、分词标注软件的格词分块匹配算法和还原算法。
As the constant development and improvement of natural language information processing, enormous lin- guistic material text processing has become a hot topic in the area of computational linguistics. One important rea- son is that it can collect the demanding knowledge from the huge corpus. This article puts together the development experience of the 973 project "Studies on syncopate-dimensional norms of the Tibetan corpus", elaborates on the large-scale construction of the Banzhiada Tibetan corpus, the design and the realization of the syncopate-dimensional dictionary storehouse and the syncopate-dimensional software. It mainly discusses the index structure and the look- up algorithm of the dictionary storehouse, the matching algorithm case auxiliary words block and the decompression algorithm of syncopate-dimensional software.