随着互联网的普及和藏文信息技术的不断发展,出现了大量的藏文网站。该文根据藏文“音节点”的特征识别藏文网页并进行抓取。在建立DOM树的基础上,分析网页的链接、非链接文本与主题信息块之间的相关度。通过语义修剪算法提取藏文主题信息。经测试证实,该算法在藏文网页识别和藏文主题信息提取中具有较好的适应性。
With the widespread use of Interuet and the development of Tibetan information technology, there are a lot of Websites of Tibetan information resource. This paper identifies Tibetan Web page and crawls it according to features of Tibetan syllable dot. Based on DOM, it analyzes relevance between linked and non-linked Web page text with topical information via pruning semantics algorithm to extract Tibetan topical information. Test result shows that the algorithm to identify and extract in the Tibetan Websites topical information has good adaptation.