短语树库是自然语言处理的研究和实际应用的重要资源,就越南语而言目前也缺乏这类树库资源,不利于汉越双语信息处理工作.提出一种融合越南语语法特征与改进PCFG(概率上下文无关文法)的越南语短语树库构建方法,能够自动分析出越南语的短语结构树,从而可解决了越南语短语树库的自动构建问题.首先通过分析越南语的语言特征,制定出越南语的语言特征集;然后利用Inside-Outside算法从人工标注的少量越南语短语树获取PCFG模型中的语法规则集;最后将语法特征集作为语法规则集的补充融入PCFG模型,用得到的新模型最终完成越南语短语树库的构建.实验结果表明,新的PCFG模型针对越南语短语树库构建的准确率达到了81.14%,相比传统PCFG模型以及基于最大熵的树库构建方法准确率明显提高了2%~3%.
Phrase Treebank is an important resource for Natural Language Processing research and practical application.For Vietnamese,we still lack this kind of Treebank resources,which has made Chinese and Vietnamese bilingual information processing be difficult to carry on.This paper presents a method to construct the Vietnamese phrase Treebank by fusion of Vietnamese grammatical features and improved PCFG(probabilistic context-free grammar)model.We think that it is a necessary resource for the linguistic research in general and for the development of real applications in the area of NLP(Natural Language Processing).This method can automatically analyze Vietnamese phrase structure tree,and it can solve the problem of constructing the Vietnamese phrase Treebank.Firstly,Vietnamese grammatical feature set is established by analysis of Vietnamese grammatical features.Then,grammar rule set of PCFG(probabilistic context-free grammar)model is obtained from manual annotation Vietnamese phrase trees.Atthe same time,The traditional PCFG(probabilistic context-free grammar)model is improved by adding more contextual semantic information,which are Pre co-occurrence probability and Post co-occurrence probability.Finally,Vietnamese grammatical feature set is fused into improved PCFG(probabilistic context-free grammar)model,which is regarded as a supplement.The new method completes the construction of Vietnamese phrase Treebank.The final improved PCFG(probabilistic context-free grammar)model has obtained good results for Vietnamese syntactic analysis.It not only improves the accuracy,but also reduces syntactic parsing time.The process of Vietnamese automatic syntactic analysis also promotes the construction of Vietnamese phrase Treebank.The experimental results show that the accuracy of proposed PCFG(probabilistic context-free grammar)model for the Vietnamese phrase Treebank construction reaches 81.14%.Compared with conventional PCFG(probabilistic context-free grammar)model and the maximum entropy method,t