汉语依存树库的建设相对其他语言如英语,在规模和质量上还有一些差距。树库标注需要付出很大的人力物力,并且保证树库质量也比较困难。该文尝试通过规则和统计相结合的方法,将宾州汉语短语树库Penn Chinese Treebank转化为哈工大依存树库HIT-IR-CDT的体系结构,从而增大现有依存树库的规模。将转化后的树库加入HIT-IRCDT,训练和测试依存句法分析器的性能。实验表明,加入少量经转化后的树库后,依存句法分析器的性能有所提高;但加入大量树库后,性能反而下降。经过细致分析.作为一种利用多种树库提高依存句法分析器性能的方法,短语转依存还存在很多需要深入研究的方面。
The progress of Chinese dependency treebank construction has fallen behind other languages, such as English, in terms of scale and quality. Building a large scale treebank needs a lot of human and material resources. Meanwhile, it is very difficult to guarantee the quality of the treebank. In this paper, we explore a new method which combines rule based method and statistical-based method to convert a constituent treebank named Penn Chinese Treebank to a dependency treebank which follows the annatation standard of HIT Chinese Dependency Treebank (HIT-IR-CDT). We increase the size of training data by adding converted treebank into HIT-IR CDT and retrain the dependency parser. Experiments show that small addition of converted treebank can improve the performance of dependency parser, while large addition will bring it down. Through detailed analysis, we believe that convertion of constituent to dependency treebank still needs in depth research as a method of improving performance of dependency parser by utilizing different treebanks.