模糊限制语常用来表示不确定性和可能性的含义,由模糊限制语所引导的信息为模糊限制信息。为进行中文事实信息的抽取,应将模糊限制信息与事实信息区分开来。然而中文模糊限制语语料资源却十分缺乏,影响了中文模糊限制语和模糊限制信息检测的研究。该文研究了中文模糊限制语的分类,并在生物医学和维基百科两个领域,设计构建了一个具有2.4万句规模的中文模糊限制语语料库。统计分析了语料标注的一致性,以及模糊限制语的类型和领域之间的关系。这些资源对于中文模糊限制信息检测研究,以及中文事实信息的抽取具有重要意义。同时,为语言学家从语义和语用等方面进行模糊限制语的研究提供了强大的知识库支持。
Hedge is usually used to express uncertainty and possibility. When authors cannot back up their state- ments, they usually use hedge to express uncertain information. To avoid extracting uncertain statements as factual information, uncertain information should be distinguished from factual information. However, inadequate Chinese hedge corpus limited the research of Chinese hedge. This paper discusses the categorization of Chinese hedge, introduces the design and construction of a 24,000-sentence Chinese hedge corpus in the biomedical and Wikipedia domains. We calculate agreement rates for the corpus and reveal the domain and genre dependency of hedges. The con- struction of the corpus is of great significance in the research of Chinese hedge detection and Chinese information extraction. Meanwhile, the resource provides a great support for linguists to study the semantic hedge and the pragmatic hedge.