文本分割是信息检索的一个重要问题。文本分割是指在一个书面文档或语音序列中自动识别具有独立意义的单元(片段)之间的边界,其分割对象可以是书面的、语音的或者动态的文本。文本线性分割的主要目的是找出主题边界,它对于很多自然语言处理如自动文摘、问答系统等来说具有重要的价值。在大量文献的基础上,总结归纳文本线性分割中的主要方法,并提出未来的研究方向。
Text segmentation is an important issue in information retrieval.Text segmentation can be defined as the automatic identification of boundaries between distinct textual units (segments) in written documents or speech sequences.Static written text, speech text arid dynamic text can be segmented.The main motive of linear text segmentation is to find out topic boundaries, which is important for many natural language processing tasks,including summarization and QA system.This paper generalizes the main approaches on linear text segmentation on the basis of lots of literatures,points out the future research.