文章从文本表示、特征选择、分类算法、常用基准语料以及评估指标等方面对近年来的研究成果进行综述并讨论。认为短文本分类和多语言文本分类管理是新出现的重要且紧迫的问题,并对这两个问题以及数据集偏斜、多层分类、标注瓶颈等几个关键问题进行重点讨论。最后总结并展望这些研究内容。
Research results in automatic text classification in resent years are summarized and discussed from the perspective of text representation,feature selection,classification algorithm,commonly-used benchmark corpuses and evaluation indices.It's believed that short-text classification and multilingual text organization are the newly-emerging important and urgent problems.This paper focuses on discussing these two problems as well as several other key problems such as class imbalance,hierarchical classification and labeled corpus bottleneck.Finally,the paper summarizes and forecasts these researches.