针对面向汽车领域的软文识别问题,将软文识别分为顶贴识别、无关帖识别、广告帖识别和伪造帖识别4个子任务,并分别使用基于规则的方法和基于机器学习的方法对4类软文进行识别。基于规则的方法综合考虑汽车领域专业信息、极性词信息、作者级别信息等因素;基于机器学习的方法结合网帖内容特征和作者信息特征,使用最大熵分类器进行模型训练。实验结果表明,对于领域特征明显、具有数值化反馈信息和明确标注数据的领域,适合使用机器学习的方法进行软文识别。
The task that aims to detect spam reviews for the automobile domain was divided into four sub-tasks: sup- porting review detection, irrelevant review detection, advertisement detection and fake review detection. Both rule- based methods and machine learning methods were used to identify spam reviews. Many aspects were considered in the rule-based method, such as automobile domain knowledge, words with polarity, and information of the author. The re- view content feature and author information were combined to train a model with a maxent classifier. Experimental re- sults showed that machine learning method performs well for the domain whose property was obvious, with numerical feedback information and labeled training data.