针对当前互联网网页越来越多样化、复杂化、非规范化的特点,提出了基于特征文本密度的网页正文提取方法。该方法将网页包含的文本根据用途和特征进行分类,并构建数学模型进行比例密度分析,从而精确地识别出主题文本。该方法的时间和空间复杂度均较低。实验显示,它能有效地抽取复杂网页以及多主题段网页的正文信息,具有很好的通用性。
The current web pages are getting more and more diverse,complex and non-standardized which makes the information extraction more difficult,the paper proposes a web content information extraction method based on density of feature text,which classifies the page text according to its usage and features,and constructs mathematical models to analyze the text proportion and density,thus identifies the content information accurately.The method has rather low time and space complexity.Experiments show that it can extract content information effectively from complex and multi-topic web pages and has a wide applicability.