网络数据的飞速增长为搜索引擎带来了巨大的存储和网络服务压力,大量冗余、低质量乃至垃圾数据造成了搜索引擎存储与运算能力的巨大浪费,在这种情况下,如何建立适合万维网实际应用环境的网页数据质量评估体系与评估算法成为了信息检索领域的重要研究课题。在前人工作的基础上,通过网络用户及网页设计人员的参与,文章提出了包括权威知名度、内容、时效性和网页外观呈现四个维度十三个因素的网页质量评价体系;标注数据显示我们的网页质量评价体系具有较强的可操作性,标注结果比较一致;文章最后使用Ordinal Logistic Regres-sion模型对评价体系的各个维度的重要性进行了分析并得出了一些启发性的结论:互联网网页内容和实效性能否满足用户需求是决定其质量的重要因素。
The rapid growth of Web data poses a great challenge in both storage and service quality for search engines.The existence of low-quality web pages,or rather spam pages,increases the cost of crawling,indexing,and storage in search engines.This paper presents a measure of Web page quality with 4 dimensions: authority,content,timeliness and appearance.Human assessors are recruited to rate the sampled pages using this evaluation framework.High inter-rater reliability of the rating results showed that the framework is consistent and functional.Finally,Ordinal Logistic Regression analyses were conducted to model the relationship between the 4 core dimensions and quality of Web pages.