随着博客信息源成指数级的增长,在博客空间中的信息检索,知识发现等任务正面临着巨大的挑战.博客特有的格式为以博客为载体的数据挖掘任务带来不便.本文提出挑选最具代表性的m个博文构成的博文集对博客兴趣建模,挑选的标准保证博文集中博文的重要性和主题多样性,并根据这两个指标来构造博文评估函数,将其转换成实例选择优化问题求解.实验以博客分类为目标,表明通过本文方法预处理后的博客,能够降低时间复杂度,提高分类准确率.
With an exponential growth of the bloggers and the amount of information,there are more and more challenging about Information Retrieval and Knowledge Discover in blogosphere,which result in the inconveniences for subsequent blog data mining task.In this paper,we investigate a new problem of profiling a blog by choosing the m most representative entries from the blog.We proposed two principles: importance and diversity.We combine them into a objective function,formulate the entry selection program into a formal optimization task of instance selection.We evaluated the proposed entry selection algorithms by blog classification,our experiment results showed high classification accuracy and low Time complexity.