论述了元数据在定题Web信息采集中的重要作用,分析了常见的元数据类型,确定了Href,Anchor Text及Surrounding Text三种元数据类型最适合作为定题信息采集依据的元数据类型.利用关联规则挖掘,将支持度和置信度相结合作为相关度的判定标准,并采用禁用词过滤和相关策略过滤技术,给出了元数据的抽取与主题扩展迭代方法.实验证明所提出的元数据处理策略能使主题相关词和实际相关词较好符合,改善误包含和误排除的情况,为定题Web信息采集提供良好前提.
In this paper, the significance of Web metadata in topic-specific information gathering was discussed and the common kinds of Web metadata were analyzed to confirm the appropriate kinds for topic-specific information gathering. It comes out that Href, Anchor Text and Surrounding Text are the three ones. Using association mining, support and confidence combine to make a standard for relevant judgment. Meanwhile, the technologies of metadata extraction and topic expansion are proposed with forbidden words filtering and relevance filtering. Experimental results indicate that our algorithm and strategies have low false inclusion and low false exclusion, and the relevant topics can inosculate well with the actual relevant topics. It provides better precondition for topic-specific information gathering.