东篱科研大数据发现系统（DRDS）

欢迎您！东篱公司退出

申报数据库
1. 申报指南
立项数据库
成果数据库
1. 期刊论文
2. 会议论文
3. 著作
4. 专利
项目获奖数据库

位置：成果数据库 > 期刊 > 期刊详情页

用户定制主题聚焦爬虫的设计与实现

ISSN号：1000-7024
期刊名称：《计算机工程与设计》
时间：0
分类：TP311[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
作者机构：[1]清华大学电子工程系信息认知与智能系统研究所,北京100084, [2]清华大学信息科学与技术国家实验室,北京100084
相关基金：国家863高技术研究发展计划基金项目（2012AA011004）; 清华大学自主科研基金项目（20111081023）; 国家基金委前瞻计划基金项目（61161140454）

作者：闵钰麟, 黄永峰[1,2]

关键词：聚焦爬虫, K-means, best-first策略, 自适应主题模型, 用户定制主题爬虫, focused crawler, k-meansl best-first strategy, adaptive topic model, user customized topic focused crawler

中文摘要：

传统的聚焦爬虫在主题未知或者缺少相应训练集的情况下无法完成主题爬行。为让聚焦爬虫具有更好的主题适应性,提出基于聚类算法的自适应主题模型,指导聚焦爬虫在只有少量相同主题（主题未知）初始url的情况下完成主题爬行。通过对初始页面聚类得到主题中心向量,寻找相关网页更新主题中心位置;基于best-first策略实现url排序;基于该模型实现用户定制主题聚焦爬虫。通过对比实验验证了使用该模型的爬虫具有较高的收获比（havest rate）。

英文摘要：

The traditional focused crawler can not work without train sets of correspond topics. To make the focused crawler adapt to more topics, a clustering-based adaptive topic model was proposed, which helped the focused crawler to work with some url with the same topic. The topic vector was obtained by clustering the initial page, and correspond page was found out to update the topic vector, the url with the best-first strategy was ordered then. Based on the adaptive topic model, a user customized topic focused crawler was implemented. Finally, an experiment was executed. The results prove the focused crawler with the adaptive topic model performs well.

同期刊论文项目

下一代互联网

期刊论文 40 会议论文 31

同项目期刊论文

Towards Evolvable Internet Architecture-Design Constraints and Models Analysis.

Revisiting the Design of Mega Data Centers: Considering Heterogeneity among Containers

Source Address Validation Improvement (SAVI) Framework

LTTP: An LT-code Based Transport Protocol for Many-to-One Communication in Data Centers.

 CRRP: Cost-based Replacement with Random Placement for En-route Caching.

On the Deployability of Inter-AS Spoofing Defenses

A Unified Approach to Routing Protection in IP Networks

Control Theory-Based Load Balancing for Wireless Sensor Network

Guaranteeing Heterogeneous Bandwidth Demand in Multi-tenant Data Center Networks.

Willow: Saving Data Center Network Energy for Network-limited Flows.

IP网络时延敏感型业务流自适应负载均衡算法

Public IPv4-over-IPv6 Access Network

Towards Fast Rerouting-based Energy Efficient Routing

A Dormant Multi-Controller Model for Software Defined Networking

高能物理数据处理的混合计算集群

A bottleneck-free model for P4P

软件定义网络研究进展

Configuring IPv4 over IPv6 Networks: Transitioning with DHCP

Tunnel-based IPv6 Transition.

VegaNet网络虚拟路由器

一种互联网的稳定路由选择策略

Can P2P Technology Benefit Eyeball ISPs? A Cooperative Profit Distribution Answer.

An Anti-Tracking Source-Location Privacy Protection Protocol in WSNs Based on Path Extension

CMS 实验元数据发现的数据聚集系统

基于传统交换机实现OpenFlow功能

Reliable Multicast in Data Center Networks

Revisiting the Design of Mega Data Centers: Considering the Heterogeneity among Containers

Load-aware Spectrum Allocation based on Interference Graph Adapting to Radio Characteristics.

Stateless Source Address Mapping for ICMPv6 Pa

IEEE 802.11n中速率、模式及信道的联合自适应算法

支持异构集群并行的高能物理数据处理系统

基于数据库的文件系统管理工具设计与实现

面向医疗术语的本体库模型及其服务系统的设计

基于AP可替代性模型的密集无线网络节能机制研究

面向服务和信息的网络体系结构——SIONA

软件定义网络（SDN）研究进展

一种路由设备服务可信属性定义方法与可信路由协议设计

期刊信息

《计算机工程与设计》
北大核心期刊（2011版）

主管单位:中国航天科工集团
主办单位:中国航天科工集团二院706所
主编：汤铭瑞
地址：北京142信箱37分箱
邮编：100854
邮箱：ced@china-ced.com
电话：010-68389884

国际标准刊号：ISSN：1000-7024
国内统一刊号：ISSN：11-1775/TP
邮发代号:82-425

获奖情况:
中国科学引文数据库来源期刊,中国学术期刊综合评价数据库来源期刊,中国科技论文统计与分析用期刊

国内外数据库收录:
波兰哥白尼索引,美国剑桥科学文摘,英国科学文摘数据库,中国中国科技核心期刊,中国北大核心期刊（2004版）,中国北大核心期刊（2008版）,中国北大核心期刊（2011版）,中国北大核心期刊（2014版）

被引量:45616