东篱科研大数据发现系统（DRDS）

欢迎您！东篱公司退出

申报数据库
1. 申报指南
立项数据库
成果数据库
1. 期刊论文
2. 会议论文
3. 著作
4. 专利
项目获奖数据库

位置：成果数据库 > 期刊 > 期刊详情页

基于互信息度量的 Web 信息抽取

ISSN号：1000-386X
期刊名称：《计算机应用与软件》
时间：0
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]广东工业大学计算机学院,广东广州510006, [2]南京大学计算机软件新技术国家重点实验室,江苏南京210093
相关基金：国家自然科学基金项目（61070033,61100148）;广东省自然科学基金项目（9251009001000005,S2011040004804）.

作者：张奇[1], 郝志峰[1], 温雯[1], 蔡瑞初[1,2]

关键词：信息抽取, DOM, 互信息, 阈值, Information extraction DOM Mutual information Threshold

中文摘要：

如何从纷繁复杂的网页中抽取有价值的信息是信息检索和Web数据挖掘中的重要问题。利用网页集信息所呈现的分布特点，提出基于互信息度量的Web信息抽取方法，它能够自动识别噪声信息并保留关键信息。该方法将网页解析成DOM树，计算叶子节点的互信息值；然后按DOM树结构对叶子节点进行分块聚集，向上递归求得标签〈boay〉的互信息值，并以此作为阈值区分噪声与非噪声。最后与多个国内知名网站上的实验及对比结果证明了该方法的有效性。

英文摘要：

How to extract valuable information from complex web pages is an important issue in information retrieval and Web data mining. We utihse the distribution feature presented by the information of webpage set and propose a mutual information metric-based Web information extraction method, it can automatically identify the noisy information and keep the key information. In this method, webpage is parsed into a DOM tree and the mutual information value of leaf nodes is calculated. Then the leaf nodes are block aggregated according to the structure of the DOM tree, the mutual information value of tag 〈 body 〉 is upward recursively computed and is set as the threshold to distinguish the non-noise from noise. Experiments and contrast results on various famous domestic websites prove the effectiveness of the proposed method.

同期刊论文项目

基于因果关系推断的致病基因发现算法研究

期刊论文 24 会议论文 5 获奖 2

基于支持向量机的快速多分类算法的设计与分析

期刊论文 45 会议论文 19

同项目期刊论文

Gaussian kernel-based fuzzy inference systems for high dimensional regression

Higher-order Takagi-Sugeno fuzzy model based on kernel mapping

A Linear Support Higher-Order Tensor Machine for Classification

基于实数值链接分析的ESSC融合算法

基于文章要素影响分析的博客文章分类方法

Regularized Gaussian Mixture Model based discretization for gene expression data association mining

Multi-objective Differential Evolution Algorithm based on Adaptive Mutation and Partition Selection

BILGO: Bilateral greedy optimization for large scale semidefinite programming

Example-based learning particle swarm optimization for continuous optimization

Exact algorithm and heuristic for the Closest String Problem

带噪声的文本聚类及其在反垃圾邮件中的应用

An adaptive class pairwise dimensionality reduction algorithm

Convergence time analysis of ant system algorithm

An Efficient Algorithm for the Longest Cycle

Superior-in-Status Analysis of Improved Genetic Algorithm for GTSP

基于关系模型的进化算法收敛性分析与对比

基于分方向选择搜索的多目标进化算法

质量度量指标驱动的数据聚合与多维数据可视化

基于随机取样的选择性K-means聚类融合算法

双向反馈蚁群算法在网络负载均衡问题的研究

进化算法漂移分析基本定理的改进与证明

An Improved Crawler Algorithm Based on Hierarchical Structure Preservation

Causal gene identification using combinatorial V-structure search

求解车辆路径问题的多邻域下降搜索蚁群优化算法

SVDD-based outlier detection on uncertain data

求解第二类GTSP的距离矩阵重构遗传算法

面向图像数据集的高斯过程分类

基于最大间隔的基因表达规则筛选

基于因子分析的NBC及其在边坡识别中的应用

贝叶斯预测型进化算法

融合信息熵与信任机制的防攻击推荐算法研究

二元进化策略的全局收敛与早熟收敛

A General Framework of Hierarchical Clustering and Its Applications

Causal discovery on high dimensional data

Product named entity recognition for Chinese query questions based on a skip-chain CRF model

<span style="color:#000000;font-family:Tahoma;font-size:medium;">Causal Gene Identif

<span style="color: rgb(0, 0, 0); font-family: Tahoma; font-size: medium; orphans: 2; text-a

基于实数值链接分析的ESSC融合算法

结合相关规则和本体加权图的查询扩展

基于互信息的适用于高维数据的因果推断算法

A Novel Fast Pattern Matching Algorithm

Determining Molecular Predictors of Adverse Drug Reactions with Causality Analysis based on Structur

An improved link analysis based clustering ensemble method

Causal gene identification based on high dimensional causal network discovery

Two novel interestingness measures for gene association rule mining

Software project risk analysis using Bayesian networks with causality constraints

一种基于关联规则与支持向量机的基因表达数据分类模型

基于互信息度量的Web信息抽取

一种分布式的舆情分析系统架构

求解第二类GTSP的距离矩阵重构遗传算法

基于PageRank的微博用户影响力度量

一种高维数据的因果推断算法

融合信息熵与信任机制的防攻击推荐算法研究

期刊信息

《计算机应用与软件》
北大核心期刊（2011版）

主管单位:上海科学院
主办单位:上海市计算技术研究所上海计算机软件技术开发中心
主编：朱三元
地址：上海市愚园路546号
邮编：200040
邮箱：cas@sict.stc.sh.cn
电话：021-62254715 62520070-505

国际标准刊号：ISSN：1000-386X
国内统一刊号：ISSN：31-1260/TP
邮发代号:4-379

获奖情况:
全国计算机类中文核心期刊

国内外数据库收录:
波兰哥白尼索引,美国剑桥科学文摘,中国中国科技核心期刊,中国北大核心期刊（2004版）,中国北大核心期刊（2011版）,中国北大核心期刊（2000版）

被引量:27463