东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

一种基于语料特性的聚类算法

ISSN号：1000-9825
期刊名称：软件学报
时间：2010.11.11
页码：2802-2813
分类：TP18[自动化与计算机技术—控制科学与工程;自动化与计算机技术—控制理论与控制工程]
作者机构：[1]中国科学院计算技术研究所网络重点实验室,北京100190, [2]中国科学院研究生院,北京100049
相关基金：Supported by the National Natural Science Foundation of China under Grant No.60933005 （国家自然科学基金）; the National Basic Research Program of China under Grant Nos.2007CB311100, 2004CB318109 （国家重点基础研究发展计划（973））; the National High-Tech Research and Development Plan of China under Grant No.2007AA01Z441 （国家高技术研究发展计划（863））
相关项目：Web搜索与挖掘的新理论和新方法—支持舆情监控的Web搜索与挖掘的理论与方法研究

关键词： CADIC(clustering, algorithm, based, on, the, DISTRIBUTIONS, of, INTRINSIC, clusters), 文本聚类, 模型不匹配, 重标度, 信息检索, CADIC（clustering algorithm based on the distributions of intrinsic clusters）； text clustering； model misfit； rescaling； information retrieval；

中文摘要：

为寻求模型不匹配问题的一种恰当的解决途径,提出了基于语料分布特性的CADIC（clustering algorithm based on the distributions of intrinsic clusters）聚类算法.CADIC以重标度的形式隐式地将语料特性融入算法框架,从而使算法模型具备更灵活的适应能力.在聚类过程中,CADIC选择一组具有良好区分度的方向构建CADIC坐标系,在该坐标系下统计固有簇的分布特性,以构造各个坐标轴的重标度函数,并以重标度的形式对语料分布进行隐式的归一化,从而提高聚类决策的有效性.CADIC以迭代的方式收敛到最终解,其时间复杂度与K-means保持在同一量级.在国际知名评测语料上的实验结果表明,CADIC算法的基本框架是合理的,其聚类性能与当前领先水平的聚类算法相当.

英文摘要：

In finding a flexible approach to solve the model misfit problem,a clustering algorithm based on the distributions of intrinsic clusters（CADIC） is proposed,which implicitly integrates distribution characteristics into the clustering framework by applying rescaling operations.In the clustering process,a set of discriminative directions are chosen to construct the CADIC coordinate,under which the distribution characteristics are analyzed in order to design rescaling functions.Along every axis,rescaling functions are applied to implicitly normalize the data distribution such that more reasonable clustering decisions can be made.As a result,the reliability of clustering decisions is improved.The time complexity of CADIC remains the same as K-means by using a K-means-like iteration strategy.Experiments on well-known benchmark evaluation datasets show that the framework of CADIC is reasonable,and its performance in text clustering is comparable to that of state-of-the-art algorithms.

同期刊论文项目

Web搜索与挖掘的新理论和新方法—支持舆情监控的Web搜索与挖掘的理论与方法研究

期刊论文 113 会议论文 114 获奖 6 专利 39

同项目期刊论文

一种基于相似性聚类的社会网络合作模式发现方法

文本分类算法研究

一种基于Chord的物联网信息服务方法

NaEPASC: a novel and efficient public auditing scheme for cloud data

基于中文维基百科链接结构与分类体系的语义相关度计算

Construction of unsupervised sentiment classifier on idioms resources

基于社会性标注的本体学习方法

第三届中文倾向性分析评测(COAE2011)语料的构建与分析

一种基于内存的高效在线数据处理服务框架

面向智能搜索的动态知识网络建模

基于传播模拟的消息流行度预测

基于开放网络知识的信息检索与数据挖掘

网民重要度建模方法研究

复杂网络的社区结构

图索引技术研究综述

网络舆情信息源影响力的评估研究

Twitter数据采集方案研究

基于逐点互信息的查询结构分析

一种抵抗链接作弊的PageRank改进算法

网络维吾尔文判别及其文本长度下界的探讨

大规模短文本的不完全聚类

面向网络论坛的高质量主题发现

基于随机游走模型的跨领域倾向性分析研究

一种基于空间映射及尺度变换的聚类框架

微博中基于统计特征与双向投票的垃圾用户发现

基于带权图的层次式化社区并行计算方法

A two-stage framework for cross-domain sentiment classification,

Adapting centroid classifier for document categorization

Uncovering the community structure associated with the diffusion dynamics on networks

Spectral methods for the detection of network community structure: a comparative analysis

Comprehensive Quantitative Analysis for Privacy Leak Software Behavior

Bridgeness: a local index on edge significance in maintaining global connectivity,

Contextual Correlation Based Thread Detection in Short Text Message Streams

跨领域倾向性分析相关技术研究

基于多视角特征融合的中文垃圾微博过滤

网络大数据:现状与展望

Design of an Evaluation System for Large Scale Network Attack Based on Emulab

a sampling method for mining user's preference

Detecting Hidden Anomalies Using Sketch for High-speed Network Data Stream Monitoring

Topic Diffusion Behavior Tracking in Online Social Network

基于密度估计的社会网络特征簇挖掘方法

Detecting Spammers in Microblogs

Cross-language Opinion Lexicon Extraction using Mutual-reinforcement Label Propagation.

Modelling and Analysis of an Integrated Scheduling Scheme with Heterogeneous LRD and SRD Traffic

Modelling priority queuing systems with varying service capacity

Degree-strength correlation reveals anomalous trading behavior.

Quality-of-Service Analysis of Queuing Systems with Long-Range-Dependent Network Traffic and Variabl

Providing Hierarchical Lookup Service for P2P-VoD Systems

Auto-sampling of feature words on imbalanced data

基于多维熵值分类的骨干网上异常检测研究

一种新型的层次化动态社区并行计算方法

短文本信息流的无监督会话抽取技术

一种基于LDA的在线主题演化挖掘模型

微博中基于多关系网络的话题层次影响力分析

基于二部图半监督方法的查询日志实体挖掘

面向分面导航的层次概念格模型及挖掘算法

基于半监督话题模型的用户查询日志命名实体挖掘

基于查询意图的长尾查询推荐

排序学习中数据噪音敏感度分析

一种基于社会性标注的网页排序算法

Analytical Modelling and Optimization of Congestion Control for Prioritized Multi-Class Self-Similar

基于情感关键句抽取的情感分类研究

基于随机博弈模型的网络攻防量化分析方法

Modeling the clustering in citation networks

Stochastic Game Net and Applications in Security Analysis for Enterprise Network

Mining Topical Influencers Based on the Multi-Relational Network in Micro-Blogging Sites

A dimensionality reduction framework for detection of multiscale structure in heterogeneous networks

基于热传导模型的更新摘要算法

Improving Text Categorization with Semantic Knowledge in Wikipedia

Covariance, correlation matrix, and the multiscale community structure of networks,

一种相关话题微博信息的筛选规则学习算法

开放式环境下一种基于信任度的RBAC模型

基于吸收态随机行走的两阶段效用性查询推荐方法

大规模层次分类中的候选类别搜索

一种基于情感符号的在线突发事件检测方法

基于词向量的开放文本领域概念识别方法

网络信息安全测试平台设计与实现

大规模层次分类问题研究及其进展

对等点播系统中节点搜索机制研究

基于带权图的层次化社区并行计算方法

Symbolic representation based on trend features for knowledge discovery in long time series

期刊信息

《软件学报》
北大核心期刊（2011版）

主管单位:中国科学院
主办单位:中国科学院软件研究所中国计算机学会
主编：赵琛
地址：北京8718信箱中国科学院软件研究所
邮编：100190
邮箱：jos@iscas.ac.cn
电话：010-62562563

国际标准刊号：ISSN：1000-9825
国内统一刊号：ISSN：11-2560/TP
邮发代号:82-367

获奖情况:
2001年入选中国期刊方阵“双百期刊”,2000年荣获中国科学院优秀科技期刊一等奖

国内外数据库收录:
俄罗斯文摘杂志,美国数学评论（网络版）,波兰哥白尼索引,德国数学文摘,荷兰文摘与引文数据库,美国工程索引,美国剑桥科学文摘,英国科学文摘数据库,日本日本科学技术振兴机构数据库,中国中国科技核心期刊,中国北大核心期刊（2004版）,中国北大核心期刊（2008版）,中国北大核心期刊（2011版）,中国北大核心期刊（2014版）,中国北大核心期刊（2000版）

被引量:54609