东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

面向分布式搜索引擎的索引库动态维护算法

期刊名称：山东大学学报(理学版)
时间：2011.5.5
页码：24-27
分类：TP301.6[自动化与计算机技术—计算机系统结构;自动化与计算机技术—计算机科学与技术]
作者机构：[1]复旦大学计算机科学技术学院,上海200433
相关基金：国家自然科学基金项目（61073170）
相关项目：不良文本内容在线感知的多粒度语义模式研究

关键词：字符串相似度, 算法, 编辑距离, 内容过滤, String similarity Algorithm Edit distance Content filtering

中文摘要：

随着网络技术的发展,网络空间出现了各种各样的文本交流类网络应用,如聊天室、BBS等。为维护网络环境的文明,这些网络应用中会将用户发表的＂脏话＂词汇进行过滤。有些恶意用户为了避免所发信息被系统过滤,经常会将＂脏话＂词汇进行变形处理,如何识别这些变形后的＂脏话＂词汇,是一个重要的问题。通过计算变异敏感词汇相似度,来对变形词汇进行识别。该方法具有如下特点：（1）计算结果接近于人脑识别的结果;（2）计算所用的时间复杂度较低;（3）对变体识别率较高。根据计算的相似度值,来决定是否对该疑似敏感词进行过滤。实验数据表明,所提出的相似度计算方法好于现有的算法。

英文摘要：

With the development of Internet technology,there are various network applications of textual communication,such as chat rooms,BBS and so on. In order to maintain the healthy development of network environment,many applications usually filter the profanities posted by users. To avoid being filtered,some of malicious users often disguise these profanities in their information posted. How to recognise these disguised profanities is an important issue. In this paper we present an algorithm to recognise these disguised profanities by computing the string similarity of aberrant sensitive words. This algorithm has the following features：（ 1） the score for string similarity of disguised profanities given by this algorithm is very close to the one by human brain;（ 2） very low time complexity;（ 3） very high identification rate about disguised profanities. The algorithm determines whether to filter the suspected sensitive words or not according to the calculated similarity values. Data of experiment show that this algorithm outperforms the state-of-the-art metric of string similarity for newly coined profanities.

同期刊论文项目

不良文本内容在线感知的多粒度语义模式研究

期刊论文 8 会议论文 6 专利 3

同项目期刊论文

Semantic multi-grain mixture topic model for text analysis

Text stream clustering algorithm based on adaptive feature selection

A hybrid generative/discriminative method for semi-supervised classification

Posterior probability model for stock return prediction based on analyst's recommendation behavi

Inter-training: Exploiting unlabeled data in multi-classifier systems

Web objectionable text content detection using topic modeling technique

Topics modeling based on selective Zipf distribution