随着网络技术的发展,网络空间出现了各种各样的文本交流类网络应用,如聊天室、BBS等。为维护网络环境的文明,这些网络应用中会将用户发表的"脏话"词汇进行过滤。有些恶意用户为了避免所发信息被系统过滤,经常会将"脏话"词汇进行变形处理,如何识别这些变形后的"脏话"词汇,是一个重要的问题。通过计算变异敏感词汇相似度,来对变形词汇进行识别。该方法具有如下特点:(1)计算结果接近于人脑识别的结果;(2)计算所用的时间复杂度较低;(3)对变体识别率较高。根据计算的相似度值,来决定是否对该疑似敏感词进行过滤。实验数据表明,所提出的相似度计算方法好于现有的算法。
With the development of Internet technology,there are various network applications of textual communication,such as chat rooms,BBS and so on. In order to maintain the healthy development of network environment,many applications usually filter the profanities posted by users. To avoid being filtered,some of malicious users often disguise these profanities in their information posted. How to recognise these disguised profanities is an important issue. In this paper we present an algorithm to recognise these disguised profanities by computing the string similarity of aberrant sensitive words. This algorithm has the following features:( 1) the score for string similarity of disguised profanities given by this algorithm is very close to the one by human brain;( 2) very low time complexity;( 3) very high identification rate about disguised profanities. The algorithm determines whether to filter the suspected sensitive words or not according to the calculated similarity values. Data of experiment show that this algorithm outperforms the state-of-the-art metric of string similarity for newly coined profanities.