利用矢量量化码书作为数据分类模式最优代表集的特点,提出基于码书的离群点概念,论证了其与经典统计学关于离群点定义的内在联系。在基于学习的矢量量化码书生成算法和最近邻码字搜索算法基础上构造了离群点检测算法。实验结果表明了提出的关于离群点定义的合理性和算法的有效性。
In vector quantization, the codebook is chosen so as to best represent the distributional structure of the dataset of vectors. This characteristic of eodebook is suitable for the purpose of outlier detection. This paper defined the concept codebook-based outlier followed by a dedicated analysis of its relation with the definition from statistical discipline. With this definition, the outliers could be found with a two-phase algorithm. Experiments on real world dataset show that this novel approach is quiet promising both on its rationality and effectivity.