从特征选择、局部区域划分和词汇语义相似性计算入手,利用随机词汇迭代模型(random terms iterative model,RTIM)进行海量兴趣点(point of interest,POI)文本分类。通过词汇频度、集中度和离散度方法筛选出特征词汇;依据文本与各POI类别间的相似度进行局部区域划分;在每个局部区域内基于词汇在文本中的排列顺序构建词频向量,基于词频向量中词频的随机删除和重构,获取特征映射矩阵;通过特征映射矩阵将文本转为特征向量,并采用SVM分类器进行POI文本分类。实验证明,该方法有效提升了POI文本分类准确性和覆盖率。
This paper focused on the novel approach of open POI texts classification based on the RTIM, which took the ad- vantages of features selection, local region division and computing of terms semantic similarities. Particularly, it firstly extrac- ted feature terms by the improved methods of concentration, dispersion and frequency. Then, divided the POI text dataset into local regions according to the text similarity between every text and the POI categories. In each local region, it created every word frequency vector based on the sequence order of words in the text. Furthermore, generated feature mapping matrix with the processing of random deletion of word frequency and word frequency vector reconstruction. All texts were then transformed into the feature space by feature mapping matrix. Finally, it classified POI texts by support vector machines. The experimental results show that our approach acquires the great enhancement in precision and coverage rate of POI text classification.