位置:成果数据库 > 期刊 > 期刊详情页
蕴含地理事件微博客消息的自动识别方法
  • ISSN号:1560-8999
  • 期刊名称:《地球信息科学学报》
  • 时间:0
  • 分类:TP393[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
  • 作者机构:[1]中国科学院地理科学与资源研究所资源与环境信息系统国家重点实验室,北京100101, [2]中国科学院大学,北京100101
  • 相关基金:国家“863”计划课题(2013AA120305); 国家自然科学基金项目(41401460)
中文摘要:

微博客文本蕴含类型丰富的地理事件信息,能够弥补传统定点监测手段的不足,提高事件应急响应质量。然而,由于大规模标注语料的普遍匮乏,无法利用监督学习过程识别蕴含地理事件信息的微博客文本。为此,本文提出一种蕴含地理事件微博客消息的自动识别方法,通过快速获取的语料资源增强识别效果。该方法利用主题模型具有提取文档中主题集合的优势,通过主题过滤候选语料文本,实现地理事件语料的自动提取。同时,将分布式表达词向量模型引入事件相关性计算过程,借助词向量隐含的语义信息丰富微博客短文本的上下文内容,进一步增强事件消息的识别效果。通过以新浪微博为数据源开展的实验分析表明,本文提出的蕴含地理事件信息微博客消息识别方法,识别来自事件微博话题的消息文本的F-1值可达到71.41%,比经典的基于SVM模型的监督学习方法提高了10.79%。在模拟真实微博环境的500万微博客数据集上的识别准确率达到60%。

英文摘要:

Micro-blogs usually contain abundant types of geographical event information, which could compensate for the shortcomings of traditional fixed point monitoring technologies and improve the quality of emergency response. Identify the micro- blog messages that containing the geographical event information is the prerequisite for fully utilizing this data source. The trigger-based and the supervised machine learning methods are commonly adopted to identify the event related texts. Comparatively, the supervised machine learning methods have better performance than the trigger-based ones for unrestricted texts. Unfortunately, the lack of large-scale tagged corpuses cause the supervised machine learning methods cannot be implemented to identify the geographical event related messages. In this paper, we propose an automatic method for recognizing micro-blogs that are related to geographical events based on the topic model and word vector. This method could achieve a satisfying identification result by increasing the corpus scale rapidly. Firstly, the topic model is capable to extract topics from documents. Thus, the web pages fetched by a search engine are grouped by the topics, and the corpus is obtained after combining the pages under the topics that are related to geographical events through judging their keywords of each topic. Secondly, the distributed representation word vector model is introduced to compensate the lack of context in the micro-blog, which is caused by its character count limit. These word vectors are integrated into the context semantic information from corpus training during the vector generation process. Thirdly, the correlation between the micro-blog message and the given geographical event is calculated and applied to determine whether this message contains the specified geographical event or not. In addition, some heuristic rules are used to correct the error correlations of very short messages. Experiments where the rainstorm is set as the targeting geographical event are conducted to valida

同期刊论文项目
同项目期刊论文
期刊信息
  • 《地球信息科学学报》
  • 中国科技核心期刊
  • 主管单位:中国科学院
  • 主办单位:中国科学院地理科学与资源研究所 中国地理学会
  • 主编:徐冠华
  • 地址:北京大屯路甲11号
  • 邮编:100101
  • 邮箱:sxfu@lreis.ac.cn
  • 电话:010-64888891
  • 国际标准刊号:ISSN:1560-8999
  • 国内统一刊号:ISSN:11-5809/P
  • 邮发代号:82-919
  • 获奖情况:
  • 国内外数据库收录:
  • 中国中国科技核心期刊,中国北大核心期刊(2008版),中国北大核心期刊(2011版),中国北大核心期刊(2014版)
  • 被引量:3181