提出了一种弱指导的方法从搜索引擎查询日志中挖掘命名实体。该方法中采用人工选择的少量命名实体名称作为种子,使用随机游走模型从查询日志中获得大量的命名实体。其中采用了查询日志中的实体上下文模板,用户点击URL和候选命名实体构建三分图,根据在该图上的随机游走计算候选命名实体属于指定目标实体类别的概率,从而在查询日志中获取该类别的命名实体。在真实的查询日志上对7个实体类别进行的实验,实验结果显示本文方法在各个类别上均获得较好的命名实体挖掘效果。
This paper proposes a novel weakly-supervised approach to mining named entities (NEs) from the query log of search engine. In the approach, a random walk model is adopted to obtain a great amount of NEs from a query log, in which only a few seed NEs manually selected are required. Specifically, the context patterns of NEs in queries, clicked URLs and candidate NEs extracted from a query log arc used to construct a tri-partite graph. The random Walk on the tri-partite graph can assign each candidate NE a probability of belonging to a given target NE category, so that the candidate NEs belonging to the category in query log can be obtained. The paper experiments the ap- proach on a real-world query log within 7 NE categories and experimental results show that the approach achieves good performance in NE mining on each NE category.