针对海量Web文本的关键词提取问题,提出一种基于Hadoop分布式计算平台的关键词提取方案.首先,配置Hadoop平台,使其能够支持自然语言处理过程;然后,使用GATE工具对Web文本进行词句分割、词性标注和注释规则操作,得到候选关键词集;最后,利用单词位置和跨度重要性因子对传统TF-IDF算法进行加权,从而计算候选关键词与文档之间的相关性,最终获得该文档的关键词以标注文档属性.实验结果表明,提出的分布式关键词提取方案能够快速准确地提取Web文档的关键词.
For the issues that the keyword extraction of massive Web text, a web text keyword extractionscheme based on the Hadoop distributed platform is proposed. F irst, The Hadoop platform is configured tosupport natural language processing. Then, the GATE tool is used to perform words segmentation, part ofspeech tagging and annotation rules for Web text, and get a set of candidate keywords. F inally, the TF-IDFalgorithm which weighted by the word position and span factor is used to calculate the correlation betweencandidate keywords and documents, and obtain the document keywords to indicate document properties.Experimental results show that the distributed keyword extraction system can quickly and accurately extractthe key words of Web documents.