扩展标记语言(XML)带有一定的结构和语义信息,与普通文本相比,XML具有描述精确、表现形式丰富等特点,但同时也使得传统的自然语言处理和数据挖掘等技术不能直接应用.根据XML内容和结构并非独立,内容影响结构,结构作用于内容,提出一种基于张量的XML特征降维及综合相似度计算方法.针对XML文档,使用张量表示并采用基于最大互信息的方法对其进行降维,采用将XML结构和内容相融合的综合相似度度量方法确定结构和内容的内在联系及共同作用方式,提高XML综合相似度计算性能.实验及结果分析验证了所提出方法的有效性.
XML documents have both structural and semantic information, bringing data integration and deeply utilization based on XML more precise description and versatile expression, but meanwhile traditional natural language processing(NLP) and data mining(DM) methods can not be applied directly. Feature dimension reduction and general similarity of XML based on tensor analysis are discussed. Considering the correlation between XML's structure and content,a tensor based method of describing XML documents and a maximization mutual information(MMI) method of XML's dimension reduction are presented. Since the structure and the content are not independent each other, a tensor based algorithm of calculating general similarity from a non-linear angle is designed to show their relationships and effects, which can improve the calculated performance for the general similarity of XML. The experimental results show the effectiveness of the proposed method.