随着Internet的发展,Web上信息呈爆炸式增长趋势,呈现方式也愈发多种多样,这就给多媒体内容的检索,信息提取等计算机处理带来了巨大困难。针对信息提取后,网页的多媒体内容的不一致性,本文提出了一种Web网页多媒体信息提取的融合算法。该算法通过对图像和文本的语义融合,判断信息提取后的网页中的各种形态的内容是否一致,并通过网页中的文字更加准确地表示图片所传达的内容。对来自30个网站的307个网页进行测试后的实验表明,本文提出的方法是可行的。
With the development of Internet, the presentations of the Web documents are also diverse, processing, such as multimedia information retrieval, information on the Web has been exploded and the the later brings tremendous troubles for the information information extraction etc. Considering the multimedia content' s incoherence after the web information extraction, a fusion method of the web multimedia content is proposed. This method can judge if the web' s multimedia contents are coherence via fusing the image semantic and the text semantic. Testing on 307 web pages from 30 web sites shows that the method is effective.