为帮助机器人根据自然语言表达定位环境中的物品,提出一种快速、端对端的基于自然语言表达的目标检测算法:同时训练一个卷积神经网络与一个循环神经网络来学习视觉与文本信息.循环神经网络用于将自然语言编码为向量,卷积神经网络用于获取图片中的区域特征信息.对比图片中的区域特征与自然语言特征,相似度高的即为目标区域.在开源数据库UNC-Ref与G-Ref中训练并测试了该模型,证明了该模型的快速性与准确性.
To help the robotics localize a target object based on a natural language expression about the target,a fast and end-to-end object detection algorithm based on natural language expression was proposed as follow:a convolution neural network and a recurrent neural network was jointly trained to learn visual and linguistic information.Recurrent neural network was used to encode the natural language expression into a vector representation,and convolution neural network was used to extract the feature of image regions.Comparing those region features to language feature,region with high similarity was the target object.Model was trained in UNC-Ref and G-ref dataset,and showed outperformance in speed and precision.