位置:成果数据库 > 期刊 > 期刊详情页
LSTM-in-LSTM for generating long descriptions of images
  • 时间:0
  • 分类:TP391.41[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
  • 作者机构:College of Computer Science and Technology, Zhejiang University, Department of Computer Science, Watson School of Engineering and Applied Sciences, Binghamton University
  • 相关基金:supported in part by the National Basic Research Program of China (No. 2012CB316400);National Natural Science Foundation of China (Nos. 61472353 and 61572431);China Knowledge Centre for Engineering Sciences and Technology,the Fundamental Research Funds for the Central Universities;2015 Qianjiang Talents Program of Zhejiang Province;supported in part by the US NSF (No. CCF1017828)
中文摘要:

In this paper, we propose an approach for generating rich fine-grained textual descriptions of images. In particular, we use an LSTM-in-LSTM(long short-term memory) architecture, which consists of an inner LSTM and an outer LSTM. The inner LSTM effectively encodes the long-range implicit contextual interaction between visual cues(i.e., the spatiallyconcurrent visual objects), while the outer LSTM generally captures the explicit multi-modal relationship between sentences and images(i.e., the correspondence of sentences and images). This architecture is capable of producing a long description by predicting one word at every time step conditioned on the previously generated word, a hidden vector(via the outer LSTM),and a context vector of fine-grained visual cues(via the inner LSTM). Our model outperforms state-of-theart methods on several benchmark datasets(Flickr8k,Flickr30 k, MSCOCO) when used to generate long rich fine-grained descriptions of given images in terms of four different metrics(BLEU, CIDEr, ROUGE-L, and METEOR).

同期刊论文项目
同项目期刊论文