In this paper, we propose an approach for generating rich fine-grained textual descriptions of images. In particular, we use an LSTM-in-LSTM(long short-term memory) architecture, which consists of an inner LSTM and an outer LSTM. The inner LSTM effectively encodes the long-range implicit contextual interaction between visual cues(i.e., the spatiallyconcurrent visual objects), while the outer LSTM generally captures the explicit multi-modal relationship between sentences and images(i.e., the correspondence of sentences and images). This architecture is capable of producing a long description by predicting one word at every time step conditioned on the previously generated word, a hidden vector(via the outer LSTM),and a context vector of fine-grained visual cues(via the inner LSTM). Our model outperforms state-of-theart methods on several benchmark datasets(Flickr8k,Flickr30 k, MSCOCO) when used to generate long rich fine-grained descriptions of given images in terms of four different metrics(BLEU, CIDEr, ROUGE-L, and METEOR).