Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu*, Jimmy Lei Ba^†, Ryan Kiros^†, Kyunghyun Cho*,
Aaron Courville*, Ruslan Salakhutdinov^†, Richard Zemel^†, Yoshua Bengio*
University of Toronto^†/University of Montreal*

Overview

How does it work?

The model brings together convolutional neural networks, recurrent neural networks and work in modeling attention mechanisms.

model_diagram

Above: From a high level, the model uses a convolutional neural network as a feature extractor, then uses a recurrent neural network with attention to generate the sentence.

convolutional

recurrent

The model in action

Want all details? Interested in what else we've been up to?

Please check out the following technical report and visit the pages of the authors:

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015)

Code