Recurrent Models of Visual Attention

Volodymyr Mnih, Nicolas Heess, Alex Graves, Koray Kavukcuoglu

Link to paper


  • This paper is important not necessarily for it’s result but for it’s formulation of “hard attention.”
  • Where to focus in an image is decided by a stochastic policy (in reinforcement learning parlance) that outputs a single location (agent’s action) given the hidden state of the RNN. This sampling process is non-differentiable and hence cannot be backpropagated through (
    • More detailed notes can be found here, also check Andrej Karpathy’s post on policy gradients here
  • By sampling rollouts of the game under the policy, we can compute the gradient update of the RNN using the REINFORCE rule.


  • How does this attention differ from Bahdanau attention for MT?
  • In terms of ops/speed/O(n) calculations is this really more efficent than a CNN? My intuition suggests not.


  • Bahdanau attention captures and compresses the information already seen before by the encoder, and is produced by asking the neural net “what do I expect to produce next?” This is done by computing similarity (cosine/MLP/..) between the “expected input vector”(function of prev. hidden state and last word produced) and the encoder hidden states.
    • In this paper, the attention is being used for exploration- “what should I see next?”

Leave a Comment