Sequence to Sequence Learning Meta-post

2 minute read

I’ve studied neural nets before in classes but my first serious foray into modern “deep learning” architectures is through Sequence-to-Sequence models. Suffice to say most of what I learnt was new to me. Here I’m going to lay out the resources that I wish I found when I first got started.


If you’re like me, you might have studied the basics of neural nets, backpropagation and Stochastic Gradient Descent at the level covered by Andrew Ng in his ML class. Modern Deep Learning architectures are far more complex especially when you’re looking to implement them.

  • Introduction to RNNs
    • a four part series that covers simple RNNs, Backpropagation Through Time (BPTT)
    • BPTT is needed to perform backpropagation for RNNs. Straightforward application of the chain rule, but over a single sequential training sample
  • Introduction to LSTMs
    • great introduction to LSTMs by Cristopher Olah. Imo, you don’t need to know too much about LSTMs when getting started, or even when you start coding. Treat them as a modular unit that are functionally identical to RNNs.

Sequence to Sequence models



I played around with Keras last week. It has a dead simple API that you definitely want if your task is fairly standard. Keras does have a good RNN module and can be used for tasks like sentiment detection, language modeling, NER. But the abstraction seems insufficient for Sequence to Sequence models, and I couldn’t find any non-toy implementations using Keras. Tensorflow seems like the defacto

Rule of thumb seems to be that the more standard a technique is, the more likely that Keras has assimilated it.


  • Excellent notebooks on Sequence-to-Sequence models in Tensorflow.
    • Consolidates a lot of details needed when building practical seq2seq models such as padding, dynamic batch/sequence sizes (dynamic_rnn API), Bi-directional RNNs, and raw_rnn TF API needed for implementing more complex decoders (eg. attention models). Missing some details such as how to implement beam search and masking the loss when working with variable output lengths
  • Tutorial on RNNs by Quoc Le
    • Elaborates on some of the tricks of the trade for sequence models that are glossed over in the original paper, especially about the decoder (greedy search, beam search, etc.)
  • Stackexchange thread on the same topic
  • Denny Britz brings it again with a great list of best practices for working with RNNs in Tensorflow
    • This was the first time I heard of a method to compute accurate losses for variable length outputs
    • Seriously. Read it all

Tensorflow basics, nuances

Leave a Comment