Seq2Seq

With all the attention getting paid to ChatGPT, it’s drawn my attention to some NN and related topics and my own interest in playing with them, sometimes thinking about topics connected to work, sometimes not. Recent chats with people led me back to thinking about transformers and LSTM and the like and finally landed me on some material on RNN techniques and Seq2Seq in particular. In the back of my head I think about playing out certain chronologies of a life (address history for instance) and wondering if things like this would be useful to, say, predict where people will live next. So back to basic Seq2Seq, this post is just a quick review of a Medium Post I came across on the topic:

Encoder-Decoder Seq2Seq Models, Clearly Explained!!

That’s the title of the Medium article by Kriz Moses. It really was very nicely written and he has some useful links in there, in particular to an Arxiv preprint from 3 Google associates who wrote a seminal paper on the topic. This article, like the Arxiv one, is built around understanding Seq2Seq in their use for translating from language to language (English to French for their example). The article nicely de-mystifies the whole thing. There are a couple minor details that I’m not sure about and one semi-major aspect that was never articulated but is, I think, pretty important – assuming I’m correct about it. At the end, there was an interesting variant they pointed to where you can use these models to do image-to-caption ‘translation’ too.

The key to these models is that they all talk about model-based transformations of sequences as a set of connected LSTMs and they say these are ‘equivalent to unrolling an RNN’, but I think they need to make this point stronger since it was the only way I could get around how they handled variable length sequences. That is, an RNN is a ‘state-preserving’ NN. What it will predict next depends on it’s current state which preserves a memory of what it has done before. Assuming some sort of ‘framing’ device for a sequence, then a set of records presented can be seen as a sequence. The state-preserving sense of an RNN means that the stream of handling records within a sequence by the RNN can be expressed as equivalent to a sequence of NN (they use LSTM). With time going from left to right, the first token enters the LSTM at the left, the second token enters the next LSTM moving to the right. That’s the same as the RNN: token 1 enters the RNN, the state is prserved and presented to the same RNN at the time the 2nd token enters, and so on. Paying attention to what demarcates the start and stop of a sequence is key. I’m not completely certain, but I think either the start or stop probably needs to be seen as equivalent to a ‘reset’ button (though if you want to connect sets of sequences, that’s probably not right).

As I said above, there were a couple minor technical points where the article felt either in error or I missed something. In particular, designating the ‘true’ and ‘predicted’ outputs of the RNNs (there are two in this type of model, one referred to as the ‘encoder’ and one the ‘decoder’). During training, the loss function is comparing the ‘true’ sequence produced by the encoder with the ‘predicted’ sequence and I think the author mangled some of the specific content.

I’m quite eager to try this out because I think there are many sequenced things beyond literal sentences that would be of interest here (curious what it would make of irrational numbers like pi.). This also motivated me to look at wor2vec because the seq2seq framework he demonstrated relies on it or something similar as a preprocessing step to handle the input. The original input is a (presumably) massive vector of 1-hot encodings of all possible words you want considered.

Leave a comment