Étienne Simon's homepage: ~/projects/FAIR 2016

Neural Machine Translation with Memory Network Based Attention

Figure 1: Neural MT system. When a dot product is used to calculate the weights then the attention mechanism performs an operation very similar to a memory network.

We investigated whether memory networks can be used to improve the attention model of a neural MT system. Using multiple hops on the source sentence resulted in an improvements of up to 1.4 BLEU on the IWSLT De/En task. We then integrated a second memory of all preceding target words, this brought an additional 0.4 gain in the BLEU score. Finally, we completely removed the decoder LSTM and show that a memory network can jointly handle the attention mechanism and the generation of the target sentence.

Architecture

We noticed that the classical attention mechanism using a dot product is very similar to a memory network: at each time step of the LSTM, the attention can be seen as a memory network where the memories are composed of the hidden states of the encoder and the query is the previous state of the decoder. This similarity is shown by figure 1.

With this is mind, we extended the standard attention mechanism in NMT in three ways:

Use a memory network with multiple hops on the source to calculate the attention vector. That is, we stack several memory networks on top of each other by setting the query of a layer to the output of the previous layer as illustrated by Figure 2 (center). By these means we hope to achieve better, incrementally refined alignments.
Perform hops on the source and the preceding target words. That is, we use a memory network where the memories are filled with embeddings of the preceding words of the generated sentence. Including a memory on the targets could help to selectively focus on some past words. This is illustrated by Figure 2 (right).
Replace the decoder LSTM with a memory network which simultaneously performs the attention and sentence generation. In this case the output of the memory network is directly used to generate the target word with a softmax layer.

Figure 2: Different attention mechanisms used in our NMT system. Left: Standard attention (with dot-product). Center: Extension to multiple hops on the source sentence. Right: Hops on the target and source sentence.

Results

We trained each model 5 times with different random initializations and report the score on the test set of the one with smallest validation cost. First of all, we observed that models which attend the source sentence before attending the target sentence performed poorly. Furthermore, alternating between source and target hops did not bring any improvement. Therefore, we report results on model performing target hops followed by source hops:

	S1	S2	S3	S4	S5	S6	S7
T0	23.0	23.5	23.9	24.1	24.4	24.1	23.9
T1	22.0	23.9	24.0	24.2	24.8	24.1	23.9
T2	21.8	23.8	24.0	24.2	24.3	23.8	23.7
T3	21.5	23.5	23.8	23.8	23.9	23.9	23.6
T4	21.9	23.5	24.1	24.0	24.3	23.8	23.6
T5	21.4	22.5	23.9	23.7	24.2	23.7	23.2

BLEU scores on the test set in function of the number of hops on the source and target sentence. Best results are obtained with one hop on the target (T1) and five hops on the source sentence (S5). The standard attention model achieves a BLEU score of 23.0 (entry T0S1).

When looking at the alignment matrices, a similar pattern is observed for most sentences: the model begins to predict a fuzzy alignment which is then refined by subsequent hops. This is further confirmed by truncating the memory network: when we train a model with 5 source hops and then test it with only 3 source hops, we observe that the theme of the translated sentences is somewhat preserved while the exact meaning is lost:

Original Das hat uns enorme einsichten und inspiration für unsere eigenen, autonomen fahrzeuge gegeben.

Target Now, this has given us tremendous insight and inspiration for our own autonomous vehicles.

5 hops It has given us enormous insight and inspiration for our own, autonomous vehicles.

4 hops So this has given us a tremendous insight and inspiration for our own, autonomous vehicles.

3 hops This is what we've been given to, this is a tremendous idea that we have, is to give rise to our own, UNK, UNK vehicles.

2 hops This is what we've found out of, the way that we have a lot of time, is what is going on is that it's a lot of way for us to think about it.

1 hops That's the way that we have is that, when we think about it, a lot of the way that we think is that we have a lot of the way to think about it, is that it's a lot of time, you k now, UNK UNK, or is it, you know, UNK UNK.

Original	Das hat uns enorme einsichten und inspiration für unsere eigenen, autonomen fahrzeuge gegeben.
Target	Now, this has given us tremendous insight and inspiration for our own autonomous vehicles.
5 hops	It has given us enormous insight and inspiration for our own, autonomous vehicles.
4 hops	So this has given us a tremendous insight and inspiration for our own, autonomous vehicles.
3 hops	This is what we've been given to, this is a tremendous idea that we have, is to give rise to our own, UNK, UNK vehicles.
2 hops	This is what we've found out of, the way that we have a lot of time, is what is going on is that it's a lot of way for us to think about it.
1 hops	That's the way that we have is that, when we think about it, a lot of the way that we think is that we have a lot of the way to think about it, is that it's a lot of time, you k now, UNK UNK, or is it, you know, UNK UNK.

In the past, memory networks had been successfully applied to language modeling (Sukhbaatar et al. End-to-end memory net paper). The decoder LSTM of an NMT system is basically a language model which is conditioned on the source sentence. In our best configuration (1 target and 5 source hops), all the preceding target words are already in memory. In other words, our memory network has already all the necessary information to generate the next word and one may wonder whether the LSTM in the decoder is still necessary.

To verify this, we completely removed the decoder LSTM. The output of the memory network is directly fed to a softmax output layer to generate the next word. In this case, more hops on the target sentence are needed (as it was observed in language modeling with memory networks). We were able to achieve a BLEU score of 22.9 using 3 hops on the target and 7 hops on the source. This is not as good as our best configuration with an LSTM (BLEU of 24.8), but it matches the baseline NMT system with a standard attention mechanism (BLEU of 23.0).