Towards Interlingua Neural Machine Translation

Common intermediate language representation in neural machine translation can be used to extend bilingual to multilingual systems by incremental training. In this paper, we propose a new architecture based on introducing an interlingual loss as an additional training objective. By adding and forcing this interlingual loss, we are able to train multiple encoders and decoders for each language, sharing a common intermediate representation. Translation results on the low-resourced tasks (Turkish-English and Kazakh-English tasks, from the popular Workshop on Machine Translation benchmark) show the following BLEU improvements up to 2.8. However, results on a larger dataset (Russian-English and Kazakh-English, from the same baselines) show BLEU loses if the same amount. While our system is only providing improvements for the low-resourced tasks in terms of translation quality, our system is capable of quickly deploying new language pairs without retraining the rest of the system, which may be a game-changer in some situations (i.e. in a disaster crisis where international help is required towards a small region or to develop some translation system for a client). Precisely, what is most relevant from our architecture is that it is capable of: (1) reducing the number of production systems, with respect to the number of languages, from quadratic to linear (2) incrementally adding a new language in the system without retraining languages previously there and (3) allowing for translations from the new language to all the others present in the system


Introduction
Machine translation -in a highly multilingual environment-poses several challenges, as the number of possible combinations of translation directions grows quadratically. Among those challenges are the acquisition and curation of parallel data and the allocation of hardware resources for training and inference purposes. This situation becomes even worse given that the translation quality depends strongly on the amount of available training data when willing to offer translation for a language pair where there is little or no parallel data available.
Neural Machine Translation (neural machine translation) (Cho et al. 2014;Sutskever, Vinyals, and Le 2014) has arisen as a completely new paradigm for machine translation outperforming previous statistical approaches (Koehn, Och, and Marcu 2003) in most of the tasks. One clear exception is low-resourced tasks (Koehn and Knowles 2017), where statistical machine translation still can outperform or be competitive with neural machine translation (Artetxe, Labaka, and Agirre 2018;Lample et al. 2018).
Among others, one clear advantage of neural machine translation is that it opens new challenges in machine translation like multimodal machine translation (Elliott et al. 2017) or the finding of a common intermediate representation that allows training single encoders and decoders for each language reducing the number of translation systems from a quadratic dependency on languages to linear. As we will show in section 2, there are different approaches that have used the idea of a common intermediate language in neural machine translation. However, recent research in this topic has been mainly on evaluating if the neural machine translation architecture of encoder-decoder with recurrent neural networks (RNNs), with or without attention mechanisms, is able to reach a universal language while training multiple languages (Johnson et al. 2016;Schwenk and Douze 2017). These approaches can be further explored into unsupervised machine translation where the system learns to translate between languages without parallel data just by enforcing the generation and representation of the tokens to be similar (Artetxe et al. 2017;Lample, Denoyer, and Ranzato 2017). All these architectures share parameters between languages and/or require all languages to be trained at the same time. This forces the system to be retrained in order to add new languages to the system and share the same representation.
Differently, in this paper, we specifically pursue training a common intermediate representation for the benefit of incremental multilingual translation. Incremental translation is of high importance in situations like a natural disaster occurs like hurricane, tsunami or earthquake and humanitarian help is required from all over the world towards the place of the disaster. In situations where the place of the disaster has a minority language for which there are little speakers and/or non-machine translation systems available, a quick deployed machine translation system has been shown to be of vital need (?). Other than these, companies may be interested in deploying new machine translation systems for clients without retraining the other languages that they have. The incremental multilingual translation that we are presenting is supported on a common intermediate representation which implies that either encoders or decoders of new languages have to be trained with only one encoder or decoder from the system, while keeping this last one frozen. Therefore, the proposed approach differs from the previous cases in which the intermediate representation is an end by itself (e.g., (Conneau et al. 2017). Our proposed architecture combines variational autoencoders and the state-of-the-art Transformer architecture (Vaswani et al. 2017). In our case, we are using specifically the vector quantization strategy as type of variational autoencoder and the main motivation for this is matching several close sentences in the same quantization step. However, this vector quantization strategy results in worse translation quality results. Another novelty of our architecture is that, in the optimisation process, we are adding a loss term. This loss term is the correlation between intermediate representations from different languages. Like this, we are forcing the system to learn the intermediate representation while training multiple translation systems. One of the challenges at this point is to find a suitable intermediate representation distance function. In order to address this, we propose and evaluate different distance measures. Results on the Workshop on Machine Translation (WMT) benchmark on low-resourced tasks (English, Turkish and Kazakh) 1 show that our architecture (with the new loss term based on correlation distance and without variational autoencoders) produces competitive translations and improves over the baseline system while benefiting from an easily and inexpensively way of extending to new languages. Regarding adding new languages, we show that our architecture is capable of scaling to new source languages without requiring re-training all languages in the system.
Contribution. This paper proposes a proof of concept of a new multilingual neural machine translation approach. The current approach is based on joint training without parameter sharing by enforcing a compatible representation between the jointly trained languages and using multitask learning. We are proposing to effectively use the correlation as the distance between intermediate representations. This approach is shown to offer a scalable strategy to new languages without retraining any of the previous languages in the system and enabling zero-shot translation. Our architecture while being much more flexible and allowing for quickly deploying new languages is also able to improve BLEU results over the baseline systems.
Organisation. The rest of the paper is organised as follows. Section 2 reports the most closely related work on the topic. Next Section 3 explains the necessary background to make the manuscript self-contained. Section 4 details the architecture proposed in this study, both the joint training and how to scale to new languages. Section 5 describes the data and implementation used in the experiments, as well as it reports the translation results. Section 6 provides an analysis of how the system is able to recover sentence from the intermediate representation and also an insightful visualization of the intermediate representation. Finally, section 7 depicts the most relevant conclusions of this study.

Related Work
In this work, we are focusing on training a common language representation with deep learning techniques. The objective is to train an intermediate representation that allows using independent encoders and decoders for each language. In this scenario, translation systems in a highly multilingual environment get reduced from quadratic to linear and also, translation is available for language pairs that have not been explicitly trained. Following a similar objective or methodology, related works include the following ones.
Shared Encoders/Decoders. Johnson et al. (2016) feed a single encoder and decoder with multiple input and output languages. With this approach, authors show that zero-shot learning is possible. Authors show by means of visualizing similar sentences in different languages that there is some hint that these appear somehow close in the common representation. More recently, Arivazhagan et al (2019) propose auxiliary losses on the neural machine translation encoder that impose representational invariance across languages, which is shown to improve zeroshot translation.
Dedicated Encoder/Decoder. These approaches vary from many encoders to one decoder (many-to-one) (Zoph and Knight 2016), one encoder to many decoders (one-to-many) (Dong et al. 2015a) and, finally, one encoder to one decoder (oneto-one), which we are focusing on because they are closest to our approach. Firat et al. 2016 propose to extend the classical recurrent neural machine translation bilingual architecture (Bahdanau, Cho, and Bengio 2015) to multilingual by designing a single encoder and decoder for each language with a shared attentionbased mechanism. Schwenk et al. (2017) and Espana-Bonet et al. (2017) evaluate how a recurrent neural machine translation architecture without attention is able to generate a common representation between languages. Authors use the inner product or cosine distance to evaluate the distance between sentence representations. Recently, Lu et al., (2018) train single encoders and decoders for each language generating interlingual embeddings which are agnostic to the input and output languages.
Other related architectures. While unsupervised machine translation (Lample, Denoyer, and Ranzato 2017;Artetxe et al. 2017) is not directly pursuing a common intermediate representation, but it is somehow related to our approach. Artetxe et al. (2017) and Lample et al. (2017) propose a translation system that is able to translate trained only on a monolingual corpus. The architecture is basically a shared encoder with pre-trained embeddings and two decoders (one of them includes an autoencoder). On the other hand, our work is also related to recent works on sentence representations (Conneau et al. 2017Eriguchi et al. 2018) and taking advantage of multitask learning (Dong et al. 2015b). However, the main difference is that these works aim at extending representations to other natural language processing tasks, while we are aiming at finding the most suitable representation to make interlingual machine translation feasible. While interesting for further research, it is out-of-scope of this study the evaluation and adaptation of this intermediate representation to multiple tasks.

Background
In this section, we report the techniques that are used for the development of the proposed architecture in this paper: variational autoencoders (Rumelhart, Hinton, and Williams 1985), decomposed vector quantization (van den Oord, Vinyals et al. 2017) and Transformer (Vaswani et al. 2017).

Variational Autoencoders
Autoencoders consist of a generative model that is able to generate its own input. This is useful to train an intermediate representation, which can be later employed as a feature for another task or even as a dimensionality reduction technique. This is the case of traditional autoencoders that learn to produce an intermediate representation for an existing example. Variational autoencoders (Rumelhart, Hinton, and Williams 1985;Kingma and Welling 2013;Zhang et al. 2016) present a different approach in which the objective is to learn the parameters of a probability distribution that characterizes the intermediate representation. This allows to sample new synthetic instances from the distribution and generate them using the decoder part of the network.

Decomposed Vector Quantization
One of the strategies to create variational autoencoders is vector quantization (van den Oord, Vinyals et al. 2017). This consists of the addition of a table of dimension K · D where K is the number of possible representations and D the dimension or set of dimensions of each of the representations. The closest vector to the output of the encoder of the network is fed to the decoder as a discrete latent representation that would be employed to reconstruct the input.
As proposed in (Kaiser et al. 2018) this approach may produce a vector quantization in which only a small part of the vectors is employed. To solve this, decomposed vector quantization uses a set of n tables in which each table is used to represent a portion of the representation that would be later concatenated and fed to the decoder. This approach presents the advantage that by using n K · D n tables and optimizing the same number of parameters, K n possible vectors of dimension D can be generated.

Transformer
The current state-of-the-art architecture for neural machine translation is the Transformer (Vaswani et al. 2017) and it uses multiple self-attention and feedforward layers, which allow to deal with sequences by inputting the whole sequence at once and using self-attention for attending to the relevant parts of the sequence and solving correference issues. Apart from improving the results of the previous sequence-to-sequence systems (Vaswani et al. 2017), the Transformer can run in parallel. Among other tricks and particularities, this architecture requires summing positional embeddings, as previously proposed in (?), to word embeddings for explicitly encoding the relative positions of tokens, because being a non-recurrent architecure it does not have an inherent notion of sequentiality.

Model Architecture
In this section, we report the details of our proposed architecture. We describe the joint training and how we are scaling to new languages.

Definitions
Before explaining our proposed model we introduce the annotation that will be assumed hereinafter. Languages will be referred to as capital letters X, Y, Z while sentences will be referred in lower case x, y, z given that x ∈ X, y ∈ Y , and z ∈ Z. Then, sentence i in the corpus data is referred to as x i , y i , z i .
We consider as an encoder (e x , e y , e z ) the layers of the network that given an input sentence produce a sentence representation in a space. Analogously, a decoder (d x , d y , d z ) is the layers of the network that given the sentence representation of the source sentence is able to produce the tokens of the target sentence. Encoders and decoders will be always considered as independent modules that can be arranged and combined individually as no parameter is shared between them. Each language and module has its own weights independent from all the others present in the system.

Joint Training
Given two languages X and Y , our objective is to train independent encoders and decoders for each language, e x , d x and e y , d y that produce compatible sentence representations. For instance, given a sentence x in language X, we can obtain a representation r x from that the encoder e x that can be used to either generate a sentence reconstruction using decoder d x or a translation using decoder d y . With this objective in mind, we propose a training schedule that combines two tasks (auto-encoding and translation) and the two translation directions simultaneously by optimizing the following loss: where L XX and L Y Y correspond to the reconstruction losses of both language X and Y (defined as the cross-entropy of the generated tokens and the source sentence for each language); L XY and L Y X correspond to the translation terms of the loss measuring token generation of each decoder given a representation generation by the other language decoder (using the cross-entropy between the generated tokens and the translation reference); and d corresponds to the distance metric between the representation computed by the encoders. This last term forces the representations to be similar without sharing parameters while providing a measure of similarity between the generated spaces. We have tested different distance metrics such as L1, L2 or the discriminator addition (that tried to predict from which language the representation was generated). For all these alternatives, we experienced a space collapse in which all sentences tend to be located at the same spatial region. This closeness between the sentences of the same languages makes them non-informative for decoding. As a consequence, the decoder performs as a language model, producing an output only based on the information provided by the previously decoded tokens. To prevent this collapse, we propose a less restrictive measure based on correlation distance (Chandar 2015) computed as in equations 2 and 3. The rationale behind this loss is maximizing the correlation between the representations produced by each language while not enforcing the distance over the individual values of the representations.
where X and Y correspond to the data sources we are trying to represent; h(x i ) and h(y i ) correspond to the intermediate representations learned by the network for a given i observation; and h(X) and h(Y ) are, for a given batch, the intermediate representation mean of X and Y , respectively. Figure 1 shows the different task and directions that the system has been trained to perform. Each decoder is able to process the representation produced by each encoder to either translate or reconstruct the source language sentence.

Scaling to new languages
Given the jointly trained model between languages X and Y , the following step is to add new languages in order to use our architecture as a multilingual system. Since parameters are not shared between the independent encoders and decoders, our architecture enables to add new languages without the need to retrain the current languages in the system. Let's say we want to add language Z. To do so, we require to have parallel data between Z and any language in the system. So, assuming that we have trained X and Y , we need to have either Z − X or Z − Y parallel data. For illustration, let's fix that we have Z − X parallel data. Then, we can set up a new bilingual system with language Z as source and language X as target. To ensure that the representation produced by this new pair is compatible with the previously jointly trained system, we use the previous X decoder (d x ) as the decoder of the new ZX system and we freeze it. During training, we optimize the cross-entropy between the generated tokens and the language X reference data but only updating the layers belonging to the language Z encoder (e z ). Doing this, we train e z not only to produce good quality translations but also to produce similar representations to the already trained languages.

Figure 2
Language addition and zero shot training scheme Our training schedule enforces the generation of a compatible representation, which means that the newly trained encoder e z can be used as input of the decoder d y from the jointly trained system to produce zero-shot Z to Y translations. See Figure 2 for illustration. The fact that the system enables zeroshot translation shows that the representations produced by our training schedule contain useful information and that this can be preserved and shared to new languages just by enforcing the new modules to train with the previous one, without any modification of the architecture.
A current limitation is the need to use the same vocabulary for the shared language (X) in both training steps. The use of subwords (Sennrich, Haddow, and Birch 2015) mitigates the impact of this constraint.

Experimental framework
In this section, we provide details about the data and implementation for the experiments. Additionally, we report the translation results. Results are presented in terms of BLEU (Papineni et al. 2002) which is the standard automatic measure in machine translation.

Data
For experiments, we use the Turkish-English parallel data from setimes2 (Tiedemann 2009) which is used in WMT 2017 2 and the Kazakh-English parallel data from the news domain which is used in WMT 2019 3 . The training set for Turkish-English is around 200,000 parallel sentences and for the Kazakh-English is around 100,000 parallel sentences. As development and test set we used news-dev2016 and newstest2016, respectively, for Turkish-English and newsdev2019 was split into development and test set for Kazakh-English experiments. Additionally, we extracted the Kazakh-Turkish test set from the OPUS database (Tiedemann 2012) to evaluate the zero-shot translation. For experiments with larger datasets we used data between Russian-English used in WMT 2019 4 and between Russian-Kazakh. The validation and test sets from Russian-English were extracted from the Yandex corpus. Validation set for Russian-Kazakh was extracted from news-commentary-v14. Finally, and only for visualization and analysis purposes, we extracted 381 sentences which are multi-way parallel in Turkish-Kazakh-English. These sentences are also extracted from the OPUS database 5 For the latter, we download TK-EN and KK-EN and did the matching of the English sentences that were identical. Detailed statistics of the corpus are shown in Table 1.
Preprocessing consisted of a pipeline of punctuation normalization, tokenization, corpus filtering of longer sentences than 80 words and true-casing. These steps were performed using the scripts available from Moses tools (Koehn et al. 2007). In the experiments using subwords, preprocessed data is tokenized using Byte Pair Encoding (BPE) (Sennrich, Haddow, and Birch 2016).

Implementation
We used the Transformer implementation provided by Fairseq 6 . Parameters varied depending on the configuration of the experiment. For experiments within configuration 1 (conf1), we used 6 blocks of multihead attention of 8 heads each, embedding/hidden dimensionality of 128 and fixed learning rate of 0.001 and vocabulary size of 12,000 words. For experiments within configuration 2 (conf2), 2 http://www.statmachine translation.org/WMT17/ 3 http://www.statmt.org/WMT19/ 4 http://www.statmt.org/WMT19/ 5 The datasets that we prepared for Kazakh-Turkish and Turkish-Kazakh-English, which are the only ones that not belong to a benchmark, are freely available under request. 6 Release v0.6.0 available at https://github.com/pytorch/fairseq we used 6 attention blocks with 4 heads, embedding/hidden dimensionality of 512, and a fixed learning rate of 0.001 and vocabulary size of 16000 BPE tokens. For all cases, we used Adam (Kingma and Ba 2014) as the optimizer. The joint training was performed in two Nvidia Titan X GPUs with 12 GB of RAM while the addition of languages used one Titan X GPU. As for stopping criterion, systems trained until non-improvement was seen on the validation set. Table 2 shows BLEU results for two experimental configurations (1 & 2) as reported in above and for each translation direction from English-to-Turkish (EN-TK) and from Turkish-to-English (TK-EN). Within the first configuration we shows results for variations on the proposed architecture (JointTrain) which include both non-variational and variational (dqv) with the same hyperparameters of the baseline system and a comparison of two distance losses.

Translation quality for the Joint Training
Variational vs non-variational. When using the decomposed vector quantization, the performance gets worse in both directions and the loss is higher in EN-TK. When contrasting the impact of the decomposed vector quantization in our proposed architecture, we see that the performance of non-variational architecture is also higher than the decomposed vector quantization (dvq) using any type of distance.
However, the loss, when using the correlation distance, is higher in the direction of TK-EN than in the opposite.
Correlation vs Maximum distance loss. In regard to the distance loss, the correlation distance clearly provides better translation results, by approximately 1.5 BLEU in both directions when using the non-variational architecture. The improvement of the correlation distance compared to the maximum distance is even higher when using the variational architecture. Within this best configuration, our best proposed architecture, which is nonvariational autoencoder with correlation distance (JoinTrain + corr) shows similar performance compared to the baseline system (Transformer).
Up to here, experiments were performed using words. At this point, we are changing from we employ subword-neural machine translation (Sennrich, Haddow, and Birch 2016) which is standard tokenization of words. Our configuration follows the standard set-up with 16000 operations and shared vocabulary between both languages. Second part of Table 2 shows the performance of our best architecture from configuration 1 against the baseline system with the second experimental configuration (using BPE as tokenization and a larger word embedding of 512).
We achieve gains of +0.5 BLEU from English-to-Turkish and +0.7 BLEU from Turkish-to-English over the corresponding baseline. At this point, we found interesting training our architecture without autoencoders, and we see that it improves over the baseline system but not over our complete original proposed joint training. Therefore, training with autoencoders helps improving translation performance.
Note that the performance of both the baseline and our architecture in this second configuration is higher than the best system results from WMT 2017 (García-Martínez et al. 2017). We are comparing to the case of using only parallel data, without adding back-translated monolingual data (which were 10.9 for EN-TK and 14.2 for TK-EN).

Adding new languages and Zero-shot Translation
At this point, we use the best configuration from the above experiments (configuration 2, using BPE and 512 word embedding dimension). We add Kazakh as a new language to this system as proposed in section 4.3. Table 3 shows that Kazakh-English performs +0.6 BLEU points over the baseline. The frozen English decoder previously trained using the Turkish-English parallel data may be responsible for the increase of performance.
Finally, another relevant aspect of the proposed architecture is enabling zeroshot translation. To evaluate it, we compare the performance of Kazakh-Turkish compared to a pivot system based on the cascade. Such a system consists of translating from Kazakh to English and from English to Turkish. We can do this pivot either with the baseline system (standard Transformer) or with our JoinTrain architecture modules. Finally, we can use the ZeroShot strategy which consists in using Kazakh encoder and Turkish decoder. Results show that the zero-shot translation provides slightly lower quality than the pivot systems, but the pivot joint training improves over the pivot baseline. Table 3 New supervised language (KK-EN) comparing: the baseline architecture to our added language (AddLang). Zero-shot translation (KK-TK) provided by our architecture compared to a baseline which is a pivot system from KK-EN and EN-TK. Best results in bold. Results from Table 3 may seem low, but note that our system in Kazakh-to-English gets comparable results to those systems that participated in the WMT 2019 evaluation that did not use monolingual data and backtranslate it 7 .

Larger Datasets
One of the main advantages of shared architecture multilingual NMT systems (Johnson et al. 2016) is that they allow by parameter sharing the flow of information between high-resourced languages and low-resourced languages. As a result, the performance of the later ones improved.
In this experiment, we want to confirm that our proposed architecture also presents this property, by adding a new resource language with a frozen model trained in a high-resourced language. To do so, we trained a Russian-English model and add Kazakh as low-resourced language. We added Kazakh to the already trained Russian decoder from the Russian-English trained system.
As the objective of this experiment is only to measure the impact of the additional data for low resources the vocabularies were shared between all three languages, including two different script systems, Latin for English and Cyrillic for Russian and Kazakh. Table ?? shows the results for the baselines and our architecture between Russian and English. As the system benefits from more data available, we want to evaluate its performance in the case of training Kazakh with a larger dataset and perform zeroshot translation from Kazakh to English, so as mentioned, we are adding Kazakh to the system by training a Kazakh encoder with the already trained Russian decoder. Results in Table 5 show that this zero-shot approach outperforms both baseline and direct Kazakh to English addition, proving that having access to more data provides better models in this architecture. This idea holds even when to parallel data has been used only during the training and the translation is based only in the compatibility of the sentence representations.
Nota that both table 4 and ?? show that our proposed system has a lower performance than the baseline system when trained on this new larger dataset.

Encoders/Decoders Compatibility and Intermediate Representation Visualization
Our main objective is creating an intermediate representation that can be understood by the different modules trained in the system, where the modules are the encoders and decoders of all languages involved in training. But similar representations may not lead to compatible encoder/decoders. Also, different trainings can produce representations with different mean distances in the representations that can generate similar translation outputs. With this focus in mind, in this section we do further analysis of our architecture by looking at the compatibility between encoders and decoders and visualising the intermediate representation.
The model used for the analysis within this section is the JoinTrain+corr and configuration 2, which is the best performing model from Table 2 for Turkish-English and Kazakh-English from Table 3.

Encoders/Decoders Compatibility
We propose the following analysis towards measuring the compatibility of our encoders and decoders. Given a parallel set of sentences in the languages in which the system has been trained, we can generate e x and e y . Both encodings, coming from a parallel test, have the same number of vectors each of them of the same dimensionality.
Our proposed analysis consists of inferring one of the decoders in the system (d x and d y ) using e x and e y as input. This generates two different outputs: an autoencoding output and a machine translation output. As we have parallel references for both languages we can measure the BLEU score (Papineni et al. 2002) of each of the results against the reference to measure how the models perform.
Additionally, we can calculate a new BLEU score comparing the outputs of the autoencoding and the machine translation outputs. In the ideal case, encoders from two different languages have to produce the same representation for the same sentences. Therefore, the difference between the BLEU score obtained in the autoencoding output and the translation output shows how different are e x and e y representations in terms of how the decoder is able to generate accurate results from them. Our analysis consists in evaluating the BLEU score using the autoencoding output as a reference and the machine translation output as a hypothesis (this comparison will be refer to as A-T). Figure 3 shows the full pipeline of this procedure.  Table 6 shows that the quality of the output of the decoder is quite better when the input comes from the encoder of the same language (autoencoder) than from another (machine translation). We also included the BLEU score between both autoencoder and translation outputs (A-T), which is the measure that we are proposing to evaluate the quality of our intermediate representation. Low BLEUs Table 6 Comparison of BLEU scores on the JointTrain+corr architecture (on configuration 2) when performing as autoencoder and machine translation. The third column is the BLEU between autoencoder and translation outputs in A-T indicates that we are still far from being able to decode from the common intermediate representation.

Visualization and Translation Examples
As follows, we analyze the intermediate representations at the last attention block of the encoders, where we are forcing the similarity. In order to graphically show the presentation, we are using an in-house visualization tool (?) freely available 8 . The tool trains a UMAP (McInnes et al. 2018) model combining the representations of languages that makes a dimensionality reduction of the sentence representations. As follows, we used the 381 multi-way parallel sentences extracted specifically for this analysis (see statistics in Table 1). Figure 6.2 shows the sentence representations created by their encoders. The separated clusters show that languages are not yet represented in the same space. Related work (Arivazhagan et al. 2018) shows similar results for the case of a multilingual system with shared encoder and decoder. While the system is able to produce compatible representations clear clusters can be observed for each language in the system. Plausible explanations for this difference may be the distance measure that we are using and/or the alignment of the source sentences. Some distance measures cause the representations to collapse in a small region of the space making them non-informative for the decoder. Our distance measure, the correlation distance, while it enforces the representations to correlate, it does not constrain the scale of the values in the contextual vectors. This measure enforces the sentence distribution within the same language to be similar between all languages. However, since we are not constraining the scale, each language can be represented in a different space region. Table 7 shows some examples of English translations (using the English decoder) when using the three different encoders (English, Turkish and Kazakh) together with the reference translation.

Figure 4
Encoder representations of the multi-way parallel 381 sentences. Turkish sentence representations (blue) compared to English sentence representations (red) and Kazakh sentences (green)

System
Sentence Reference it was a terrific season. EN we had a strong season . TK it was a terrific season. KK we made a very big season . Reference in London and Madrid it is completely natural for people with serious handicaps to be independently out in public, and they can use the toilets, go to the museum, or wherever ... EN in London and Madrid , it is very normal for people with severe disability to be left to the public and be able to serve , to the museum , where ... TK in London and Madrid it is completely natural for people with serious handicaps to be independently out in public, and they can use the toilets, go to the museum, or wherever ... KK in London and Madrid, it is quite common for people with a heavy disability to travel on their own in public spaces; they can go to the toilets, to the museum, anywhere ... Reference from the Czech viewpoint, it seems they tend to put me to the left. EN from a Czech point of view, I have the impression that people see me more than on the left. TK from the Czech viewpoint, it seems they tend to put me to the left. KK from a Czech point of view , I have the impression that people are putting me on the left .

Conclusions
We proposed a novel translation architecture which aims at a common intermediate representation for the benefit of incremental training in machine translation.
While there are already some machine translation systems where the implicit emergence of an internal interlingua representation is suggested, the main proposed difference in the current paper is forcing the Neural Machine Translation system to learn an intermediate multilingual representation while using independent encoders and decoders. This is achieved by combining the maximum likelihood loss, normally used in Neural Machine Translation, together with an extra loss term that computes a measure of the distance between intermediate representations in different languages.
By achieving an interlingua representation, encoders and decoders are decoupled by having the interlingua as an interface. This leads to enabling every possible combination of encoder and decoder, effectively turning the quadratic needs, in terms of training data and resources, to linear. Furthermore, such a decoupling also allows training encoders/decoders to/from a new language that only has parallel data with one of the already supported languages, enabling the translation to/from any of the other supported languages. As a consequence, our architecture allows to extend the bilingual system to a multilingual system by incremental training. This particularity allows for quickly deploying new systems when requiring new languages into the system, which is highly relevant in scenarios like humanitarian crisis.
We have evaluated our model on different low-resourced language pairs. Our model outperforms current bilingual systems in the low-resourced setting (but not in the high-resourced) in addition to presenting a flexible architecture that enables scaling to new languages (achieving multilingual and zero-shot translation) without retraining languages in the system. Our approach supersedes pivoting approaches again in the context of a low-resourced setting. One of the next steps will be to exploit monolingual data in our architecture further avoiding dependency on the availability of parallel data.