Current neural machine translation (NMT) systems can translate numerous types of texts with a very high accuracy and fluency. The most important condition for high quality translation is a sufficient amount of training data, which consists of parallel texts in the desired language pair. Aside from language, other aspects of the training data and their overall similarity to the inputs we plan to translate are also important. These aspects include style, domain and topic, specific vocabulary or even sentence length – the more the training data are similar to the texts that are going to be translated, the better the results. Thus, the research in domain adaptation mostly consists of exploring methods for obtaining specific types of training data and using them to finetune translation models for certain input texts.
There are however types of texts where domain adaptation using additional training data might not be sufficient and novel training and decoding algorithms are necessary to reach satisfying translation quality. More specifically, models based on neural networks are notoriously bad at processing surprising or even adversarial inputs. As a real-world example of such texts, some forms of literature base their appeal on surprisal of the reader. We formally define surprisal as the information conveyed by a linguistic unit in a text, typically by a word in a sentence. In NMT, the problem is two-fold -- first the model needs to meaningfully encode the atypical input (this part of the issue is shared by all NLP problems solved by neural networks, for example sentiment classification or named entity recognition) and then produce equally atypical output (this part is specific for NMT). To generate the output sequence from the model, it is necessary to use a decoding algorithm. The most popular decoding algorithm of today, beam search, prefers to generate outputs where all the words carry similar information (i.e. have similar levels of surprisal for the model/reader, for example see fig. 2 in [1]).
Another typical issue of NMT algorithms of today is underestimation of the probability of rare words, and in result inadequate overrepresentation of frequent words in the output ([2]). Both of these issues make NMT less suitable for the described types of texts, since they cause lack of diversity and surprisal in the output. Another part of the NMT architecture which might be suboptimal for our use-case is the objective function used to train the model. In this case, for source sentence x and target sentence y, traditional NMT models a single distribution P(y|x) over all possible sentences in the target language, i. e. for a given source sentence, all target sentences must "compete" over a probability mass of a single distribution. Such formulation creates an assumption of a single correct translation. This is of course not true and this assumption may further hurt diversity of the translation, since rare alternative translations are forgotten during the training in favor of the more common ones. Recently, a novel objective function was introduced – SCONES [3] models NMT as a multi-label classification. For each pair of x,y, a separate binary classifier is trained to indicate whether the sentences are translations of each other to allow to express the ambiguity of the translation task.
We plan to examine the effect of this objective function and its modifications in our use-case. In general, it has been shown that conventional NMT, i.e. the combination of maximum likelihood estimation training and maximum-a-posteriori decoding (i.e. beam search) shifts statistical properties of the produced texts towards more uniform outputs, in terms of lexical choice and surprisal distribution over the words in the output sentence. Our proposal focuses on alternative approaches to mitigate this problem. A specific translation task is simultaneous translation (SimulMT if the input is text and SimulST in case of direct speech input). In simultaneous translation, the NMT model is expected to produce translation while the source text is not yet complete and growing. This further increases the difficulty of the translation task, as we usually also require fast and stable translations. There are two major approaches: incremental and retranslation. In the incremental approach, once a partial hypothesis is shown to the user, the model cannot revisit its decision. In the retranslation approach, revisiting the partial hypothesis is allowed. However, too many retranslations hurt the user experience. There are several challenges connected to simultaneous translation. First, although growing, there is still a scarcity of spoken corpora, which leads to use of written text during training. This is not optimal, because mentioned before, NMT is notorious for poor generalization. Another issue in the simultaneous translation is over- and under-generation which might be associated with the inappropriate assumption of a single correct translation. As we saw that simultaneous translation also has many properties that are different from ordinary, general domain text, we decided to include it as one of the analyzed types of input.
In the first year, we will carry out following steps:
Bibliography: [1] Holtzman, Ari, et al. "The curious case of neural text degeneration." (2019). [2] Ott, Myle, et al. "Analyzing uncertainty in neural machine translation." (2018). [3] Stahlberg, Felix, and Shankar Kumar. "Jam or Cream First? Modeling Ambiguity in Neural Machine Translation with SCONES." (2022)