Part of the related work section in my Master Thesis. Download the full version here.

Motivation
Constructing Word Representations
Evolution of Architectures
Cross-Lingual Static Word Representations
Cross-Lingual Contextualized Word Representations
Challenges in Multilingual Transformers

Motivation

The latest NLP technology relies on pre-training on massive amounts of text in the respective language in an unsupervised fashion, producing fixed-size sequence or word representations that can be used to fine-tune on a task with sufficient labeled data. In most cases, both data sources are needed to meet the performance of state-of-the-art NLP approaches. However, the lack of both data sources will highly degrade the performance of these methods, posing a fundamental problem in scaling low-resource languages. Cross-lingual representation techniques try to alleviate the issue of data scarcity for low-resource languages by inducing an aligned representation across languages, i.e., language-agnostic language representations. The idea is to transfer lexical, syntactic, and semantic knowledge across languages that can be used for cross-lingual downstream tasks. This gives rise to two advantages: (1) Transferring lexical knowledge across languages enables us to reason about the semantics of words in multilingual contexts and is a vital source of knowledge for multilingual systems such as machine translation (Artetxe et al.; Qi et al.; Lample et al.), multilingual search, and question answering (Vulić and Moens). (2) More importantly, given a downstream task, models can utilize the joint representation space by training on a high-resource language such as English, where labeled data exists, to then use the acquired task knowledge to transfer it to other (low-resource) languages. The hope is that the model can generalize lexical properties and relations across languages (Plath). Ultimately, cross-Lingual representations can also be seen as a type of transfer learning which can help us understand why transferring knowledge across languages works.

Relation to Transfer Learning. Transfer Learning is a sub-field in Machine Learning that focuses on reusing the acquired knowledge from past related tasks to help the learning process of solving a new task. Cross-Lingual Representation Learning, therefore, is a type of transfer learning, specifically similar to domain adaption; see Figure 1 for a taxonomy of transfer learning in NLP.

A taxonomy for transfer learning for NLP. Given a downstream task, Cross-Lingual Representation Learning utilizes the joint representation space by fine-tuning on a high-resource language to then use the acquired task knowledge to transfer it to other (low-resource) languages. Adopted from .

Viewing cross-lingual representation as a form of transfer learning can help us understand in which cases knowledge from one to another language can be transferred:

Transfer Learning works well when the underlying structures are similar to each other. Languages share many aspects on many different levels, e.g., on a lexical level, languages can incorporate words from another language (loanwords) and have words from the same origin (cognate). On a syntactical level, languages might have a similar structure of sentences, and on a semantic level languages, languages are built upon a so-called natural semantic metalanguage, see (Goddard) for a more in-depth analysis.
Transfer Learning fails when the source and target settings are vastly different. In our cross-lingual setting, transferring any knowledge from languages that are not related in any way is hard, i.e., languages that are not typologically or etymologically related.

In summary, languages must be in some way related; otherwise, we can not transfer knowledge across languages. Languages in the same language family share much more than unrelated languages. This is also reflected in the performance of cross-lingual methods (Lample and Conneau; Lauscher et al.). Furthermore, even when languages come from different language families, (Goddard) argues that, on a semantic level, languages are built upon a natural semantic metalanguage, therefore, share a connection.

Constructing Word Representations

The question remains how one can exploit shared structures across languages to build a cross-lingual representation. As word representations, i.e., words represented as real-valued vectors, are the basis of modern NLP and cross-lingual representations, we first discuss different approaches to built word representations.

Bag-of-Words. The simplest solution to represent words in a vector form is to use the bag-of-words approach, which describes the occurrence of words within a document. However, the bag-of-word approach has major problems such as losing information about word order and semantics, being highly dimensional (curse of dimensionality), and can not represent out-of-vocabulary tokens.

Distributed Word Representation. To improve upon bag-of-words, distributed word representations were utilized, which represent words (or more generally, tokens¹) as distributed representations of lower dimensionality, trained to capture syntactic and semantic relationships between words. The approach is motivated by the underlying idea that "a word is characterized by the company it keeps", called the distributional hypothesis (Harris). This means that words that occur in the same context are semantically related. The general approach to generating distributed word embeddings is by computing word co-occurrence statistics based on unlabeled free text (unsupervised).

Language Models. The most prominent way to make use of the distributional hypothesis to create distributed word representations is by using language models (LMs), which is the backbone of modern NLP. LMs first gained momentum when (Collobert and Weston) showed the effectiveness of applying neural models to language modeling to create semantic word embeddings for downstream tasks. It was the start point of the modern approach to NLP: Pre-train neural models to create word representations for downstream tasks. Formally, an LM, given a sequence of tokens ${t_1, …, t_m}$ with length $m$, outputs a probability distribution over the sequence of tokens, i.e., the likelihood of the tokens occurring jointly:

\[\begin{equation} \label{eq:lm_likelihood} P(t_1, ..., t_m) \stackrel{\text{(chain rule)}}{=} P(t_1) \cdot P(t_2 \vert t_1) \cdot ... \cdot P(t_m\vert t_1,...,t_{m-1}) = \prod_{i=1}^m P (t_i \vert t_{i:i-1}). \end{equation}\]

The chain rule allows us to calculate the joint probability of the tokens in the sequence as conditional probabilities of a token given their respective previous tokens. Even then, this becomes quickly intractable due to combinatorial explosion of the possible number of previous tokens. To avoid this problem, typically we leverage the Markov assumption, i.e., it is assumed that the probability of $P(t_m\vert t_1,…,t_{m-1})$ depends only on the previous $n-1 « m$ tokens: $P(t_m\vert t_1,...,t_{m-1}) \approx P(t_m\vert t_{m-(n-1)},...,t_{m-1}) \implies P(t_1, ..., t_m) & \approx \prod_{i=1}^m P (t_i \vert t_{i-(n-1):i-1}).$

\[\implies P(t_1, ..., t_m) \approx \prod_{i=1}^m P (t_i \vert t_{i-(n-1):i-1}).\]

As the joint probability of tokens now only depend on the product of probabilities of the form

\[\begin{equation} \label{eq:lm} P(t_i \vert t_{i-(n-1)}, ..., t_{i-1}), \end{equation}\]

called $n$-grams, we need to estimate the $n$-grams which can be done with the maximum likelihood estimate (MLE) on the training corpus. In practice, models are trained to either predict the subsequent tokens (directional) or to predict the missing token given the surrounding context of the word (bidirectional).

To evaluate the performance of a language model, the usual metric is the Perplexity (Jelinek et al.), which is defined as the inverse probability of sequences on a validation set normalized by the number of tokens:

\[\begin{equation} PP(t_1,...,t_{m-1}) = \sqrt[m]{\frac{1}{P(t_1, t_2, ...,t_{m})}} \stackrel{\text{(chain rule)}}{=} \sqrt[m]{\frac{1}{\prod_{i=1}^m P (t_i \vert t_{i:i-1})}}. \end{equation}\]

A lower perplexity indicates a better model.

The following subsection will explore different model architectures that utilize language modeling to create powerful word embeddings.

Evolution of Architectures

We outline the evolution of (neural) architectures in NLP to induce strong word representations that can be utilized for downstream tasks, such as Natural Language Understanding. For now, we restrict ourselves to inducing monolingual word representations using the language modeling task on monolingual text corpus.

Feedforward Neural Networks. (Mikolov et al.) introduce an efficient way of learning high-quality word vectors for millions of words in the vocabulary from a large amount of unlabeled text data in an unsupervised fashion. The released word embeddings not only capture semantic and syntactic information but also learn relationships between words², e.g., Paris - France + Italy = Rome. They dub their approach word2vec and give two novel model architectures: Skip-Gram (SG) and Continuous Bag-of-Words (CBOW).

Both CBOW and SG architectures are based on a simple feedforward neural network. The CBOW method computes the probability of the current token based on the context tokens within a window $W$ of $k$ neighboring tokens. On the other hand, the SG computes the probability of surrounding tokens within a window $W$ of $k$ neighboring tokens given the current token. The network encodes each token $t_i$ into a center token $e_i$ and context token $c_i$ which correspond to the $i-$th row of the center token matrix $E^{\vert V \vert \times d}$ and context token matrix $E^{\vert V \vert \times d}$, where the $V$ is the size of the vocabulary and $d$ the token vector embedding size. Given a center token $t$, SG estimates the likelihood of seeing a context token $w$ conditioned on the given center token with the softmax function:

\[\begin{equation} \label{eq:word2vec_likelihood} P(c_w \vert e_t) = \frac{\exp{e_t^Tc_w}}{\sum_{i=1}^N \exp{e_t^T c_i} }, \end{equation}\]

where $e_t$ denote the embedding for the center token $t$ and $c_w$ the embedding for the context token $w$ in the window $c_w \in W$. Given a text corpus ${t_1, …, t_T }$ of length $T$ and assuming that the context words are independently generated given any center word, we learn the model parameters (token embeddings) by maximizing the likelihood function over the text corpus which is eqivalent to minimizing the negative log-likelihood:

\[\begin{equation} \label{eq:word2vec} \max_{e,c} \prod_{t=1}^T \prod_{w \in W_t} P(c_w \vert e_t) \Leftrightarrow \min_{e,c} - \sum_{t=1}^T \sum_{w \in W_t} \log \left( P(c_w \vert e_t) \right). \end{equation}\]

For a downstream task, the final embedding for token $t$ is either the center token or calculated as the element-wise average or the sum of its center and context representations. Therefore, the word2vec objective directly uses the language modeling task to generate effective word embeddings.

Even though word2vec is very effective in creating powerful word representations, there are some considerable drawbacks: First, the denominator in the Equation of the word2vec likelihood sums over the entire vocabulary, slowing down the calculating of the softmax. There are some approaches such as hierarchical softmax and negative sampling to overcome this (Mikolov et al.). Still, there are two major conceptually disadvantages of word2vec representations: First, they can not embed any tokens outside the vocabulary, and second, they do not account for the linguistic morphology of a word, e.g., the representations of "eat" and "eaten" are learned separately (no parameter-sharing) based on its context they appear on.

To solve the above issues, (Bojanowski et al.) introduce FastText, a new embedding method. FastText extends the idea of word2vec by using the internal structure of a word to improve the word representations of word2vec. Instead of constructing representations for words, FastText learns representations for n-grams of characters which are then used to build word representations by summing the bag-of-character n-grams up. E.g., for $n=3$, the word "artificial" is represented by $<$ar, art, rti, tif, ifi, fic, ici, ial, al$>$ where the angular brackets indicate the beginning and end of the word. This allows us to better represent and capture the meaning of suffixes and prefixes. Furthermore, words that do not appear during training can be represented by breaking the word into $n$-grams to get its representation.

The released representations of FastText and word2vec became famous because of their ease of use and effectiveness in a variety of NLP problems (Lample et al.; Kiros et al.; Kusner et al.). Furthermore, these (monolingual) representations can be used to construct a cross-lingual representation space by mapping representations of multiple languages into a shared space (see Section 3).

However, word2vec and FastText have several drawbacks: Each word has a static word representation. Consequently, both methods can not correctly capture phrases and polysemy of words. Furthermore, during training, we only consider the context of a word, leading to a similar representation of a word and its anatomy since both appear in a similar context. Another drawback is that we only consider a fixed-size window of context words for conditioning the language model. A more natural way to learn representation is to allow a variable amount of context words.

Recurrent Neural Network (RNN). A RNN is a class of artificial neural networks that are specialized to process sequential data, e.g., natural language text. RNNs are capable of conditioning the model to an arbitrary number of words in the sequence. The above Figure depicts the architecture of an uni-directional RNN where each vertical box is a hidden layer at a time-step $t$. At each time step $t$, the hidden layer gets two inputs: the output of the previous layer $h_{t-1}$ and the input at that time-step $x_t$.

drawing

To produce the output features $h_t$ and to obtain a prediction output $\hat{y}$ of the next word, we utilize the weight matrices $W^{(hh)}, W^{(hx)}, W^{(S)}$ as follows:

\[h_t = \sigma\left(W^{(hh)} h_{t-1} + W^{(hx)} x_{[t]} \right)\] \[\hat{y}_t = \text{softmax}\left(W^{(S)} h_{t} \right)\]

Notice that the weights $W^{(hh)}, W^{(hx)}$ are applied repeatedly at each time step, therefore sharing the weights across time steps. This allows the model to process sequences of arbitrary length. Furthermore, the model size does not increase with longer input sequences. In theory, RNNs can use information from any steps from the past. However, in practice, this is difficult as the vanishing and exploding gradients become a big issue with long sequences (Hochreiter; Bengio et al.) which then makes the model insensitive to past inputs. To alleviate these issues, we mention some heuristic solutions: Clipping the gradient to a small number whenever they explode (Pascanu et al.) , initialization of $W^{(hh)}$ as the identity matrix since it helps avoid the vanishing gradients (Le et al.) and using the Rectified Linear Units (ReLU) instead of the sigmoid function (Agarap) . However, one of the most important extensions to solve the vanishing gradient problem is the so-called long-short term memory (LSTM) (Hochreiter and Schmidhuber; Gers et al.) , which is a sub-architecture for the hidden layer of an RNN. The LSTM unit introduces a gating mechanism that selectively propagates only a subset of relevant information across time steps and consequently mitigates the vanishing gradient problem.

RNNs and LSTMs started to dominate NLP, either performing competitively or outperforming existing state-of-the-art on various tasks (Sutskever et al.; Mikolov et al.; Sutskever et al.) . One particular interesting architecture emerged to address tasks where a output sequence was needed such as machine translation: The encoder-decoder architecture. The architecture was first proposed by (Hinton and Zemel) and was then later used in the context of NLP (Kalchbrenner and Blunsom; Sutskever et al.).

For big data, deep learning methods perform much better than traditional machine learning algorithms. — Encoder-decoder architecture of an RNN on a machine translation task. The encoder produces a thought vector $C$, representing the source sentence. The decoder then unfolds state $C$ to produce an output sequence.

The encoder part takes a sequence as an input and outputs a single vectorized representation of the whole sequence (called thought vector), which the decoder takes as an input to generate a sequence.

However, since the thought vector has a fixed size representation form, and the decoder only depends on the thought vector, the representative power of the (uni-directional) RNN encoder-decoder architecture for sequences is naturally limited. All the information about the input has to be encoded into the fixed-size thought vector, which becomes increasingly difficult for long sequences.

Attention. To improve upon the above described shortcoming of encoder-decoders, (Bahdanau et al.) introduces the concept of attention. The attention module allows the decoder to re-access and select encoder hidden states at decoding time.

Following the numbers in the above Figure, the attention module can be explained by the following: (1) The decoder’s hidden state and (2) the intermediate hidden states of the encoder are being fed into the attention module. (3) The attention module then selects relevant information from the hidden states of the encoder based on the decoder’s hidden state and calculates the context vector. Finally, (4) the decoder takes the context vector and the last output word ("cómo") as the input and outputs the next word. {\cite Galassi_2021 } gives an exhaustive overview of different attention implementations. However, we restrict ourselves to the common attention mechanism used in transformers, explained in the next paragraph.

First, the decoder state $\mathbf{h}_D \in \mathbb{R}^{1 \times d}$ is embed to a query $\mathbf{q} \in \mathbb{R}^{1 \times d_k}$ using a learnable weight matrix $W_Q \in \mathbb{R}^{d \times d_k}$:

\[\mathbf{q} = \mathbf{h}_D W_Q\]

and each encoder state $\mathbf{h}_E^{(i)}$, where $i$ denotes the encoder time-step, is stacked to a encoder state matrix $\mathbf{H}_E$ and is used to produce the key matrix $\mathbf{K}$ and value matrix $\mathbf{V}$:

\[\mathbf{K} = \mathbf{H}_E W_K, \quad \mathbf{V} = \mathbf{H}_E W_V,\]

where $W_K \in \mathbb{R}^{d \times d_k}, W_V \in \mathbb{R}^{d \times d_v}$ are learnable weights. We then calculate the attention weights which computes how relevant a single key $\mathbf{k}^{(i)}$ vector is for the query $\mathbf{q}$:

\[w = \mathbf{q} \mathbf{K}^T.\]

Commonly, the weights are normalized to a probability distribution using the softmax function which are then used to create the context vector $\mathbf{c}$ by taking the weighted average of the values:

\[\mathbf{c} = \sum_i a^{(i)} \mathbf{v}^{(i)} \quad \text{ where } \quad a^{(i)} = \text{softmax}(w)_i.\]

During training, we optimize the weights $W_Q, W_K, W_V$ which then improves the selective focus of the attention module. We can summarize the (dot-product) Attention with:

\[\text{Attention}(Q, K, V) = \text{softmax}\left( Q K^T \right) V.\]

Notice that we extended the vector query to a matrix query $Q$ where each row represents one query vector.

Transformers. Even though attention solves the issue of restrictive expressiveness, RNNs have another main architectural drawback: RNNs are slow since they are sequential and therefore hard to parallelize. A new model architecture based solely on attention mechanisms and fully parallelised was proposed by (Vaswani et al.), called Transformers, an encoder-decoder model. In this thesis, our models only rely on the encoder part of the model, which is why we omit the description of the decoder. We visualize the architecture of the encoder:.

First, the sequence is tokenized, then the model embeds each token in the input sequence with a token embedding layer, then adds a positional encoding³ depending on the position of the token in the sequence. These representations are then routed $N$ times through separate (self-)attention and feedforward sub-networks. The core difference between the attention module described above and self-attention is that the query matrix $\mathbf{Q}$ is generated from tokens of the input sequence and can attend to all other tokens of the same sequence, including itself, to generate its new representation. Furthermore, they do not use the dot-product attention but the scaled dot-product attention:

\[\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V.\]

Additionally they utilize multiple heads (multiple self-attention layers), which split the queries, keys, and values matrices $Q, K, V$ along the embedding dimensions with $d_k = d_v = d / h$ where $h$ is the number of heads. Subsequently, they apply the self-attention independently, each having its own parameters. The advantage of multi-heads is that tokens can jointly attend to multiple tokens in the sequence. Each head produces its own output and gets concatenated, once again projected, resulting in the final values

\[\text{MultiHead}(Q, V, K) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W^O\]

where $W^O \in \mathbb{R}^{hd_v \times d}$ is the projection matrix. Finally, the result is fed into a feedforward neural network. Additionally, the architecture uses dropout, residual connections, and layer normalization to stabilize and improve training.

Using the transformer architecture has improved upon RNNs in many ways: Through multi-heads, the total computational complexity per layer is much lower, and through their ability to parallelize many computations, the scalability of transformers far exceeds RNNs. Therefore, stacking transformer blocks to increase the representative model capacity can be done efficiently. Furthermore, the path length between long-range dependencies in the network is reduced from $O(n)$ to $O(1)$ as self-attention allows access to the input directly.

Another critical aspect of transformers is the pre-training and fine-tuning paradigm: The general procedure is to pre-train on a language modeling task on huge training text, which is possible because of the high parallelizability of transformers, e.g., (Radford and Narasimhan) train on the next word prediction task on a corpus with over $7000$ books. Given a downstream task, the whole pre-trained model (coupled with a task head) is then fine-tuned on the task dataset.

Cross-Lingual Static Word Representations

In the previous Section 2, we outlined different architectures to induce word representations. However, we restricted ourselves to inducing monolingual word representations by pre-training on monolingual text corpora (with the language modeling task, i.e., predicting the next word). Since monolingual word embeddings pre-train in each language independently, therefore only learning monolingual distributional information, they can not capture semantic relations between words across languages. The fundamental idea behind cross-lingual word representations is to create an aligned representation space for word representations from across multiple languages. This section briefly discusses how to extend static word embeddings (induced by, e.g., word2vec and FastText) to create a cross-lingual space.

FastText and word2vec induce static word embeddings, i.e., they do not consider the context of a word in their representation. Therefore phrases and polysemy of words can not be correctly captured, which prohibits the effectiveness of an aligned space across languages with static embeddings. Nonetheless, we discuss two popular approaches: (1) Projection-based models and (2) Bilingual Embeddings.

Projection-based methods. Projection-based methods rely on independently trained monolingual word vector spaces in the source language $\mathbf{X}_{L1}$ and target language $\mathbf{X}_{L2}$, that post-hoc align the monolingual spaces into one cross-lingual word embedding $\mathbf{X}_{CL}$:

The alignment is based on word translation pairs $D$, which can be obtained by existing dictionaries or by inducing it automatically. The former is a supervised method and the latter an unsupervised approach that usually assumes (approximately) isomorphism between monolingual spaces. Typically, a supervised projection-based method uses a pre-obtained dictionary $D$ containing word pairs for finding the alignment (Mikolov et al.; Huang et al.) . Unsupervised methods induce the dictionary using different strategies such as adversarial learning (Conneau et al.) , similarity-based heuristics (Artetxe et al.) , PCA (Hoshen and Wolf) , and optimal transport (Alvarez-Melis and Jaakkola) .

Projection-based methods construct an aligned monolingual subspace $\mathbf{X}_S$ and $\mathbf{X}_T$, where the aligned rows are translations of each other. (Mikolov et al.) learns the projection $\mathbf{W}_{L1}$, by minimizing the Euclidean distance between the linear projection of $\mathbf{X}_S$ onto $\mathbf{X}_T$:

\[\mathbf{W}\_{L1} = argmin \vert \mathbf{X}\_{S} \mathbf{W} - \mathbf{X}\_{T} \vert\]

which can be further improved by constraining $\mathbf{W}_{L1}$ to be an orthogonal matrix (Xing et al.).

The induced cross-lingual space performs well for related languages on the BLI task but degrades when the language pair is distant (Vulić et al.). Furthermore, (Glavas et al.) show that the BLI is not necessarily correlated to downstream performance.

Bilingual Embeddings. Bilingual Embeddings induce the cross-lingual space by jointly learning representations from scratch. In general, the general joint objective can be expressed as:

\[\alpha(Mono_1 + Mono_2) + \beta Bi\]

where $Mono_1$ and $Mono_2$ are monolingual models, aiming to capture the clustering structure of each language, whereas the bilingual component, $Bi$, encodes the information that ties the two monolingual spaces together (Klementiev et al.; Luong et al.) . The hyperparameters $\alpha$ and $\beta$ weight the influence of the monolingual components and the bilingual component.

One popular choice is the BiSkip-gram model which extends the Skip-gram model (see Section 2) by predicting words crosslingually rather than just monolingually:

However, the approach is expensive in terms of supervision as the BiSkip-gram approach is based on a parallel corpus. Furthermore, for low-resource languages, this level of supervision is, in some cases, impossible to acquire.

Cross-Lingual Contextualized Word Representations

Transformers and RNNs produce contextualized word embeddings, i.e., they encode the same word differently depending on its context. We already discussed in Section 2 why transformers improve upon RNNs from an architectural standpoint and how it benefits the pre-training. This section will first dive into one of the most popular transformer models and its pre-training tasks. We then explore multilingual transformers and, more importantly, the XLM-R model as it builds the foundation of our thesis.

BERT. Before exploring multilingual transformer models, we introduce the perhaps most popular transformer: BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al.) . BERT is a sequence encoder-only model, which slightly modifies the Transformer architecture (Vaswani et al.) in the following ways: As it only consists of an encoder, the model allows information to flow bidirectionally, creating bidirectional representations.

Specifically, the BERT encoder is composed of $L$ layers, each layer $l$ with $M$ self-attention heads, where a self-attention head $(m,l)$ has a key, query and value encoders. Each one is calculated by a linear layer:

\[\begin{aligned} \mathbf{Q}^{m,l}(\mathbf{x}) & = \mathbf{W}_q^{m,l} (\mathbf{x}) + \mathbf{b}_q \\ \mathbf{K}^{m,l}(\mathbf{x}) & = \mathbf{W}_k^{m,l} (\mathbf{x}) + \mathbf{b}_k \\ \mathbf{V}^{m,l}(\mathbf{x}) & = \mathbf{W}_v^{m,l} (\mathbf{x}) + \mathbf{b}_v \end{aligned}\]

where the input $\mathbf{x}$ is the output representation embedding layer for the first encoder layer, and for the rest of the layers, $\mathbf{x}$ is the output representation of the former encoder layer. These are then fed into the scaled dot-product attention (which does not introduce any new parameters) and concatenated by the projection, outputting $\mathbf{h}_1^l$. The output is then again fed into the MLP with the layer norm component of the transformer block:

\[\begin{aligned} \mathbf{h}\_2^{l} & = \text{Dropout}(\mathbf{W}\_{m_1}^{l} \cdot \mathbf{h}\_1^l + \mathbf{b}\_{m_1}^{l}) \\ \mathbf{h}\_3^{l} & = \mathbf{g}^l\_{LN_1} \odot \frac{(\mathbf{h}\_2^l + \mathbf{x}) - \mu}{\sigma} + \mathbf{b}^l\_{m_1} \\ \mathbf{h}\_4^{l} & = \text{GELU}(\mathbf{W}\_{m_2}^{l} \cdot \mathbf{h}\_3^l + \mathbf{b}\_{m_2}^{l}) \\ \mathbf{h}\_5^{l} & = \text{Dropout}(\mathbf{W}\_{m_3}^{l} \cdot \mathbf{h}\_4^l + \mathbf{b}\_{m_3}^{l}) \\ \text{out}^{l} & = g^l\_{LN_2} \odot \frac{(\mathbf{h}\_5^l + \mathbf{h}\_3^l) - \mu}{\sigma} + \mathbf{b}^l\_{LN_2}. \end{aligned}\]

The model parameters $\Theta$ are therefore constructed from the weight matrices $\mathbf{W}^{l,(\cdot)}_{(\cdot)}$, bias vectors $\mathbf{b}_{(\cdot)}^{l, (\cdot)}$ and vectors $\mathbf{g}^l_{(\cdot)}$.

Furthermore, the pre-training task is not the next word prediction anymore but consists of two novel pre-training tasks: (1) The masked language modeling task (MLM) and the (2) next-sentence prediction (NSP) task. MLM masks tokens of the input sequence, and the task is to predict the original token based on the masked sequence:

The masking happens by selecting $15\%$ tokens, then $80\%$ are masked, or $10\%$ replaced by a random token or $10\%$ left unchanged. On the other hand, NSP asks the model to predict whether the input, which consists of two concatenated sentences, if they are consecutive to one another. Specifically, they construct sentence pairs by taking $50\%$ actual sentence pairs that are consecutive and $50\%$ artificial sentence pairs that are not consecutive. Additionally, while tokenizing the input sequence, BERT inserts special tokens, such as the [CLS] token and [SEP], where the former is always inserted at the start of a sequence, and the latter separates two sentences allowing to process sentence pairs. The pre-trained BERT model can then be fine-tuned end-to-end by adding one additional output layer, which either takes the token representations for token level tasks or the [CLS] representation for classification tasks as an input. BERT achieves state-of-the-art for a wide range of tasks, such as question answering and language inference (Devlin et al.) and marks the start of modern NLP with transformers. A noteworthy modification of the BERT model is RoBERTa (Robustly Optimized BERT Pretraining Approach) (Liu et al.) : First, they remove the NSP task, use bigger batch sizes & longer sequences, and use a dynamic masking strategy, i.e., the masking pattern is generated every time a sequence is fed to the model. RoBERTa outperforms BERT on GLUE, RACE, and SQuAD.

mBERT. The idea to extend BERT to multiple languages is to concatenate multiple monolingual corpora to then jointly pre-train on it. As this massive multilingual corpus from many languages has an enormous vocabulary size and a large number of out-of-vocabulary tokens, BERT/mBERT uses a subword-based tokenizer. The idea is to split rare words into smaller meaningful subwords, e.g., "papers" is split into "paper" and "s". The model then learns that the word "papers" is formed using the word "paper" with a slightly different meaning but the same root word. There are many different implementations of this idea, such as WordPiece (Wu et al.) , BPE (Sennrich et al.) and SentencePiece (Kudo and Richardson) . The original BERT/mBERT implementation uses WordPiece. To encode text from multiple languages, the subword tokenizer creates its vocabulary on the concatenated text. The multilingual version of BERT, dubbed mBERT, pre-trains on $104$ languages and surprisingly learns strong cross-lingual representations that generalize well to other languages via zero-shot transfer (Pires et al.; Wu et al.) without any explicit supervision.

This ability can be explained by three factors (Pires et al.) : (1) The subword tokenizer maps common subwords across languages which act as anchor points for learning an alignment, e.g., "DNA" ⁴ has a similar meaning even in distantly related languages. The anchor points are similar to the seed dictionary in the projection-based approach (see Section 3). (2) This effect is then reinforced and distributed to other non-overlapping tokens by jointly training across multiple languages forcing co-occurring tokens also to be mapped to a shared space. (3) mBERT learns cross-lingual representations deeper than simple vocabulary memorization, generalizing across languages. However, recent works (Wu and Dredze; K et al.) show that a shared vocabulary is not required to create a strong cross-lingual representation. (K et al.) additionally demonstrate that word order plays an important role.

XLM. (Lample and Conneau) introduce a new unsupervised method for learning cross-lingual representations, called XLM (cross-lingual language models). XLM builds upon BERT and makes the following adjustment: First, they include the Translation Language Modeling (TLM) task into the pre-training. Each training sample consisting of pairs of parallel sentences (source and target sentence) is randomly masked. To predict a masked word, the model is then allowed to either attend to the surrounding source words or the target translation, encouraging the model to align the source and target representations:

They then choose to drop the NSP task and only alternate training between MLM and TLM. Furthermore, the model receives a language ID to its input (similar to positional encoding), helping the model learn the relationship between related tokens in different languages. XLM uses the subword tokenizer BPE (Sennrich et al.) which learns the splits on the concatenation of sentences sampled randomly from the monolingual corpora. Furthermore, XLM samples according to a multinomial distribution with probabilities:

\[\begin{aligned} q_i = \frac{p_i^\alpha}{\sum_{j=1}^Np_i^\alpha} \quad \text{ with } \quad p_i = \frac{n_i}{\sum_{k=1}^N n_k} \end{aligned}\]

where $i$ denotes the index of the language and $n_i$ the the number of sentences in the text corpora of the language with the index $i$. XLM uses $\alpha = 0.5$.

XLM outperforms mBERT on XNLI (Conneau et al.) in 15 languages. However, XLM handles fewer languages than mBERT, is based on a larger model, and uses a high amount of supervision as it needs parallel sentences during pre-training. Therefore the difference may not be so significant in reality. Furthermore, acquiring parallel sentences for low-resource languages is problematic, making the model unsuitable for such scenarios.

XLM-R. (Conneau et al.) propose to take a step back and drop the TLM task and only pre-train in RoBERTa fashion with the MLM task on a huge, multilingual dataset. They dub their multilingual model XLM-RoBERTa (XLM-R). They crawled a massive amount of text, over 2.5TB of data in 100 languages. Additionally, they changed the vocabulary size of RoBERTa to $250 000$ tokens compared to RoBERTa’s $50 000$ tokens. They employ the subword tokenizer SentencePiece (Kudo and Richardson) with an unigram language model (Kudo) . They use the same sampling strategy as XLM, but utilize $\alpha = 0.3$. Furthermore, XLM-R does not use language IDs, which will allow XLM-R to better deal with code-switching. (Conneau et al.) provide two models: XLM-R~Base~ ($L = 12, H = 768, A = 12, 270$M params) and XLM-R ($L = 24, H = 1024, A = 16, 550$M params).

(Conneau et al.) show that XLM-R sets a new State-of-the-Art on numerous cross-lingual tasks. Compared to mBERT and XLM, XLM-R provides substantial gains in classification, sequence labeling, and question answering without any explicit cross-lingual supervision.

Challenges in Multilingual Transformers

As XLM-R and the concept of multilingual transformers build the basis of our thesis, we will further analyze weaknesses and how our approach tries to alleviate them.

Low-Resource Languages. Even though multilingual LMs do not use any explicit cross-lingual signals, they still create multilingual representations (Pires et al.; Wu and Dredze) , which can be used for cross-lingual downstream tasks. By releasing a new multi-task benchmark for evaluating the cross-lingual generalization, called XTREME⁵, which covers nine tasks and 40 languages, (Hu et al.) showed that even though models achieve human performance on many tasks in English, there is a sizable gap in the performance of cross-lingually transferred models, especially in low-resource languages. Additionally, (Wu and Dredze) (Lauscher et al.) showed that multilingual transformers pre-trained via language modeling significantly underperform in resource-lean scenarios and for distant languages. Furthermore, the literature (Pires et al.; Wu and Dredze; K et al.) focused on evaluating languages that were from the same language family or with large corpora in pre-training, languages such as German, Spanish or French. For example, (K et al.) investigate Hindi, Spanish, and Russian, which are from the same language family, Indo-European, and have a large corpus from Wikipedia. This concern is raised by multiple sources (Lauscher et al.; Wu and Dredze) , which show that the performance drops huge for distant target languages and target languages that have small pre-training corpora. Furthermore, (Lauscher et al.) show empirically that for massively multilingual transformers, pre-training corpora sizes affect the zero-shot performance in higher-level tasks. In contrast, the results in lower-level tasks are more impacted by typological language proximity.

As multilingual transformers struggle with low-resource languages, we investigate if our approach is capable of improving in these scenarios. To strengthen the performance of low-resource languages, we try to avoid the curse of multilinguality.

Curse of Multilinguality. (Conneau et al.) experiment with different settings for the XLM-R model, showing that scaling the batch size, training data, training time and shared vocabulary improve performance on downstream task.

More importantly, they show that for a fixed model capacity, adding more languages to the pre-training lead to better cross-lingual performance on low-resource languages till the per-language capacity is too low (capacity dilution), after which the overall performance on monolingual and cross-lingual benchmarks degrades. They call this the curse of multilinguality.

Adding more languages to the pre-training ultimately has two important effects: (1) Positive cross-lingual transfer, especially for low-resource languages, and (2) lower per-language capacity, which then, in turn, can degrade the overall model performance. Multilingual transformer models have to carefully balance these two effects of capacity dilution and positive transfer. Adding more capacity to the model can alleviate some of the curse but is not a solution for moderate-size models.

Furthermore, the allocation of the model capacity across languages is influenced by the training set size, the size of the shared subword vocabulary, and the rate at which the model samples training instances from each language during pre-training. (Conneau et al.) show that sampling batches of low-resource languages improve performance on low-resource languages and vice-versa. XLM-R uses a rate of $\alpha = 0.3$, which still leaves some room for improvement on low-resource languages.

In my Master Thesis, we introduces language-specific students that allocate $100\%$ of model parameters to one language, avoiding the capacity dilution while still benefiting from the acquired cross-lingual knowledge from XLM-R by distilling from the XLM-R model.

References

Artetxe, Mikel, et al. Unsupervised Neural Machine Translation. arXiv, 2017, doi:10.48550/ARXIV.1710.11041.
Qi, Ye, et al. “When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?” Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Association for Computational Linguistics, 2018, pp. 529–35, doi:10.18653/v1/N18-2084.
Lample, Guillaume, et al. “Phrase-Based & Neural Unsupervised Machine Translation.” Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2018, pp. 5039–49, doi:10.18653/v1/D18-1549.
Vulić, Ivan, and Marie-Francine Moens. “Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings.” Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, 2015, pp. 363–72, doi:10.1145/2766462.2767752.
Plath, Warren J. “Early Years in Machine Translation: Memoirs and Biographies of Pioneers.” Computational Linguistics, vol. 28, no. 4, December 2002, pp. 554–59, doi:10.1162/089120102762671990.
Goddard, Cliff. Natural Semantic Metalanguage. 2006.
Lample, Guillaume, and Alexis Conneau. Cross-Lingual Language Model Pretraining. 2019.
Lauscher, Anne, et al. “From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers.” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 2020, pp. 4483–99, doi:10.18653/v1/2020.emnlp-main.363.
Harris, Zellig S. “Distributional Structure.” WORD, vol. 10, no. 2-3, Routledge, 1954, pp. 146–62, doi:10.1080/00437956.1954.11659520.
Collobert, Ronan, and Jason Weston. “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning.” Proceedings of the 25th International Conference on Machine Learning, Association for Computing Machinery, 2008, pp. 160–67, doi:10.1145/1390156.1390177.
Jelinek, Frederick, et al. “Perplexity—a Measure of the Difficulty of Speech Recognition Tasks.” Journal of the Acoustical Society of America, vol. 62, 1977.
Mikolov, Tomas, et al. Exploiting Similarities among Languages for Machine Translation. 2013.
Bojanowski, Piotr, et al. Enriching Word Vectors with Subword Information. arXiv, 2016, doi:10.48550/ARXIV.1607.04606.
Lample, Guillaume, et al. Neural Architectures for Named Entity Recognition. arXiv, 2016, doi:10.48550/ARXIV.1603.01360.
Kiros, Ryan, et al. Skip-Thought Vectors. arXiv, 2015, doi:10.48550/ARXIV.1506.06726.
Kusner, Matt J., et al. “From Word Embeddings To Document Distances.” ICML, 2015.
Hochreiter, Sepp. “The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions.” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 6, April 1998, pp. 107–16, doi:10.1142/S0218488598000094.
Bengio, Yoshua, et al. “Learning Long-Term Dependencies with Gradient Descent Is Difficult.” IEEE Transactions on Neural Networks, vol. 5 2, 1994, pp. 157–66 .
Pascanu, Razvan, et al. On the Difficulty of Training Recurrent Neural Networks. arXiv, 2012, doi:10.48550/ARXIV.1211.5063.
Le, Quoc V., et al. “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units.” ArXiv, vol. abs/1504.00941, 2015.
Agarap, Abien Fred. Deep Learning Using Rectified Linear Units (ReLU). arXiv, 2018, doi:10.48550/ARXIV.1803.08375.
Hochreiter, Sepp, and Jürgen Schmidhuber. “Long Short-Term Memory.” Neural Computation, vol. 9, December 1997, pp. 1735–80, doi:10.1162/neco.1997.9.8.1735.
Gers, Felix, et al. “Learning to Forget: Continual Prediction with LSTM.” Neural Computation, vol. 12, October 2000, pp. 2451–71, doi:10.1162/089976600300015015.
Sutskever, Ilya, et al. “Generating Text with Recurrent Neural Networks.” Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 1017–24.
Mikolov, Tomas, et al. “Extensions of Recurrent Neural Network Language Model.” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2011, pp. 5528–31, doi:10.1109/ICASSP.2011.5947611.
Sutskever, Ilya, et al. “Sequence to Sequence Learning with Neural Networks.” NIPS, 2014.
Hinton, Geoffrey E., and Richard S. Zemel. “Autoencoders, Minimum Description Length and Helmholtz Free Energy.” NIPS, 1993.
Kalchbrenner, Nal, and Phil Blunsom. “Recurrent Continuous Translation Models.” EMNLP, 2013.
Bahdanau, Dzmitry, et al. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv, 2014, doi:10.48550/ARXIV.1409.0473.
Vaswani, Ashish, et al. Attention Is All You Need. 2017.
Radford, Alec, and Karthik Narasimhan. Improving Language Understanding by Generative Pre-Training. 2018.
Conneau, Alexis, et al. Word Translation Without Parallel Data. arXiv, 2017, doi:10.48550/ARXIV.1710.04087.
Huang, Kejun, et al. “Translation Invariant Word Embeddings.” Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2015, pp. 1084–88, doi:10.18653/v1/D15-1127.
Artetxe, Mikel, et al. “A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings of Word Embeddings.” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, 2018, pp. 789–98, doi:10.18653/v1/P18-1073.
Hoshen, Yedid, and Lior Wolf. “Non-Adversarial Unsupervised Word Translation.” Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2018, pp. 469–78, doi:10.18653/v1/D18-1043.
Alvarez-Melis, David, and Tommi Jaakkola. “Gromov-Wasserstein Alignment of Word Embedding Spaces.” Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2018, pp. 1881–90, doi:10.18653/v1/D18-1214.
Xing, Chao, et al. “Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation.” Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, 2015, pp. 1006–11, doi:10.3115/v1/N15-1104.
Vulić, Ivan, et al. Do We Really Need Fully Unsupervised Cross-Lingual Embeddings? arXiv, 2019, doi:10.48550/ARXIV.1909.01638.
Glavas, Goran, et al. How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions. 2019.
Klementiev, Alexandre, et al. “Inducing Crosslingual Distributed Representations of Words.” Proceedings of COLING 2012, The COLING 2012 Organizing Committee, 2012, pp. 1459–74, https://aclanthology.org/C12-1089.
Luong, Thang, et al. “Bilingual Word Representations with Monolingual Quality in Mind.” Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Association for Computational Linguistics, 2015, pp. 151–59, doi:10.3115/v1/W15-1521.
Devlin, Jacob, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. 2019.
Torregrossa, François, et al. “A Survey on Training and Evaluation of Word Embeddings.” International Journal of Data Science and Analytics, vol. 11, March 2021, pp. 1–19, doi:10.1007/s41060-021-00242-8.
Liu, Yinhan, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. 2019.
Wu, Yonghui, et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv, 2016, doi:10.48550/ARXIV.1609.08144.
Sennrich, Rico, et al. Neural Machine Translation of Rare Words with Subword Units. arXiv, 2015, doi:10.48550/ARXIV.1508.07909.
Kudo, Taku, and John Richardson. SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. arXiv, 2018, doi:10.48550/ARXIV.1808.06226.
Pires, Telmo, et al. “How Multilingual Is Multilingual BERT?” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2019, pp. 4996–5001, doi:10.18653/v1/P19-1493.
Wu, Shijie, et al. Emerging Cross-Lingual Structure in Pretrained Language Models. arXiv, 2019, doi:10.48550/ARXIV.1911.01464.
Wu, Shijie, and Mark Dredze. “Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT.” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, 2019, pp. 833–44, doi:10.18653/v1/D19-1077.
K, Karthikeyan, et al. Cross-Lingual Ability of Multilingual BERT: An Empirical Study. 2020.
Conneau, Alexis, et al. “XNLI: Evaluating Cross-Lingual Sentence Representations.” EMNLP, 2018.
---. Unsupervised Cross-Lingual Representation Learning at Scale. 2020.
Kudo, Taku. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. arXiv, 2018, doi:10.48550/ARXIV.1804.10959.
Hu, Junjie, et al. XTREME: A Massively Multilingual Multi-Task Benchmark for Evaluating Cross-Lingual Generalization. 2020.

A tokenizer splits the text into smaller units called tokens. Tokens can be words, characters, or subwords. In our thesis, we mostly use the term word representations to illustrate concepts better. Only when necessary do we explicitly state tokens. Nevertheless, all presented approaches can be generalized to any token-level. ↩
Relationships are defined by subtracting two words vectors, and the result is added to another word. ↩
Notice that without the positional encoding, the transformer has no notion of word order. ↩
"DNA" is indeed a subword in mBERT (Wu and Dredze) . ↩
https://sites.research.google/xtreme ↩

[Paper] Survey: Cross-Lingual Representation Learning

Table of Contents