[Project] Distilling Multilingual Encoders into Monolingual Components

In this blog post, we motivate and lay out my main contributions of my Master Thesis. To read the full thesis, download my thesis here. To read the related work sections as blog posts, see Cross-Lingual Representation, Parameter-Efficient Fine-Tuning in NLP and Knowledge Distillation.

interview-img

Table of Contents

  1. Motivation
  2. Research Objective & Contribution

1. Motivation

Natural language processing (NLP) has made significant progress in recent years, achieving impressive performances across diverse tasks. However, these advances are focused on just a tiny fraction of the 7000 languages in the world, e.g., English, where sufficient amounts of text in the respective language are available (high-resource languages). Nevertheless, when the situation arises where text data in a language is scarce, language technologies fail. These languages are called low-resource languages, e.g., Swahili, Basque, or Urdu; see the next Figure for a more apparent distinction between high-, mid-, and low-resource languages:

interview-img A conceptual view of the NLP resource hierarchy categorised in availability of task-specific labels and availability of unlabeled language-specific text. Taken from (Ruder et al.).

This leaves low-resource languages and, therefore, most languages understudied, which further increases the digital language divide1 on a technological level. Being able to develop technologies for low-resource languages is vital for scientific, social, and economic reasons, e.g., Africa and India are the hosts of around 2000 low-resource languages and are home to more than 2.5 billion inhabitants (Magueresse et al.). "Opening" the newest NLP technologies for low-resource languages can help bridge the gap, e.g., digital assistants, or help reduce the discrimination against speakers of non-English languages (missing reference).

To improve language technologies for low-resource languages, the field of Cross-Lingual Representation Learning is focused on creating high-quality representations for these languages by gaining benefit from abundant data in another language via a shared representation space. As static word representations gained in popularity, many multilingual embedding methods have been presented (missing reference). The idea behind these methods is to induce embeddings2 such that the embeddings for two languages are aligned, i.e., word translations, e.g., cat and Katze, have similar representations. Recently, however, large pre-trained language models, the so-called transformer models, took static word embedding methods over in virtually every aspect, partly because these models induce context-dependent word representations, capturing the rich meaning of a word better (missing reference). E.g., mBERT and XLM-R, transformer-based multilingual masked language models pre-trained on text in (approximately) 100 languages, can obtain impressive performances for a variety of cross-lingual transfer tasks (missing reference). Even though these models were not trained with any cross-lingual objectives, they still produce representations that can generalize well across languages for a wide range of downstream tasks (missing reference). To analyze the cross-lingual transfer ability of multilingual models, the model is first fine-tuned on annotated data of a downstream task and then evaluated in the zero or few-shot scenario, i.e., evaluated with the fine-tuned models in the target language (Hu et al.) with no or few additional labeled target language data.

As impressive as these multilingual transformers might seem, low-resource languages still perform sub-par to high-resource languages (missing reference), partly due to the fact of a smaller pre-training corpus (Conneau et al.) , the curse of multilinguality (Conneau et al.) and the importance of vocabulary curation and size (missing reference). E.g., the curse of multilinguality argues assuming that the model capacity stays constant that adding more languages leads to better cross-lingual performance on low-resource languages up until a point where the overall performance on monolingual and cross-lingual benchmarks degrades. Intuitively explained, adding more languages to the model has two effects: (1) Positive cross-lingual transfer, especially for low-resource languages, and (2) lower per-language capacity, which then, in turn, can degrade the overall model performance. These two effects of capacity dilution and positive transfer need to be carefully traded against each other. The model either needs to have a large model capacity3 or is specialized (constrained) towards a subset of languages beforehand. For these reasons, it is hard to create a single model that can effectively represent a diverse set of languages. One solution is to create language-specific models (monolingual models) with language-specific vocabulary and model parameters (missing reference), but in return, monolingual models need enough text to pre-train the model on the language modeling task, which is typically not available for low-resource languages. Additionally, we can not benefit from any cross-lingual transfer from related languages, making it harder to create an adequate representation for low-resource languages (missing reference).

2. Research Objective & Contribution

In this thesis, we explore how one can alleviate the issues of big multilingual transformers for low-resource languages, especially the curse of multilinguality. Specifically, our two main objectives are: (1) Improving the cross-lingual alignment for low-resource languages and (2) improving cross-lingual downstream task performance for low-resource languages. We utilize Knowledge Distillation (KD) by distilling the multilingual model into language-specialized (also called monolingual) language models. We make the following contributions:

  • We propose a novel setup to distill multilingual transformers into monolingual components. Based on the setup, we propose two KD strategies: One for improving the alignment between two languages and one to improve cross-lingual downstream task performance. We call the former MonoAlignment and the latter MonoShot.
  • MonoAlignment uses a distillation strategy to distill multilingual transformer models into smaller monolingual components which have an improved aligned representation space between a high-resource language and a low-resource language. We demonstrate the effectiveness by distilling XLM-R and experimenting with aligning English with Turkish, Swahili, Urdu, and Basque.
  • We compare MonoAlignment to other Knowledge Distillation strategies showing that it outperforms them in the retrieval task for low-resource languages.
  • Our work suggests that an increase in the cross-lingual alignment of a multilingual transformer model does not necessarily translate into an increase in cross-lingual downstream task performance.
  • Therefore, we propose MonoShot, another Knowledge Distillation strategy to distill multilingual transformer models into smaller monolingual components but which have a strong cross-lingual downstream performance in the zero- and few-shot settings.
  • We show that MonoShot performs best among many different Knowledge Distillation strategies, albeit still lacks behind the teacher performance. However, it outperforms models built upon the teacher architecture but is trimmed down to the same size as the distilled components and initialized from parts of the teacher.
  • We demonstrate an effective fine-tuning strategy for the zero-shot scenario for aligned monolingual models and compare it against many other strategies.

To conduct our research, we will draw inspiration from the field of Cross-Lingual Representation Learning, Knowledge Distillation, and Parameter-Efficient Fine-tuning. Following different Knowledge Distillation strategies, such as from DistilBert (Sanh et al.) or TinyBert (Jiao et al.) , we distill the aligned cross-lingual representation space of the multilingual transformer model XLM-R (Conneau et al.) into smaller monolingual students. To fine-tune aligned monolingual models in a zero-shot scenario, we study the field of parameter-efficient fine-tuning, i.e., Adapters (missing reference), BitFit (Zaken et al.) and Sparse Fine-Tuning (Guo et al.). Finally, we evaluate the general-purpose cross-lingual representation of our monolingual models in the retrieval, classification, structured prediction, and question-answering task.

References

  1. Ruder, Sebastian, et al. “Unsupervised Cross-Lingual Representation Learning.” Proceedings of ACL 2019, Tutorial Abstracts, 2019, pp. 31–38.
  2. Magueresse, Alexandre, et al. Low-Resource Languages: A Review of Past Work and Future Challenges. 2020.
  3. Hu, Junjie, et al. XTREME: A Massively Multilingual Multi-Task Benchmark for Evaluating Cross-Lingual Generalization. 2020.
  4. Conneau, Alexis, et al. Unsupervised Cross-Lingual Representation Learning at Scale. 2020.
  5. Sanh, Victor, et al. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. 2020.
  6. Jiao, Xiaoqi, et al. TinyBERT: Distilling BERT for Natural Language Understanding. 2020.
  7. Conneau, Alexis, et al. Word Translation Without Parallel Data. arXiv, 2017, doi:10.48550/ARXIV.1710.04087.
  8. Zaken, Elad Ben, et al. BitFit: Simple Parameter-Efficient Fine-Tuning for Transformer-Based Masked Language-Models. arXiv, 2021, doi:10.48550/ARXIV.2106.10199.
  9. Guo, Demi, et al. Parameter-Efficient Transfer Learning with Diff Pruning. arXiv, 2020, doi:10.48550/ARXIV.2012.07463.
  1. http://labs.theguardian.com/digital-language-divide/ 

  2. Word embeddings and word representation are interchangeable in our thesis. 

  3. Here: Measured in the number of free parameters in the model. 

updated_at 22-08-2022