Part of the related work section in my Master Thesis. Download the full version here.

Motivation
Adapters
Sparse Fine-Tuning

1. Motivation

The standard approach to solving a new task with a pre-trained transformer is by adding a task-head (e.g., a linear classification layer) on top of the pre-trained transformer (encoder) and minimizing the task loss end-to-end. However, this approach results in a completely new unique, large model making it harder to track what significantly changed during fine-tuning and therefore making it also hard to transfer acquired task-specific knowledge (modularity). The latter is important in our monolingual setup as we want to transfer the acquired task-specific knowledge by our source student into our target student. Ideally, transferring the acquired task-specific knowledge still matches the results of fully fine-tuning one model.

The first work that we explore is Adapters (missing reference) which inserts a small subset of trainable task-specific parameters between layers of a model and only changes these during fine-tuning, keeping the original parameters frozen. We then discuss sparse fine-tuning, which only changes a subset of the pre-trained model parameters. Specifically, we consider Diff-Pruning (Guo et al.), which adds a sparse, task-specific difference-vector to the original parameters, and BitFit (Zaken et al.), which enforces sparseness by only fine-tuning the bias terms and the classification layer.

2. Adapters

Adapters were initially proposed for computer vision to adapt to multiple domains (Rebuffi et al.) but were then used as an alternative lightweight training strategy for pre-trained transformers in NLP (Houlsby et al.). Adapters introduce additional parameters to a pre-trained transformer, usually small, bottleneck feed-forward networks inserted at each transformer layer. Adapters enable us to keep the pre-trained parameters of the model fixed and only fine-tune the newly introduced parameters on either a new task (missing reference) or new domains (Bapna et al.). Adapters perform either on par or slightly below full fine-tuning (missing reference). Importantly, adapters learn task-specific representations which are compatible with subsequent transformer layers (Pfeiffer et al.).

Placement & Architecture. Most work insert adapters at each layer of the transformer model, the architecture and placement of adapters are, however, non-trivial: (Houlsby et al.) experiment with different adapter architectures and empirically validated that using a two-layer feed-forward neural network with a bottleneck worked well:

This simple down- and up-projection with a non-linearity has become the common adapter architecture. The placement and number of adapters within each transformer block are still debated. (Houlsby et al.) place the adapter at two positions: One after the multi-head attention and one after the feed-forward layers. (Stickland and Murray) just use one adapter after the feed-forward layers, which (Bapna et al.) adopted and extends by including a layer norm (Ba et al.) after the adapter. (Pfeiffer et al.) test out different adapter positions and adapter architectures jointly and came to the conclusion to use the same adapter architecture as (Houlsby et al.) but only places the adapter after the feed-forward neural network:

Modularity of Representations. One important property of Adapters is that they learn task-specific knowledge within each adapter component. The reason is that they are placed within a frozen transformer block layer, forcing the adapters to learn an output representation compatible with the subsequent layer of the transformer model. This results in modular adapters, meaning that they can be either stacked on top of each other or replaced dynamically (Pfeiffer et al.). This modularity of adapters can be used to fine-tune multiple monolingual aligned students on cross-lingual downstream tasks: Instead of fine-tuning the whole student model on the source language, we insert and fine-tune the adapters, which then can be inserted into the monolingual student corresponding to the target language.

3. Sparse Fine-Tuning

Sparse fine-tuning (SFT) only fine-tunes a small subset of the original pre-trained model parameters at each step, effectively fine-tuning in a parameter efficient way. The fine-tuning procedure can be describes as

\[\begin{aligned} \Theta_{\text{task}} = \Theta_{\text{pretrained}} + \delta_{\text{task}} \end{aligned}\]

where $\Theta_{\text{task}}$ is the task-specific parameterization of the model after fine-tuning, $\Theta_{\text{pretrained}}$ is the set of pretrained parameters which is fixed and $\delta_{\text{task}}$ is called the task-specific diff vector. We call the procedure sparse if $\delta_{\text{task}}$ is sparse. As we only have to store the nonzero positions and weights of the diff vector, the method is parameter efficient. Method generally differ in the calculation of $\delta_{\text{task}}$ and its induced sparseness.

Methods. (Guo et al.) introduce Diff Pruning, which determines $\delta_{\text{task}}$ by adaptively pruning the diff vector during training. To induce this sparseness, they utilize a differentiable approximation of the $L_0$-norm penalty (Louizos et al.). (Zaken et al.) induce sparseness by only allowing non-zero differences in the bias parameters (and the classification layer) of the transformer model. The method, called BitFit, and Diff-Pruning, both can match the performance of fine-tuned baselines on the GLUE benchmark (missing reference). (Ansell et al.) learn sparse, real-valued masks based on a simple variant of the Lottery Ticket Hypothesis (Frankle and Carbin): First, they fine-tune a pre-trained model for a specific task or language, then select the subset of parameters that change the most which correspond to the non-zero values of the diff vector $\delta_{\text{task}}$. Then, the authors set the model to its original pre-trained initialization and re-tune the model again by only fine-tuning the selected subset of parameters. The diff vector $\delta_{\text{task}}$ is therefore sparse.

Comparison to Adapters. In contrast to Adapters, Sparse fine-tuning (SFT) does not modify the architecture of the model but restricts its fine-tuning to a subset of model parameters (missing reference). As a result, SFT is much more expressive, as they are not constricted to just modifying the output of Transformer layers with shallow MLP but can directly modify the pre-trained model’s embedding and attention layers (Ansell et al.). Similar to Adapters, (Ansell et al.) show that their sparse fine-tuning technique has the same concept of modality found in Adapters (Pfeiffer et al.). Again this modularity can be used in our monolingual setup to fine-tune on cross-lingual downstream tasks.

References

Guo, Demi, et al. Parameter-Efficient Transfer Learning with Diff Pruning. arXiv, 2020, doi:10.48550/ARXIV.2012.07463.
Zaken, Elad Ben, et al. BitFit: Simple Parameter-Efficient Fine-Tuning for Transformer-Based Masked Language-Models. arXiv, 2021, doi:10.48550/ARXIV.2106.10199.
Rebuffi, Sylvestre-Alvise, et al. Learning Multiple Visual Domains with Residual Adapters. 2017.
Houlsby, Neil, et al. Parameter-Efficient Transfer Learning for NLP. 2019.
Bapna, Ankur, et al. Simple, Scalable Adaptation for Neural Machine Translation. 2019.
Pfeiffer, Jonas, et al. AdapterHub: A Framework for Adapting Transformers. arXiv, 2020, doi:10.48550/ARXIV.2007.07779.
Stickland, Asa Cooper, and Iain Murray. BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning. 2019.
Ba, Jimmy Lei, et al. Layer Normalization. 2016.
Pfeiffer, Jonas, et al. AdapterFusion: Non-Destructive Task Composition for Transfer Learning. 2021.
---. MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. arXiv, 2020, doi:10.48550/ARXIV.2005.00052.
Louizos, Christos, et al. Learning Sparse Neural Networks through L_0 Regularization. arXiv, 2017, doi:10.48550/ARXIV.1712.01312.
Ansell, Alan, et al. Composable Sparse Fine-Tuning for Cross-Lingual Transfer. arXiv, 2021, doi:10.48550/ARXIV.2110.07560.
Frankle, Jonathan, and Michael Carbin. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. arXiv, 2018, doi:10.48550/ARXIV.1803.03635.

[Paper] Survey: Parameter-Efficient Fine-Tuning in NLP

Table of Contents

1. Motivation

2. Adapters

3. Sparse Fine-Tuning

References