Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning (2024)

Tuc Nguyen
Department of Computer Science
Indiana University
nguyentuc1003@gmail.com
&Thai Le
Department of Computer Science
Indiana University
leqthai.vn@gmail.com

Abstract

Several parameter-efficient fine-tuning methods based on adapters have been proposed as a streamlined approach to incorporate not only a single specialized knowledge into existing Pre-Trained Language Models (PLMs) but also multiple of them at once. Recent works such as AdapterSoup propose to mix not all but only a selective sub-set of domain-specific adapters during inference via model weight averaging to optimize performance on novel, unseen domains with excellent computational efficiency. However, the essential generalizability of this emerging weight-space adapter mixing mechanism on unseen, in-domain examples remains unexplored.Thus, in this study, we conduct a comprehensive analysis to elucidate the generalizability of domain-specific adapter mixtures in in-domain evaluation. We also provide investigations into the inner workings of the mixture of domain-specific adapters by analyzing their weight signs, yielding critical analysis on the negative correlation between their fraction of weight sign difference and their mixtures’ generalizability.The code is available at Github.

1 Introduction

Recently, several parameter-efficient fine-tuning methods that are based on adapters have been introduced as a streamlined approach for fine-tuning Pre-trained Language Models (PLMs) to equip them with new, specialized knowledge or domain.Several algorithms have been proposed to train a distinct adapter for each new domainHoulsby etal. (2019); Pfeiffer etal. (2021); Hu etal. (2022).To further improve a model’s generalizability, existing worksPfeiffer etal. (2021); Wang etal. (2021a); Diao etal. (2023) mostly focus on training multiple adapters for multiple tasks and continuously adding more adapters for incoming new tasks.This can be inefficient for the new domain tasks that have only a few examples, making the learning among the tasks unequal.Thus, more recent works such as Matena and Raffel (2022); Wang etal. (2022, 2021b); Li etal. (2022); Chronopoulou etal. (2023) opt for weight-space averaging of model and/or adapters trained on different domains, resulting in so-called Mixture of Expert Adapters.

One recent notable work in this space is AdapterSoupChronopoulou etal. (2023), which proposes to merge the weights of a fixed-size, selective sub-set of different domain-specific adapters via an averaging function to accommodate unseen tasks or domains during inference.Such weight-space merging mechanism on adapters is efficient in practice as one can efficiently train a small, additional adapter and plug it into existing PLMs to incorporate new knowledge.Although the work reported favorable evaluation results on unseen, novel domains, it is unclear to what extent such weight-space merging mechanism on domain-specific adapters can generalize in an in-domain evaluation setting–i.e., how well it makes predictions on unseen examples of domains already seen during training.Moreover, to the best of our knowledge, no existing works comprehensively study the generalization of the mixture of adapters in the in-domain setting. This literature gap seems counter-intuitive because in-domain evaluation is fundamental and should precede out-of-domain evaluation.Moreover, in real-world applications, model owners have incentives to utilize as much as possible available information to improve their models over time.With the availability of parameter-efficient finetuning methods that are fairly easy to adopt with minimal space and runtime cost, the model owners are then incentivized to quickly fine-tune their models on a few examples collected from a new domain on an additional adapter to optimize the performance (rather than totally relying on out-of-domain prediction capability).As a result, although in-domain evaluation seems trivial, it is fundamental as one must ensure that the mixture of adapters works well on the tasks they have already been trained on.Furthermore, several key questions regarding the resulting mixtures of domain-specific adapters remain unanswered, especially those regarding their generalizability and their adversarial robustness when mixing adapters trained from very different tasks.

Therefore, borrowing the pop-culture saying that β€œmixed drinks and co*cktails aren’t actually the same thing”, in contrast from existing works, we hypothesize that not all mixture of expert adapters are created equal and all have superior performance.Then, through an array of comprehensive experiments, we attempt to give answers to questions about when and what to mix when it comes to domain-specific adapters.We found that the weight-space merging mechanism suffers from performance degradation in terms of both generalizability and adversarial robustness even with inputs from domains it already trains on.Moreover, we also attempt to explain such performance degradation by revealing a critical negative correlation between signed directions of adapter weights during mixing and domain-specific predictive performance (Fig.1).Although simple, this intuitive and novel observation also allows us to select β€œwhen and what adapters to mix?” and design a more effective model pruning as a by-product application.

Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning (1)

Overall, our study does not focus on proposing a new mechanism, algorithm, or method.Instead, we focus on analyzing and bringing understanding of an existing and emergingparadigm of mixing multiple domain-specific adapters that was previously introduced Chronopoulou etal. (2023); Wang etal. (2022).Specifically, we focus on in-domain prediction when mixing adapters from different domains as an emerging and potential paradigm for the deployment of PLMs in practice.

Our contributions are summarized as follows.

  1. 1.

    This is the first and most comprehensive analysis of in-domain generalizability of a mixture of domain-specific adapters with 3 different adapter methods on 13 diverse classification datasets,

  2. 2.

    We provide insights and analysis on when and what adapters to mix to minimize performance degradation via the lens of signed directions of adapters’ parameter weights,

  3. 3.

    We demonstrate the utility of such insights to train mixtures of adapters with 90% sparsity that improve both generalizability and efficiency.

2 Related works

Adapter Fine-tuning.The primary method for adapting general-purpose PLMs to downstream tasks is via full fine-tuning, which requires adjusting all models’ parametersPeters etal. (2018); Devlin etal. (2019a).However, this results in redundant copies of fine-tuned models for each task, posing a significant memory challenge.Thus, various parameter-efficient fine-tuning methods have been proposed, including prompt-based(Li and Liang, 2021) and adapter-based fine-tuningHoulsby etal. (2019); Pfeiffer etal. (2021); Hu etal. (2022).Among adapter-based fine-tuning methods, HoulsbyHoulsby etal. (2019) introduces two adapter blocks with bottleneck networks in each Transformer block of a PLM.Similarly, the PfeifferPfeiffer etal. (2021) adapter differs in architecture, incorporating only one adapter layer in each Transformer block, in contrast to the two layers introduced by HoulsbyHoulsby etal. (2019).LoRAHu etal. (2022) takes a distinctive approach by freezing the MLP modules of transformers and representing updates to attention weights with two low-rank matrices to optimize space while effectively retaining model performance.In this work, we focus on analyzing adapter-based fine-tuning methods as they are more popular and effective.

Mixture of Expert Adapters.Additionally, several approachesWang etal. (2021a); Pfeiffer etal. (2021, 2020); Wang etal. (2022) have been proposed to further optimize their adapters for various downstream tasks by maintaining a set of adapters and combine them during inference.Particularly, AdaMixWang etal. (2022) fine-tunes so-called Mixture of Experts (MoEs) with adapters on a downstream task and averaging their weights during inference.In addition, Li etal. (2022) explores performance in novel domains through weight averaging on entire language models.Similarly, AdapterSoupChronopoulou etal. (2023) opts for weight-space averaging of adapters trained on different domains.Among these methods, weight-space averaging is identified as the most intuitive method for mixing different adaptersJin etal. (2023a) and Chronopoulou etal. (2023).

Nevertheless, none of these works comprehensively evaluates and analyzes the generalizability and adversarial robustness of the resulting mixture of adapters under different mixtures of domain-specific knowledge, which is necessary to answer the question β€œwhen and what to mix?”.

3 Comprehensive In-Domain Evaluation

To evaluate the in-domain performance of adapter mixtures, we train several adapters with domain-specific knowledge and mix them in different combinatorial combinations.Then, we evaluate each combination on different downstream tasks on two aspects: (1) generalizability on unseen in-domain examples and (2) adversarial robustness under adversarial text attacks.

Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning (2)

3.1 Evaluation Datasets

Diverse Domain Knowledge.To simulate knowledge diversity, we gather a total of 13 distinct and diverse domain-specific datasets of classification tasks for evaluation.We refer the readers to Appendix A.1 for detailed information and their linguistic statistics.Fig.2 reveals the intricate diversity within our selected datasets, both semantic and topic-wise.Notably, SST2 and IMDB, both originating from the same movie corpus, exhibit proximity in topic embedding spaces. On the contrary, non-formal datasets such as Wiki and Tweets are distinctly distant from other datasets in this regard.We refer the readers to Appendix.A.2 for a detailed exploration and analysis of the topic distributions among the datasets.

3.2 Mixing Fine-Tuned Adapters

Base Models and Individual Adapters.We design our evaluation using two transformer-based models, namely BERTDevlin etal. (2019b) and RoBERTaLiu etal. (2019), with a 3 diverse and well-known adapter methods.They are HoulsbyHoulsby etal. (2019), PfeiferPfeiffer etal. (2021) and LoRAHu etal. (2022).These adapter-based methods introduce variations in the adapter architecture and parameterization (Sec.2), contributing to the comprehensiveness of our analysis.

Mixing Adapters.From the pre-trained weights ΞΈP⁒L⁒Msubscriptπœƒπ‘ƒπΏπ‘€\theta_{PLM}italic_ΞΈ start_POSTSUBSCRIPT italic_P italic_L italic_M end_POSTSUBSCRIPT of either BERT and RoBERTa, we train a suite of 13 domain-specific adapters tailored for diverse domains, denoted as ΞΈD1,ΞΈD2,…,ΞΈDksubscriptπœƒsubscript𝐷1subscriptπœƒsubscript𝐷2…subscriptπœƒsubscriptπ·π‘˜\theta_{D_{1}},\theta_{D_{2}},\dots,\theta_{D_{k}}italic_ΞΈ start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ΞΈ start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_ΞΈ start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT.Following Chronopoulou etal. (2023), the final inference of the target mixture of domain-specific adapters becomes:

f⁒(x,ΞΈP⁒L⁒M+1kβ’βˆ‘i=1i=kΞΈDi)𝑓π‘₯subscriptπœƒπ‘ƒπΏπ‘€1π‘˜superscriptsubscript𝑖1π‘–π‘˜subscriptπœƒsubscript𝐷𝑖f(x,\theta_{PLM}+\frac{1}{k}\sum_{i=1}^{i=k}\theta_{D_{i}})italic_f ( italic_x , italic_ΞΈ start_POSTSUBSCRIPT italic_P italic_L italic_M end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_k end_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i = italic_k end_POSTSUPERSCRIPT italic_ΞΈ start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT )(1)

3.3 Adversarial Text Generation

Textual adversarial attacks are popular in AI robustness research.Given a dataset D={(xi,yi)}iN𝐷superscriptsubscriptsubscriptπ‘₯𝑖subscript𝑦𝑖𝑖𝑁D{=}\{(x_{i},y_{i})\}_{i}^{N}italic_D = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where xπ‘₯xitalic_x represents the sample and y𝑦yitalic_y denotes the ground truth label, a textual adversarial attack aims to attack a PLMs fΞΈsubscriptπ‘“πœƒf_{\theta}italic_f start_POSTSUBSCRIPT italic_ΞΈ end_POSTSUBSCRIPT with a classification loss function β„’β„’\mathcal{L}caligraphic_L by perturbing each sample xπ‘₯xitalic_x with perturbation noise δ𝛿\deltaitalic_Ξ΄ given a certain budget C𝐢Citalic_C:

arg maxδ∈C⁒ℒ⁒[fθ⁒(x+Ξ΄),y],subscriptarg max𝛿𝐢ℒsubscriptπ‘“πœƒπ‘₯𝛿𝑦\text{arg max}_{\delta\in C}\mathcal{L}[f_{\theta}(x+\delta),y],arg max start_POSTSUBSCRIPT italic_Ξ΄ ∈ italic_C end_POSTSUBSCRIPT caligraphic_L [ italic_f start_POSTSUBSCRIPT italic_ΞΈ end_POSTSUBSCRIPT ( italic_x + italic_Ξ΄ ) , italic_y ] ,(2)

Toward evaluating the robustness of a mixture of adapters, we employ both black-box and white-box textual attacks to exercise Eq. 2.We utilize the popular TextFooler Jin etal. (2020) as the black-box attack, which aims to replace words with synonyms or contextually similar words to deceive PLMs.We utilize the well-known FGSM Goodfellow etal. (2015) as the white-box attack, which can efficiently craft adversarial examples by perturbing embedding of text data in the direction of the sign of the gradient of the loss function to the input, thereby exposing vulnerabilities in model robustness.

Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning (3)
Datasetmnlimrpcsst2rteqnliqqprctagauthorshipfinancialimdbtweetswikiAverage
βˆ‡c⁒l⁒e⁒a⁒nsubscriptβˆ‡π‘π‘™π‘’π‘Žπ‘›\nabla_{clean}βˆ‡ start_POSTSUBSCRIPT italic_c italic_l italic_e italic_a italic_n end_POSTSUBSCRIPT15.048.314.908.504.199.9221.7213.6912.6015.3314.2613.1713.1611.91
βˆ‡b⁒l⁒a⁒c⁒k⁒b⁒o⁒xsubscriptβˆ‡π‘π‘™π‘Žπ‘π‘˜π‘π‘œπ‘₯\nabla_{blackbox}βˆ‡ start_POSTSUBSCRIPT italic_b italic_l italic_a italic_c italic_k italic_b italic_o italic_x end_POSTSUBSCRIPT12.1016.0710.4310.2110.139.8212.3212.6213.4213.8213.0411.8310.5212.03
βˆ‡w⁒h⁒i⁒t⁒e⁒b⁒o⁒xsubscriptβˆ‡π‘€β„Žπ‘–π‘‘π‘’π‘π‘œπ‘₯\nabla_{whitebox}βˆ‡ start_POSTSUBSCRIPT italic_w italic_h italic_i italic_t italic_e italic_b italic_o italic_x end_POSTSUBSCRIPT10.1411.218.319.4212.1410.2412.5312.0610.459.5310.0410.247.5310.30

3.4 Combinatory Evaluation

To evaluate in-domain performance for each target domain, we generate all possible combinations of adapters.To illustrate, for the target domain MNLI, we can first evaluate with a mixture of only itself. When combining two adapters, we have the flexibility to choose 1 additional adapter out of the remaining 12, resulting in 12 possible combinations.For a set of 3 adapters, including MNLI, we select 2 adapters out of the 12 to generate C122superscriptsubscript𝐢122C_{12}^{2}italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT combinations.This process continues for sets ranging from 4 to 13 adapters, where, in the case of 13 adapters, all adapters are combined.Thus, we have 13βˆ—(1+βˆ‘i=112C12i)=53,248131superscriptsubscript𝑖112superscriptsubscript𝐢12𝑖5324813*(1{+}\sum_{i=1}^{12}C_{12}^{i}){=}53,24813 βˆ— ( 1 + βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = 53 , 248 combinations for all domains.We report mean and variance of in-domain performance for each set of mixtures of kπ‘˜kitalic_k adapters.

Notably, this setup already includes all the possible mixtures potentially selected by AdapterSoupChronopoulou etal. (2023), which proposes an additional mechanism to select a subset of a fixed number of domain-specific adapters to mix.

4 Experiments

4.1 Implementation Details

For the Ag News, Authorship, Financial, IMDB, Tweets, and Wiki-Toxic, we partition the dataset into three segments with an 8:1:1 for train:val:test splits.For datasets belonging to the GLUE corpus, we employ their public training and evaluation splits. For the black-box TextFooler attack, we set the minimum embedding cosine similarity between a word and its synonyms as 0.8, and the minimum USE similarity is 0.84.For white-box FGSMGoodfellow etal. (2015) attack, we set the perturbation magnitude to 0.010.010.010.01.Following the setup of Houlsby and PfeifferHoulsby etal. (2019); Pfeiffer etal. (2021), we use all adapters with a dimension of 64646464 and 256256256256 for RoBERTa-large and BERT-base models.With LoRA, we use rank r=4π‘Ÿ4r{=}4italic_r = 4 following Hu etal. (2022).Detailed training, evaluation dataset, and hyper-parameter configuration for different tasks are presented in Appendix A.3.

4.2 In-Domain Performance Results

Overview. We present the performance of a candidate setting of RoBERTa with Pfeiffer adapterPfeiffer etal. (2021) with different numbers of additional mixing domains in Fig.3.Table 1 shows how much the predictive performance drops without and with adversarial black-box and white-box attacks when mixing all adapters.Overall, the average performance drops over all tasks on the clean test set from the original performance to a mix of all adapters is 11.91%, and that for black-box and white-box attacks are 12.03% and 10.30%, respectively.Further results of generalization and adversarial robustness performance across other models and adapter methods are documented in Appendix A.4.

Finding #1: As we add more tasks or domains, the predictive performance of every single task decreases, reaching its lowest point when we incorporate the maximum of 13131313 adapters training on various topic datasets.Fig.3 shows that mixing domain-specific adapters indeed decreases in-domain performance (reduction of around 10% in MRPC, QQP, RTE, etc., and nearly 17% in the Financial domain when mixing 13 adapters).The same behaviors were also observed in Jin etal. (2023b) where they merge the weight of PLMs.Notably, task accuracies decreased at a slower pace for QNLI and SST2 when evaluating with increasing size of mixtures (Fig.3).In contrast, a substantial decrease in accuracy is observed for domains such as RCT, IMDB, Ag News, and Authorship (Fig.3).This shows that mixing domain-specific adapters impair model performance differently depending on the target domain, or β€œwhat to mix” in an adapter mixture has a crucial effect on the mixture’s performance.To attempt to explain this behavior, we later present and verify a hypothesis that such mixing domain-specific adapters via weight averaging can result in β€œforgotten knowledge” that can happen due to the differences in signs when mixing these adapters (Sec.5).

Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning (4)

Finding #2: On average, there are no notable differences of the magnitude in accuracy degradation with and without adversarial attacks (12.03% dropped in black-box attack versus 11.91% dropped without attack) (Fig. 3).The overall predictive performance was significantly lower under the white-box compared to the black-box attack as expected because the white-box FGSM attack has additional access to the models’ parameters.Interestingly, on the RCT dataset, the accuracy drop on clean (21.72%percent21.7221.72\%21.72 %) is much larger than on black-box and white-box attacks (12.32%percent12.3212.32\%12.32 % and 12.53%percent12.5312.53\%12.53 %, respectively).

Finding #3: Variance in task accuracy on adversarial attacks, when combining different domain adapters, is observed to be larger than the variance in clean accuracy (Fig. 3).Although the magnitude decrease appears similar in Fig.3, differences in variance (max, min) are discernible among each combination.Specifically, in MRPC, QNLI, QQP, RTE, SST2, Tweets, IMDB, Financial, and Authorship, we can observe that certain mixed models observed slightly better adversarial robustness compared to single adapters (when the number of mixing adapters kπ‘˜kitalic_k is 2222 or 3333).Moreover, the variance of robustness in adversarial scenarios tends to be higher than in clean scenarios because all of the adapters are trained on different tasks and they may exhibit different vulnerabilities to the same attack method.

Finding #4: Mix up to three adapters to maintain competitive performance. Based on the results from Fig.3 and all findings above, it is advisable to mix only up to three tasks to maintain competitive performance (as observed in MRPC, QQP, RTE, Tweet, Financial, and Authorship domains with less than a 3% accuracy drop in accuracy).Especially, in some cases when evaluating QNLI, SST2, mixtures of less than three adapters even achieved better performance in terms of generalizability (up to 1%) and also in terms of adversarial robustness of up to 3% compared to the original performance.

5 Effects of Sign Differences of Adapter Weights during Mixing: A Hypothesis

5.1 An Explanation Hypothesis

The ideal scenario when averaging adapter weights during mixing is to make minimal adjustments to their weights, both in terms of values–i.e., magnitudes, and directions, to sustain as much as possible the knowledge learned.Investigating how adapter weights mix regarding both their magnitudes and directions, can be overly complex.Thus, we simplify this assessment by focusing on the sign directions of the adapter weights in our analysis. Following the mixing process of kπ‘˜kitalic_k individual adapter weight in Eq. 1 (Sec.4), we then hypothesize that mixing adapter weights of conflicting signs can result in β€œforgotten knowledge”, and lead to performance degradation.As illustrated in Fig.1, averaging adapter weights across various tasks may lead to nullifying importance weights for individual tasks if their signs are opposite–i.e., positive v.s. negatives.In other words, the fraction of sign difference or FSD (%) or proportion of weight sign difference in adapter weights during mixing correlate with their mixtures’ generalizability. Alg. 3 (Appendix A.5) shows the calculation of FSD.

We evaluate our hypothesis with different cases: (i) individual adapters (mixture with k=1π‘˜1k{=}1italic_k = 1), (ii) dual adapters (k=2π‘˜2k{=}2italic_k = 2), and generalize to (iii) multiple adapters (k>2π‘˜2k{>}2italic_k > 2).To demonstrate the utility of our hypothesis, we apply it to improve the generalizability of adapter mixtures and also to derive a more effective model pruning in Sec.6, 7.

5.2 Individual Adapters (k=1π‘˜1k{=}1italic_k = 1)

We calculate the FSD of adapter weights on RoBERTa and normalize it by the total number of weights, denoted as a matrix 𝐒kΓ—ksubscriptπ’π‘˜π‘˜\mathbf{S}_{k\times k}bold_S start_POSTSUBSCRIPT italic_k Γ— italic_k end_POSTSUBSCRIPT where each row is the FSD of a single adapter train on task kπ‘˜kitalic_k to the remaining adapters (Fig.4).We refer the readers to Sec.A.6 in the Appendix for results on BERT.Interestingly, a consistent trend in the FSD is observed across various model architectures (BERT, RoBERTa) and adapter methods (e.g., Fig.4 and Fig.9 of AppendixA.6).The reason is that adapters act as small MLP layers that integrate task-specific knowledge into pre-trained models Meng etal. (2022) in different adapter methods.This shared functionality contributes to a similar trend in FSD, highlighting the robustness and generalizability of the observed behavior across different adapters’ architectural variations.In addition, Adapters trained on datasets with distinct topic distributions and cosine similarities (Fig.2) exhibit varying weight directions (Fig.4).Especially, MNLI has a similar linguistic distribution with other datasets (i.e. MRPC, RTE, Ag News, etc) (Fig.2), but the adapter trained on MNLI has a significantly larger FSD compared to the remaining domain-specific adapters (Fig.4).Thus, datasets that are similar in linguistic statistics may not necessarily share the same optimization trajectory.As a result, methods that are based on the linguistic distribution to choose the closest set of adapters to mix like AdapterSoup Chronopoulou etal. (2023) may lead to sub-optimal performance.

Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning (5)

5.3 Dual Adapters (k=2π‘˜2k{=}2italic_k = 2)

Fig.5 shows the FSD for each adapter when mixing two domain-specific adapters.This investigation is conducted within the context of the Pfeiffer AdapterPfeiffer etal. (2021) using a pre-trained RoBERTa model.Overall, there is a strong negative correlation between the FSD and the generalizability–i.e., the lower the sky-blue bar (or the smaller fraction of weight sign conflicts), the higher the performance of the mixture (Fig.5).Notably, tasks with substantial difference in the weight signs witness a pronounced performance decrease.Specifically, RCT exhibits significant performance drops due to substantial differences in adapter weight direction.Conversely, tasks such as MRPC, and QNLI demonstrate either marginal improvement or no change in performance when mixed with other adapters.This is well correlated to the marginal FSD of the mixed adapters, ranging from only 5% to 10% compared to the original weight.

Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning (6)

5.4 Multiple Adapters (k>2π‘˜2k{>}2italic_k > 2)

Similar to the dual-adapter setting, there is still a strong negative correlation between FSD and the generalizability, and increasing the number of mixed adapters amplified the sign disparity (Fig.6).For example, adapters trained on RCT and Tweet exhibit large FSD compared to other adapters and hence observed a significant decrease in generalizability (Fig.6).In some cases, mixing only a few adapters (e.g., 2–4) could still maintain competitive performance as in QNLI, SST2 domains (Fig. 6, 15).This correlates with the relatively marginal differences in adapter weights of QNLI and SST compared to adapters trained on other domains (Fig.4), as their mixtures do not lead to significant cancellations of existing parameters and hence preserve learned knowledge.

Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning (7)

5.5 Discussion

Which adapters we should mix? From Sec.5, we find that it is not advisable to merge adapters that are significantly different from each other in weight signs.On the other hand, mixing the weight of these adapters with other adapters which have a small FSD achieves competitive performance (within 3% drop in accuracy) only when the number of mixing adapters is small (QNLI, MRPC in Fig.6) or can achieve better performance, although rarely, compared to training the original adapter as seen in QNLI domain (Fig.5).Therefore, when deploying PLMs, it is prudent to only select a group of tasks with a small FSD to minimize the performance drop in the final mixed model.Our observation is crucial in deploying these models in edge devices where only the adapters are saved on edge, which often has a specific memory capacity limit.

Experiment comparison with AdapterSoup. AdapterSoup Chronopoulou etal. (2023) dynamically selects a set of l𝑙litalic_l adapters during inference. When l=kπ‘™π‘˜l{=}kitalic_l = italic_k, then our experiment setting is the same as the AdapterSoup setting when we use all adapters at once.When l=1𝑙1l{=}1italic_l = 1, it is the original performance of a single adapter, assuming that AdapterSoup is perfect at picking the same domain that is already trained on, and this result is already included in our paper (mixture of only 1111 adapter, Fig.3).When 1<l<k1π‘™π‘˜1{<}l{<}k1 < italic_l < italic_k, we do not have the results of the specific combination that AdapterSoup would select.However, we reported the maximum performance across all combinatorial combinations among 13 domain-specific adapters or each value in Fig.3.For example, when l=3𝑙3l{=}3italic_l = 3, we only observe a possible comparative performance (within less than 3% drop) in QNLI, MRPC, and SST2 domains (Fig. 15).We also emphasize that it is one thing to select the best adapters to mix during inference, it is much harder to choose l𝑙litalic_l or how many of them.

Input: kπ‘˜kitalic_k domain-specific adapters, a matrix 𝐒kΓ—ksubscriptπ’π‘˜π‘˜\mathbf{S}_{k\times k}bold_S start_POSTSUBSCRIPT italic_k Γ— italic_k end_POSTSUBSCRIPT of FSD, l𝑙litalic_l number of adapters to fusion.

Output: Average(π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œπ–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œ\mathsf{candidates}sansserif_candidates)

1:π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œβ†{}β†π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œ\mathsf{candidates}\leftarrow\{\}sansserif_candidates ← { }

2:Compute the average of FSD

3:a⁒v⁒gS=m⁒e⁒a⁒n⁒(S,a⁒x⁒i⁒s=1)π‘Žπ‘£subscriptπ‘”π‘†π‘šπ‘’π‘Žπ‘›π‘†π‘Žπ‘₯𝑖𝑠1avg_{S}=mean(S,axis=1)italic_a italic_v italic_g start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_m italic_e italic_a italic_n ( italic_S , italic_a italic_x italic_i italic_s = 1 )

4:Select the top l𝑙litalic_l from set of kπ‘˜kitalic_k adapters according to

5:smallest average FSD

6:π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œβ†π—π—ˆπ—‰lβ†π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œsubscriptπ—π—ˆπ—‰π‘™{\mathsf{candidates}\leftarrow\mathsf{top}_{l}}sansserif_candidates ← sansserif_top start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT

7:return average⁑(π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œ)averageπ–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œ\operatorname{average}(\operatorname{\mathsf{candidates}})roman_average ( sansserif_candidates )

6 Greedy Adapter Mixing with FSD

Our observations from Sec.5 reveal that two adapters with minimal disparity in FSD can yield competitive performance when combined.Therefore, to demonstrate the utility of FSD, in this section, we design a mixing strategy, so-called Greedy Adapter Mixing (Alg. 1), that utilizes FSD to decide which domain-specific adapters to mix by minimizing the overall FSD in a greedy manner to get a final mixed model with competitive performance.This algorithm is also based on the hypothesis that mixing adapters with minimal FSD can yield mixture models of better generalizability.We proceed by evaluating model performance across various adapter combinations.

Fig.7 shows the generalizability performance of a mixture of l∈[2,13]𝑙213l\in[2,13]italic_l ∈ [ 2 , 13 ] domain-specific adapters.Overall, Greedy Adapter Mixing resulted in very competitive performance compared with empirical upper-bound accuracy.In contrast, using the same algorithm but maximizing FSD resulted in performance close to the empirical lower-bound accuracy (Fig.7).However, in both two cases, mixing adapters with FSD cannot achieve the empirical upper-bound and lower-bound performance.Thus, greedily mixing adapters to minimize FSD does not totally prevent knowledge loss in the adapter mixtures.Nevertheless, FSD is still useful as a guidance measure to effectively mix domain-specific adapters.

7 Towards Effective Model Pruning

To further demonstrate the utility of our FSD analysis in Sec.5 and Sec.6, in this section, we leverage FSD information to reduce knowledge loss through the development of a pruning algorithm guided by FSD insights.Specifically, Fig.5 shows that predictive performance experiences a significant drop when integrating adapters with pronounced disparities in weight signs–i.e., positive v.s. negative signs.Moreover, neural network pruning indicates that only a limited number of weights significantly contribute to task performance, suggesting redundancy within the weights that can be pruned without compromising the original task performanceHan etal. (2016); Frankle etal. (2021); Lazarevich etal. (2021).Thus, to mitigate the impact of weight sign differences in adapter mixtures, we propose mixing only the sparse versions of the adapters’ weights.

Different from Alg. 1, this strategy indirectly reduces the fraction of weight sign conflicts.Intuitively, by minimizing the FSD, the mixing process becomes more resilient to the inadvertent elimination of important weights by less significant or redundant weights.This phenomenon is visually depicted in Step 2 of Fig.1, where only significant weights in the two adapters need to be preserved, and small or unimportant weights of opposing signs can be eliminated.

7.1 FSD-based Magnitude Pruning

Input: adapter parameters w𝑀witalic_w, sparse ratio s𝑠sitalic_s.

Output: pruned adapter w~~𝑀\tilde{w}over~ start_ARG italic_w end_ARG.

1: w←Trained⁑(w)←𝑀Trained𝑀w\leftarrow\operatorname{Trained}(w)italic_w ← roman_Trained ( italic_w )

2: Compute important score z=|w|𝑧𝑀z=|w|italic_z = | italic_w |

3: Compute the s𝑠sitalic_s-th percentile of z𝑧zitalic_z as zssubscript𝑧𝑠z_{s}italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT

4: mβ†πŸ™β’[zβˆ’zsβ‰₯0]β†π‘š1delimited-[]𝑧subscript𝑧𝑠0m\leftarrow\mathbbm{1}\left[z-{z_{s}}\geq 0\right]italic_m ← blackboard_1 [ italic_z - italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT β‰₯ 0 ]

5: w~←mβŠ™w←~𝑀direct-productπ‘šπ‘€\tilde{w}\leftarrow m\odot wover~ start_ARG italic_w end_ARG ← italic_m βŠ™ italic_w

Post-training Pruning.Sparse AdapterHe etal. (2022) employs pruning across every layer of adapters, being able to achieve comparable or even superior predictive performance than standard adapters, even when the sparse ratio reaches up to 80%. By adopting a similar process of adapters pruning after training, with the guidance of our FSD analysis in Sec.5, we can eliminate redundant parameters at an early stage, circumventing the need for a time-consuming iterative pruning process, as discussed in prior works such as Han etal. (2016); Lazarevich etal. (2021). The detailed pruning algorithm is presented in Alg.2.

Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning (8)

Generalizability of Pruned Adapters.Fig.8a shows RoBERTa’s performance, where we systematically prune the weight of the Pfeiffer adapter 40%–100% of sparsity. For a single task at the sparsity level d%percent𝑑d\%italic_d %, we retain only the largest–i.e., the top-d%percent𝑑d\%italic_d %, influential parameters of the corresponding adapter, and report the average in-domain performance across 13 domains.Remarkably, pruning up to 90% parameters of adapter weight does not lead to performance degradation.This observation suggests redundancy in adapters’ parameters may contribute to the increase in the fraction of weight direction conflicts–i.e., high FSD when merging them (Fig.6).

Generalizability of Mixed Pruned Adapters.Motivated by pruned adapter can still maintain original performance up to 90% of sparsity (Fig. 8a), we hypothesize that mixing sparse adapters may indirectly reduce the fraction of weight sign conflict, therefore, leading to competitive performance with the original adapters.Given a set of domain-specific adapters, based on the FSD, we choose the top kπ‘˜kitalic_k layers with the highest fraction of weight sign difference and prune each layer up to 90% of sparsity. Then we mix these adapters by weight averaging. Details of mixing domain-specific adapters with weight sign conflict information is shown in Alg.4 (Appendix A.8).Fig.8b shows that mixing adapters with 80% or 90% sparsity consistently achieved better performance than the upper-bound empirical accuracy achieved when mixing their dense versions.

8 Conclusion

This work provides a comprehensive empirical in-domain evaluation of the emerging mechanism of mixing domain-specific adapters. We also provide insights into the inner workings of the mixture of domain-specific adapters by analyzing their weight signs, yielding critical observations on the negative correlation between the fraction of sign difference among adapters and their mixtures’ generalizability. By examining the signed directions of adapter weights, we also offer the readers valuable advice on the optimal selection of adapters to mix to achieve competitive performance. Such examination also helps enhance our understanding of the interconnected role of weight sign difference in the context of sparse neural networks.

Limitation

Primarily, our exploration focused solely on one classic pruning method, namely Magnitude PruningSanh etal. (2020) while there are existing more advanced pruning techniques such as SynFlowTanaka etal. (2020), GraSPWang etal. (2020) that are also applicable for condensing neural network architectures.Consequently, future works include investigating the applicability of our findings to these alternative pruning approaches.Furthermore, our examination was confined to the natural language understanding tasks.A valuable avenue for future research would involve extending our analysis to encompass the emerging text generation tasks, particularly within the context of the current transformer-based language model, including but not limited to the machine translation tasks utilizing complex GPT-family models.

References

  • Altakrori etal. (2021)Malik Altakrori, Jackie ChiKit Cheung, and BenjaminCM Fung. 2021.The topic confusion task: A novel evaluation scenario for authorship attribution.In EMNLP.
  • Bentivogli etal. (2009)Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009.The fifth pascal recognizing textual entailment challenge.In TAC.
  • Blei etal. (2003)DavidM Blei, AndrewY Ng, and MichaelI Jordan. 2003.Latent dirichlet allocation.In Journal of machine Learning research.
  • Cer etal. (2018)Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, RhomniSt John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, etal. 2018.Universal sentence encoder for english.In EMNLP.
  • Chronopoulou etal. (2023)Alexandra Chronopoulou, MatthewE Peters, Alexander Fraser, and Jesse Dodge. 2023.Adaptersoup: Weight averaging to improve generalization of pre-trained language models.In EACL.
  • Dernoncourt and Lee (2017)Franck Dernoncourt and JiYoung Lee. 2017.PubMed 200k RCT: a dataset for sequential sentence classification in medical abstracts.In Proceedings of the Eighth International Joint Conference on Natural Language Processing.
  • Devlin etal. (2019a)Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019a.Bert: Pre-training of deep bidirectional transformers for language understanding.In NAACL.
  • Devlin etal. (2019b)Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019b.Bert: Pre-training of deep bidirectional transformers for language understanding.In NAACL.
  • Diao etal. (2023)Shizhe Diao, Tianyang Xu, Ruijia Xu, Jiawei Wang, and Tong Zhang. 2023.Mixture-of-domain-adapters: Decoupling and injecting domain knowledge to pre-trained language models memories.In ACL.
  • Dolan and Brockett (2005)Bill Dolan and Chris Brockett. 2005.Automatically constructing a corpus of sentential paraphrases.In Third International Workshop on Paraphrasing.
  • Frankle etal. (2021)Jonathan Frankle, GintareKarolina Dziugaite, DanielM Roy, and Michael Carbin. 2021.Pruning neural networks at initialization: Why are we missing the mark?In ICLR.
  • Goodfellow etal. (2015)IanJ. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015.Explaining and harnessing adversarial examples.In ICLR.
  • Han etal. (2016)Song Han, Huizi Mao, and WilliamJ Dally. 2016.Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding.In ICLR.
  • He etal. (2022)Shwai He, Liang Ding, Daize Dong, Jeremy Zhang, and Dacheng Tao. 2022.SparseAdapter: An easy approach for improving the parameter-efficiency of adapters.In EMNLP.
  • Houlsby etal. (2019)Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin DeLaroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019.Parameter-efficient transfer learning for nlp.In ICLR.
  • Hu etal. (2022)EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, LuWang, and Weizhu Chen. 2022.Lora: Low-rank adaptation of large language models.In ICLR.
  • Iyer etal. (2017)Shankar Iyer, Nikhil Dandekar, KornΓ©l Csernai, etal. 2017.First quora dataset release: Question pairs.In data. quora. com.
  • Jin etal. (2020)DiJin, Zhijing Jin, JoeyTianyi Zhou, and Peter Szolovits. 2020.Is BERT really robust? A strong baseline for natural language attack on text classification and entailment.In AAAI.
  • Jin etal. (2023a)Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. 2023a.Dataless knowledge fusion by merging weights of language models.In ICLR.
  • Jin etal. (2023b)Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. 2023b.Dataless knowledge fusion by merging weights of language models.In ICLR.
  • Lazarevich etal. (2021)Ivan Lazarevich, Alexander Kozlov, and Nikita Malinin. 2021.Post-training deep neural network pruning via layer-wise calibration.In ICCV.
  • Li etal. (2022)Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, NoahA Smith, and Luke Zettlemoyer. 2022.Branch-train-merge: Embarrassingly parallel training of expert language models.In arXiv.
  • Li and Liang (2021)XiangLisa Li and Percy Liang. 2021.Prefix-tuning: Optimizing continuous prompts for generation.In ACL.
  • Liu etal. (2019)Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining approach.In arXiv.
  • Malo etal. (2014)Pekka Malo, Ankur Sinha, Pekka Korhonen, Jyrki Wallenius, and Pyry Takala. 2014.Good debt or bad debt: Detecting semantic orientations in economic texts.In Journal of the Association for Information Science and Technology.
  • Matena and Raffel (2022)Michael Matena and Colin Raffel. 2022.Merging models with fisher-weighted averaging.In NeurIPS.
  • Meng etal. (2022)Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022.Locating and editing factual associations in gpt.In NeurIPS.
  • Peters etal. (2018)MatthewE Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018.Deep contextualized word representations.In NAACL.
  • Pfeiffer etal. (2021)Jonas Pfeiffer, Aishwarya Kamath, Andreas RΓΌcklΓ©, Kyunghyun Cho, and Iryna Gurevych. 2021.Adapterfusion: Non-destructive task composition for transfer learning.In EACL.
  • Pfeiffer etal. (2020)Jonas Pfeiffer, Ivan VuliΔ‡, Iryna Gurevych, and Sebastian Ruder. 2020.MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer.In EMNLP.
  • Rajpurkar etal. (2016)Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.SQuAD: 100,000+ questions for machine comprehension of text.In EMNLP.
  • Sanh etal. (2020)Victor Sanh, Thomas Wolf, and Alexander Rush. 2020.Movement pruning: Adaptive sparsity by fine-tuning.In NeurIPS.
  • Socher etal. (2013)Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, ChristopherD Manning, AndrewY Ng, and Christopher Potts. 2013.Recursive deep models for semantic compositionality over a sentiment treebank.In EMNLP.
  • Tanaka etal. (2020)Hidenori Tanaka, Daniel Kunin, DanielL Yamins, and Surya Ganguli. 2020.Pruning neural networks without any data by iteratively conserving synaptic flow.In NeurIPS.
  • Wang etal. (2020)Chaoqi Wang, Guodong Zhang, and Roger Grosse. 2020.Picking winning tickets before training by preserving gradient flow.In ICLR.
  • Wang etal. (2021a)Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Guihong Cao, Daxin Jiang, Ming Zhou, etal. 2021a.K-adapter: Infusing knowledge into pre-trained models with adapters.In ACL.
  • Wang etal. (2021b)Xinyi Wang, Yulia Tsvetkov, Sebastian Ruder, and Graham Neubig. 2021b.Efficient test time adapter ensembling for low-resource language varieties.In EMNLP.
  • Wang etal. (2022)Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, AhmedHassan Awadallah, and Jianfeng Gao. 2022.AdaMix: Mixture-of-adaptations for parameter-efficient model tuning.In EMNLP.
  • Williams etal. (2018)Adina Williams, Nikita Nangia, and Samuel Bowman. 2018.A broad-coverage challenge corpus for sentence understanding through inference.In NAACL.

Appendix A Appendix

Datasetmnlimrpcsst2rte
Train392,7023,66867,3492,490
Test9,815408872277
Datasetqnliqqprctag
Train104,743363,846178,882120,000
Test5,46340,43030,1357,600
authorshipfinancialimdbtweetswiki
2,7434,84622,50031,962127,656
6864842,5003,19663,978
Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning (9)
Data Source
Average
Document Length
Average
Sentence Length
Average
##\## Sentences per Document
MNLI15.114.71.0
MRPC21.921.11.0
QNLI18.218.01.0
QQP11.19.91.2
RTE26.218.11.4
SST10.410.41.0
RCT26.526.31.0
Ag-news38.429.11.3
Authorship1038.620.251.3
Financial23.122.81.0
IMDB233.821.610.8
Tweets15.99.61.6
Wiki-toxic67.815.44.4

Input: two adapters with similar architecture learned from two domain ΞΈAsubscriptπœƒπ΄\theta_{A}italic_ΞΈ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, ΞΈBsubscriptπœƒπ΅\theta_{B}italic_ΞΈ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT

Output: Fraction f𝑓fitalic_f of weight sign difference

1:Compute total parameters s𝑠sitalic_s in each adapter (s=sA=sB𝑠subscript𝑠𝐴subscript𝑠𝐡s=s_{A}=s_{B}italic_s = italic_s start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT)

2:C←{}←𝐢C\leftarrow\{\}italic_C ← { }

3:For every layer in adapter ΞΈA,ΞΈBsubscriptπœƒπ΄subscriptπœƒπ΅\theta_{A},\theta_{B}italic_ΞΈ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_ΞΈ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, compute

4:element-wise product for each layer.

5:for k,vπ‘˜π‘£k,vitalic_k , italic_v in ΞΈA.i⁒t⁒e⁒m⁒s⁒()formulae-sequencesubscriptπœƒπ΄π‘–π‘‘π‘’π‘šπ‘ \theta_{A}.items()italic_ΞΈ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT . italic_i italic_t italic_e italic_m italic_s ( ) do:

6:C⁒[k]𝐢delimited-[]π‘˜C[k]italic_C [ italic_k ] = mul⁑(ΞΈA⁒[k],ΞΈB⁒[k])mulsubscriptπœƒπ΄delimited-[]π‘˜subscriptπœƒπ΅delimited-[]π‘˜\operatorname{mul}(\theta_{A}[k],\theta_{B}[k])roman_mul ( italic_ΞΈ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT [ italic_k ] , italic_ΞΈ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT [ italic_k ] )

7:Count total of numbers which value is

8:smaller than 00.

9:for k,vπ‘˜π‘£k,vitalic_k , italic_v in C.i⁒t⁒e⁒m⁒s⁒()formulae-sequenceπΆπ‘–π‘‘π‘’π‘šπ‘ C.items()italic_C . italic_i italic_t italic_e italic_m italic_s ( ) do:

10:c⁒o⁒u⁒n⁒t⁒e⁒r⁒+⁣=⁑sum⁑(v⁒a⁒l⁒u⁒e<0)π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿsumπ‘£π‘Žπ‘™π‘’π‘’0counter\operatorname{+=}\operatorname{sum}(value<0)italic_c italic_o italic_u italic_n italic_t italic_e italic_r start_OPFUNCTION + = end_OPFUNCTION roman_sum ( italic_v italic_a italic_l italic_u italic_e < 0 )

11:f=c⁒o⁒u⁒n⁒t⁒e⁒r/sπ‘“π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿπ‘ f=counter/sitalic_f = italic_c italic_o italic_u italic_n italic_t italic_e italic_r / italic_s

12:return f𝑓fitalic_f

A.1 Datasets.

Diverse Knowledge Datasets.

To simulate knowledge diversity, we gather a total of 13 distinct and diverse domain-specific datasets or classification tasks for evaluation. They are MNLIWilliams etal. (2018), QNLIRajpurkar etal. (2016), RTEBentivogli etal. (2009), MRPCDolan and Brockett (2005), QQPIyer etal. (2017) and SST2Socher etal. (2013) from the GLUE domain corpus. PubMed-20K RCT datasetDernoncourt and Lee (2017) from Biology domain for sentence classification.IMDB dataset from a Movie Review domain. Ag News, FinancialMalo etal. (2014) and Guardian AuthorshipAltakrori etal. (2021) are News domain datasets across World, Sports, Business, Science/Technology, and Financial topics. Wiki Toxic 111https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/ and Tweets Hate Speech are two Informal text domain for toxicity detection.

Linguistic Statistic.

Table 3 shows detailed statistics such as number of documents, average document length, and sentence lengths.

A.2 Topic distribution of training datasets

Tables from 4 to 16 show 10 topics and corresponding important words which are exacted from LDA for each training dataset. Notably, Ag News and SST2 have high cosine similarity but observe a large difference in terms of topic distribution compared to other domains (Fig.2). Therefore, each statistical mechanism like cosine similarity or topic distribution only reflects one aspect of data distribution and may show inconsistencies with each other.

#TopicMNLI
1well, time, got, take, one, much, day, something, ive, even, way, long, little, make, back
2kind, system, though, come, went, well, today, view, church, including, president, seems, across, run, policy
3say, get, cost, guess, were, business, car, local, whole, north, rather, getting, question, technology, capital
4service, state, world, get, big, pretty, give, war, yes, standard, real, here, came, call
5probably, high, thought, however, set, hand, enough, said, since, type, jon, yet, and, service
6could, mean, around, part, another, change, percent, made, course, life, book, fact, name, room
7government, program, federal, information, country, problem, le, new, national, may, number, agency, report, organization
8year, two, house, case, old, three, town, street, century, one, city, study, man, four, different
9know, like, think, thats, right, really, people, thing, good, go, one, lot, going
10yeah, work, legal, rule, last, year, he, american, small, home, company, act, group, analysis, public

#TopicMRPC
1said, court, company, would, official, statement, decision, made, state, appeal, two, board
2said, year, people, president, program, time, million, two, last, house, official, weapon
3said, million, would, state, period, compared, men, get, democratic, plan, company, united, also, could
4percent, share, cent, million, stock, point, nasdaq, billion, new, index, trading, rose, per, year
5said, also, state, iraq, center, united, attack, hospital, killed, war, three, american, people
6said, two, home, police, told, state, friday, last, year, federal, company, yesterday, national
7standard, poor, index, chief, point, said, percent, justice, one, spx, broader, executive, three
8said, analyst, expected, street, many, suit, call, yesterday, angeles, wall, los, research, one, change, according
9case, said, court, filed, death, also, charged, lawsuit, charge, state, found, reported, office, cancer
10said, would, server, window, network, one, new, microsoft, also, taken, people, company

#TopicQNLI
1city, american, south, large, west, season, de, roman, service, art, london, first, located, street, new
2state, united, new, including, people, city, national, million, school, north, government, army, many, within, building
3also, system, later, early, used, based, part, control, four, use, death, official, known, act, called
4group, language, east, among, found, common, company, india, federal, movement, population, early, included, production, range
5the, church, term, example, university, greek, german, like, english, specie, god, word, per, old, one
6form, although, following, law, central, rule, culture, without, often, modern, territory, society, treaty, considered, christian
7war, world, british, life, development, empire, first, region, community, year, france, though, time, set, began
8well, three, include, place, power, party, league, may, needed, right, one, political, club, a, event
9became, first, time, john, film, president, number, year, french, one, day, land, america, process, le
10century, music, around, house, home, period, age, record, late, established, several, standard, time, world, river

#TopicQQP
1like, become, feel, get, job, movie, good, student, want, engineering, girl, website, sex, study, go
2best, way, difference, learn, whats, money, make, online, book, india, buy, start, good, language, programming
3much, best, time, weight, year, lose, old, place, month, day, iphone, read, possible, class
4thing, day, business, get, first, going, example, one, prepare, video, woman, word, men
5work, note, india, indian, ever, computer, black, r, science, you, help, rupee, different
6would, life, trump, world, country, new, donald, war, india, win, happen, president, clinton, hillary
7get, friend, used, long, why, bad, back, see, take, cant, good, facebook, system, relationship, person
8someone, love, english, one, know, improve, account, people, get, instagram, tell, average, hair, password
9mean, app, song, name, android, give, bank, right, what, company, india, working, get, now, create
10people, quora, question, think, do, me, answer, google, stop, use, state, get, many, live

#TopicRTE
1year, bank, world, ago, police, place, human, people, said, man, problem, game, many, took, explosion
2people, attack, california, killed, life, united, day, lost, air, one, space, injured, national, capital, said
3oil, said, nuclear, company, new, president, iran, million, military, john, un, country, bush, price
4said, world, state, united, minister, country, million, people, nobel, south, peace, war, trade, prize, mexico
5woman, corp, parliament, case, confirmed, said, rabies, represented, cause, poorly, fire, president, police, loss
6year, new, said, one, would, died, university, show, company, family, first, service, since, country, home
7state, iraq, said, bush, bomb, found, used, water, home, killed, caused, damage, one, police
8party, police, president, new, two, officer, name, drug, state, prime, people, minister, last, year, democratic
9new, said, government, year, iraq, would, york, official, today, baghdad, also, euro, announced, percent, minister
10said, year, leader, new, sanfrancisco, work, justice, two, president, government, end, free, guerrilla

#TopicSST2
1film, really, enough, movie, something, make, interesting, many, like, subject, intelligent, laugh, short
2movie, bad, film, better, great, fun, one, look, director, story, ultimately, smart, cinema, put
3performance, funny, way, moment, film, cast, another, screen, yet, big, work, perfect, made
4new, material, ve, movie, rather, film, special, seen, minute, enjoyable, might, offer, story, effect
5comedy, drama, thriller, romantic, documentary, actor, moving, clever, funny, sometimes, pleasure, often, movie, film
6work, film, movie, hard, well, keep, filmmaker, ever, life, original, sense, dull, quite, could
7like, feel, movie, much, people, film, make, see, get, character, one, thing
8good, real, film, worth, fascinating, make, time, lack, bit, amusing, humor, tale, pretty, run
9character, one, best, film, movie, story, far, compelling, two, every, year, picture, little
10love, audience, film, story, character, seems, entertainment, way, powerful, care, take, one, movie, spirit

#TopicRCT
1group, patient, week, randomized, study, received, control, year, mg, randomly, placebo, day
2patient, session, visit, cohort, failure, lesion, myocardial, hospital, twice,death, heart, infarction
3analysis, using, data, model, used, test, sample, analyzed, regression, characteristic, time, collected, cell, method, performed
4outcome, primary, month, patient, baseline, measure, score, secondary, treatment, scale, assessed, symptom, week, followup
5risk, associated, level, weight, factor, effect, disease, body, increased, diabetes, insulin, high, glucose, change, activity
6study, patient, treatment, effect, therapy, efficacy, may, effective, result, evaluate, safety, weather, clinical, outcome, intervention
7group, difference, significant, significantly, compared, control, treatment, score, higher, lower, time, observed, rate
8trial, study, randomized, intervention, care, health, controlled, clinical, quality, life, conducted, prospective, effectiveness, child, number
9patient, event, surgery, adverse, postoperative, complication, procedure, pain, undergoing, surgical, rate, incidence, common, infection, injection
10mean, respectively, ratio, patient, group, median, versus, interval, year, day, month

#TopicTweets
1new, get, here, music, home, cool, playing, free, want, fun, season, shop, update, reason
2day, one, night, time, good, week, last, never, first, get, year, got, lot, today
3day, father, love, happy, time, weekend, take, friday, dad, fathersday, model
4want, bull, up, do, help, trump, whatever, direct, dominate, waiting, libtard, yet, sleep, post
5thankful, need, good, positive, orlando, morning, city, tear, news, blessed, friend, dream, bing, yeah, bong
6user, amp, day, see, cant, go, like, new, today, one, people, get, wait, make
7birthday, like, positive, affirmation, happy, baby, amp, god, girl, woman, feel, hate, hot, you
8love, work, life, happy, happiness, make, always, food, quote, smile, wedding, moment, right, feeling, music
9healthy, blog, gold, silver, altwaystoheal, forex, healing, grateful, dog, buffalo, peace, really, story
10love, me, smile, summer, beautiful, fun, cute, girl, selfie, friend, sun, instagood, beach, photo

#TopicIMDB
1story, film, life, movie, character, one, love, time, people, see, way, family, would, well
2movie, like, one, good, really, it, film, bad, see, even, time, would, make, get
3get, one, man, the, go, woman, take, back, he, find, there, scene, two, girl
4hamilton, gadget, arkin, scooby, talespin, stallion, smoothly, tenderness, shaggy, gil, inspector, keller, nevada, hopelessness
5war, american, documentary, soldier, political, world, german, country, history, america, military, army, hitler
6bollywood, indian, kapoor, khan, akshay, fi, amitabh, ramones, verhoeven, christina, sci, braveheart, kumar, chiller
7film, one, the, scene, character, story, director, much, plot, well, even, work, time
8film, role, performance, great, play, best, good, cast, one, actor, comedy, john
9show, series, episode, year, tv, time, great, first, kid, dvd, one, funny, still, watch
10match, matthau, luke, shakespeare, neil, bruce, scarface, boxing, hamlet, elvis, branagh, lucas, polanski

#TopicAg News
1palestinian, said, iraqi, killed, iraq, reuters, attack, baghdad, arafat, israeli, bomb, scored, force, city
2win, world, first, point, coach, cup, lead, victory, team, second, no, champion, night, final
3president, afp, said, minister, election, bush, leader, india, state, reuters, prime, united
4reuters, oil, price, stock, new, search, dollar, google, market, york, rate, apple, share, record
5court, drug, say, ap, could, may, new, year, eu, case, said, state, scientist, trial
6space, nasa, canadian, dec, press, former, nba, williams, winter, houston, monday, arsenal, sunday
7said, company, inc, million, deal, corp, billion, sale, year, percent, reuters, buy, business
8microsoft, new, software, internet, service, system, computer, technology, phone, ibm, music, online, web, company
9china, police, said, reuters, people, worker, british, government, official, party, japan, group, chinese
10game, new, year, red, one, time, season, first, team, series, last, york

#TopicFinancial
1company, finnish, new, plant, finland, construction, order, line, contract, service, unit, production, investment
2company, share, bank, said, also, capital, start, issue, term, financial, price, business, executive, dividend
3eur, profit, sale, net, operating, million, period, quarter, compared, loss, year
4finnish, said, today, million, company, first, helsinki, year
5company, mobile, said, phone, nokia, solution, business, pretax, finland, network, product, group, store, customer
6market, board, option, company, share, stock, director, member, concerning, meeting, general, bank, flow, chairman
7share, company, group, lower, helsinki, stock, president, capital, holding, new, right
8service, finland, customer, corporation, company, electronics, solution, industry, business, helsinki, ltd, group
9company, expected, sale, said, people, production, paper, year, finland, plant, cut, staff, expects
10euro, service, company, item, nokia, excluding, technology, business, mobile, device, market, product

#TopicAuthorship
1one, would, may, people, year, even, could, time, last, minister, public, police, many, blair, say
2one, would, war, farmer, even, new, blair, bush, could, need, time, iraq, much, week
3labour, new, people, government, tax, year, time, even, public, brown, blair, party, money
4would, one, government, new, world, year, labour, much, state, blair, last, british
5new, public, government, labour, people, year, one, would, may, way, time, make, right, life, need
6people, time, public, said, even, government, lord, like, party, make, day
7one, bush, american, world, year, right, war, child, people, british, state, new
8people, one, child, like, time, family, get, year, burrell, may, still, even, much
9would, one, blair, bush, war, nuclear, even, it, new, make, could, weapon, people, party
10would, one, year, people, could, even, royal, like, woman, time, war, right, iraq

#TopicWiki Toxic
1page, talk, edit, please, user, edits, wikipedia, editor, comment, block, blocked, editing, discussion, thanks, stop
2image, use, you, copyright, page, fair, picture, please, medium, wikipedia, see, template, deleted, file, photo
3article, deletion, deleted, page, please, tag, may, speedy, notable, talk, guideline, subject, wikipedia, criterion, add
4nigg*r, hate, bitchf*ck, fa*ggot, lol, class, rape, fat, asshole, mama, f*cker, hairy, ha, boymamas
5like, know, get, people, it, think, you, want, one, time, go, thing, me, really
6state, english, country, american, language, people, name, war, city, world, government, history, british, jew, group
7f*ck, ass, suck, f*cking, sh*t, u, hi, c*nt, school, moron, go, bitch, shut, co*ck, dick
8utc, year, new, game, redirect, song, old
9page, wikipedia, talk, help, please, link, welcome, question, article, thank, thanks, like, name, best
10article, one, would, source, also, think, section, fact, see, it, like, point, say, time, reference

A.3 Hyper-parameter

Training and evaluation datasets.To assess performance in out-of-distribution scenarios, we conduct evaluations on a diverse set of 13 datasets covering various topics, ranging from movie reviews, news, authorship, and healthcare, to non-formal language text such as Wiki Toxic and Tweets.For datasets within the GLUE corpus, we employ training and evaluation datasets to gauge accuracy across different settings.In the case of Ag News, Authorship, Financial, IMDB, Tweets, and Wiki-Toxic, we partition the training set into three segments with an 8:1:1 ratio, utilizing them for training, evaluation, and test datasets, respectively.This approach ensures a comprehensive evaluation of model performance across a wide spectrum of domains and linguistic styles.Table 2 shows data statistics on train/test datasets.

Setting on text adversarial attack.In this study, we employ two types of attacker methods: TextFooler Jin etal. (2020) and FGSM Goodfellow etal. (2015).

TextFooler word-level attacks focus on replacing words within the text with synonyms or contextually similar words.By making ostensibly minor alterations to the input text, these attacks can deceive LLMs into producingincorrect outputs or substantially modifying their predictions.We meticulously fine-tune the hyperparameters of TextFooler to obtain more appropriate synonyms.We set the minimum embedding cosine similarity between a word and its synonyms as 0.8, and the minimum Universal Sentence Encoder similarity is 0.84.

FGSM Goodfellow etal. (2015) is a white-box embedding-level attack.FGSM uses the Fast Gradient Sign Method to calculate gradients of the model’s loss to the input text and generates an adversarial example by perturbing the embedding of input text in the direction that maximizes the loss.We choose the magnitude of the perturbation in embedding space as 0.010.010.010.01 on BERT and RoBERTa models.

Adapter Configuration.We use adapters with a dimension of 64 and 256 usingRoBERTa-large and BERT-base encoders following the setup of Houlsby etal. (2019), Pfeiffer etal. (2021).With LoRA, we use rank r=4π‘Ÿ4r=4italic_r = 4 following the setup of Hu etal. (2022).

Hardware Information. We evaluate model performance on AMD Ubuntu 22.04.2 with Ryzen Threadripper PRO 5975WX, 1800MHz, Cached 512 KB and 4 Γ—\timesΓ— GPU Nvidia A6000.Hyper-Parameters.Detailed hyper-parameter configuration for different tasks is presented in Table17.

TaskLearning rateepochbatch sizewarmupweight decayadapter size
BERTBASE
MNLI4e-420320.060.1256
MRPC4e-45320.060.1256
QNLI4e-420320.060.1256
QQP4e-420320.060.1256
RCT4e-420320.060.1256
RTE4e-45320.060.1256
SST24e-410320.060.1256
Tweets4e-45320.060.1256
IMDB4e-45320.060.1256
Ag News4e-420320.060.1256
Financial4e-45320.060.1256
Authorship4e-45320.060.1256
RoBERTaLARGE
MNLI3e-420640.60.164
MRPC3e-45640.60.164
QNLI3e-420640.60.164
QQP3e-420640.60.164
RCT3e-420640.60.164
RTE3e-45640.60.164
SST23e-410640.60.164
Tweets3e-45640.60.164
IMDB3e-45640.60.164
Ag News3e-420640.60.164
Financial3e-45640.60.164
Authorship3e-45640.60.164

A.4 Model performance when mixing adapters across tasks

Tables 10, 11, 12 and 13 show task accuracy when mixing multiple adapters.

Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning (10)
Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning (11)
Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning (12)
Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning (13)

A.5 Fraction of weight sign difference

Alg.3 presents a detailed algorithm to compute the FSD.

A.6 Additional result on weight sign difference

Fig.9 shows the weight sign difference of the adapter and normalizes it by the total number of adapter weights in BERT.

A.7 Additional results when mixing two adapters

Fig.14 and 15 show model generalization when mixing two and multiple adapters across various tasks.

Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning (14)
Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning (15)

A.8 Mixing Sparse Adapters with weight sign difference

Alg.4 shows details of mixing sparse adapter with sign conflict information.

Input: kπ‘˜kitalic_k domain-specific adapters, the FSD matrix 𝐒kΓ—ksubscriptπ’π‘˜π‘˜\mathbf{S}_{k\times k}bold_S start_POSTSUBSCRIPT italic_k Γ— italic_k end_POSTSUBSCRIPT, l𝑙litalic_l number of mix adapters.

Output: Average(π—Œπ—‰π–Ίπ—‹π—Œπ–Ύβ’_β’π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œπ—Œπ—‰π–Ίπ—‹π—Œπ–Ύ_π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œ\mathsf{sparse\_candidates}sansserif_sparse _ sansserif_candidates)

1:π–½π–Ύπ—‡π—Œπ–Ύβ’_β’π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œβ†{}β†π–½π–Ύπ—‡π—Œπ–Ύ_π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œ\mathsf{dense\_candidates}\leftarrow\{\}sansserif_dense _ sansserif_candidates ← { }

2:Compute the average of the difference in weight

3:sign: a⁒v⁒e⁒r⁒a⁒g⁒eS=m⁒e⁒a⁒n⁒(S,a⁒x⁒i⁒s=1)π‘Žπ‘£π‘’π‘Ÿπ‘Žπ‘”subscriptπ‘’π‘†π‘šπ‘’π‘Žπ‘›π‘†π‘Žπ‘₯𝑖𝑠1average_{S}=mean(S,axis=1)italic_a italic_v italic_e italic_r italic_a italic_g italic_e start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_m italic_e italic_a italic_n ( italic_S , italic_a italic_x italic_i italic_s = 1 )

4:Select the top l𝑙litalic_l smallest adapters (𝐒lsubscript𝐒𝑙\mathbf{S}_{l}bold_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT) to mix based

5:on the average weight sign difference

6:π–½π–Ύπ—‡π—Œπ–Ύβ’_β’π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œβ†π’lβ†π–½π–Ύπ—‡π—Œπ–Ύ_π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œsubscript𝐒𝑙{\mathsf{dense\_candidates}\leftarrow\mathbf{S}_{l}}sansserif_dense _ sansserif_candidates ← bold_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT

7:For each adapter, compute the average fraction of

8:weight sign different in each layer with corresponding layers from other adapters.

9:Get the top mπ‘šmitalic_m layer with the highest fraction of

10:weight sign conflict to prune

11:π—Œπ—‰π–Ίπ—‹π—Œπ–Ύβ’_β’π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œβ†Prune⁑(π–½π–Ύπ—‡π—Œπ–Ύβ’_β’π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œ)β†π—Œπ—‰π–Ίπ—‹π—Œπ–Ύ_π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—ŒPruneπ–½π–Ύπ—‡π—Œπ–Ύ_π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œ\mathsf{sparse\_candidates}\leftarrow\operatorname{Prune}(\mathsf{dense\_%candidates})sansserif_sparse _ sansserif_candidates ← roman_Prune ( sansserif_dense _ sansserif_candidates )

12:return average⁑(π—Œπ—‰π–Ίπ—‹π—Œπ–Ύβ’_β’π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œ)averageπ—Œπ—‰π–Ίπ—‹π—Œπ–Ύ_π–Όπ–Ίπ—‡π–½π—‚π–½π–Ίπ—π–Ύπ—Œ\operatorname{average}(\operatorname{\mathsf{sparse\_candidates}})roman_average ( start_OPFUNCTION sansserif_sparse _ sansserif_candidates end_OPFUNCTION )

Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning (2024)
Top Articles
Latest Posts
Article information

Author: Fr. Dewey Fisher

Last Updated:

Views: 6128

Rating: 4.1 / 5 (42 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Fr. Dewey Fisher

Birthday: 1993-03-26

Address: 917 Hyun Views, Rogahnmouth, KY 91013-8827

Phone: +5938540192553

Job: Administration Developer

Hobby: Embroidery, Horseback riding, Juggling, Urban exploration, Skiing, Cycling, Handball

Introduction: My name is Fr. Dewey Fisher, I am a powerful, open, faithful, combative, spotless, faithful, fair person who loves writing and wants to share my knowledge and understanding with you.