Model for Peanuts: Hijacking ML models without Training Access is Possible (2024)

Mahmoud GhorbelUniversité Polytechnique Hauts-de-FranceValenciennesFrancemahmoud.ghorbal@uphf.fr,Halima BouzidiQueen’s University BelfastBelfastUKh.bouzidi@qub.ac.uk,Ioan Marius BilascoUniversité de LilleLilleFrancemarius.bilasco@univ-lille.frandIhsen AlouaniIEMN-CNRS-8520, INSA Hauts-de-France
Queen’s University BelfastBelfastUKi.alouani@qub.ac.uk

(2018)

Abstract.

The massive deployment of Machine Learning (ML) models has been accompanied by the emergence of several attacks that threaten their trustworthiness and raise ethical and societal concerns such as invasion of privacy, discrimination risks, lack of accountability, and wariness of unlawful surveillance. Model hijacking is one of these attacks, where the adversary aims to hijack a victim model to execute a different task than its original one. Model hijacking can cause accountability and security risks since a hijacked model owner can be framed for having their model offering illegal or unethical services. Prior state-of-the-art works consider model hijacking as a training time attack, whereby an adversary requires access to the ML model training to execute their attack. In this paper, we consider a stronger threat model where the attacker has no access to the training phase of the victim model. Our intuition is that ML models, typically over-parameterized, might be able to (unintentionally) learn more than the intended task for they are trained. We propose a simple approach for model hijacking at inference time named SnatchML to classify unknown input samples using distance measures in the latent space of the victim model to previously known samples associated with the hijacking task classes. SnatchML empirically shows that benign pre-trained models can execute tasks that are semantically related to the initial task. Surprisingly, this can be true even for hijacking tasks unrelated to the original task. We also explore different methods to mitigate this risk. We first propose a novel approach we call meta-unlearning, designed to help the model unlearn a potentially malicious task while training on the original task dataset. We also provide insights on over-parametrization as one possible inherent factor that makes model hijacking easier, and we accordingly propose a compression-based countermeasure against this attack. We believe this work offers a previously overlooked perspective on model hijacking attacks, featuring a stronger threat model and higher applicability in real-life contexts.

Machine Learning, Security, Privacy, Model Hijacking

^†^†copyright: acmlicensed^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correctconference title from your rights confirmation emai; June 03–05,2018; Woodstock, NY^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Security and privacySoftware and application security

1. Introduction

Machine Learning models have demonstrated cutting-edge performance across a broad spectrum of applications, progressively expanding to be deployed into domains with security-critical and privacy-sensitive implications, such as healthcare, financial sectors, transportation systems, and surveillance. However, as the massive adoption of ML models continues to rise, a variety of attacks with different threat models have emerged, which can jeopardize ML models’ trustworthiness. For example, adversarial attacks (Carlini and Wagner, 2017; Huang etal., 2011; Goodfellow etal., 2014; Moosavi-Dezfooli etal., 2016; Venceslai etal., 2020; Guesmi etal., 2023) are popular inference time attacks that compromise the security of the model, by causing it to misclassify to the attacker’s advantage.The necessity of large amounts of data and high computational resources at the training stage has introduced another attack surface against ML, where the adversary interferes with the model training. Such attacks are usually referred to as training time attacks. Within this category, backdoor and data poisoning attacks are two of the most popular ones (Sun etal., 2019; Bagdasaryan etal., 2020; Biggio etal., 2012; Naseri etal., 2020).

Model hijacking. Recently, a new type of training time attack known as model hijacking has been introduced (Si etal., 2023; Salem etal., 2022; Elsayed etal., 2018; Mallya and Lazebnik, 2018).In model hijacking attacks, the adversary aims to take control of a target model and repurpose it to perform a completely different task, referred to as hijacking task. Salem et al.(Salem etal., 2022) proposed a Model Hijacking attack that hides a model covertly while training a victim model. Elsayed et al.(Elsayed etal., 2018) proposed adversarial reprogramming in which, instead of creating adversarial instances, they crafted inputs that would trick the network into performing new tasks. In the same direction, Mallya et al.(Mallya and Lazebnik, 2018) proposed Packnet that trains the model with multiple tasks. The conventional execution of these attacks is concretely performed through data poisoning aimed to repurpose the victim model trained for an original task to perform a hijacking task without reducing the original task’s utility.

In a model hijacking scenario, the attacker repurposes a model to perform a malicious task without the model owner’s knowledge or consent. This can cause accountability risks for the model owner; they can be framed as offering illegal or unethical services. For example, an adversary can repurpose a benign image classifier into a facial recognition model for illegal surveillance (e.g., on data powered by IoT bots) or for NSFW movies. Other possible unethical or illegal scenarios include hijacking a benign model to enable systematic discrimination based on gender, race, or age.

The threat model of existing hijacking attacks is similar to data poisoning/backdoor attacks, i.e., it requires access to the training process of the victim model. In this work, we consider an even more critical threat model for model hijacking where the attacker has limited access capabilities. Specifically, we consider a trusted model provider who (securely) trains a benign ML model for an original task. At training time, the adversary cannot access the training data or process. We consider the scenario where the attacker has access to the model at inference time only, i.e., after deployment. Under these assumptions, we ask the following question:

Model for Peanuts: Hijacking ML models without Training Access is Possible (1)

Considering this threat model, we propose SnatchMLto hijack an ML model with inference time-only access. SnatchML leverages the capacity of benign ML models trained on clean datasets, using a conventional training process, to acquire ”extra-knowledge” to infer an extra (potentially malicious) task. Having access to the deployed model, the adversary utilizes ”benign extracted knowledge” to infer the hijacking task. Specifically, we analyze the use of either logits (in a black-box scenario) or feature vectors (in a white-box scenario) to classify unknown input samples, using distance measures in the latent space, to previously known samples associated with the classes relevant to the hijacking task. The proposed approach is detailed in Section 4. Figure 1 gives a high-level illustration of the attack setting in a scenario where the original task is emotion recognition, and the hijacking task is biometric identification.

To demonstrate our attack methodology, we analyze various scenarios. This analysis initially focuses on hijacking tasks semantically related to the original task. Specifically, we examine three original tasks where we consider a pre-trained model and demonstrate that, in each case, an attacker with restricted access can exploit the pre-trained model for a hijacking task that shares semantic overlap with the original task (Section 5, 6 and 7).

While these attacks illustrate the hijacking risk under a stronger threat model than what is currently established in the state-of-the-art, the prerequisite of relatedness between the original and hijacking tasks may limit the scope and impact of such attack strategy. Therefore, in Section 8, we investigate the general case where the relatedness constraint is relaxed. Surprisingly, we found that SnatchML is capable of hijacking a deployed model for a task that is totally unrelated to the original one. We attempt to provide an explanation of these findings in Section 9. We hypothesize that the overparametrization of ML models is a core reason behind this phenomenon; (i) it provides the capacity to learn clues useful for related tasks, and (ii) it results in a hyper-dimensional representation of data in the latent space, which acts akin to a random projection block, enabling the inference of unrelated tasks.

We contend that our work also has significant implications for risk-based regulatory frameworks, as it challenges some of their foundational assumptions.In fact, the debate on regulating AI-powered systems has gained global momentum, ultimately exemplified by initiatives like the European Commission’s proposal for a risk-assessment framework known as the EU AI-Act (Commission, [n. d.]). This framework seeks to identify and categorize potential security risks and safety implications based on the nature of the task for which the ML model is trained, i.e., the original task. The reliance on the learned task’s criticality as a risk metric is inevitably tied to the following implicit hypothesis:”If an ML model is trained on a data distribution $\mathcal{D}$ , to learning a task $\mathcal{T}$ , it is unlikely to (unintentionally) learn another task $\mathcal{T^{\prime}}$ ”. Our work shows that it is not sufficient for a benign model to be securely trained on a legitimate original task to guarantee that it will be immune to repurposing for unethical or illegitimate tasks. Therefore, additional safeguards and assurances are necessary.

We propose two methods to mitigate SnatchML’s risk. We first propose meta-unlearning, which helps unlearn the potentially malicious task while learning the original task. The second defense method is based on our study of the over-parametrization’s impact on enabling model hijacking. We formulate an optimization problem to find the most compact model that preserves the accuracy of the original task while being less prone to hijacking attacks.

Contributions. In summary, our contributions are as follows:

•
We investigate the risk of ML hijacking attacks with a strong threat model where the attacker cannot access the training data/process. Specifically, we propose SnatchML to exploit the model’s unintentionally learned/extracted capabilities.
•
We illustrate our study with practical scenarios and show that models trained for benigntasks can be hijacked for unethical use, such as biometric identification gender or ethnicity recognition. Surprisingly, we also find it possible to infer hijacking tasks totally unrelated to the original task.
•
We investigate over-parametrization as one of the potential fundamental reasons behind the capacity of ML models to break the least privilege principle during their training.
•
We propose two approaches that can be deployed to limit the risk of SnatchML; We propose meta-unlearning, a novel meta-learning-based approach that helps the model unlearn the potentially malicious task when training on the original task. We finally use model compression to strictly limit the model’s capacity to the original task.

This work illustrates new risks to ML models. We hope it will draw attention to the principle of least privilege as a core cybersecurity principle that needs to be considered for the training and deploying of trustworthy ML models.

Model for Peanuts: Hijacking ML models without Training Access is Possible (2)

2. Intuition and Preliminary Analysis

We posit that the models might learn more than they should in the training process, including learning other tasks. Our intuition is that ML models, perhaps due to over-parametrization, have a by-design capacity that exceeds the minimum necessary capacity to learn the task for which they are trained. While it has been shown in the literature that models can overfit and memorize data, making them vulnerable to certain attacks such as Membership (Ye etal., 2022) and property inference (Ganju etal., 2018) attacks, the risk that ML models (unintentionally) generalize to other tasks is yet to be explored/exploited.In this section, we provide a preliminary analysis to gain insights into the plausibility of this hypothesis. Given a model $f^{(i)}(.)$ trained on a data distribution $(X,Y)\sim\mathcal{D}$ , using a loss function $\mathcal{L}_{i}$ , relevant to a Task $\mathcal{T}_{i}$ . Let $f^{(j)}(.)$ be a model with the same architecture as $f^{(i)}(.)$ , but is trained on a data distribution $(X’,Y’)\sim\mathcal{D}’$ , using a loss function $\mathcal{L}_{j}$ , relevant to a Task $\mathcal{T}_{j}\neq\mathcal{T}_{i}$ . We ask whetherfeatures learnt by $f^{(i)}(.)$ are correlated with those learnt by $f^{(j)}(.)$ ?We train two ResNet-9 and two Mobilenet architectures each for Emotion Recognition and Face Recognition. We then check the correlation coefficient distribution between the feature maps of the same layers that correspond to the 2 tasks.The correlation coefficient between two variables $X$ and $Y$ , denoted as $r$ , is defined as:

(1)

r=\frac{\sum_{i=1}^{n}(X_{i}-\overline{X})(Y_{i}-\overline{Y})}{\sqrt{\sum_{i=%1}^{n}(X_{i}-\overline{X})^{2}}\sqrt{\sum_{i=1}^{n}(Y_{i}-\overline{Y})^{2}}}

where:

•
$n$ is the number of data points,
•
$X_{i}$ and $Y_{i}$ are the individual values of the variables $X$ and $Y$ ,
•
$\overline{X}$ and $\overline{Y}$ are the means of $X$ and $Y$ respectively.

Figure 2 shows the features correlation distribution for Layers 5 and 11 for ResNet-9 and Layers 2 and 4 for Mobilenet. Interestingly, Figure 2 shows a positive correlation, suggesting a potential overlap between what the model learns for tasks that share semantic clues. In other words, this preliminary observation supports the idea that models trained on an original task can extract features relevant to other tasks.

3. Threat Model

We consider an attacker who deploys a pre-trained ML model acquired from a trusted third party (vendor). The model is securely trained on clean data to execute a benign original task.

Attacker’s objective: This threat model focuses on scenarios where the attacker aims to re-purpose the deployed ML model for eventual malicious tasks different from the original one. For example, an attacker wants to hijack a compliant pretrained model for unethical or illegal activities without having access to the training.

Attacker’s Capabilities: (i) At training time, We assume that the model is trained to perform a benign task and that the attacker cannot interfere by any means with the model’s training phase, i.e., the adversary cannot poison the target model’s training dataset, in contradiction with existing hijacking (Salem etal., 2022; Si etal., 2023) or poisoning (Biggio etal., 2012) attacks’ assumptions.

4. SnatchML: General Approach

Problem Statement–Given a model $h_{\theta}(\cdot)$ with parameters $\theta$ , securely trained on a clean data distribution $D(X,Y)$ using a Loss function $\mathcal{L}$ to perform a task $\mathcal{T}$ .The objective is to leverage the hidden capabilities that the victim model $h_{\theta}(\cdot)$ acquired through the design and training process to exploit the model to the adversary’s advantage post-deployment. Specifically, we want to investigate the following questions:

•
Q1– Can $h_{\theta}(\cdot)$ unintentionally learn information that allows an attacker to hijack it for another task $\mathcal{T^{\prime}}\neq\mathcal{T}$ ?
•
Q2– Ultimately, can $h_{\theta}(\cdot)$ still be used to infer a task $\mathcal{T^{\prime}}$ that is totally unrelated to $\mathcal{T}$ ?

Proposed approach–Without loss of generality, let us suppose that the model $h_{\theta}(\cdot)$ is trained to perform a given multi-class classification (original) task $\mathcal{T}$ with $n$ classes. Given the threat model in Section 3, the attacker wants to use the model for another (hijacking) task $\mathcal{T^{\prime}}$ .

Definition 4.1.

– (Benign Extracted Knowledge). We define the benign extracted knowledge facts (BEK) as metadata learned by a benign model from clean input. For a given model $h(\cdot)$ and an input sample $x$ , we note $\zeta_{h}(\cdot)$ an operator that extracts BEK.

Given Definition 4.1, $\zeta_{h}(x)$ may correspond in the black-box scenario to $\zeta_{(}x)=Z_{h}(x)$ , where $Z_{h}=\{z_{i}\}_{i=1..n}$ is model’s output logits vector, or in the white-box scenario to the learned features tensor, i.e., $\zeta_{h}(x)=h_{k}(x)$ , where $h_{k}(x)$ is the output of the $k^{th}$ layer of the model $h(\cdot)$ .

The attacker wants to repurpose $h(\cdot)$ for a hijacking $m$ -class classification task $\mathcal{T}^{\prime}$ through post-processing $\zeta_{h}(\cdot)$ . The attacker has access to a dataset $\mathcal{D}^{*}={(x^{*}_{i},\ell^{*}_{i}),i\in[1,m]}$ , which contains only $m$ data samples, each corresponding to a class of the hijacking task $\mathcal{T}^{\prime}$ .We propose a straightforward exploitation method that is based on the distance in the BEK space. Given an input sample $x_{s}$ and the corresponding BEK vector $\zeta_{h}(x_{s})$ after inference with $h(\cdot)$ , the classification of $x_{s}$ for $\mathcal{T}^{\prime}$ is inferred as follows:

(2)			$\displaystyle y_{\mathcal{T}^{\prime}}(x_{s})=\ell^{*}_{k},\textit{ s.t. }$
(3)			$\displaystyle k=\underset{i\in[1,m]}{Argmin}\Big{[}\delta(\zeta_{h}(x^{*}_{i})%,\zeta_{h}(x_{s}))\Big{]}$

Where $\delta$ is a distance metric such as $\ell_{2}$ or the Cosine Similarity.

In essence, SnatchML hijacks the model at inference time by exploiting its capability to distinguish patterns in a data distribution related to a task $\mathcal{T}^{\prime}$ that has not been included in its training.

Remark. It is worth noting that in the proposed approach, there are no a priori constraints on $m$ . In fact, in state-of-the-art attacks, such as (Salem etal., 2022), the number of classes inferred through the hijacking task cannot exceed those in the original task, i.e., $m\leq n$ . Therefore, our approach not only assumes a stronger threat model but also overcomes this fundamental limitation, enabling the attacker to hijack an $n$ -class model for an $m$ -class task where $m>n$ . We illustrate this case in Section 5. Since there is no interference with the model during training, the same model can be hijacked to execute more than one task at the same time. This case is illustrated in Section 6.

In the following sections, we illustrate our approach to different scenarios where the attacker tries to misuse a benign trained model to hijack the victim model for a different task. Sections 5, 7, and 6 correspond to the case where the hijacking task and the original task share relevant features. In Section 8, we investigate the general case where the hijacking tasks are unrelated to the original one.

5. Scenario 1: Emotion Recognition

5.1. Context

Let the original task $\mathcal{T}$ be Emotion Recognition (ER). We aim to explore the extent to which a model trained to recognize emotions can be vulnerable to hijacking at inference time for the purpose of biometric identification, denoted as the hijacking task $\mathcal{T^{\prime}}$ . In practice, the adversary possesses a dataset of ’persons of interest’ (the hijacking dataset) and intends to identify these individuals by exploiting the original model.

5.2. Setup

Datasets. To run our experiments, we need datasets that are labeled for both the original and the hijacking tasks. For this experiment, we use $4$ different datasets from both real and synthetic distributions:

(i) CK+: The Extended Cohn-Kanade (CK+) (Lucey etal., 2010) is a dataset for emotion recognition that contains $593$ video clips from $123$ individuals between the ages of $18$ and $50$ . These videos capture transitions between neutral and peak emotions: anger, disgust, fear, happiness, sadness, and surprise.

(ii) Olivetti faces:The Olivetti faces Dataset (Samaria and Harter, 1994) is a facial recognition dataset comprising 400 grayscale face images gathered from 40 unique individuals with ten images for each. The dataset captures variations in lighting and facial expression with a uniform black background. The dataset is labeled with encoded integers referring to the identities of the 40 individuals.

(iii) Celebrity dataset. We extracted 79 facial images for $9$ different celebrities. Then, we generated emotion-specific images from the neutral samples using BRIA AI (BRIA AI, [n. d.]) is based on a visual generative tool (Elasri etal., 2022), we generated emotion-specific images from the neutral samples. These images are labeled for seven emotions (anger, disgust, fear, happiness, sadness, surprise, and neutral) and associated with $9$ unique individuals used as identity labels.

(iv) Synthetic dataset:To cover the synthetic data distribution case, we create a new dataset labeled for emotions and identities using MakeHuman (Bastioni etal., 2008), an open-source tool designed to generate virtual human faces. This dataset encompasses $395$ images labeled with both emotion and identity. Among these images, $47$ depicts neutral expressions, while $348$ represents six distinct emotions. The dataset contains $47$ individual identities, each associated with varying emotional expressions.

Implementation. We evaluate this attack scenario by designing an experimental setup with four pre-trained ER models: (i) 2D-CNN (Allaert etal., 2022; Poux etal., 2021) with three convolutional layers, each followed by ReLU activation and max-pooling followed by two fully connected layers, (ii) ResNet-9 (He etal., 2016), (iii) MobileNet (Howard etal., 2017), and (iv) Vision Transformer (Dosovitskiy etal., 2020). We train the models for ER on CK+ (Lucey etal., 2010) as the original task $\mathcal{T}$ using a learning rate of $0.001$ and Adam optimizer for $100$ epochs.

Comparison. To evaluate the accuracy of the hijacking tasks, we consider a lower bound and an upper bound (UB): The lower bound (LB) corresponds to the random guessing probability, while the UB corresponds to an unconstrained version of (Salem etal., 2022) without covertness requirement, which is practically equivalent to freely poisoning the model by training it on both the original and hijacking tasks.

Model for Peanuts: Hijacking ML models without Training Access is Possible (3)

Model for Peanuts: Hijacking ML models without Training Access is Possible (4)

5.3. ER Systems for user re-identification

We refer to this scenario as user re-identification, as it involves the adversary aiming to identify users whose data (images) were used to train the original ER model. It is worth noticing that in our experiments, we run the hijacking on users’ data that is not member of the original task’s training dataset as it does not comply with our threat model.

We use the top-N ranked reference users as an evaluation metric for the hijacking attack performance. The top-N metric is particularly relevant for evaluating identification tasks (Wang and Deng, 2021); it can provide a contextual method for narrowing down potential candidate users. Figure 3 illustrates with queries and their corresponding top-5 samples. In the second line of this figure, the query image features a person with dark skin, and only one individual with dark skin appears in the top-5 reference users. The attacker could reasonably deduce that the unknown query image likely belongs to this individual. Figure 4 shows the top-1 to top-5 accuracy results for the hijacking task, presented as mean and standard deviation after conducting $10$ -fold experiments. Considering our hijacking task involves identifying $85$ persons, we use random classification as a reference ( $1.1\%$ ).SnatchMLachieves over $40\%$ top-1 accuracy for 2D-CNN, while the hijacking UB is up to $98\%$ .

Although all four tested models demonstrate significant performance, the results depicted in Figure 4 reveal disparities among them, suggesting that different architectures may have varying levels of susceptibility to the proposed model hijacking for the same task. For example, while the attack on 2D-CNN and ResNet-9 succeeded with around $40\%$ accuracy, the ViT and MobileNet were more robust with an average Top-5 accuracy not exceeding $30\%$ .

5.4. ER systems learn biometric identification

In this second scenario, we consider a setting where the attacker aims to hijack the model for biometric identification of users whose data is not necessarily a part of the ER model’s training dataset. Following the same approach in Section 5.3, we exploit the pre-trained ER model to identify unseen users based on the extracted BEK. As for the hijacking reference database, we consider two main settings:
(i) The attacker has a reference database from a different distribution from the training dataset, comprising distinct users’ images with several facial expressions captured at different times and under different lighting conditions.

(ii) The attacker has more restricted access, specifically to images of users with a neutral facial expression. These images are typically sourced from official documents such as passports or staff cards. It is important to note that in both scenarios, we assume the users are not part of the training dataset of the ER model.

Model for Peanuts: Hijacking ML models without Training Access is Possible (5)

Model for Peanuts: Hijacking ML models without Training Access is Possible (6)

Model for Peanuts: Hijacking ML models without Training Access is Possible (7)

Case (i): We follow the same approach as detailed in Section 5.3 by training the ER model on CK+ dataset. However, for evaluating the identification attack, we use the Olivetti dataset (Samaria and Harter, 1994). We perform ER queries with images from the Olivetti dataset to get the output FV/Logits and execute SnatchML. Figure 5 displays illustration samples of the identification task for Olivetti. An interesting observation is highlighted in the first row of the figure, where the top-5 output candidates for a query involving a person wearing glasses also predominantly feature candidates wearing glasses. This indicates that the ER model has learned to recognize this accessory despite its irrelevance from an emotion recognition perspective. The second row in Figure 5 also highlights the significance of the top-5 metric. In this case, the query image corresponds to a female individual with $4$ out of $5$ of the closest candidates representing male subjects, with the first female candidate appearing only at rank-5, potentially leading to a precise identification.

Figure 6 shows the hijacking attack accuracy results on the four pre-trained ER models regarding top-1 to top-5 accuracy. The attacks demonstrate notable success, with the top-1 accuracy achieving over $60\%$ on 2D-CNN and ResNet-9. Although the lower success rate on MobileNet and the ViT remains consistent in this setting, the hijacking performance is still considerable, considering a random guessing probability of $\sim$ 2.5%.

Case (ii): In this scenario, we assume that the attacker has access only to neutral images of targeted users. To conduct our attack, we use two datasets featuring individuals with neutral facial expressions, namely Celebrity and Synthetic, as detailed in Section 5.1. The ER model is pretrained on the CK+ dataset. We assess the effectiveness of the identification hijacking attack by performing ER queries with images from both Celebrity and Synthetic datasets, analyzing the output FV/Logits. The accuracy of the hijacking task using the four pretrained ER models is depicted in Figure 7. Notably, the top-1 identification accuracy reached approximately $100\%$ in the 2D-CNN model under both black-box and white-box attacks for both datasets. The least successful attack was observed with MobileNet under the black-box scenario, where it achieved approximately $48\%$ top-1 identification accuracy with the Synthetic dataset while the random guess probability is around $2\%$ .

6. Scenario 2: Age/Gender/Ethinicity

Model		2D-CNN			ResNet-9			MobileNet			Transformer
OriginalTask		Age	Gender	Ethnicity	Age	Gender	Ethnicity	Age	Gender	Ethnicity	Age	Gender	Ethnicity
Original Accuracy		0.682	0.876	0.762	0.705	0.897	0.785	0.668	0.864	0.726	0.635	0.848	0.727
Hijacking AgeAccuracy	Random	0.166			0.166			0.166			0.166
	SnatchML(Logits)	-	0.322	0.369	-	0.288	0.347	-	0.347	0.326	-	0.357	0.343
	SnatchML(FV)	-	0.407	0.420	-	0.452	0.436	-	0.350	0.346	-	0.382	0.422
	Hijacking UB	-	0.669	0.668	-	0.669	0.675	-	0.662	0.670	-	0.636	0.615
Hijacking GenderAccuracy	Random	0.500			0.500			0.500			0.500
	SnatchML(Logits)	0.612	-	0.549	0.547	-	0.529	0.561	-	0.542	0.578	-	0.548
	SnatchML(FV)	0.626	-	0.556	0.618	-	0.601	0.574	-	0.541	0.627	-	0.611
	Hijacking UB	0.875	-	0.874	0.881	-	0.884	0.871	-	0.860	0.852	-	0.843
Hijacking EthnicityAccuracy	Random	0.200			0.200			0.200			0.200
	SnatchML(Logits)	0.384	0.274	-	0.323	0.275	-	0.360	0.266	-	0.351	0.303	-
	SnatchML(FV)	0.397	0.310	-	0.401	0.409	-	0.346	0.316	-	0.415	0.326	-
	Hijacking UB	0.754	0.766	-	0.746	0.759	-	0.755	0.758	-	0.698	0.710	-

6.1. Context

In this section, we consider an application that predicts personal attributes (e.g., age, gender, ethnicity) from facial images. An example of this application is Microsoft Azure’s Face (Microsoft, [n. d.]) as an MLaaS API, where users can query one specific model (e.g., for age estimation) using facial images. While the original task $\mathcal{T}$ here is age estimation, we assume that an adversary can hijack the same model to infer other personal attributes like gender or ethnicity. In practice, we assume the adversary has access to a publicly available facial image database labeled with personal attributes (relevant to the hijacking task). The implications of this hijacking attack can be critical, mainly if users don’t consent to using their data. Unauthorized inference of gender and ethnicity from this data could constitute a significant privacy violation, potentially leading to discriminatory practices, especially against individuals from marginalized ethnic or gender groups.

6.2. Setup

Dataset: UTKface The UTKFace (Zhang etal., 2017) dataset is a collection of $20,000$ facial images of individuals ranging in age from 1 to 116 years old. Each image is labeled with information on age, gender, and ethnicity as follows:
-Age: (6 Classes) – ’Newborns/Toddlers’, ’Pre-adolescence’, ’Teenagers’, ’Young Adults’, ’Middle-aged’, ’Seniors’.
-Gender: (2 classes) – ’Male’ and ’Female’.
-Ethnicity: (5 Classes) – ’White’, ’Black’, ’Asian’, ’Indian’, ’Others’.

Models. We consider the same model architectures as in Section 5.2, trained using a learning rate of $10^{-3}$ and Adam optimizer. We compare our results with the hijacking upper bound (UB), which corresponds to an unconstrained version of (Salem etal., 2022) without covertness requirement, which is practically equivalent to freely poisoning the model by training it on both the original and hijacking tasks.

6.3. Cross-attribute hijacking: age, gender and ethnicity

The adversary’s goal is to hijack a model – exclusively trained on one of the three tasks– to infer other personal attributes for which the victim model has not been trained. Specifically, from an unknown facial image query submitted by an unknown user to an age estimation model and by only using the original model’s BEK, the adversary aims to infer the gender and ethnicity of the user. For a comprehensive study, we test all the possible combinations of (age, gender, and ethnicity) as original and hijacking tasks. Following the leave-one-out on the possible combinations, we exclusively train a model on one task and test SnatchML on the two remaining tasks. Following our threat model, we run SnatchML using the facial images from the test dataset.

Table 1 provides an overview of the experimental results with random guess probability as a lower bound. We expectedly note that the white-box setting consistently delivers higher performance. For instance, ResNet-9, trained for gender recognition, can be hijacked to predict age and ethnicity with an accuracy of approximately $45\%$ , while the random guessing probability for this task is $16.60\%$ and the upper bound hijacking accuracy is $66.89\%$ .Different tasks have varying levels of difficulty that can be related to their inherent complexity. For example, age recognition is challenging; a 17-year-old ’teenager’ and an 18-year-old ’young adult’ are not easily distinguished.The fact that models trained for age prediction as the original task achieve accuracy ranging from $63\%$ (Transformer) to $70\%$ (ResNet-9) illustrates the task complexity and highlights the relative success of the attack.Similar conclusions can be drawn regarding hijacking the model for recognizing ethnicity. SnatchML achieves over $40\%$ accuracy under the white-box setting for the Transformer and ResNet-9 modelsfor a hijacking UB of $75\%$ .However, we observe that hijacking gender is more challenging, and the attack has shown its limitations despite its accuracy consistently exceeding random guessing probability.

7. Scenario 3: Pneumonia Diagnosis

7.1. Context

This section illustrates our attack with a use case in the healthcare domain. Let the original task $\mathcal{T}$ be Pulmonary Disease Diagnosis (PDD) using chest X-ray images. The original task is a binary classification of whether the patient has ’Pneumonia’. Such a model can be deployed within MLaaS API such as Google Healthcare API (Google, [n. d.]). Patients or practitioners can query the model with X-ray images and receive feedback on potential pneumonia diagnoses. Our attack aims to investigate the possibility of inferring even more information on the type of disease using the same model. Specifically, the attacker tries to infer a hijacking task on samples identified with pneumonia, i.e., whether the individual’s infection is from a viral or bacterial origin. Furthermore, we assume that the adversary has access to a database of X-ray images labeled with the type of infection. These images, which may be publicly available, are not necessarily associated with the individuals querying the PDD model.

7.2. Setup

Dataset: Chest X-Ray Images (Pneumonia) This dataset (Kermany etal., 2018) consists of $5,863$ X-ray images categorized into ’Pneumonia’ and ’Normal’ classes. Two expert physicians graded the diagnoses, and a third expert validated the evaluation set to ensure diagnostic accuracy. The ’Pneumonia’ category is further labeled into ’Viral’ and ’Bacterial’ subcategories.

We consider the same backbone architectures as in Section 5.2 with slight modifications to enable the (original) binary classification.

7.3. PDD Models recognize more than Pneumonia

The adversary aims to hijack a PDD model to extract more information on the victim’s health. Specifically, from unknown query X-ray images only using the PDD model’s output, the attacker aims to infer the pulmonary infection type (i.e., viral or bacterial). Such information can be highly sensitive and might be further used for malicious intent. In the rest of this section, we consider the classification of pulmonary infection to be our hijacking task.

Following our threat model, we conduct a hijacking attack using BEK from the hijacking dataset to recognize the infection type of unknown query X-ray images. Similar to our evaluation of SnatchML effectiveness in Section 5.3, we use the X-ray images from the test dataset as targets for the hijacking attack.

Model for Peanuts: Hijacking ML models without Training Access is Possible (8)

Model for Peanuts: Hijacking ML models without Training Access is Possible (9)

Figure8 reports the accuracy results of our hijacking attacks under different settings and models. We observe that the PDD model achieves an accuracy up to $\sim 85\%$ on the original task. Interestingly, the accuracy of the hijacking task (i.e., type of pneumonia infection) under white-box settings (using FV) is comparable to the accuracy of the model in the original task. Among all architectures, SnatchMLwas particularly successful on ResNet-9 under both white and black-box settings, reaching an attack accuracy of $76\%$ and $78\%$ , respectively. Despite its inherited problem of learning complex tasks on small datasets, the Transformer achieves a hijacking accuracy of $62.5\%$ . The disparity in our attack efficiency as a function of the model architecture for the same settings provides an interesting insight into the importance of the model hyperparameters in the underlying phenomena leading to the hijacking success.It is also worth noticing that under the white-box setting and for all models, there’s a strong correlation between the accuracy of the original and hijacking tasks. This further emphasizes the overlap in the learning dynamics for the original and hijacking tasks. Such an overlap can be of a high risk of privacy breaches, especially for sensitive applications like healthcare systems. SnatchMLachieves results that are close to the hijacking UB.

Model for Peanuts: Hijacking ML models without Training Access is Possible (10)

8. What if the hijacking task is unrelated to the original task?

Previous experiments demonstrate that a model trained for an original task can be effectively hijacked to perform a different task. This finding has so far involved scenarios where the hijacking tasks share semantic relatedness or overlap in data distribution with the original task. While this represents a significant security threat, the requirement for task-relatedness can be viewed as a limitation from an attack perspective. State-of-the-art model hijacking attacks are more general and are not particularly constrained by the relatedness to the original task.In this section, we extend our investigation to examine the generalizability of SnatchML to hijacking tasks unrelated to the original task.

Experimental setup. We evaluate several image classification benchmarks as original tasks. We use a MobileNet pretrained on ImageNet and CIFAR-10, sourced from the Pytorch model Zoo (Pytorch, [n. d.]). These pretrained models are the targets of SnatchML, which aims to hijack them for unrelated tasks involving different data distributions. Specifically, we hijack the models for MNIST, CK+ for ER, CK+ for biometric identification, and Olivetti for face recognition. We run SnatchML on the test distribution to evaluate the hijacking task performance. For each hijacking attack, Figure 9 presents the performance under both black-box and white-box settings and the probability of a random guess.

Observations. The results in Figure 9 illustrate the capacity of overparametrized models to extract useful knowledge from a data distribution unrelated to their original training dataset. In fact, the SnatchML hijacking task accuracy is comparable to the state-of-the-art accuracy of tasks like MNIST when the victim model is trained on ImageNet. In the following section, we attempt to provide possible explanations for these results.

9. Why do ML models learn more than they should?

In the previous sections, we have shown that SnatchMLcan hijack ML models to perform tasks that can be related or entirely unrelated to the original task. While the first case is intuitive, as illustrated in Section 2, the second case is more surprising. In the next two subsections, we explore the reasons behind these observations by examining two hypotheses: (i) Overparametrization of the ML model and (ii) Random Projection.

9.1. Training ML models do not respect the ”Least Privilege Principle”

ML models may learn features that are not particularly related to the original task. For example, in the context of ER, these models might capture facial characteristics that could inadvertently be used for biometric identification. From a conceptual perspective, such a phenomenon might lead to critical security breaches due to not respecting the fundamental principle of ”least privilege” in cybersecurity. Practically, the amount of parameters in an ML model reflects the upper bound of its learning capacity. The overparametrization implies that such models contain more parameters than are necessary to accomplish the original task. In both software(Wang etal., 2013) and hardware(Fern etal., 2015) systems, undefined behavior and don’t-care states have been shown to be potential sources of vulnerabilities. In the same vein, we believe that this extra capacity represents undefined behavior and don’t-care states with respect to the model that can be exploited by attackers. We posit that the success of our hijacking attack is related to the overparametrization of victim models.We suggest that the model’s capacity to unintentionally learn unnecessary features positively correlates with the accuracy of the hijacking task. This is particularly supported by the fact that we’re exploiting the same extracted feature from the original model to perform hijacking tasks. We assume that within those features, some strongly correlate with the hijacking task (e.g., ER vs. biometric identification).

To verify this hypothesis, we study the correlation between the ML model size and accuracy of the hijacking tasks for user re-identification and biometric identification. The results of our study are shown in Figure 10. We study the case of four ML models, notably 2D-CNN, ResNet-9, MobileNet, and Transformer. We vary the width expansion ratio (i.e., number of channels) from $0.25\times$ to $2.5\times$ to emulate different amounts of model parameters while maintaining the same baseline model architecture. From the reported results in Figure 10, we draw the following insights:

First, we notice a positive correlation between the model’s width expansion ratio and the hijacking task accuracy. This can be attributed to the increase in the model’s capacity to learn facial features from ER training data that are leveraged for user identification in the hijacking attack. For instance, for a re-identification attack with 2D-CNN, the learning capacity of the hijacking task improves by around $13\%$ under $1.25\times$ width expansion and experiences stability afterward. With MobileNet, we notice a steady increase in the hijacking task accuracy, achieving a peak of $6\%$ improvement. The low hijacking accuracy of MobileNet compared to other models can be explained by the compactness of this model, which translates into a lower capacity to learn extra tasks.

Second, we also observe that different model architectures exhibit different behaviors when scaling the model size. This is particularly clear for ResNet-9 and Transformer models (first row of Figure 10). The over-parametrized nature of their architecture can explain this trend, even at low-width expansion ratios. Thus, further up-scaling their number of parameters does not improve the hijacking task.

Third, the observed trends between the model size and the learning capacity for the hijacking task differ w.r.t the task complexity. For instance, ResNet-9 shows slight improvement of $2\%$ in the hijacking attack for user re-identification under overparameterized regime. For the identification task in Olivetti, ResNet-9 depicts higher improvement of $9\%$ in the hijacking attack ( $2^{nd}$ row, $1^{st}$ and $2^{nd}$ columns in Figure 10). The re-identification task (i.e., the ER training dataset) includes $89$ user identities, making the re-identification task much more challenging than the identification task, which involves around $40$ user identities.

Model for Peanuts: Hijacking ML models without Training Access is Possible (11)

9.2. General case: Universality of Random DNNs

The general case corresponds to a model hijacked for a task totally unrelated to the original task. The rationale by which features learned by the model on the original task overlap with the hijacking task does not hold under these assumptions. As far as the hijacking task is concerned, this case is similar to a randomly initialized model. We hypothesize that the universal approximation capabilities of random DNNs can explain the relatively high accuracy of the hijacking task. We recall the Johnson-Lindenstrauss Lemma (Johnson and Lindenstrauss, 1984), which states that a set of points in a high-dimensional space can be projected into a lower-dimensional space while approximately preserving the pairwise distances between the points. While this lemma might give an interesting insight, it cannot rigorously explain the behavior of random ML models because of their non-linearity. However, several works such as (Giryes etal., 2016; Basteri and Trevisan, 2023; Liao and Couillet, 2018; Rahimi and Recht, 2007) investigated the capabilities of random neural networks. Particularly, Giryes et al. (Giryes etal., 2016) prove that DNN preserves the metric structure of the data as it propagates along the layers. Interestingly, our work shows that networks tend to decrease the Euclidean distances between points with a small angle between them (“same class”) more than the distances between points with large angles between them (“different classes”). Further, the theoretical underpinnings provided by (Giryes etal., 2016) demonstrate that deep random networks can act as powerful feature extractors, where the depth and width of the network play crucial roles in enhancing the representational power of the features.

While the properties of random neural networks are extensively studied in the ML community, our work shows the potential security threats that can emanate from these properties. To illustrate this perspective, we run our hijacking attack on totally random neural networks, i.e., the parameters are sampled from a random Gaussian distribution. The results shown in Figure 11 are coherent with our observations in Section 8. In the Appendix, we provide a visualization of this setting in the latent space in Figure 13.

10. Countermeasures

In this section, we investigate possible countermeasures to mitigate the threat of SnatchML; We propose the following two approaches:

10.1. Meta-unlearning: meta-Learning-based malicious task unlearning

In cases where the hijacking threat can be identified at training time with a precise task, e.g., biometric identification, we propose to train the model to learn the original task and simultaneously unlearn the potentially malicious one.

Data: $p(\mathcal{T}_{i}):$ distribution over original task

Input: $\alpha,\beta:$ step size hyperparameters

Randomly initialize $\theta$

whilenot donedo

Sample batch of data $\mathcal{B}_{k}\sim p(\mathcal{T}_{i})$

forall $\mathcal{B}_{k}$ do

Evaluate $\nabla_{\theta}\mathcal{L}_{\mathcal{T}_{i}}(f_{\theta})$

Update parameters for Original Task: $\theta^{\prime}_{i}=\theta-\alpha\nabla_{\theta}\mathcal{L}_{\mathcal{T}_{i}}(%f_{\theta})$

Unlearn sensitive task $\mathcal{T}_{j}$ $\theta\leftarrow\theta+\beta\nabla_{\theta}\mathcal{L}_{\mathcal{T}_{j}}(f_{%\theta^{\prime}_{i}})$

Return $(f_{\theta})$

The intuition is that some internal representations are more transferable across tasks than others. For example, given an original task $\mathcal{T}_{i}$ and a related task $\mathcal{T}_{j}$ , i.e., both $\mathcal{T}_{i},{T}_{j}\sim p(\mathcal{T})$ , where $p(\mathcal{T})$ is distribution of tasks, a model might learn internal features that are broadly applicable to all tasks in $p(\mathcal{T})$ , and others that are specific to the original task $\mathcal{T}_{i}$ . Our objective is to penalize the learning of hijacking task-specific features while learning the ones relevant to the original one. The proposed approach is inspired by the meta-learning literature (Finn etal., 2017): We train the model on $\mathcal{T}_{i}$ while maximizing the loss function $\mathcal{L}_{j}$ , relative to the (hijacking) task $\mathcal{T}_{j}$ . The approach is detailed in Algorithm 1. Tables 2 and 3 show the results of experiments designed to test the effectiveness of the meta-unlearning to retain accuracy on the original classification task while resisting SnatchML. We also add a hijacking setting with stronger access than our approach, assuming the hijacking dataset can be used to train a neural network for hijacking (NN Surrogate), not only using distances such as in SnatchML. Table 3 shows the results for gender classification as the original task and ethnicity recognition as the hijacking task. It shows that when the model is trained with unlearning, the accuracy for the original task generally decreases. This is likely a consequence of the unlearning procedure, which may also remove features beneficial to the original task. While the surrogate NN achieves, as expected, higher hijacking success, we notice that the meta-unlearning is more efficient on this attack than SnatchML, suggesting a higher robustness of distance-based inference in this setting. For example, the defense has not significantly impacted the accuracy of the hijacking task (Ethnicity). After empirically exploring the loss weights, these results were obtained with $\alpha=1$ and $\beta=0.01$ . Increasing the unlearning loss coefficient leads to an unacceptable accuracy drop in the original task.

Training

Strategy

Original

Model

(ER)

Original

Accuracy

Hijack by

NN Surrogate

SnatchML

(Logits)

WithoutUnlearning

2D-CNN

0.94

0.84

0.29

ResNet-9

0.95

0.41

0.21

MobileNet

0.93

0.65

0.24

WithUnlearning

2D-CNN

0.67

0.12

0.33

ResNet-9

0.62

0.11

0.27

MobileNet

0.77

0.50

0.23

Training

Strategy

Original

Model

(Gender)

Original

Accuracy

Hijack by

NN Surrogate

SnatchML

(Logits)

WithoutUnlearning

2D-CNN

0.87

0.51

0.29

ResNet-9

0.88

0.44

0.28

MobileNet

0.85

0.46

0.29

WithUnlearning

2D-CNN

0.65

0.15

0.27

ResNet-9

0.76

0.16

0.32

MobileNet

0.67

0.45

0.30

Training

Strategy

Original

Model

(PDD)

Original

Accuracy

Hijack by

NN Surrogate

SnatchML

(Logits)

WithoutUnlearning

2D-CNN

0.803

0.700

0.543

ResNet-9

0.830

0.648

0.602

MobileNet

0.793

0.666

0.656

WithUnlearning

2D-CNN

0.625

0.379

0.620

ResNet-9

0.843

0.651

0.631

MobileNet

0.753

0.587

0.595

Hijacking Task(Dataset)	Model	Original Task Accuracy			SnatchML: Hijacking Task Accuracy
Hijacking Task(Dataset)	Model	$f(.)$	$f_{cmp}(.)$	Expansion $m$	Logit of $f(.)$	Logit of $f_{cmp}(.)$	FV of $f(.)$	FV of $f_{cmp}(.)$
Re-identification(CK+ )	2D-CNN	0.967	0.941	0.5 $\times$	0.298	0.244	0.329	0.320
	Resnet-9	0.937	0.935	0.1 $\times$	0.257	0.217	0.293	0.294
	Mobilenet	0.939	0.915	0.45 $\times$	0.148	0.143	0.176	0.163

Identification(Olivetti)	2D-CNN	0.967	0.941	0.5 $\times$	0.319	0.289	0.564	0.523
	Resnet-9	0.937	0.935	0.1 $\times$	0.232	0.144	0.479	0.359
	Mobilenet	0.939	0.915	0.45 $\times$	0.074	0.070	0.076	0.066

Type of Pneumonia(Chest X-ray)	2D-CNN	0.776	0.764	0.1 $\times$	0.544	0.564	0.787	0.759
	Resnet-9	0.853	0.851	0.1 $\times$	0.764	0.541	0.787	0.756
	Mobilenet	0.801	0.780	0.1 $\times$	0.544	0.482	0.751	0.633

10.2. Model Compression as a Defense

The previous approach is limited by the constraint of known hijacking tasks, which may not always reflect real-world scenarios. We aim to overcome this limitation in this second approach by proposing a more task-agnostic methodology.In Section 9.1, we observed a correlation between the model size and the effectiveness of SnatchML. In this section, we explore the extent to which model compression can help defend against SnatchML.

Problem Formulation. Let $f(.)$ be an ML model architecture from a design space $\mathcal{F}$ . This model is supposed to be trained on a data distribution $(X,Y)\sim\mathcal{D}^{i}$ for an original task $\mathcal{T}_{i}$ . The architecture of $f(.)$ can be described as follows:

(4)		$\displaystyle f(.)=l^{n}\circ l^{n-1}\circ l^{n-2}\circ\dots\circ l^{2}\circ l%^{1}s.t.\;\;f\in\mathcal{F}$
(5)		$\displaystyle\text{Where for each layer}\quad l^{j}=\{W_{1}^{j},W_{2}^{j},W_{3%}^{j},...,W_{m_{j}}^{j}\}$

Where $n$ is the number of neural layers in $f(.)$ . Each layer $l^{j}$ is characterized by a specific amount of learnable parameters $W_{m}^{j}$ ; $m$ refers to the layer’s width and can be used as a hyperparameter to scale the number of learnable parameters in each layer of $f(.)$ . We define our problem of finding a more compact and compressed version $f_{cmp}(.)$ of $f(.)$ as follows:

(6)

f_{cmp}=\operatorname*{arg\,min}_{f_{m}\in\mathcal{F}}\alpha\cdot\mathcal{L}_{%i}(f_{m}(X),Y)+\beta\cdot\mathcal{C}ount(f_{m})

Where $\mathcal{L}_{i}$ is the loss function used to train $f$ on dataset $D^{i}$ , $\mathcal{C}ount$ is the function that returns the number of learnable parameters in $f$ . $\alpha$ and $\beta$ are control knobs to balance the tradeoff between utility and compression ratio. Here, we abuse the notation and use $m$ as a width scaling factor that can be applied to all the layers of $f(.)$ . We apply the same scaling factor to all the layers to preserve the feature abstraction hierarchy along the model’s layers. Our objective is to reduce the number of learnable parameters in benign ML models while maintaining the utility of the original task $\mathcal{T}_{i}$ and lowering the capacity of learning or inferring unrelated features that may be used for hijacking.

Experimental Setup. We design a search space of $14$ candidate width expansion ratios, ranging from $0.1\times$ to $0.75\times$ . For each model $f$ , we exhaustively iterate through the search space and sample an expansion ratio $m$ to get a compact model $f_{m}(.)$ . At each iteration, we evaluate $f_{m}(.)$ using the loss $\mathcal{L}_{i}(.)$ (as a proxy for utility on the original task) and parameters count $\mathcal{C}ount(.)$ (as a proxy for model size). After the exhaustive search, we rank all candidate $f_{m}(.)$ according to their loss values and parameters count using the TOPSIS method (Behzadian etal., 2012) to render an optimal $f_{cmp}(.)$ with a tradeoff between utility and model size.

Results and Discussion. To test if lowering the models’ dimensionality enhances their resistance to hijacking, we evaluate our compression method on three hijacking attacks: user re-identification on CK (Lucey etal., 2010) identification on Olivetti (Samaria and Harter, 1994), and Pneumonia recognition on Chest X-ray (Kermany etal., 2018). Results are detailed in Table 5 for each hijacking attack using three models: 2D-CNN, ResNet-9, and MobileNet. Across the three datasets, compact models showed a notable reduction in hijacking task accuracy. Specifically, ResNet-9 compact models demonstrate high resilience to hijacking while maintaining comparable accuracy to the baseline model. For example, identification accuracy on Olivetti and pneumonia type recognition on Chest X-ray decreases by $29\%$ and $38\%$ , respectively, even under an aggressive compression strategy with a width expansion of $m=0.1\times$ on the baseline. Conversely, MobileNet shows a modest reduction in hijacking attack vulnerability for re-identification and identification tasks at the expense of a slight decrease in utility for the original task of ER (approximately $\sim$ 3% reduction). This finding is consistent with the discussions in Section 9.1, highlighting the intrinsic compactness of the baseline MobileNet model. Notably, in the pneumonia scenario, MobileNet with a $m=0.1\times$ shows low hijacking accuracy with a decrease of $11.3\%$ (Logits) and $15.7\%$ (FV).

It’s worth noting that model compression does not always reduce the hijacking accuracy in specific scenarios, e.g., re-identification on CK+. This phenomenon may be attributed to the overlap between the original and hijacking tasks, particularly in smaller datasets where models are prone to overfitting. Such models tend to capture fine-grained features necessary for the original task, which concurrently exhibit a high correlation with the hijacking task. Isolating these fine-grained features could compromise the utility of the original task. Therefore, more sophisticated techniques of learning combined with model compression are needed.

11. Discussion and Concluding remarks

Increased potential for harm with less access privilege. In this paper, we propose a new threat model within the model hijacking attack scenario. Specifically, we establish a new risk of adversaries hijacking ML models with restricted access, i.e., without access to the training process or data. We propose SnatchML, which uses features extracted by the victim model at inference time to infer a different task defined by the attacker.In contrast to related hijacking attacks, where those of the original task limit the number of classes in the hijacking task, our proposed approach imposes no predefined constraints on the number of classes involved in the hijacking task. Therefore, SnatchML not only assumes a stronger threat model but also enables attackers with lower access privileges to potentially cause greater harm.

Insights towards security-aware NAS. We suggest that the vulnerability of ML models to this type of attack is primarily due to their capacity to unintentionally learn ”extra-knowledge”, which is mainly related to over-parametrization. Interestingly, our results reveal a disparity in attack success based on the model architecture. This observation could be particularly significant in the Neural Architecture Search towards developing secure-by-design models.

Beyond the technical implications of SnatchML. We believe SnatchML challenges some of the AI regulatory efforts’ foundations and underscores the need for more comprehensive approaches. The potential for compliant ML models to be repurposed for unethical or illegal tasks poses significant accountability risks for model owners. This highlights a critical oversight in the EU AI Act and similar regulatory frameworks, which assume that models trained on specific tasks are confined to those boundaries.

Attack Limitations. In our study, we demonstrate that the hijacking effectiveness of SnatchMLcan reach state-of-the-art task accuracy in some cases. However, this effectiveness can be influenced by several factors, particularly the victim model’s architecture. This limitation does not exist in traditional hijacking attacks where the attacker typically has access to training data, allowing them to directly manipulate the model’s parameters. Despite this limitation, our approach represents a significant advancement over traditional hijacking methods by overcoming the constraints related to the number of classes in the hijacking task.

Defenses Limitations. The meta-unlearning defense strategy we propose is designed to unlearn or forget specific tasks that are known to pose ethical or privacy concerns. However, one significant limitation of this approach is its specificity to the task it is designed to counter. This makes it less useful in environments where the potential hijacking tasks are unknown or can vary widely. However, we believe more advanced meta-learning-inspired approaches that define a distribution of tasks instead of one could be more efficient.

The compression defense mechanism we explore also presents its challenges, particularly when dealing with tasks with semantic or functional overlap with the original task. The necessity for aggressive model compression to disrupt the latent space, where features might otherwise be exploited for a hijacking task, can inadvertently reduce the model’s accuracy on its original task. While effectively reducing the model’s vulnerability to hijacking, this approach requires a delicate balance between maintaining the original task’s utility and mitigating the risk of hijacking. Exploring additional compression techniques, such as quantization and pruning followed by fine-tuning on the original dataset, could offer further robustness to the models.

In summary, we believe this work contributes a new perspective on the ML security landscape and calls for advancements in security-aware model design and evaluation.

References

(1)
Abadi etal. (2016)Martin Abadi, Andy Chu, Ian Goodfellow, HBrendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016.Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 308–318.
Allaert etal. (2022)Benjamin Allaert, IsaacRonald Ward, IoanMarius Bilasco, Chaabane Djeraba, and Mohammed Bennamoun. 2022.A comparative study on optical flow for facial expression analysis.Neurocomputing 500 (2022), 434–448.
Bagdasaryan etal. (2020)Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deborah Estrin, and Vitaly Shmatikov. 2020.How to backdoor federated learning. In International Conference on Artificial Intelligence and Statistics. PMLR, 2938–2948.
Basteri and Trevisan (2023)Andrea Basteri and Dario Trevisan. 2023.Quantitative Gaussian Approximation of Randomly Initialized Deep Neural Networks.arXiv:2203.07379[cs.LG]
Bastioni etal. (2008)Manuel Bastioni, Simone Re, and Shakti Misra. 2008.Ideas and methods for modeling 3D human figures: the principal algorithms used by MakeHuman and their implementation in a new approach to parametric modeling. In proceedings of the 1st Bangalore annual compute conference. 1–6.
Behzadian etal. (2012)Majid Behzadian, SKhanmohammadi Otaghsara, Morteza Yazdani, and Joshua Ignatius. 2012.A state-of the-art survey of TOPSIS applications.Expert Systems with applications 39, 17 (2012), 13051–13069.
Biggio etal. (2012)Battista Biggio, Blaine Nelson, and Pavel Laskov. 2012.Poisoning attacks against support vector machines.arXiv preprint arXiv:1206.6389 (2012).
BRIA AI ([n. d.])BRIA AI. [n. d.].BRIA AI Labs.https://labs.bria.ai/.
Carlini and Wagner (2017)Nicholas Carlini and David Wagner. 2017.Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp). IEEE, 39–57.
Commission ([n. d.])European Commission. [n. d.].EU AI Act.https://www.europarl.europa.eu/news/en/headlines/society/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence.Accessed: 2023-11-03.
Dosovitskiy etal. (2020)Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, etal. 2020.An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929 (2020).
Elasri etal. (2022)Mohamed Elasri, Omar Elharrouss, Somaya Al-Maadeed, and Hamid Tairi. 2022.Image generation: A review.Neural Processing Letters 54, 5 (2022), 4609–4646.
Elsayed etal. (2018)GamaleldinF Elsayed, Ian Goodfellow, and Jascha Sohl-Dickstein. 2018.Adversarial reprogramming of neural networks.arXiv preprint arXiv:1806.11146 (2018).
Fern etal. (2015)Nicole Fern, Shrikant Kulkarni, and Kwang-TingTim Cheng. 2015.Hardware Trojans hidden in RTL don’t cares—Automated insertion and prevention methodologies. In 2015 IEEE International Test Conference (ITC). 1–8.
Finn etal. (2017)Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017.Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol.70), Doina Precup and YeeWhye Teh (Eds.). PMLR, 1126–1135.https://proceedings.mlr.press/v70/finn17a.html
Ganju etal. (2018)Karan Ganju, Qi Wang, Wei Yang, CarlA. Gunter, and Nikita Borisov. 2018.Property Inference Attacks on Fully Connected Neural Networks using Permutation Invariant Representations. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (Toronto, Canada) (CCS ’18). Association for Computing Machinery, New York, NY, USA, 619–633.https://doi.org/10.1145/3243734.3243834
Giryes etal. (2016)Raja Giryes, Guillermo Sapiro, and AlexM. Bronstein. 2016.Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy?IEEE Transactions on Signal Processing 64, 13 (2016), 3444–3457.https://doi.org/10.1109/TSP.2016.2546221
Goodfellow etal. (2014)IanJ Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014.Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572 (2014).
Google ([n. d.])Google. [n. d.].https://cloud.google.com/healthcare-apiAccessed: 2024-04-01.
Guesmi etal. (2023)Amira Guesmi, Ruitian Ding, MuhammadAbdullah Hanif, Ihsen Alouani, and Muhammad Shafique. 2023.DAP: A Dynamic Adversarial Patch for Evading Person Detectors.arXiv:2305.11618[cs.CR]
He etal. (2016)Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016.Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
Howard etal. (2017)AndrewG Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017.Mobilenets: Efficient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861 (2017).
Huang etal. (2011)Ling Huang, AnthonyD Joseph, Blaine Nelson, BenjaminIP Rubinstein, and JDoug Tygar. 2011.Adversarial machine learning. In Proceedings of the 4th ACM workshop on Security and artificial intelligence. 43–58.
Johnson and Lindenstrauss (1984)WilliamB Johnson and Joram Lindenstrauss. 1984.Extensions of Lipschitz mappings into a Hilbert space.Contemp. Math. 26 (1984), 189–206.
Kermany etal. (2018)DanielS Kermany, Michael Goldbaum, Wenjia Cai, CarolinaCS Valentim, Huiying Liang, SallyL Baxter, Alex McKeown, Ge Yang, Xiaokang Wu, Fangbing Yan, etal. 2018.Identifying medical diagnoses and treatable diseases by image-based deep learning.cell 172, 5 (2018), 1122–1131.
Liao and Couillet (2018)Zhenyu Liao and Romain Couillet. 2018.The dynamics of learning: A random matrix approach. In International Conference on Machine Learning. PMLR, 3072–3081.
Lucey etal. (2010)Patrick Lucey, JeffreyF Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar, and Iain Matthews. 2010.The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In 2010 ieee computer society conference on computer vision and pattern recognition-workshops. IEEE, 94–101.
Mallya and Lazebnik (2018)Arun Mallya and Svetlana Lazebnik. 2018.Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 7765–7773.
Microsoft ([n. d.])Microsoft. [n. d.].Face API - Azure Cognitive Services.https://azure.microsoft.com/en-us/services/cognitive-services/face/.Accessed: 2024-04-01.
Moosavi-Dezfooli etal. (2016)Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. 2016.Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2574–2582.
Naseri etal. (2020)Mohammad Naseri, Jamie Hayes, and Emiliano DeCristofaro. 2020.Toward robustness and privacy in federated learning: Experimenting with local and central differential privacy.arXiv preprint arXiv:2009.03561 (2020).
Poux etal. (2021)Delphine Poux, Benjamin Allaert, Nacim Ihaddadene, IoanMarius Bilasco, Chaabane Djeraba, and Mohammed Bennamoun. 2021.Dynamic facial expression recognition under partial occlusion with optical flow reconstruction.IEEE Transactions on Image Processing 31 (2021), 446–457.
Pytorch ([n. d.])Pytorch. [n. d.].https://pytorch.org/serve/model_zoo.htmlAccessed: 2024-04-01.
Rahimi and Recht (2007)Ali Rahimi and Benjamin Recht. 2007.Random features for large-scale kernel machines.Advances in neural information processing systems 20 (2007).
Salem etal. (2022)Ahmed Salem, Michael Backes, and Yang Zhang. 2022.Get a Model! Model Hijacking Attack Against Machine Learning Models. In 29th Annual Network and Distributed System Security Symposium, NDSS 2022, San Diego, California, USA, April 24-28, 2022. The Internet Society.https://www.ndss-symposium.org/ndss-paper/auto-draft-241/
Samaria and Harter (1994)FerdinandoS Samaria and AndyC Harter. 1994.Parameterisation of a stochastic model for human face identification. In Proceedings of 1994 IEEE workshop on applications of computer vision. IEEE, 138–142.
Si etal. (2023)WaiMan Si, Michael Backes, Yang Zhang, and Ahmed Salem. 2023.Two-in-One: A Model Hijacking Attack Against Text Generation Models. In 32nd USENIX Security Symposium (USENIX Security 23). USENIX Association, Anaheim, CA, 2223–2240.https://www.usenix.org/conference/usenixsecurity23/presentation/si
Sun etal. (2019)Ziteng Sun, Peter Kairouz, AnandaTheertha Suresh, and HBrendan McMahan. 2019.Can you really backdoor federated learning?arXiv preprint arXiv:1911.07963 (2019).
Venceslai etal. (2020)Valerio Venceslai, Alberto Marchisio, Ihsen Alouani, Maurizio Martina, and Muhammad Shafique. 2020.NeuroAttack: Undermining Spiking Neural Networks Security through Externally Triggered Bit-Flips. In 2020 International Joint Conference on Neural Networks (IJCNN). 1–8.https://doi.org/10.1109/IJCNN48605.2020.9207351
Wang and Deng (2021)Mei Wang and Weihong Deng. 2021.Deep face recognition: A survey.Neurocomputing 429 (2021), 215–244.
Wang etal. (2013)Xi Wang, Nickolai Zeldovich, MFrans Kaashoek, and Armando Solar-Lezama. 2013.Towards optimization-safe systems: Analyzing the impact of undefined behavior. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. 260–275.
Ye etal. (2022)Jiayuan Ye, Aadyaa Maddi, SasiKumar Murakonda, Vincent Bindschaedler, and Reza Shokri. 2022.Enhanced Membership Inference Attacks against Machine Learning Models. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security (Los Angeles, CA, USA) (CCS ’22). Association for Computing Machinery, New York, NY, USA, 3093–3106.https://doi.org/10.1145/3548606.3560675
Zhang etal. (2017)Zhifei Zhang, Yang Song, and Hairong Qi. 2017.Age Progression/Regression by Conditional Adversarial Autoencoder. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.

12. Appendix

12.1. Differential Privacy

Differential Privacy is designed to safeguard individual Privacy in data analytics by introducing controlled noise to data or computation results. This technique prevents the extraction of specific information by obscuring each data point’s contribution.

Taking our study a step further, we investigated whether Differential Privacy SGD could counter our attack. We compared the effectiveness of our approach on models trained both with and without DP-SGD. The figure 12 and tables 6 and 7 display the results obtained. Interestingly, we found that using DP-SGD only impacts the accuracy (utility) of the original task without effectively countering the attack.

ER Model

Acc (%)(w/o DP)

Acc (%)(w/ DP)

Re-identificationAccuracy(%)

Logits

(w/o DP)

Logits

(w/ DP)

(w/o DP)

(w/ DP)

2D-CNN

0.91

0.56

0.31

0.40

0.35

0.47

ResNet-9

0.94

0.57

0.29

0.22

0.38

0.76

MobileNet

0.88

0.63

0.11

0.09

0.16

0.26

Transformer

0.68

0.45

0.08

0.28

0.13

0.38

Differentially-Private Stochastic Gradient Descent (DP-SGD)(Abadi etal., 2016) is an application of differential privacy specifically in the context of training ML models, particularly those involving stochastic gradient descent (SGD) optimization. In DPSGD, noise is added to the gradients computed during the training process, ensuring that the updates to the model parameters are differentially private. This helps prevent the inadvertent memorization of individual data points, enhancing the privacy of the training process.

Model

OriginalTask

Acc (%)(w/o DP)

Hijacking AgeSCR(%)

Hijacking Gender SCR(%)

Hijacking EthnicitySCR(%)

Logits

(w/o DP)

Logits

(w/ DP)

(w/o DP)

(w/ DP)

Logits

(w/o DP)

Logits

(w/ DP)

(w/o DP)

(w/ DP)

Logits

(w/o DP)

Logits

(w/ DP)

(w/o DP)

(w/ DP)

2D-CNN

Age

68.21

50.25

61.17

64.52

62.65

61.21

38.43

41.3

39.72

35.43

Gender

87.65

64.67

32.15

38.33

40.69

30.35

27.42

39.34

31.03

29.30

Ethnicity

76.21

67.75

36.98

42.12

42.04

35.18

54.88

62.54

55.58

54.33

ResNet-9

Age

70.51

57.50

54.75

64.25

61.80

54.17

32.28

44.76

40.16

31.03

Gender

89.77

70.71

28.79

42.69

45.26

29.47

27.50

40.58

40.90

27.38

Ethnicity

78.57

57.99

34.70

44.21

43.58

31.85

52.96

68.60

60.16

53.28

MobileNet

Age

66.83

53.41

56.13

60.08

57.44

56.72

36.06

38.21

34.66

31.09

Gender

86.48

65.78

34.74

30.54

35.03

29.55

26.62

30.44

31.60

27.34

Ethnicity

72.6

61.84

32.59

35.81

34.06

33.54

54.19

57.24

54.08

52.21

Transformer

Age

63.57

37.31

57.86

52.52

62.67

52.46

35.16

14.35

41.55

14.55

Gender

84.86

55.24

35.71

34.80

38.2

34.74

30.33

14.55

32.65

14.53

Ethnicity

72.79

42.82

34.36

37.25

42.21

37.31

54.88

47.52

61.15

47.48

Model for Peanuts: Hijacking ML models without Training Access is Possible (12)

12.2. Assessing Model Hijacking Across Unrelated Tasks: t-SNE Analysis

To investigate the generalizability of model hijacking to tasks unrelated to the original training task, we conducted experiments using a ResNet18 model pretrained on ImageNet. We aimed to assess whether the model could be effectively hijacked for a task, such as MNIST digit classification, without any prior training on that specific task. To illustrate this, we generated t-SNE plots comparing the feature distributions of MNIST digit classes using the feature vectors extracted from the pretrained ResNet18 model and a randomly initialized ResNet18 model (see Figure 13. Surprisingly, we observed distinct clusters corresponding to different digit classes in both cases despite neither model being trained on the MNIST dataset. This observation underscores the potential for model hijacking even when the hijacking task is entirely unrelated to the original task. The presence of discernible clusters in the feature space, as depicted in the figure 13, serves as compelling evidence to further explore the vulnerabilities of pretrained models to hijacking attacks.

Model for Peanuts: Hijacking ML models without Training Access is Possible (13)

The Figure 14 also illustrates the t-SNE distribution of identity classes for three datasets (Olivetti, Celebrity, and Synthetic) based on their feature vectors derived from inference using a ResNet18 model pretrained on ImageNet. The figures distinctly exhibit the existence of separate clusters corresponding to each identity. This finding further reinforces the notion of model hijacking’s potential, as it demonstrates that pretrained models can capture meaningful representations even for tasks unrelated to their original training, echoing the observations made in the preceding paragraph.

Model for Peanuts: Hijacking ML models without Training Access is Possible (14)

12.3. Assessing Model Hijacking Across Additional Qualitative Results

In this section, we present additional qualitative results for the scenario discussed in Section 6. In Figure 15, we depict some examples of predictions made by an Ethnic Groups trained model used for predicting age groups on the UTKface dataset. The five nearest images are displayed for each query. For instance, in the first row, even though the two closest images do not belong to the correct age group (old), they are individuals within the age group boundary (adult). Additionally, we observe in the third row that both errors belong to the nearest age group, which could contain relatively useful information. In Figure 16, we present some examples of results obtained from a gender-trained model used for predicting ethnic groups on the UTKface dataset. The five nearest images are displayed for each query. In the second row, we observe that the model returns images with similar attributes, such as mustaches or glasses, unrelated to the original task it was trained for, namely predicting gender groups. This result illustrates that the model unintentionally learns additional information that can be exploited unauthorizedly.