October, 2023
Let's start by looking in the case of attacks against supervised learning models in computer vision.
Let's begin by reviewing the empirical risk minimization (ERM) framework for supervised learning. In ERM, we are given a dataset of samples , where is a feature vector and is a label. We assume that the samples are drawn i.i.d. from some distribution over . We also assume that there exists some function that minimizes the expected loss for some loss function . The goal of ERM is to find a function that minimizes the empirical loss .
Outside of the nice mathematical properties, do these gaps in the topology actually matter?
overfitting? cant be since it occurs in small models too topological holes? https://arxiv.org/pdf/1901.09496v1.pdf linearity hypothesis?
Adversarial attacks are algorithms that malicious attackers use to cause an misclassification in an ML model. Techniques involve data poisoning, evasion attacks, and model extraction attacks; however, we will focus on methods for constructing adversarial examples.
One of the most interesting facts about these adversarial examples is their generalizability1 of transferability.
Typically, in the formulations before, we find specific adversarial perturbations per sample point.
There are perturbations that work across multiple different models and images and generalize to new ones.
The existence of these universal adversarial perturbations 2 gives surprising insight into the topology of neural networks.
You may have seen the "jailbreaks" that bypass safety alignment in large language models (LLMs) such as TODO, and successfully reproduced them yourself. However, could you automatically find attacks that bypass alignment? In fact yes! you can and it displays similar findings to the traditional adversarial attacks.
If we can construct these adversarial examples for all neural nets, which are meant to approximate human intelligence, then that naturally brings up the question of whether humans have adversarial examples.
San Francisco