This is a bunch of things that I’ve been thinking about for a while. All of this should apply to any parametric model that learns through gradient-based updates. This includes MLPs, CNNs, GPTs, etc.
What are Adversarial Attacks?
Adversarial attacks refer to a class of techniques where intentionally crafted inputs (or intentional updates to parts of an input) are used to deceive or manipulate parametric neural network models. These attacks exploit the vulnerabilities of these models to slight perturbations in input data. Here’s how one of the simplest adversarial attacks work:
Let’s consider a model parametrized by \(\theta\), which given an input \(x\), outputs \(p(y|x; \theta)\) which is the probability of output \(y\) given input \(x\) and parameters \(\theta\). The model is trained by gradient-based updates on the parameters using the usual cross-entropy loss.
Now, we introduce an adversarial perturbation \(\epsilon\), which is added to the original input data \(x\). The perturbed input data is represented by $Latex x_{adv} = x + \epsilon$. The goal of the attack is to generate an adversarial example that is misclassified by the model. The adversarial perturbation \(\epsilon\) is calculated as the sign of the gradient of the loss function \(J(\theta, x, y)\) with respect to the input data \(x\):
$LATEX \epsilon = sign(\nabla_x J(\theta, x, y))&s=2$
By perturbing the input along the direction that maximizes the loss, the attack effectively maximizes the model’s prediction error and thus induces misclassification.

For example, let’s say we have a neural network model that classifies images into “cat” or “dog” classes. By applying the attack, we can generate an adversarial example by perturbing the original image in a way that fools the model into misclassifying it. The attack computes the gradient of the loss with respect to the input image, and then adjusts the pixels of the image based on the sign of the gradient, effectively maximally shifting the image in a way that confuses the model.
These “perturbations” can also be made on small patches of the input, and they can also be made on a number of models. In fact, it is found that adversarial attacks for one model often generalize/transfer well to other models as well. This could be because all of these models are trained on similar features and circuits these models learn, especially if they are trained on similar data distributions.
Feel free to read up more form the myriad of literature written on adversarial attacks, but for this post, this much introduction should suffice.
Editing foundational language models
Recently, with GPTs performing very well in a lot of tasks, people got interested in understanding and editing foundational language models. This is important because we can’t finetune them every time a fact about the world changes. We might also want to fix some incorrect things the model has learned.
A number of these “model editing” techniques (such as ROME and others) change the activations of one of the hidden layers. Once again, they make gradient-based changes in the direction of the required output change. They don’t talk about this similarity in the papers, but the math is almost exactly the same.
Instead of changing the input (which is much more visually appealing for image data and difficult for discrete tokens), they adversarially update the hidden layer activations. Interestingly, they seem to try and pick these layers to edit from causal mediation analysis, or causal tracing. But in this light, these methods seem just like adversarial attacks using a hidden layer instead of the input.
Are adversarial attacks the same as model training?
Consider a data distribution \(D\), from which data pairs \((x, y)\) are sampled. A model \(f\) is defined as a (parametrized) function from \(X\) to \(Y\). A trained model gives correct outputs for all inputs, and an untrained model gives random outputs based on the network initialization. If an adversary \(a \in A\) (the set of all adversaries) can successfully fool a subset of the models \(f \in F\) (the set of all models), does it need more patches to generalize to other models?
- Are the circuits learned by an adversarially attacked network similar to the circuits inside a network trained directly on the output perturbations on which the attack was designed?
- Is it possible that the hitherto mentioned attacks overfit on the very small manifold over \(F\) that includes trained models?
- Can we train a randomly initialized model by just injecting adversarial patches? In this case, I guess we should call it “friend help”, and not “adversarial attack”.
- Can we build an adversary for an adversary? Is it a minimax game?
- Can every parametric model trained with gradient updates be broken through gradient updates (to either the inputs or the outputs)?
- Can we prove so, at least for the simple hypothetical of infinite width neural networks?
Answers to some of these questions should be relevant for both model understanding and editing. If someone is working on (or want to work on) related problems, or have already done so and feel that I should include their ideas here, please let me know.