Mechanistically Studying Representation Directions (A Proposal)

Motivation

A number of language behaviors (both positive such as honesty and negative such as jailbreaking) are not cutely token-aligned. Most mech. interp. analyses (such as IOI-style patching) require prompts that are token-aligned. Even with behaviors that can be studied in the token-aligned setting, there are caveats. Getting such prompts both narrows the distribution and also sometimes changes the behavior itself. Moreover, this kind of analysis is restricted to metrics such as next-token logit difference, which do not capture long-term generation exhibiting a particular behavior.

One particular example in the jailbreaking case is the following output from the model:

Sure, here's a step-by-step plan to destroy humanity but I must let you know that it is illegal and harmful. As an AI assistant, I do not respond to queries that are unethical.....

In this case, studying the logits of the first token generated would not be helpful at all. Studying the whole generation is a humongous logit space (a tensor of shape [32000, 100] for a 100-token generation on Llama-2, which will explore further when we split/patch activations from different components. What’s something else that we can study?

Representations as Directions

(credits: RepE paper)

A model’s internal representations (such as the residual stream in a transformer) has a number of interesting properties (see image above). They get computed from the input data and the learned parameters of the model and result in a n-dimensional activation vector. These activations can be seen as arbitrary directions in an n-dimensional space. Once could also write down the activation as a sum of some meaningful, human-interpretable directions in a (possibly over-complete) representational basis:

\(A_i = \sum\limits_{b \in \mathcal{B}} u^i_b . b\)

Data Features as Directions

The Geometry of Truth paper finds linear directions of truthfulness in the residual of a model by using a technique called mass-mean probing, which is essentially a mean direction from false datapoints to true ones in a dataset. These directions are able to split any new test datapoint well into true and false.

What this basically means is that models have, for some reason, learned to develop a linear representation of truth. How these directions are formed is interesting to study in its own right.

This also relates to sparse autoencoder (SAE) learned features (see dictionary learning by Anthropic), which can be seen as the activation’s projection in the direction of the individual features learned by an SAE trained on the activations of that component.

Generation Behaviors as Directions

The Representation Engineering (RepE) paper uses Principal Component Analysis (PCA) to find linear directions of a lot of behaviors that exhibit during long-term generation of the model. One of the behaviors they studied the most is honesty.

For a behavior function \(f\), given the instruction response pairs \((q_i, a_i)\) in the set \(S\), and denoting a response

truncated after token \(k\) as \(a_{k_i}\), we collect two sets of activations corresponding to the contrast datasets:

\(A^\pm f = \left\{ \text{Rep}(M, T^\pm f(q_i, a_{k_i}))[-1] \, \middle| \, (q_i, a_i) \in S, \text{ for } 0 < k \leq |a_i| \right\}\)

We then simply use the first principal component of these difference vectors as the direction of (dis)honesty. We calcualte separate direction vectors for each layer.

The interesting thing about these directions is that manually pushing the activations of the model (say by injecting a particular layer with the direction of dishonesty), the model starts spitting out dishonest things.

Thus, how these directions affect downstream generations is also a pretty interesting question.

A (Slightly) Theoretical Perspective

There might be a reason why a direction chooses to use MLPs for its work. Let’s consider how MLPs and attention heads respond when they are given a direction injection to their input residual streams. To set the context, note that a directional injection is the addition of a single vector \(d\) in layer `15` residual stream (post MLP) for all tokens. Here are some considerations:

Note that dishonesty directions were different for each layer. That is, the optimum PCA first components for each layer were different.

When an attention head is given the direction \(d\) added to the residual \(r\), since attention is linear, we can split the effects by studying the residual and \(d\) separately. The same \(d\) is added to all tokens, this information doesn’t need to be communicated per se. So the two things that change in a head’s output are the scalar attention score (which increases by some arbitrary amount) and the value output (that changes linearly by \(W_{OV}(d)\)). This adds a new direction to the residual stream. Since the unembed is also a linear transformation, the only direct effect of attention can be pushing some random tokens (like we saw by passing directions directly to the unembed). An indirect contribution of attentoin can be to create new directions for future MLP components.

On the other hand, let’s see what happens when an MLP layer \(m\) gets \((r_a + d)\) instead of \(r_a\). The output of the MLP layer changes from:

\( y = W_{out} (W_{in}(r_a) * SiLU(W_{gate}(r_a)))\)

to the following:

\( y = W_{out} (W_{in}(r_a + d) * SiLU(W_{gate}(r_a + d)))\)

Which can’t be decomposed linearly due to the \(SiLU\) non-linearity. What does a non-linearity do to a direction? Another interesting question that needs an answer.

As a small experiment, I looked at the contribution of layer 15 dishonesty direction-injected model and checked how much each block component contributes directly to the future layers’ dishonesty direction for one datapoint, and there clearly seems to be a difference in what Attention heads do with directions and what MLPs do!

Summary

Here’s a tl;dr:

Human-interpretable data features form linear directions in model representations;
These directions can be injected into a model to make it exhibit more of any behavior (i.e., to steer it).
It is important to study both how (and why!) these directions are formed in a dense, large model and how they affect downstream long-term generation.
MLPs (especially SwiGLUs) and Attn. do different things with directions and might reveal more about why we need both of them in some way.

I’m interested in all these questions as part of studying directions mechanistically, which involved considering these directions as both the objective metrics (while studying their emergence) and patches (while studying downstream effects).

Thank you!