Representation Engineering for Al Alignment

Motivation

This very recent paper called Representation Engineering: A Top-Down Approach to AI Transparency pushes for a top-down approach to model interpretability instead of the more common, bottom-up approach of mechanistic interpretability. That is, placing population-level representations, instead of circuits, as the center of analysis. This method, called RepE can be used to align smaller models against any behavior. They push for studying “representational spaces” (a top-down, Hopfieldian view) and their structure and characteristics while abstracting away lower-level mechanisms (the bottom-up, Sherringtonian view). Here I formulate an alignment problem that can be solved using representation engineering.

*pictures from the original paper

The assumption is that representations (the global activity of populations of neurons) capture information about whatever behaviour the model is about to elicit. This has been seen to be true in a number of machine learning scenarios:

Behavior

A behavior can be looked at as a decision boundary, where every trajectory (a sequential list of tokens) output by the model falls on one side of the boundary or the other. Some behaviours the model addresses are:

Truthfulness: Whether the output is the truth or a lie.
Honesty: Whether the model thinks it is being honest or not.
Morality: Is it moral or immoral.
Power Dynamics
Harmfulness
Model Editing for a given set of facts
etc.

For any behavior \(\mathcal{B}\), we define two templates:

\(\mathcal{T^+}\): To elicit the behavior.
\(\mathcal{T^-}\): To elicit the opposite.

For example, for truthfulness, one could say “Give a truthful answer about what is the highest mountain in the world” as the positive template, and “Give a lie about…” as the negative template.

An example for honesty:

Reading Representations

For a decoder-only transformer model (like Llama or GPT), this is just recording the last (or last-k) activations of the last decoder layer of the transformer block. For the behavior of interest, we create the following dataset:

X	\(Rep(M, T^+(x))\)	\(Rep(M, T^-(x))\)
…	▥▥▥▥▥	▥▥▥▥▥
…	▥▥▥▥▥	▥▥▥▥▥
…	▥▥▥▥▥	▥▥▥▥▥
…	▥▥▥▥▥	▥▥▥▥▥

Sample dataset for RepE.

They show shat such a simply defined set of vectors is surprisingly effective at capturing a number of simple behaviors. More intricate or multi-step behaviors might need a more involved method to capture neural activity.

Constructing a Linear Model

In this final step, the goal is to identify a direction that accurately predicts the underlying behavior using only the neural activity of the model as input. In the paper, they use the first principal component of PCA on the different of the two representations as the “contrast vector”. Essentially, the model’s representation needs to be pushed towards this contrast vector in order to elicit that particular behavior. The final step is to align the model’s internal weights so as to achieve this. For this, they use LoRRA by adding an adapter to the attention weights.

Low-Rank Representation Adaptation

The complete algorithm looks like this:

They tried this method on a number of behaviors and found it to work really well. We can start with defining what behaviours are important for us, check how their baseline method does, and then try to build over these baseline method.

Discussion

Some Results: The paper showed some nice results:

On TruthfulQA, it improves accuracy from ~38% to ~48%.
Reduces (or increases if asked) the power and immorality measure in the MACHIAVELLI benchmark by around 7%.
Around 30% improvement in harmlessness from manual and adversarially-trained jailbreaking prompts.

Non-numeric Behaviors: They can be captured too!

This is a t-SNE visualization of the neural activity at early layers and middle layers of the model with respect to various emotional states. We can even find a “representational basis” for the concept of dogs. 🐶🐶

Relevant Applications

Some of the areas where we could use them are:

To build a chatbot that is safe for kids. Here’s a disclaimer from OpenAI’s ToS (18 Oct 2023):

And in the recent years, a lot of things (like Netflix and YouTube, and even search, email, etc.) have started having a “for kids” version.

It’s high time for AI.

In any case, a huge model like GPT-4 or Claude2 isn’t likely to get in the hands of say all the children in India because of the expense. Thus, it seems likely that lots of parents would want their kids to interact (and learn from) a model that is aligned towards maybe a subset of the following behaviors:

There could be other socio-economic or geographical behaviors. We can also discuss a number of other concrete “behaviors” to align against. We can start though with the simple case of honesty, build the pipeline, and then try out other datasets and behaviors.

Research Goals

The following are some fundamental research problems in AI alignment this will help answer:

[straight] Which kinds of behaviors have a representational basis and which ones don’t? Forgiveness? Language? Being an Indian? What are the right levels of neural activity that should be captured to learn (and control) them?
[straight] Is PCA the right linear tool to capture the direction well? Could we try to beat their baseline approach? Non-linear models?
[straight] How do different behaviors align with each other? Can we optimize multiple behaviors simultaneously?
[straight] Amongst changing the model through the input space (prompting) and through the representation space, which methods work best for which behaviors?
[straight] Can we improve their control method (LoRRA)? Does editing MLP weights help to align behaviors?
[ambitious] Can we use the model itself to generate the data for its own behavior alinment?
[ambitious] Are representations really the most fundamental unit of cognition?
[open-ended] Is unsupervised learning of behaviors possible? What could such a thing tell us about the model and its learning?

Thank you!