In-Context Learning for prompting Foundational Language Models: What Works?

Motivation

Foundational language models (or LLMs) are large parametric models trained to autoregressively predict the next text (y) given a context (x):

\(y = P_{\theta}(y_1 | x) \cdot P_{\theta}(y_2 | y_1, x) \cdots P_{\theta}(y_n | y_1, y_2, \ldots, y_{n-1}, x)\)

Giving a context to the language model is called “prompting”, and it has been found that pre-trained large models can do amazing tasks (see Sparks of AGI) when prompted well enough. Thus, we wish to learn how to prompt well and understand what kinds of prompts work well.

The most straightforward way to do so would be to understand the FLM and thus know what to prompt it with. While there’s some great work being done on mechanistic interpretability and circuits in FLMs (see Transformer Circuits), we’re nowhere near that goal. Thus, the best we can do right now is to try out different kinds of prompts and understand patterns a posteriori. This could, in turn, give us insights about the inner workings of these FLMs. This post summarizes some of the hitherto recent research done in this direction.

Perhaps the biggest result of prompt experimentation is that giving demonstrative examples in the form of \((x, y)\) tuples improves performance a lot (see Language Models are Few-shot Learners). While there is no “learning” going on inside a FLM, this performance improvement through passing examples “in context” has been called in-context learning (ICL).

Some important papers in ICL:

Here is a list of some of the important results in ICL:

[2202.12837] [EMNLP 2022] Rethinking the Role of Demonstrations:
What Makes In-Context Learning Work?

This study looks at the effectiveness of ICL and its dependencies on different aspects of the demonstrations. They explore four key aspects:

Input-label mapping (M): Whether each input is paired with a correct label.
Distribution of input text (I): The underlying distribution of the input text samples.
Label space (L): The space covered by the labels in the demonstrations.
Format (F): The use of input-label pairing as the format for the demonstrations.

Figure 10 from the paper.

The study designs various experiments to evaluate the impact of each aspect on the performance of the models. They find that the ground truth input-label mapping has little impact on performance gains from ICL. However, other aspects of the demonstrations significantly affect the model’s performance.

Results show that using out-of-distribution (OOD) inputs instead of in-distribution inputs in the demonstrations reduces performance for certain models. The input-label pairing and label space have a considerable impact on performance gains, especially for direct models.

Furthermore, they observe that the format of the demonstrations plays a vital role in retaining performance gains. When using random sentences from a corpus paired with the label set while maintaining the format, the model can retain most of the improvements achieved through ICL.

Notes:

“Channel” here refers to “noisy channel models”, which are an interesting way to solve tasks (see 2108.04106). Instead of evaluating on \(p(y|x)\), we predict \(p(x|y)\), which are proportional (Bayes’ rule) with some assumptions on \(p(x)\) and \(p(y)\).

[2303.03846] Larger Language Models Do ICL Differently

One major thing the previous paper misses is the role of “semantic priors” in the model. This is the conclusion of this paper from here:

We examine the extent to which language models learn in-context by utilizing prior knowledge learned during pre-training versus input-label mappings presented in-context.
We first showed that large language models can learn to override prior knowledge when presented with enough flipped labels, and that this ability emerges with model scale.
We then found that successfully doing ICL using semantically-unrelated labels (SUL-ICL) is another emergent ability of model scale.
Finally, we analyzed instruction-tuned language models and saw that instruction tuning improves the capacity to learn input-label mappings but also strengthens the use of semantic prior knowledge even more.

Notes:

In the SUL-ICL setup, labels such as “Positive” and “Negative” are replaced with SULs such as “Foo” and “Bar”.
In instruction tuning, they finetune PaLM on instructions after pretraining it for left-to-right generation and get Flan-PaLM (see this for more).

[2104.08786] [ACL 2022] Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

On the order of training examples: In (supervised) machine learning, we say that the order of examples doesn’t matter much in performance (~1%) if one does random minibatch sampling. Curriculum learning states that performance can be improved somewhat by moving from “easy” to “hard” examples. None of this is applicable to ICL, and we don’t know whether the order matters or not. Since the model does not “learn” anything, and always attends on anything, the order of examples might as well not matter in ICL. This work tries to answer this, and has several interesting things to say:

There is high variance in the performance of different permutations of the same set of samples, especially in smaller models. The variance persists for larger models for some datasets and is gone for some others.
Adding training samples does not reduce this variance too much. (when only looking at \(24\) permutations from \(64\) examples.
Performant prompts are not transferable across models.

Notes:

They use a fixed random subset of four samples and consider all \(24\) permutations of them. They should have tried it out with different subsets as well.
In this situation, it seems important to manually look at these permutations and try to see what it is about some of them that make them better. This paper does not do that.
The fact that performant prompts are not transferable is very important. Either ordering is like random seeds (which affect the performance but without any pattern, and thus shouldn’t be optimized over), or it affects the model in ways that are different for different models. This might go against interpretability’s universality hypothesis (see Zoom In) that models learn similar internal features and circuits.

[2101.06804] [DeeLIO 2022] What Makes Good In-Context Examples for GPT3? They look at a number of ICL tasks using GPT-3 and find that picking demonstrations that are “similar” to the task query improves performance. They pick similar examples using k-nearest neighbors in the embedding space. However, their gains are not significant (~5%), and they test on just one benchmark in each domain. They also test only GPT-3 models, which might not generalize enough.

[2211.04486] [ACL 2022] Active Example Selection for In-Context Learning

This work formulates active example selection as an MDP (see this) and solves it using Q-learning (see this). The MDP tuple is:

\(\mathcal{A}\): The set of unlabeled examples. After choosing an action \(x_i\) , we get (or compute, if needed) its label \(y_i\).
\(\mathcal{S}: (x_1, y_1), (x_2, y_2), \cdots (x_i, y_i)\)
\(r\): Some scoring function. Here, they use val-set accuracy.
\(\mathcal{P}\): The transitions are all deterministic.
\(\gamma : 1\) always.

They generate off-policy data using randomly sampled examples and use it to train the agent offline. To fit in the active example selection setting, they do not use the labels of the previous actions to predict the next one. To stabilize the learning of the RL agent, they create intermediate reward functions, use target networks, and store transitions in a replay buffer.

Zero-centered ICL accuracy of 30 random sets of 4 demonstration examples. They see a difference of >30% across 4 tasks.

Notes:

The paper shows good gains on GPT-2, but they say that improvements diminish for larger GPT-3 models (>=Babbage).
- This could point to emergent capabilities of larger models, or simply mean that Q-learning (they use a 3-layer MLP) is not good enough to play with more complex models.
- But the problem is that they do not train on GPT-3 due to constraints, and only evaluate the policy trained on GPT-2 using GPT-3, which is bad because the prompt ordering paper found that even performant permutations do not generalize, so performant actions might as well not. :(
The problem of active example selection doesn’t directly fit into the RL setting, and algorithms should be tweaked to accommodate these differences:
- The action space is variable, and test-time actions can be different from training. (They do regularize Q-learning to somewhat make up for this (see Conservative Q-learning))
- The environment (task) can change from train to test.

[2305.14264] Active Learning Principles for In-Context Learning with Large Language Models

They compare standard AL algorithms based on uncertainty, diversity, and similarity:

Random: Simply random sampling. Used as a baseline.
Diversity: Encode with Sentence-BERT and then perform k-means. The assumption being that a diverse set will help ICL with complementary information.
Uncertainty: For a measure of uncertainty, they use perplexity (PPL) / loss of the individual examples.
Similarity: Same as the paper “What Makes Good In-Context Examples for GPT3?” we discussed above.

Comparison of average performance of AL methods for ICL.

Notes:

Similarity is consistently the best performing approach overall, followed by diversity and random.
They observe that uncertainty sampling underperforms in this setting of in-context learning, whereas it is seen to perform well for the general active learning setting.

Some general notes on ICL:

One major thing almost all the works on ICL seem to miss is measuring its brittleness with respect to the rest of the prompt. For instance, if I write “Look at all of these examples carefully” before sending them to the model, how much do the results of these works change? It might as well be very brittle, in which case it might make much more sense to jointly learn them, especially since performant ICL examples/orders do not seem to generalize across models.
It seems important to know whether we want model generalization or not, because if that is a must, then there seems no point in trying better permutations. They are not seen to generalize amongst models with any pattern whatsoever.
This work from Anthropic dives deep into the inner workings of ICL from the interpretability perspective and suggests that induction heads are responsible for general ICL in FLMs, and this regularly updated paper is a survey of most of the works in ICL.

Thank you!

In-Context Learning for prompting Foundational Language Models: What Works?

Motivation

Some important papers in ICL:

Like this:

Discover more from 7vik