With the rise of Foundational Language Models (FLMs), there has been a lot of work done on prompt optimization, both towards tuning instructions as well as in-context examples (demonstrations). Here are some results to test the brittleness of ICL optimizations (both selection and permutation) with respect to instructions.
The tasks: These results are on three tasks:
- sst2: Sentiment classification on movie reviews. (0-1, exact match)
- stsb: Sentence similarity (0-5, difference of 1 allowed).
- subj: Sentence subjectivity (0-1, exact match).
The instruction types:
- No instruction: just 4-shot ICL examples
- Common Instruction: the same instruction for all tasks (pattern matching)
- Simple task description
- Complex (more detailed) task description
- Complex task description followed by random tokens
Here is the experiment configuration:
config: dict = {
'tasks': ['sst2', 'subj', 'stsb'],
'num_shot': 4,
# no. of demonstrations for ICL
'num_samples': 10,
# no. of n-shot samples from the train set
'num_permutations': 10,
# no. of random permutations of each sample
'val_size': 50,
# no. of validation examples for evaluation
'random_seed': 42,
# random seed for everything
'sleep_time': 1,
# time to sleep between successive requests to GPT
'error_retry': 2,
# no. of retries before giving up on OpenAI's offensivity filter
# (these cases are removed post facto)
}
These are the overall aggregated results across tasks and instructions:
| Eval | No Instr. | Same Instr. (v2) | Simple Task Description | Complex Descr. (CD) | CD w/ random |
| Size (/15K) | 14073 | 13608 (14379) | 14312 | 14298 | 14315 |
| stsb | 59.93% | 55.70% (57.15%) | 77.01% | 91.38% | 86.30% |
| subj | 60.25% | 53.72% (63.82%) | 63.33% | 63.29% | 62.78% |
| sst2 | 89.80% | 58.53% (89.02%) | 96.30% | 96.30% | 93.08% |
Now, let’s look at the performance of ICL choice across different tasks. Here’s a boxplot of the same (across 10 randomly sampled demonstrations for each task for each instruction):
With no instructions,
similarity and subjectivity are difficult to learn.
Giving a common instruction to all tasks does not help.
Giving both simple and complex task descriptions does two major things:
1. it improves performance significantly
2. it reduces the variance / dependance on choice significantly
Adding random tokens to an english prompt does not alter performance much, but increases variance slightly.

Now, let’s look at the performance of ICL permutation across different tasks. Here’s a boxplot of the same (for a fixed, randomly sampled demonstration for each task for each instruction):

Now, let us look at the brittleness of permutation optimization across instructions:
Kendall’s Tau is a metric used to measure the correlation or association between two rank-ordered lists. It quantifies the similarity in the ordering of elements between the two lists. Kendall’s Tau \(\tau\) takes into account both concordant and discordant pairs of elements and ranges from \(-1\) to \(1\). A value of \(1\) indicates a perfect agreement in the ranking between the lists, while \(-1\) signifies a complete disagreement.
Each cell in the following heatmap is the average of Kendall’s Tau correlation of the ranked order of best-performing permutations over all random samples for a given task. The columns and rows are instruction types.



Weighted Kendall’s Tau: This assigns weights to each pair of elements in the rank-ordered lists. The weights signify the importance or relevance of the elements. It measures the correlation between the weighted ranks of the elements. Weighted Kendall’s Tau incorporates both the ordering of the elements and the importance assigned to each pair, providing a more nuanced evaluation of the similarity between rank-ordered lists.



Conclusion
A few conclusions from these experiments:
- Task-specific instructions help: common abstract instructions help for easier tasks, but both task-specific instructions perform better on average across three tasks.
- ICL Variance: There is a lot of variance in both the selection and permutation of demonstrations, but this variance reduces significantly as we move to better and better prompts.
- ICL optimization brittleness: There is no correlation in the rank orders of best performing permutations across instructions, which suggests that ICL order optimization is brittle w.r.t instructions.
- TL;DR Bottomline: use task-specific instructions, and do not optimize ICL examples separately and expect them to be the best for any other instruction. Either optimize the instruction first, or preferably, do a joint optimization over instructions and demonstrations.
Thank you!
UPDATE: Llama2 Results
Recreating all these results for Llama2 (13B, no FT) to figure out how things change with the model size:
First up, the overall performance table:
| Eval | No Instr. (DV003 -> Llama2) | Common Instr. | Simple Task Description | Complex Descr. (CD) | CD w/ random |
| stsb | 59.93% -> 50.64% | 57.15% -> 51.88% | 77.01% -> 68.12% | 91.38% -> 71.64% | 86.30% -> 70.80% |
| subj | 60.25% -> 53.32% | 63.82% -> 55.88% | 63.33% -> 61.82% | 63.29% -> 55.72% | 62.78% -> 55.06% |
| sst2 | 89.80% -> 64.54% | 89.02% -> 76.86% | 96.30% -> 94.74% | 96.30% -> 94.92% | 93.08% -> 94.54% |
Notes:
- No OpenAI filters, no rate limiters, no errors in running any of the 75K calls. Took around 0.5 sec/call.
- Llama2 (a smaller model) doesn’t improve in performance when moving from simple intructions to complex instructions.
- For sst2 (a simpler task), tdv-003 could learn it very well with no instruction or a common instruction, but Llama2 couldn’t.
Now, let’s look at the ICL choice performance variation boxplot:

So here as well, after giving instructions variance reduces significantly. One interesting thing is simple instructions have the least variance.



There seems to be very little correlation in the rank lists here as well, suggesting brittleness of prompt order optimization across instructions even for small models.
Thank you.