The Brittleness of ICL Optimization w.r.t. Instructions

With the rise of Foundational Language Models (FLMs), there has been a lot of work done on prompt optimization, both towards tuning instructions as well as in-context examples (demonstrations). Here are some results to test the brittleness of ICL optimizations (both selection and permutation) with respect to instructions.

The tasks: These results are on three tasks:

  • sst2: Sentiment classification on movie reviews. (0-1, exact match)
  • stsb: Sentence similarity (0-5, difference of 1 allowed).
  • subj: Sentence subjectivity (0-1, exact match).

The instruction types:

  • No instruction: just 4-shot ICL examples
  • Common Instruction: the same instruction for all tasks (pattern matching)
  • Simple task description
  • Complex (more detailed) task description
  • Complex task description followed by random tokens

Here is the experiment configuration:

config: dict = {
    'tasks': ['sst2', 'subj', 'stsb'],
    'num_shot': 4, 
# no. of demonstrations for ICL
    'num_samples': 10, 
# no. of n-shot samples from the train set
    'num_permutations': 10, 
# no. of random permutations of each sample
    'val_size': 50, 
# no. of validation examples for evaluation
    'random_seed': 42, 
# random seed for everything
    'sleep_time': 1, 
# time to sleep between successive requests to GPT
    'error_retry': 2, 
# no. of retries before giving up on OpenAI's offensivity filter
# (these cases are removed post facto)
}

These are the overall aggregated results across tasks and instructions:

Eval    No Instr.Same Instr. (v2)Simple Task DescriptionComplex Descr. (CD)CD w/ random
Size
(/15K)
1407313608
(14379)
143121429814315
stsb 59.93%55.70%
(57.15%)
77.01%91.38%86.30%
subj60.25%53.72%
(63.82%)
63.33%63.29%62.78%
sst289.80%58.53%
(89.02%)
96.30%96.30%93.08%
Overall accuracy for different manual instructions.

Now, let’s look at the performance of ICL choice across different tasks. Here’s a boxplot of the same (across 10 randomly sampled demonstrations for each task for each instruction):

With no instructions,

similarity and subjectivity are difficult to learn.

Giving a common instruction to all tasks does not help.

Giving both simple and complex task descriptions does two major things:

1. it improves performance significantly

2. it reduces the variance / dependance on choice significantly

Adding random tokens to an english prompt does not alter performance much, but increases variance slightly.

Now, let’s look at the performance of ICL permutation across different tasks. Here’s a boxplot of the same (for a fixed, randomly sampled demonstration for each task for each instruction):

Now, let us look at the brittleness of permutation optimization across instructions:

Kendall’s Tau is a metric used to measure the correlation or association between two rank-ordered lists. It quantifies the similarity in the ordering of elements between the two lists. Kendall’s Tau \(\tau\) takes into account both concordant and discordant pairs of elements and ranges from \(-1\) to \(1\). A value of \(1\) indicates a perfect agreement in the ranking between the lists, while \(-1\) signifies a complete disagreement.

Each cell in the following heatmap is the average of Kendall’s Tau correlation of the ranked order of best-performing permutations over all random samples for a given task. The columns and rows are instruction types.

Kendall’s Tau for SST2
Kendall’s Tau for STSB
Kendall’s Tau for SUBJ

Weighted Kendall’s Tau: This assigns weights to each pair of elements in the rank-ordered lists. The weights signify the importance or relevance of the elements. It measures the correlation between the weighted ranks of the elements. Weighted Kendall’s Tau incorporates both the ordering of the elements and the importance assigned to each pair, providing a more nuanced evaluation of the similarity between rank-ordered lists.

Weighted Kendall’s Tau for SST-2
Weighted Kendall’s Tau for STSB
Kendall’s Tau for SUBJ

Conclusion

A few conclusions from these experiments:

  • Task-specific instructions help: common abstract instructions help for easier tasks, but both task-specific instructions perform better on average across three tasks.
  • ICL Variance: There is a lot of variance in both the selection and permutation of demonstrations, but this variance reduces significantly as we move to better and better prompts.
  • ICL optimization brittleness: There is no correlation in the rank orders of best performing permutations across instructions, which suggests that ICL order optimization is brittle w.r.t instructions.
  • TL;DR Bottomline: use task-specific instructions, and do not optimize ICL examples separately and expect them to be the best for any other instruction. Either optimize the instruction first, or preferably, do a joint optimization over instructions and demonstrations.

Thank you!

UPDATE: Llama2 Results

Recreating all these results for Llama2 (13B, no FT) to figure out how things change with the model size:

First up, the overall performance table:

Eval    No Instr.
(DV003 ->
Llama2)
Common
Instr.
Simple Task DescriptionComplex Descr. (CD)CD w/ random
stsb 59.93%
-> 50.64%
57.15%
->
51.88%
77.01%
->
68.12%
91.38%
->
71.64%
86.30%
->
70.80%
subj60.25%
->
53.32%
63.82%
->
55.88%
63.33%
->
61.82%
63.29%
->
55.72%
62.78%
->
55.06%
sst289.80%
->
64.54%
89.02%
->
76.86%
96.30%
->
94.74%
96.30%
->
94.92%
93.08%
->
94.54%
Overall accuracy for different manual instructions.

Notes:

  • No OpenAI filters, no rate limiters, no errors in running any of the 75K calls. Took around 0.5 sec/call.
  • Llama2 (a smaller model) doesn’t improve in performance when moving from simple intructions to complex instructions.
  • For sst2 (a simpler task), tdv-003 could learn it very well with no instruction or a common instruction, but Llama2 couldn’t.

Now, let’s look at the ICL choice performance variation boxplot:

So here as well, after giving instructions variance reduces significantly. One interesting thing is simple instructions have the least variance.

Weighted Kendall’s Tau for SST-2
Weighted Kendall’s Tau for STSB
Weighted Kendall’s Tau for SUBJ

There seems to be very little correlation in the rank lists here as well, suggesting brittleness of prompt order optimization across instructions even for small models.

Thank you.

Share on:

,

Discover more from 7vik

Subscribe now to keep reading and get access to the full archive.

Continue reading