After looking at all this graphs I feel somewhat like this:
Harry thought over his collected experimental data. It was only the most crude and preliminary sort of effort, but it was enough to support at least one conclusion:
“Aaaaaaarrrgh this doesn’t make any sense! ”
The witch beside him lifted a lofty eyebrow. “Problems, Mr. Potter?”
“I just falsified every single hypothesis I had! How can it know that ‘bag of 115 Galleons’ is okay but not ‘bag of 90 plus 25 Galleons’? It can count but it can’t add? It can understand nouns, but not some noun phrases that mean the same thing? The person who made this probably didn’t speak Japanese and I don’t speak any Hebrew, so it’s not using their knowledge, and it’s not using my knowledge—” Harry waved a hand helplessly. “The rules seem sorta consistent but they don’t mean anything! I’m not even going to ask how a pouch ends up with voice recognition and natural language understanding when the best Artificial Intelligence programmers can’t get the fastest supercomputers to do it after thirty-five years of hard work,” Harry gasped for breath, “but what is going on? ”
It all seems sorta working, but why, for example, adding sycophancy vector with multiplier >0.5 in Llama-7B decreases sycophancy in sycophantic prompt? Why adding sycophancy vector to non-sycophantic prompt increases sycophancy relative to non-prompted model with same added vector? Why we see at all various different numbers and not “0% if subtracted, 100% if added”?
After looking at all this graphs I feel somewhat like this:
It all seems sorta working, but why, for example, adding sycophancy vector with multiplier >0.5 in Llama-7B decreases sycophancy in sycophantic prompt? Why adding sycophancy vector to non-sycophantic prompt increases sycophancy relative to non-prompted model with same added vector? Why we see at all various different numbers and not “0% if subtracted, 100% if added”?