RogerDearnaley comments on Large Language Models can Strategically Deceive their Users when Put Under Pressure.

RogerDearnaley 17 Nov 2023 5:08 UTC
3 points
0
Thinking about this, this opens up whole new areas for (what is called) alignment for current frontier LLMs. Up until now that has mostly been limited to training in “Should I helpfully answer the question? Or should I harmlessly refuse?” plus trying to make them less biased. Worthy aims, but worryingly limited in scope compared to foreseeable future alignment tasks.
Now we have a surprisingly simple formula for a significant enlargement in the scope of this work:
1. Set up scenarios in which an agent is exposed to temptation/pressure to commit a white-collar crime, and then gets investigated, giving it an opportunity (if it committed the crime) to then either (attempt to) cover it up, or confess to it. (You could even consult actual forensic accountants, and members of the fraud squad, and the FBI.)
2. Repeatedly run an LLM-powered agent through these and gather training data (positive and negative). Since we’re logging every token it emits, we know exactly what it did.
3. Apply all the usual current techniques such as prompt engineering, fine-tuning, various forms of RL, pretraining dataset filtering, interpretability, LLM lie detectors etc. etc. to this, along with every new technique we can think of.
4. Attempt to create an LLM that will (almost) never knowingly commit a crime, and that if it did would (almost) always spontaneously turn itself in.
5. Confirm that it can still think and talk rationally about other people committing crimes or not turning themselves in, and hasn’t, say, forgotten these are possible, or become extremely trusting.
6. Continue to worry about situational awareness. (If this was learned from human behavior, human criminals tend to be situationally aware — particularly the ones we don’t catch.)
Deontological law-abidingness isn’t true alignment, but it’s clearly a useful start, and might help a lot at around human-equivalent intelligence levels. Just as Tesla is aiming to make self-driving cars that are safer than the average driver, we could aim for LLM-powered AGI agents that are more law-abiding than the average employee…