Another point that deserves to be put into the conversation is that if you have designed the reward function well enough, then hitting the reward button/getting reward means you get increasing capabilities, so addiction to the reward source is even more likely than you paint.
This creates problems if there’s a large enough zone where reward functions are specifiable well enough that getting reward leading to increasing capabilities but not well enough to specify non-instrumental goals.
The prototypical picture I have of outer-alignment/goal misspecification failures looks a lot like what happens to drug addicts in humans, except unlike drug addicts IRL, getting reward makes you smarter and more capable all the time, not dumber and weaker, meaning there’s no real reason to restrain yourself from trying to do anything and everything like deceptive alignment to get the reward fix, at least assuming no inner alignment/goal misgeneralization happened in training.
Quote below:
As we have pointed out, the cognitive ability of addicts tends to decrease with progressing addiction. This provides a natural negative feedback loop that puts an upper bound on the amount of harm an addict can cause. Without this negative feedback loop, humanity would look very different16. This mechanism is, by default, not present for AI17.
Footnote 16: The link leads to a (long) fiction novel by Scott Alexander where Mexico is controlled by people constantly high on peyote, who become extremely organized and effective as a result. They are scary & dangerous.
Footnote 17: Although it is an interesting idea to scale access to compute inversely to how high the value of the accumulated reward is.
Link below:
https://universalprior.substack.com/p/drug-addicts-and-deceptively-aligned
I’d say the other major difference from brains is that LLMs don’t have a long-term memory/state, and this means that trying to keep it coherent over long tasks is impossible.
I’d argue that this difference, along with no long-term memory pretty much compactly explains why attempts to replace jobs/use LLMs for stuff often fails, and arguably why LLMs can’t be substitutes for humans at jobs, which is how I define AGI: