joebiden comments on More examples of goal misgeneralization

joebiden 8 Oct 2022 15:00 UTC
6 points
5
These examples seem like capabilities failures rather than alignment failures. Reading them doesn’t make me feel any more convinced that there will be rebellious AI, accidental paperclip maximizers, deceptive alignment, etc.

In the first example, the environment the AI is in suddenly changes, and the AI is not given the capability to learn and adapt to this change. So of course it fails.

In the second example, the AI is given the ability to continuously learn and adapt, and in this case, it actually succeeds at the intended goal. It almost depopulates the trees, because it’s a relatively simple reinforcement learner & it has to screw up once to learn from its mistakes, whereas a more sophisticated intelligence might have more foresight. Still, only messing up once is pretty impressive.

The third example is an LLM, about which it’s sort of awkward to apply the concept of having “goals”. LLMs are capable of astonishing examples of intelligence sometimes but also frequently are very “stupid” when statistical next-token-prediction just leads to faulty pattern-matching. This failure is one such example.