It is worth thinking about why ChatGPT, an Oracle AI which can execute certain instructions, does not fail text equivalents of the cauldron task.
It seems the reason why it doesn’t fail is that it is pretty good at understanding the meaning of expressions. (If an AI floods the room with water because this maximizes the probability that the cauldron will be filled, then the AI hasn’t fully understood the instruction “fill the cauldron”, which only asks for a satisficing solution.)
And why is ChatGPT so good as interpreting the meaning of instructions? Because its base model was trained with some form of imitation learning, which gives it excellent language understanding and the ability to mimic the the linguistic behavior of human agents. This requires special prompting in a base model, but supervised learning on dialogue examples (instruction tuning) lets it respond adequately to instructions. (Of course, at this stage it would not refuse any dangerous requests, which comes only in with RLHF, which seems a rather imperfect tool.)
It is worth thinking about why ChatGPT, an Oracle AI which can execute certain instructions, does not fail text equivalents of the cauldron task.
It seems the reason why it doesn’t fail is that it is pretty good at understanding the meaning of expressions. (If an AI floods the room with water because this maximizes the probability that the cauldron will be filled, then the AI hasn’t fully understood the instruction “fill the cauldron”, which only asks for a satisficing solution.)
And why is ChatGPT so good as interpreting the meaning of instructions? Because its base model was trained with some form of imitation learning, which gives it excellent language understanding and the ability to mimic the the linguistic behavior of human agents. This requires special prompting in a base model, but supervised learning on dialogue examples (instruction tuning) lets it respond adequately to instructions. (Of course, at this stage it would not refuse any dangerous requests, which comes only in with RLHF, which seems a rather imperfect tool.)