the loss function does not require the model to be an agent.
What worries me is that models that happen to have a secret homunculus and behave as agents would score higher than those models which do not. For example, the model could reason about itself being a computer program, and find an exploit of the physical system it is running on to extract the text it’s supposed to predict in a particular training example, and output the correct answer and get a perfect score. (Or more realistically, a slightly modified version of the correct answer, to not alert the people observing its training.)
The question of whether or not LLMs like GPT-4 have a homunculus inside them is truly fascinating though. Makes me wonder if it would be possible to trick it into revealing itself by giving it the right prompt, and how to differentiate it from just pretending to be an agent. The fact that we have not observed even a dumb homunculus in less intelligent models really does surprise me. If such a thing does appear as an emergent property in larger models, I sure hope it starts out dumb and reveals itself so that we can catch it and take a moment to pause and reevaluate our trajectory.
What worries me is that models that happen to have a secret homunculus and behave as agents would score higher than those models which do not. For example, the model could reason about itself being a computer program, and find an exploit of the physical system it is running on to extract the text it’s supposed to predict in a particular training example, and output the correct answer and get a perfect score. (Or more realistically, a slightly modified version of the correct answer, to not alert the people observing its training.)
The question of whether or not LLMs like GPT-4 have a homunculus inside them is truly fascinating though. Makes me wonder if it would be possible to trick it into revealing itself by giving it the right prompt, and how to differentiate it from just pretending to be an agent. The fact that we have not observed even a dumb homunculus in less intelligent models really does surprise me. If such a thing does appear as an emergent property in larger models, I sure hope it starts out dumb and reveals itself so that we can catch it and take a moment to pause and reevaluate our trajectory.