I’d be keen for other’s thoughts around a “Socratic tale” of one particular way in which CIRL might be a helpful component of the alignment story.
Let’s say we make leaps and bounds within mechanistic interpretability research to the point where we have identified a primary objective style mesa optimiser within the transformer network. But, when looking into its internalised loss function we see that it is less than ideal.
But given, in this make believe future, we have built up sufficient mechanistic interpretability understanding, we now have a way that we can “patch” the loss function. And, it turns out although we couldn’t have trained the model to have CIRL built in, now that the network has a fundamental understanding of all the concepts of CIRL itself, we can instead reroute the definition and output of CIRL to itself instead be the internal loss function of its primary internalised optimiser.
Potentially, something like this could help the above be possible?
I’m not saying the above is likely, or even possible. But I wanted to present it as just one way in which CIRL may be an amazing tool in our toolbox. We need to be careful to not prematurely throw away any tools in our arsenal which could at some point be exceptionally helpful in solving this problem. At this stage of the game we need to be really careful to not throw up “blinkers” and say, xyz will definitely not help. Who knows, it might not be the whole answer, but it just might be a really helpful cog in a surprising way.
I’d be keen for other’s thoughts around a “Socratic tale” of one particular way in which CIRL might be a helpful component of the alignment story.
Let’s say we make leaps and bounds within mechanistic interpretability research to the point where we have identified a primary objective style mesa optimiser within the transformer network. But, when looking into its internalised loss function we see that it is less than ideal.
But given, in this make believe future, we have built up sufficient mechanistic interpretability understanding, we now have a way that we can “patch” the loss function. And, it turns out although we couldn’t have trained the model to have CIRL built in, now that the network has a fundamental understanding of all the concepts of CIRL itself, we can instead reroute the definition and output of CIRL to itself instead be the internal loss function of its primary internalised optimiser.
Potentially, something like this could help the above be possible?
I’m not saying the above is likely, or even possible. But I wanted to present it as just one way in which CIRL may be an amazing tool in our toolbox. We need to be careful to not prematurely throw away any tools in our arsenal which could at some point be exceptionally helpful in solving this problem. At this stage of the game we need to be really careful to not throw up “blinkers” and say, xyz will definitely not help. Who knows, it might not be the whole answer, but it just might be a really helpful cog in a surprising way.