This reminds me of the problems that STPA are trying to solve in safe systems design:
https://psas.scripts.mit.edu/home/get_file.php?name=STPA_handbook.pdf
And, for those who prefer video, here’s a good video intro to STPA:
Their approach is designed to handle complex systems, by decomposing the system into parts. However, they are not decomposed into functions or tasks, but instead they decompose the system into a control structure.
They approach this problem by, addressing a system as built up of a graph of controllers (internal mesa optimisers which are potentially nested) which control processes and then receive feedback (internal loss functions) from those processes. From there, they are then able to logically decompose the system in such a way for each controller component and present the ways in which the resulting overall system can be unsafe due to that particular controller.
Wouldn’t it be amazing if one day we could make a neural network that when trained, the result is subsequently verifiably mappable via mech-int onto an STPA control structure. And then, potentially have verifiable systems in place that themselves undergo STPA analyses on larger yet systems, in order to flag potential hazards given a scenario, and given its current control structure.
Maybe this could look something like this?
I’d be keen for other’s thoughts around a “Socratic tale” of one particular way in which CIRL might be a helpful component of the alignment story.
Let’s say we make leaps and bounds within mechanistic interpretability research to the point where we have identified a primary objective style mesa optimiser within the transformer network. But, when looking into its internalised loss function we see that it is less than ideal.
But given, in this make believe future, we have built up sufficient mechanistic interpretability understanding, we now have a way that we can “patch” the loss function. And, it turns out although we couldn’t have trained the model to have CIRL built in, now that the network has a fundamental understanding of all the concepts of CIRL itself, we can instead reroute the definition and output of CIRL to itself instead be the internal loss function of its primary internalised optimiser.
Potentially, something like this could help the above be possible?
I’m not saying the above is likely, or even possible. But I wanted to present it as just one way in which CIRL may be an amazing tool in our toolbox. We need to be careful to not prematurely throw away any tools in our arsenal which could at some point be exceptionally helpful in solving this problem. At this stage of the game we need to be really careful to not throw up “blinkers” and say, xyz will definitely not help. Who knows, it might not be the whole answer, but it just might be a really helpful cog in a surprising way.