I agree with just about all of it (even though it paints a pretty bleak picture). It was useful to put all of these ideas and inner/outer alignment in one place especially the diagrams.
Two quotes that stood out to me:
″ “Nameless pattern in sensory input that you’ve never conceived of” is a case where something is in-domain for the reward function but (currently) out-of-domain for the value function. Conversely, there are things that are in-domain for your value function—so you can like or dislike them—but wildly out-of-domain for your reward function! You can like or dislike “the idea that the universe is infinite”! You can like or dislike “the idea of doing surgery on your brainstem in order to modify your own internal reward function calculator”! A big part of the power of intelligence is this open-ended ever-expanding world-model that can re-conceptualize the world and then leverage those new concepts to make plans and achieve its goals. But we cannot expect those kinds of concepts to be evaluable by the reward function calculator.”
And
“After all, the reward function will diverge from the thing we want, and the value function will diverge from the reward function. The most promising solution directions that I can think of seem to rely on things like interpretability, “finding human values inside the world-model”, corrigible motivation, etc.—things which cut across both layers, bridging all the way from the human’s intentions to the value function.”
Also the idea that we can use the human brain as a way to better understand the interface between our outer loop reward function and inner loop value function.
Thinking about corrigibility, it seems like having a system with finite computational resources and an inability to modify its source code would both be highly desirable especially at the early stages. This feels like a +1 to neuron based wetware that implements AGI rather than as code on a server. Of course, the agent could find ways to acquire more neurons! And we would very likely then lose out on some interpretability tools. But this is just something that popped into my head as a tradeoff for different AGI implementations.
As a more general point, I think you working with the garage door open and laying out all of your arguments is highly motivating (at least for me!) to be thinking more actively and pursuing safety research in a way that I have dilly dallied on actually doing since back in 2016 when I read Superintelligence!
Thanks for this post!
I agree with just about all of it (even though it paints a pretty bleak picture). It was useful to put all of these ideas and inner/outer alignment in one place especially the diagrams.
Two quotes that stood out to me:
″ “Nameless pattern in sensory input that you’ve never conceived of” is a case where something is in-domain for the reward function but (currently) out-of-domain for the value function. Conversely, there are things that are in-domain for your value function—so you can like or dislike them—but wildly out-of-domain for your reward function! You can like or dislike “the idea that the universe is infinite”! You can like or dislike “the idea of doing surgery on your brainstem in order to modify your own internal reward function calculator”! A big part of the power of intelligence is this open-ended ever-expanding world-model that can re-conceptualize the world and then leverage those new concepts to make plans and achieve its goals. But we cannot expect those kinds of concepts to be evaluable by the reward function calculator.”
And
“After all, the reward function will diverge from the thing we want, and the value function will diverge from the reward function. The most promising solution directions that I can think of seem to rely on things like interpretability, “finding human values inside the world-model”, corrigible motivation, etc.—things which cut across both layers, bridging all the way from the human’s intentions to the value function.”
Also the idea that we can use the human brain as a way to better understand the interface between our outer loop reward function and inner loop value function.
Thinking about corrigibility, it seems like having a system with finite computational resources and an inability to modify its source code would both be highly desirable especially at the early stages. This feels like a +1 to neuron based wetware that implements AGI rather than as code on a server. Of course, the agent could find ways to acquire more neurons! And we would very likely then lose out on some interpretability tools. But this is just something that popped into my head as a tradeoff for different AGI implementations.
As a more general point, I think you working with the garage door open and laying out all of your arguments is highly motivating (at least for me!) to be thinking more actively and pursuing safety research in a way that I have dilly dallied on actually doing since back in 2016 when I read Superintelligence!