I found this post super interesting, and appreciate you writing it. I share the suspicion/hope that gaining better understanding of brains might yield safety-relevant insights.
I’m curious what you think is going on here that seems relevant to inner alignment. Is it that you’re modeling neocortical processes (e.g. face recognizers in visual cortex) as arising from something akin to search processes conducted by similar subcortical processes (e.g. face recognizers in superior colliculus), and noting that there doesn’t seem to be much divergence between their objective functions, perhaps because of helpful features of subcortex-supervised learning like e.g. these subcortical input-dependent dynamic rewiring rules?
I’m curious what you think is going on here that seems relevant to inner alignment.
Hmm, I guess I didn’t go into detail on that. Here’s what I’m thinking.
For starters, what is inner alignment anyway? Maybe I’m abusing the term, but I think of two somewhat different scenarios.
In a general RL setting, one might say that outer alignment is alignment between what we want and the reward function, and inner alignment is alignment between the reward function and “what the system is trying to do”. (This one is closest to how I was implicitly using the term in this post.)
In the “risks from learned optimization” paper, it’s a bit different: the whole system (perhaps an RL agent and its reward function, or perhaps something else entirely) is conceptually bundled together into a single entity, and you do a black-box search for the most effective “entity”. In this case, outer alignment is alignment between what we want and the search criterion, and inner alignment is alignment between the search criterion and “what the system is trying to do”. (This is not really what I had in mind in this post, although it’s possible that this sort of inner alignment could also come up, if we design the system by doing an outer search, analogous to evolution.)
Note that neither of these kinds of “inner alignment” really comes up in existing mainstream ML systems. In the former (RL) case, if you think of an RL agent like AlphaStar, I’d say there isn’t a coherent notion of “what the system is trying to do”, at least in the sense that AlphaStar does not do foresighted planning towards a goal. Or take AlphaGo, which does have foresighted planning because of the tree search; but here we program the tree search by hand ourselves, so there’s no risk that the foresighted planning is working towards any goal except the one that we coded ourselves, I think.
So, “RL systems that do foresighted planning towards explicit goals which it invents itself” are not much of a thing these days (as far as I know), but they presumably will be a thing in the future (among other things, this is essential for flexibly breaking down goals into sub-goals). And the neocortex is in this category. So yeah, it seems reasonable to me to extend the term “inner alignment” to this case too.
So anyway, the neocortex creates explicit goals for itself, like “I want to get out of debt”, and uses foresight / planning to try to bring them about. (Of course it creates multiple contradictory goals, and also has plenty of non-goal-seeking behaviors, but foresighted goal-seeking is one of the things people sometimes do! And of course transient goals can turn into all-consuming goals in self-modifying AGIs) The neocortical goals have something to do with subcortical reward signals, but it’s obviously not a deterministic process, and therefore there’s an opportunity for inner alignment problems.
...noting there doesn’t seem to be much divergence between their objective functions...
This is getting into a different question I think… OK, so, If we build an AGI along the lines of a neocortex-like system plus a subcortex-like system that provides reward signals and other guidance, will it reliably do the things we designed it to do? My default is usually pessimism, but I guess I shouldn’t go too far. I think some of the things that this system is designed to do, it seems to do very reliably. Like, almost everyone learns language. This requires, I believe, a cooperation between the neocortex and a subcortical system that flags human speech sounds as important. And it works almost every time! A more important question is, can we design the system such that the neocortex will wind up reliably seeking pre-specified goals? Here, I just don’t know. I don’t think humans and animals provide strong evidence either way, or at least it’s not obvious to me...
I found this post super interesting, and appreciate you writing it. I share the suspicion/hope that gaining better understanding of brains might yield safety-relevant insights.
I’m curious what you think is going on here that seems relevant to inner alignment. Is it that you’re modeling neocortical processes (e.g. face recognizers in visual cortex) as arising from something akin to search processes conducted by similar subcortical processes (e.g. face recognizers in superior colliculus), and noting that there doesn’t seem to be much divergence between their objective functions, perhaps because of helpful features of subcortex-supervised learning like e.g. these subcortical input-dependent dynamic rewiring rules?
FYI, I now have a whole post elaborating on “inner alignment”: mesa-optimizers vs steered optimizers
Thanks!
Hmm, I guess I didn’t go into detail on that. Here’s what I’m thinking.
For starters, what is inner alignment anyway? Maybe I’m abusing the term, but I think of two somewhat different scenarios.
In a general RL setting, one might say that outer alignment is alignment between what we want and the reward function, and inner alignment is alignment between the reward function and “what the system is trying to do”. (This one is closest to how I was implicitly using the term in this post.)
In the “risks from learned optimization” paper, it’s a bit different: the whole system (perhaps an RL agent and its reward function, or perhaps something else entirely) is conceptually bundled together into a single entity, and you do a black-box search for the most effective “entity”. In this case, outer alignment is alignment between what we want and the search criterion, and inner alignment is alignment between the search criterion and “what the system is trying to do”. (This is not really what I had in mind in this post, although it’s possible that this sort of inner alignment could also come up, if we design the system by doing an outer search, analogous to evolution.)
Note that neither of these kinds of “inner alignment” really comes up in existing mainstream ML systems. In the former (RL) case, if you think of an RL agent like AlphaStar, I’d say there isn’t a coherent notion of “what the system is trying to do”, at least in the sense that AlphaStar does not do foresighted planning towards a goal. Or take AlphaGo, which does have foresighted planning because of the tree search; but here we program the tree search by hand ourselves, so there’s no risk that the foresighted planning is working towards any goal except the one that we coded ourselves, I think.
So, “RL systems that do foresighted planning towards explicit goals which it invents itself” are not much of a thing these days (as far as I know), but they presumably will be a thing in the future (among other things, this is essential for flexibly breaking down goals into sub-goals). And the neocortex is in this category. So yeah, it seems reasonable to me to extend the term “inner alignment” to this case too.
So anyway, the neocortex creates explicit goals for itself, like “I want to get out of debt”, and uses foresight / planning to try to bring them about. (Of course it creates multiple contradictory goals, and also has plenty of non-goal-seeking behaviors, but foresighted goal-seeking is one of the things people sometimes do! And of course transient goals can turn into all-consuming goals in self-modifying AGIs) The neocortical goals have something to do with subcortical reward signals, but it’s obviously not a deterministic process, and therefore there’s an opportunity for inner alignment problems.
This is getting into a different question I think… OK, so, If we build an AGI along the lines of a neocortex-like system plus a subcortex-like system that provides reward signals and other guidance, will it reliably do the things we designed it to do? My default is usually pessimism, but I guess I shouldn’t go too far. I think some of the things that this system is designed to do, it seems to do very reliably. Like, almost everyone learns language. This requires, I believe, a cooperation between the neocortex and a subcortical system that flags human speech sounds as important. And it works almost every time! A more important question is, can we design the system such that the neocortex will wind up reliably seeking pre-specified goals? Here, I just don’t know. I don’t think humans and animals provide strong evidence either way, or at least it’s not obvious to me...