Or e.g. that it always leads to the organism optimizing for a set of goals which is unrecognizably different from the base objective. I don’t think you see these things, and I’m interested in figuring out how evolution prevented them.
As I understand it, Wang et al. found that their experimental setup trained an internal RL algorithm that was more specialized for this particular task, but was still optimizing for the same task that the RNN was being trained on? And it was selected exactly because it did that very goal better. If the circumstances changed so that the more specialized behavior was no longer appropriate, then (assuming the RNN’s weights hadn’t been frozen) the feedback to the outer network would gradually end up reconfiguring the internal algorithm as well. So I’m not sure how it even could end up with something that’s “unrecognizably different” from the base objective—even after a distributional shift, the learned objective would probably still be recognizable as a special case of the base objective, until it updated to match the new situation.
The thing that I would expect to see from this description, is that humans who were e.g. practicing a particular skill might end up becoming overspecialized to the circumstances around that skill, and need to occasionally relearn things to fit a new environment. And that certainly does seem to happen. Likewise for more general/abstract skills, like “knowing how to navigate your culture/technological environment”, where older people’s strategies are often more adapted to how society used to be rather than how it is now—but still aren’t incapable of updating.
Catastrophic misalignment seems more likely to happen in the case of something like evolution, where the two learning algorithms operate on vastly different timescales, and it takes a very long time for evolution to correct after a drastic distributional shift. But the examples in Wang et al. lead me to think that in the brain, even the slower process operates on a timescale that’s on the order of days rather than years, allowing for reasonably rapid adjustments in response to distributional shifts. (Though it’s plausible that the more structure there is in a need of readjustment, the slower the reconfiguration process will be—which would fit the behavioral calcification that we see in e.g. some older people.)
It seems possible to me. A common strategy in religious groups is to steer for a wide barrier between them and particular temptations. This could be seen as a strategy for avoiding DA signals which would de-select for the behaviors encouraged by the religious group: no rewards are coming in for alternate behavior, so the best the DA can do is reinforce the types of reward which the PFC has restricted itself to.
This can be supplemented with modest rewards for desired behaviors, which force the DA to reinforce the inner optimizer’s desired behaviors.
Although is easier in a community which supports the behaviors, it’s entirely possible to do this to oneself in relative isolation, as well.
Kaj, the point I understand you to be making is: “The inner RL algorithm in this scenario is probably reliably aligned with the outer RL algorithm, since the former was selected specifically on the basis of it being good at accomplishing the latter’s objective, and since if the former deviates from pursuing that objective it will receive less reward from the outer, causing it to reconfigure itself to be better aligned. And since the two algorithms operate on similar time scales, we should expect any such misalignment to be noticed/corrected quickly.” Does this seem like a reasonable paraphrase?
It doesn’t feel obvious to me that the outer layer will be able to reliably steer the inner layer in this sense, especially as systems become more powerful. For example, it seems plausible to me that the inner layer might come to optimize for its proxy estimations of outer reward more than for outer reward itself, and that those two things might become decoupled.
That seems like a reasonable paraphrase, at least if you include the qualification that the “quickly” is relative to the amount of structure that the inner layer has accumulated, so might not actually happen quickly enough to be useful in all cases.
For example, it seems plausible to me that the inner layer might come to optimize for its proxy estimations of outer reward more than for outer reward itself, and that those two things could become decoupled.
Sure, e.g. lots of exotic sexual fetishes look like that to me. Hmm, though actually that example makes me rethink the argument that you just paraphrased, given that those generally emerge early in an individual’s life and then generally don’t get “corrected”.
As I understand it, Wang et al. found that their experimental setup trained an internal RL algorithm that was more specialized for this particular task, but was still optimizing for the same task that the RNN was being trained on? And it was selected exactly because it did that very goal better. If the circumstances changed so that the more specialized behavior was no longer appropriate, then (assuming the RNN’s weights hadn’t been frozen) the feedback to the outer network would gradually end up reconfiguring the internal algorithm as well. So I’m not sure how it even could end up with something that’s “unrecognizably different” from the base objective—even after a distributional shift, the learned objective would probably still be recognizable as a special case of the base objective, until it updated to match the new situation.
The thing that I would expect to see from this description, is that humans who were e.g. practicing a particular skill might end up becoming overspecialized to the circumstances around that skill, and need to occasionally relearn things to fit a new environment. And that certainly does seem to happen. Likewise for more general/abstract skills, like “knowing how to navigate your culture/technological environment”, where older people’s strategies are often more adapted to how society used to be rather than how it is now—but still aren’t incapable of updating.
Catastrophic misalignment seems more likely to happen in the case of something like evolution, where the two learning algorithms operate on vastly different timescales, and it takes a very long time for evolution to correct after a drastic distributional shift. But the examples in Wang et al. lead me to think that in the brain, even the slower process operates on a timescale that’s on the order of days rather than years, allowing for reasonably rapid adjustments in response to distributional shifts. (Though it’s plausible that the more structure there is in a need of readjustment, the slower the reconfiguration process will be—which would fit the behavioral calcification that we see in e.g. some older people.)
It seems possible to me. A common strategy in religious groups is to steer for a wide barrier between them and particular temptations. This could be seen as a strategy for avoiding DA signals which would de-select for the behaviors encouraged by the religious group: no rewards are coming in for alternate behavior, so the best the DA can do is reinforce the types of reward which the PFC has restricted itself to.
This can be supplemented with modest rewards for desired behaviors, which force the DA to reinforce the inner optimizer’s desired behaviors.
Although is easier in a community which supports the behaviors, it’s entirely possible to do this to oneself in relative isolation, as well.
Good point, I wasn’t thinking of social effects changing the incentive landscape.
Kaj, the point I understand you to be making is: “The inner RL algorithm in this scenario is probably reliably aligned with the outer RL algorithm, since the former was selected specifically on the basis of it being good at accomplishing the latter’s objective, and since if the former deviates from pursuing that objective it will receive less reward from the outer, causing it to reconfigure itself to be better aligned. And since the two algorithms operate on similar time scales, we should expect any such misalignment to be noticed/corrected quickly.” Does this seem like a reasonable paraphrase?
It doesn’t feel obvious to me that the outer layer will be able to reliably steer the inner layer in this sense, especially as systems become more powerful. For example, it seems plausible to me that the inner layer might come to optimize for its proxy estimations of outer reward more than for outer reward itself, and that those two things might become decoupled.
That seems like a reasonable paraphrase, at least if you include the qualification that the “quickly” is relative to the amount of structure that the inner layer has accumulated, so might not actually happen quickly enough to be useful in all cases.
Sure, e.g. lots of exotic sexual fetishes look like that to me. Hmm, though actually that example makes me rethink the argument that you just paraphrased, given that those generally emerge early in an individual’s life and then generally don’t get “corrected”.