That said, if mesa-optimization is a standard feature[4] of brain architecture, it seems notable that humans don’t regularly experience catastrophic inner alignment failures.
The thing I meant by “catastrophic” is just “leading to death of the organism.” I suspect mesa-optimization is common in humans, but I don’t feel confident about this, nor that this is a joint-carvey ontology. I can imagine it being the case that many examples of e.g. addiction, goodharting, OCD, and even just “everyday personal misalignment”-type problems of the sort IFS/IDC/multi-agent models of mind sometimes help with, are caused by phenomena which might reasonably be described as inner alignment failures.
But I think these things don’t kill people very often? People do sometimes choose to die because of beliefs. And anorexia sometimes kills people, which currently feels to me like the most straightforward candidate example I’ve considered.
I just feel like things could be a lot worse. For example, it could have been the case that mind-architectures that give rise to mesa-optimization at all simply aren’t viable at high levels of optimization power—that it always kills them. Or that it basically always leads to the organism optimizing for a set of goals which is unrecognizably different from the base objective. I don’t think you see these things, so I’m curious how evolution prevented them.
Governments and corporations experience inner alignment failures all the time, but because of convergent instrumental goals, they are rarely catastrophic. For example, Russia underwent a revolution and a civil war on the inside, followed by purges and coups etc., but from the perspective of other nations, it was more or less still the same sort of thing: A nation, trying to expand its international influence, resist incursions, and conquer more territory. Even its alliances were based as much on expediency as on shared ideology.
Why do you single out anorexia? Do you mean people starving themselves to death? My understanding is that is very rare. Anorexics have a high death rate and some of that is long-term damage from starvation. They also (abruptly) kill themselves at a high rate, comparable to schizophrenics, but why single that out? There’s a theory that they have practice with internal conflict, which does seem relevant, but I think that’s just a theory, not clear cut at all.
Yeah, I wrote that confusingly, sorry; edited to clarify. I just meant that of the limited set of candidate examples I’d considered, my model of anorexia, which of course may well be wrong, feels most straightforwardly like an example of something capable of causing catastrophic within-brain inner alignment failure. That is, it currently feels natural to me to model anorexia as being caused by an optimizer for thinness arising in brains, which can sometimes gain sufficient power that people begin to optimize for that goal at the expense of essentially all other goals. But I don’t feel confident in this model.
I’m objecting to the claim that it fits your criterion of “catastrophic.” Maybe it’s such a clear example, with such a clear goal, that we should sacrifice the criterion of catastrophic, but you keep using that word.
Ah, I see. The high death rate was what made it seem often-catastrophic to me. Is your objection that the high death rate doesn’t reflect something that might reasonably be described as “optimizing for one goal at the expense of all others”? E.g., because many of the deaths are suicides, in which case persistence may have been net negative from the perspective of the rest of their goals too? Or because deaths often result from people calibratedly taking risky but non-insane actions, who just happened to get unlucky with heart muscle integrity or whatever?
I asked you if you were talking about starving to death and you didn’t answer. Does your abstract claim correspond to a concrete claim, or do you just observe that anorexics seem to have a goal and assume that everything must flow from that and the details don’t matter? That’s a perfectly reasonable claim, but it’s a weak claim so I’d like to know if that’s what you mean.
Abrupt suicides by anorexics are just as mysterious as suicides by schizophrenics and don’t seem to flow from the apparent goal of thinness. Suicide is a good example of something, but I don’t think it’s useful to attach it to anorexia rather than schizophrenia or bipolar.
Long-term health damage would be a reasonable claim, which I tried to concede in my original comment. I’m not sure I agree with it. I could pose a lot of complaints about it, but I wouldn’t. If it’s clear that it is the claim, then I think it’s clearly a weak claim and that’s OK. (As for the objection you propose, I would rather say: lots of people take badly calibrated risks without being labeled insane.)
The thing I meant by “catastrophic” is “leading to the death of the organism.”
This doesn’t seem like what it should mean here. I’d think catastrophic in the context of “how humans (programmed by evolution) might fail by evolution’s standards” should mean “start pursuing strategies that don’t result in many children or longterm population success.” (where premature death of the organism might be one way to cause that, but not the only way)
I agree, in the case of evolution/humans. I meant to highlight what seemed to me like a relative lack of catastrophic within-mind inner alignment failures, e.g. due to conflicts between PFC and DA. Death of the organism feels to me like one reasonable way to operationalize “catastrophic” in these cases, but I can imagine other reasonable ways.
I think it makes more sense to operationalize “catastrophic” here as “leading to systematically low DA reward”, perhaps also including “manipulating the DA system in a clearly misaligned way”.
One way catastrophic alignment in this sense is difficult for humans is that the PFC cannot divorce itself from the DA; I’d expect that a failure mode leading to systematically low DA rewards would usually be corrected gradually, as the DA punishes those patterns.
However, this is not really clear. The misaligned PFC might e.g. put itself in a local maximum, where it creates DA punishment for giving into temptation. (For example, an ascetic getting social reinforcement from a group of ascetics might be in such a situation.)
One way catastrophic alignment in this sense is difficult for humans is that the PFC cannot divorce itself from the DA; I’d expect that a failure mode leading to systematically low DA rewards would usually be corrected
I’m not sure divorce like this is rare. For example, anorexia sometimes causes people to find food anti-rewarding (repulsive/inedible, even when they’re dying and don’t to be), and I can imagine that being because PFC actually somehow alters DAs reward function.
But I do share the hunch that something like a “divorce resistance” trick occurs and is helpful. I took Kaj and Steve to be gesturing at something similar elsewhere in the thread. But I notice feeling confused about how exactly this trick might work. Does it scale...?
I have the intuition that it doesn’t—that as the systems increase in power, divorce occurs more easily. That is, I have the intuition that if PFC were trying, so to speak, to divorce itself from DA supervision, that it could probably find some easy-ish way to succeed, e.g. by reconfiguring itself to hide activity from DA, or to send reward-eliciting signals to DA regardless of what goal it was pursuing.
Or e.g. that it always leads to the organism optimizing for a set of goals which is unrecognizably different from the base objective. I don’t think you see these things, and I’m interested in figuring out how evolution prevented them.
As I understand it, Wang et al. found that their experimental setup trained an internal RL algorithm that was more specialized for this particular task, but was still optimizing for the same task that the RNN was being trained on? And it was selected exactly because it did that very goal better. If the circumstances changed so that the more specialized behavior was no longer appropriate, then (assuming the RNN’s weights hadn’t been frozen) the feedback to the outer network would gradually end up reconfiguring the internal algorithm as well. So I’m not sure how it even could end up with something that’s “unrecognizably different” from the base objective—even after a distributional shift, the learned objective would probably still be recognizable as a special case of the base objective, until it updated to match the new situation.
The thing that I would expect to see from this description, is that humans who were e.g. practicing a particular skill might end up becoming overspecialized to the circumstances around that skill, and need to occasionally relearn things to fit a new environment. And that certainly does seem to happen. Likewise for more general/abstract skills, like “knowing how to navigate your culture/technological environment”, where older people’s strategies are often more adapted to how society used to be rather than how it is now—but still aren’t incapable of updating.
Catastrophic misalignment seems more likely to happen in the case of something like evolution, where the two learning algorithms operate on vastly different timescales, and it takes a very long time for evolution to correct after a drastic distributional shift. But the examples in Wang et al. lead me to think that in the brain, even the slower process operates on a timescale that’s on the order of days rather than years, allowing for reasonably rapid adjustments in response to distributional shifts. (Though it’s plausible that the more structure there is in a need of readjustment, the slower the reconfiguration process will be—which would fit the behavioral calcification that we see in e.g. some older people.)
It seems possible to me. A common strategy in religious groups is to steer for a wide barrier between them and particular temptations. This could be seen as a strategy for avoiding DA signals which would de-select for the behaviors encouraged by the religious group: no rewards are coming in for alternate behavior, so the best the DA can do is reinforce the types of reward which the PFC has restricted itself to.
This can be supplemented with modest rewards for desired behaviors, which force the DA to reinforce the inner optimizer’s desired behaviors.
Although is easier in a community which supports the behaviors, it’s entirely possible to do this to oneself in relative isolation, as well.
Kaj, the point I understand you to be making is: “The inner RL algorithm in this scenario is probably reliably aligned with the outer RL algorithm, since the former was selected specifically on the basis of it being good at accomplishing the latter’s objective, and since if the former deviates from pursuing that objective it will receive less reward from the outer, causing it to reconfigure itself to be better aligned. And since the two algorithms operate on similar time scales, we should expect any such misalignment to be noticed/corrected quickly.” Does this seem like a reasonable paraphrase?
It doesn’t feel obvious to me that the outer layer will be able to reliably steer the inner layer in this sense, especially as systems become more powerful. For example, it seems plausible to me that the inner layer might come to optimize for its proxy estimations of outer reward more than for outer reward itself, and that those two things might become decoupled.
That seems like a reasonable paraphrase, at least if you include the qualification that the “quickly” is relative to the amount of structure that the inner layer has accumulated, so might not actually happen quickly enough to be useful in all cases.
For example, it seems plausible to me that the inner layer might come to optimize for its proxy estimations of outer reward more than for outer reward itself, and that those two things could become decoupled.
Sure, e.g. lots of exotic sexual fetishes look like that to me. Hmm, though actually that example makes me rethink the argument that you just paraphrased, given that those generally emerge early in an individual’s life and then generally don’t get “corrected”.
What would it look like if they did?
The thing I meant by “catastrophic” is just “leading to death of the organism.” I suspect mesa-optimization is common in humans, but I don’t feel confident about this, nor that this is a joint-carvey ontology. I can imagine it being the case that many examples of e.g. addiction, goodharting, OCD, and even just “everyday personal misalignment”-type problems of the sort IFS/IDC/multi-agent models of mind sometimes help with, are caused by phenomena which might reasonably be described as inner alignment failures.
But I think these things don’t kill people very often? People do sometimes choose to die because of beliefs. And anorexia sometimes kills people, which currently feels to me like the most straightforward candidate example I’ve considered.
I just feel like things could be a lot worse. For example, it could have been the case that mind-architectures that give rise to mesa-optimization at all simply aren’t viable at high levels of optimization power—that it always kills them. Or that it basically always leads to the organism optimizing for a set of goals which is unrecognizably different from the base objective. I don’t think you see these things, so I’m curious how evolution prevented them.
Governments and corporations experience inner alignment failures all the time, but because of convergent instrumental goals, they are rarely catastrophic. For example, Russia underwent a revolution and a civil war on the inside, followed by purges and coups etc., but from the perspective of other nations, it was more or less still the same sort of thing: A nation, trying to expand its international influence, resist incursions, and conquer more territory. Even its alliances were based as much on expediency as on shared ideology.
Perhaps something similar happens with humans.
The claim that came to my mind is that the conscious mind is the mesa-optimizer here, the original outer optimizer being a riderless elephant.
Why do you single out anorexia? Do you mean people starving themselves to death? My understanding is that is very rare. Anorexics have a high death rate and some of that is long-term damage from starvation. They also (abruptly) kill themselves at a high rate, comparable to schizophrenics, but why single that out? There’s a theory that they have practice with internal conflict, which does seem relevant, but I think that’s just a theory, not clear cut at all.
Yeah, I wrote that confusingly, sorry; edited to clarify. I just meant that of the limited set of candidate examples I’d considered, my model of anorexia, which of course may well be wrong, feels most straightforwardly like an example of something capable of causing catastrophic within-brain inner alignment failure. That is, it currently feels natural to me to model anorexia as being caused by an optimizer for thinness arising in brains, which can sometimes gain sufficient power that people begin to optimize for that goal at the expense of essentially all other goals. But I don’t feel confident in this model.
I’m objecting to the claim that it fits your criterion of “catastrophic.” Maybe it’s such a clear example, with such a clear goal, that we should sacrifice the criterion of catastrophic, but you keep using that word.
Ah, I see. The high death rate was what made it seem often-catastrophic to me. Is your objection that the high death rate doesn’t reflect something that might reasonably be described as “optimizing for one goal at the expense of all others”? E.g., because many of the deaths are suicides, in which case persistence may have been net negative from the perspective of the rest of their goals too? Or because deaths often result from people calibratedly taking risky but non-insane actions, who just happened to get unlucky with heart muscle integrity or whatever?
I asked you if you were talking about starving to death and you didn’t answer. Does your abstract claim correspond to a concrete claim, or do you just observe that anorexics seem to have a goal and assume that everything must flow from that and the details don’t matter? That’s a perfectly reasonable claim, but it’s a weak claim so I’d like to know if that’s what you mean.
Abrupt suicides by anorexics are just as mysterious as suicides by schizophrenics and don’t seem to flow from the apparent goal of thinness. Suicide is a good example of something, but I don’t think it’s useful to attach it to anorexia rather than schizophrenia or bipolar.
Long-term health damage would be a reasonable claim, which I tried to concede in my original comment. I’m not sure I agree with it. I could pose a lot of complaints about it, but I wouldn’t. If it’s clear that it is the claim, then I think it’s clearly a weak claim and that’s OK. (As for the objection you propose, I would rather say: lots of people take badly calibrated risks without being labeled insane.)
The scenario I had in mind was one where death occurs as a result of damage caused by low food consumption, rather than by suicide.
This doesn’t seem like what it should mean here. I’d think catastrophic in the context of “how humans (programmed by evolution) might fail by evolution’s standards” should mean “start pursuing strategies that don’t result in many children or longterm population success.” (where premature death of the organism might be one way to cause that, but not the only way)
I agree, in the case of evolution/humans. I meant to highlight what seemed to me like a relative lack of catastrophic within-mind inner alignment failures, e.g. due to conflicts between PFC and DA. Death of the organism feels to me like one reasonable way to operationalize “catastrophic” in these cases, but I can imagine other reasonable ways.
I think it makes more sense to operationalize “catastrophic” here as “leading to systematically low DA reward”, perhaps also including “manipulating the DA system in a clearly misaligned way”.
One way catastrophic alignment in this sense is difficult for humans is that the PFC cannot divorce itself from the DA; I’d expect that a failure mode leading to systematically low DA rewards would usually be corrected gradually, as the DA punishes those patterns.
However, this is not really clear. The misaligned PFC might e.g. put itself in a local maximum, where it creates DA punishment for giving into temptation. (For example, an ascetic getting social reinforcement from a group of ascetics might be in such a situation.)
Thanks—I do think this operationalization makes more sense than the one I proposed.
I’m not sure divorce like this is rare. For example, anorexia sometimes causes people to find food anti-rewarding (repulsive/inedible, even when they’re dying and don’t to be), and I can imagine that being because PFC actually somehow alters DAs reward function.
But I do share the hunch that something like a “divorce resistance” trick occurs and is helpful. I took Kaj and Steve to be gesturing at something similar elsewhere in the thread. But I notice feeling confused about how exactly this trick might work. Does it scale...?
I have the intuition that it doesn’t—that as the systems increase in power, divorce occurs more easily. That is, I have the intuition that if PFC were trying, so to speak, to divorce itself from DA supervision, that it could probably find some easy-ish way to succeed, e.g. by reconfiguring itself to hide activity from DA, or to send reward-eliciting signals to DA regardless of what goal it was pursuing.
As I understand it, Wang et al. found that their experimental setup trained an internal RL algorithm that was more specialized for this particular task, but was still optimizing for the same task that the RNN was being trained on? And it was selected exactly because it did that very goal better. If the circumstances changed so that the more specialized behavior was no longer appropriate, then (assuming the RNN’s weights hadn’t been frozen) the feedback to the outer network would gradually end up reconfiguring the internal algorithm as well. So I’m not sure how it even could end up with something that’s “unrecognizably different” from the base objective—even after a distributional shift, the learned objective would probably still be recognizable as a special case of the base objective, until it updated to match the new situation.
The thing that I would expect to see from this description, is that humans who were e.g. practicing a particular skill might end up becoming overspecialized to the circumstances around that skill, and need to occasionally relearn things to fit a new environment. And that certainly does seem to happen. Likewise for more general/abstract skills, like “knowing how to navigate your culture/technological environment”, where older people’s strategies are often more adapted to how society used to be rather than how it is now—but still aren’t incapable of updating.
Catastrophic misalignment seems more likely to happen in the case of something like evolution, where the two learning algorithms operate on vastly different timescales, and it takes a very long time for evolution to correct after a drastic distributional shift. But the examples in Wang et al. lead me to think that in the brain, even the slower process operates on a timescale that’s on the order of days rather than years, allowing for reasonably rapid adjustments in response to distributional shifts. (Though it’s plausible that the more structure there is in a need of readjustment, the slower the reconfiguration process will be—which would fit the behavioral calcification that we see in e.g. some older people.)
It seems possible to me. A common strategy in religious groups is to steer for a wide barrier between them and particular temptations. This could be seen as a strategy for avoiding DA signals which would de-select for the behaviors encouraged by the religious group: no rewards are coming in for alternate behavior, so the best the DA can do is reinforce the types of reward which the PFC has restricted itself to.
This can be supplemented with modest rewards for desired behaviors, which force the DA to reinforce the inner optimizer’s desired behaviors.
Although is easier in a community which supports the behaviors, it’s entirely possible to do this to oneself in relative isolation, as well.
Good point, I wasn’t thinking of social effects changing the incentive landscape.
Kaj, the point I understand you to be making is: “The inner RL algorithm in this scenario is probably reliably aligned with the outer RL algorithm, since the former was selected specifically on the basis of it being good at accomplishing the latter’s objective, and since if the former deviates from pursuing that objective it will receive less reward from the outer, causing it to reconfigure itself to be better aligned. And since the two algorithms operate on similar time scales, we should expect any such misalignment to be noticed/corrected quickly.” Does this seem like a reasonable paraphrase?
It doesn’t feel obvious to me that the outer layer will be able to reliably steer the inner layer in this sense, especially as systems become more powerful. For example, it seems plausible to me that the inner layer might come to optimize for its proxy estimations of outer reward more than for outer reward itself, and that those two things might become decoupled.
That seems like a reasonable paraphrase, at least if you include the qualification that the “quickly” is relative to the amount of structure that the inner layer has accumulated, so might not actually happen quickly enough to be useful in all cases.
Sure, e.g. lots of exotic sexual fetishes look like that to me. Hmm, though actually that example makes me rethink the argument that you just paraphrased, given that those generally emerge early in an individual’s life and then generally don’t get “corrected”.