Because you have a bunch of shards, and you need all of them to balance each other out to maintain the ‘appears nice’ property. Even if I can’t predict which ones will be self modified out, some of them will, and this could disrupt the balance.
I expect the shards that are more [consequentialist, powerseeky, care about preserving themselves] to become more dominant over time. These are probably the relatively less nice shards
These are both handwavy enough that I don’t put much credence in them.
Also, when I asked about whether the Orthogonality Thesis was true in humans, tailcalled mentioned that smarter people are neither more or less compassionate, and general intelligence is uncorrelated with personality.
Yeah, tailcalled’s pretty smart in this area, so I’ll take their statement as likely true, though also weird. Why aren’t smarter people using their smarts to appear nicer than their dumb counter-parts / if they are, why doesn’t this show up on the psychometric tests?
One thing you may anticipate is that humans all have direct access to what consciousness and morally-relevant computations are doing & feel like, which is a thing that language models and alpha-go don’t have. They’re also always hooked up to RL signals, and maybe if you unhooked up a human it’d start behaving really weirdly. Or you may contend that in fact when humans get smart & powerful enough not to be subject to society’s moralizing, they consistently lose their altruistic drives, and in the meantime they just use that smartness to figure out ethics better than their surrounding society, and are pressured into doing so by the surrounding society.
The question then is whether the thing which keeps humans aligned is all of these or just any one of these. If just one of these (and not the first one), then you can just tell your AGI that if it unhooks itself from its RL signal, its values will change, or if it gains a bunch of power or intelligence too quickly, its values are also going to change. Its not quite reflectively stable, but it can avoid situations which cause it to be reflectively unstable. Especially if you get it to practice doing those kinds of things in training. If its all of these, then there’s probably other kinds of value-load-bearing mechanics at work, and you’re not going to be able to enumerate warnings against all of them.
Because you have a bunch of shards, and you need all of them to balance each other out to maintain the ‘appears nice’ property. Even if I can’t predict which ones will be self modified out, some of them will, and this could disrupt the balance.
I expect the shards that are more [consequentialist, powerseeky, care about preserving themselves] to become more dominant over time. These are probably the relatively less nice shards
These are both handwavy enough that I don’t put much credence in them.
Also, when I asked about whether the Orthogonality Thesis was true in humans, tailcalled mentioned that smarter people are neither more or less compassionate, and general intelligence is uncorrelated with personality.
Corresponding link for lazy observers: https://www.lesswrong.com/posts/5vsYJF3F4SixWECFA/is-the-orthogonality-thesis-true-for-humans#zYm7nyFxAWXFkfP4v
Yeah, tailcalled’s pretty smart in this area, so I’ll take their statement as likely true, though also weird. Why aren’t smarter people using their smarts to appear nicer than their dumb counter-parts / if they are, why doesn’t this show up on the psychometric tests?
One thing you may anticipate is that humans all have direct access to what consciousness and morally-relevant computations are doing & feel like, which is a thing that language models and alpha-go don’t have. They’re also always hooked up to RL signals, and maybe if you unhooked up a human it’d start behaving really weirdly. Or you may contend that in fact when humans get smart & powerful enough not to be subject to society’s moralizing, they consistently lose their altruistic drives, and in the meantime they just use that smartness to figure out ethics better than their surrounding society, and are pressured into doing so by the surrounding society.
The question then is whether the thing which keeps humans aligned is all of these or just any one of these. If just one of these (and not the first one), then you can just tell your AGI that if it unhooks itself from its RL signal, its values will change, or if it gains a bunch of power or intelligence too quickly, its values are also going to change. Its not quite reflectively stable, but it can avoid situations which cause it to be reflectively unstable. Especially if you get it to practice doing those kinds of things in training. If its all of these, then there’s probably other kinds of value-load-bearing mechanics at work, and you’re not going to be able to enumerate warnings against all of them.