Similary to johnswentworth: My current impression is core alignment problems are the same and manifest at all levels—often sub-human version just looks like a toy version of the scaled-up problem, and the main difference is, in the sub-human version problem, you can often solve it for practical purposes by plugging in human at some strategic spot. (While I don’t think there are deep differences in the alignment problem space, I do think there are differences in the “alignment solutions” space, where you can use non-scalable solutions, or in risk space, where dangers being small due to the systems being stupid.)
I’m also unconvinced about some of practical claims about differences for wildly superintelligent systems.
One crucial concern related to “what people want” is this seems underdefined, un-stable in interactions with wildly superintelligent systems, and prone to problems with scaling of values within systems where intelligence increases. By this line of reasoning, if the wildly superintelligent system is able to answer me these sort of questions “in a way I want”, it very likely must be already aligned. So it feels like part of the worries was assumed away. Paraphrasing the questions about human values again, one may ask “how did you get to the state where you have this aligned wildly superintelligent system which is able to answer questions about human values, as opposed to e.g. overwriting what humans believe about themselves by it’s own non-human-aligned values?”.
Ability to understand itself seems a special case of competence: I can imagine systems which are wildly superhuman in their ability to understand the rest of the world, but pretty mediocre at understanding themselves, e.g. due to some problems with recursion, self-references, reflections, or different kinds of computations being used at various levels of reasoning. As a result, it seems unclear whether the ability to clearly understand itself is a feature of all wildly super-human systems. (Toy counterexample: imagine a device which would connect someone in ancient Greece with our modern civilization, and our civilization dedicating about 10% of global GDP to answering questions from this guy. I would argue this device is for most practical purposes wildly superhuman compared to this individual guy in Greece, but at the same time bad at understanding itself)
Fundamentally inscrutable thoughts seems like something which you can study with present day systems as toy models. E.g., why does AlphaZero believe something is a good go move? Why does a go grand-master believe something is a good move? What counts as a ‘true explanation’? Who is the recipient of the explanation? Are you happy with explanation of the algorithm like ‘upon playing myriad games, my general functional approximator is approximating the expected value of this branch of an unimaginably large choice tree is larger than for other branches?’? If yes, why? If no, why not?
Inscrutable influence-seeking plans seem also a present problem. Eg, if there are already some complex influence-seeking patterns now, how would we notice?
One crucial concern related to “what people want” is this seems underdefined, un-stable in interactions with wildly superintelligent systems, and prone to problems with scaling of values within systems where intelligence increases.
This is what I was referring to with
by assumption the superintelligence will be able to answer any question you’re able to operationalize about human values
The superintelligence can answer any operationalizable question about human values, but as you say, it’s not clear how to elicit the right operationalization.
Similary to johnswentworth: My current impression is core alignment problems are the same and manifest at all levels—often sub-human version just looks like a toy version of the scaled-up problem, and the main difference is, in the sub-human version problem, you can often solve it for practical purposes by plugging in human at some strategic spot. (While I don’t think there are deep differences in the alignment problem space, I do think there are differences in the “alignment solutions” space, where you can use non-scalable solutions, or in risk space, where dangers being small due to the systems being stupid.)
I’m also unconvinced about some of practical claims about differences for wildly superintelligent systems.
One crucial concern related to “what people want” is this seems underdefined, un-stable in interactions with wildly superintelligent systems, and prone to problems with scaling of values within systems where intelligence increases. By this line of reasoning, if the wildly superintelligent system is able to answer me these sort of questions “in a way I want”, it very likely must be already aligned. So it feels like part of the worries was assumed away. Paraphrasing the questions about human values again, one may ask “how did you get to the state where you have this aligned wildly superintelligent system which is able to answer questions about human values, as opposed to e.g. overwriting what humans believe about themselves by it’s own non-human-aligned values?”.
Ability to understand itself seems a special case of competence: I can imagine systems which are wildly superhuman in their ability to understand the rest of the world, but pretty mediocre at understanding themselves, e.g. due to some problems with recursion, self-references, reflections, or different kinds of computations being used at various levels of reasoning. As a result, it seems unclear whether the ability to clearly understand itself is a feature of all wildly super-human systems. (Toy counterexample: imagine a device which would connect someone in ancient Greece with our modern civilization, and our civilization dedicating about 10% of global GDP to answering questions from this guy. I would argue this device is for most practical purposes wildly superhuman compared to this individual guy in Greece, but at the same time bad at understanding itself)
Fundamentally inscrutable thoughts seems like something which you can study with present day systems as toy models. E.g., why does AlphaZero believe something is a good go move? Why does a go grand-master believe something is a good move? What counts as a ‘true explanation’? Who is the recipient of the explanation? Are you happy with explanation of the algorithm like ‘upon playing myriad games, my general functional approximator is approximating the expected value of this branch of an unimaginably large choice tree is larger than for other branches?’? If yes, why? If no, why not?
Inscrutable influence-seeking plans seem also a present problem. Eg, if there are already some complex influence-seeking patterns now, how would we notice?
This is what I was referring to with
The superintelligence can answer any operationalizable question about human values, but as you say, it’s not clear how to elicit the right operationalization.