bayesed answers Can someone explain to me why most researchers think alignment is probably something that is humanly tractable?

bayesed 3 Sep 2022 6:16 UTC
6 points
3
AFAIK, it is not necessary to ““accurately reverse engineer human values and also accurately encode them”. That’s considered too hard, and as you say, not tractable anytime soon. Further, even if you’re able to do that, you’ve only solved outer alignment, inner alignment still remains unsolved.
Instead, the aim is to build “corrigible” AIs. See Let’s See You Write That Corrigibility Tag, Corrigibility (Arbital), Hard problem of corrigibility (Arbital).
Quoting from the last link:

The “hard problem of corrigibility” is to build an agent which, in an intuitive sense, reasons internally as if from the programmers’ external perspective. We think the AI is incomplete, that we might have made mistakes in building it, that we might want to correct it, and that it would be e.g. dangerous for the AI to take large actions or high-impact actions or do weird new things without asking first.
We would ideally want the agent to see itself in exactly this way, behaving as if it were thinking, “I am incomplete and there is an outside force trying to complete me, my design may contain errors and there is an outside force that wants to correct them and this a good thing, my expected utility calculations suggesting that this action has super-high utility may be dangerously mistaken and I should run them past the outside force; I think I’ve done this calculation showing the expected result of the outside force correcting me, but maybe I’m mistaken about that.”
Also, most if not all researchers think alignment is a solvable problem, but many think we may not have enough time.