In the kinds of model-based RL AGI architectures that I normally think about (see here)…
step 1 (“usual behavior”) is fine
steps 2-4 (“notices that there are certain features connected with the shade of blue”, “notices that the correspondence is not perfect”, “deduces that it is plausible that its reward function is a noisy approximation for destruction of cancer cells”) kinda happens to some extent by default—I think there’s a way to do generative probabilistic modeling, wherein the algorithm is always pattern-matching to lots of different learned features in parallel, and that the brain uses this method, and that future AGI programmers will use this method too because it works better. (I wouldn’t have described these steps with the kind of self-aware language that you used, but I’m not sure that matters.)
step 5 (“it starts being conservative”) seems possible to me, see Section 14.4.2 here.
step 6 (“it asks for clarification from its programmers”) seems very hard because it requires something like a good UI / interpretability. I’m thinking of the concepts and features as abstract statistical patterns in patterns in patterns in sensory input and motor output, and I expect that this kind of thing will not be straightforward to present to the programmers. Again see here. Another problem is that we need a criterion on when to ask the programmers, and I don’t see any principled way to pick that criterion, and if it’s too strict we get a big alignment tax, and by the way perhaps that alignment tax is unavoidable no matter what we do.
step 7 (“it iterates”)—I don’t know how one would implement that. The obvious method (the AGI is motivated to do good concept-extrapolation just like it’s motivated to do whatever else) seems very dicey to me, see Section 14.4.3.3. Also, if that “obvious method” is the plan, then nothing you’re doing at Aligned AI would be relevant to that plan—it would just look like making an AGI that wants to be helpful, Paul Christiano style, and then hope the concept-extrapolation would emerge organically, right?
step 8-9 (“it starts to notice that there are systematic biases in how the programmers give it feedback”, “it now has to partially infer what its goals are”)—again, I don’t know how one would implement that. I’m very confused. I was imagining that the AGI presents a menu of possible extrapolations, and the programmers pick the right one, and the source code directly sets that answer as a highly-confident ground truth. Apparently you have something else in mind? Maybe the thing I was talking about in Step 7 above?
In the kinds of model-based RL AGI architectures that I normally think about (see here)…
step 1 (“usual behavior”) is fine
steps 2-4 (“notices that there are certain features connected with the shade of blue”, “notices that the correspondence is not perfect”, “deduces that it is plausible that its reward function is a noisy approximation for destruction of cancer cells”) kinda happens to some extent by default—I think there’s a way to do generative probabilistic modeling, wherein the algorithm is always pattern-matching to lots of different learned features in parallel, and that the brain uses this method, and that future AGI programmers will use this method too because it works better. (I wouldn’t have described these steps with the kind of self-aware language that you used, but I’m not sure that matters.)
step 5 (“it starts being conservative”) seems possible to me, see Section 14.4.2 here.
step 6 (“it asks for clarification from its programmers”) seems very hard because it requires something like a good UI / interpretability. I’m thinking of the concepts and features as abstract statistical patterns in patterns in patterns in sensory input and motor output, and I expect that this kind of thing will not be straightforward to present to the programmers. Again see here. Another problem is that we need a criterion on when to ask the programmers, and I don’t see any principled way to pick that criterion, and if it’s too strict we get a big alignment tax, and by the way perhaps that alignment tax is unavoidable no matter what we do.
step 7 (“it iterates”)—I don’t know how one would implement that. The obvious method (the AGI is motivated to do good concept-extrapolation just like it’s motivated to do whatever else) seems very dicey to me, see Section 14.4.3.3. Also, if that “obvious method” is the plan, then nothing you’re doing at Aligned AI would be relevant to that plan—it would just look like making an AGI that wants to be helpful, Paul Christiano style, and then hope the concept-extrapolation would emerge organically, right?
step 8-9 (“it starts to notice that there are systematic biases in how the programmers give it feedback”, “it now has to partially infer what its goals are”)—again, I don’t know how one would implement that. I’m very confused. I was imagining that the AGI presents a menu of possible extrapolations, and the programmers pick the right one, and the source code directly sets that answer as a highly-confident ground truth. Apparently you have something else in mind? Maybe the thing I was talking about in Step 7 above?