Oh I’m very open-minded. I was writing that section for an audience of non-AGI-safety-experts and didn’t want to make things over-complicated by working through the full range of possible solutions to the problem, I just wanted to say enough to convince readers that there is a problem here, and it’s not trivial.
The Judge box (usually I call it “steering subsystem”) can be anything. There could even be a tower of AGIs steering AGIs, IDA-style, but I don’t know the details, like what you would put at the base of the tower. I haven’t really thought about it. Or it could be deep neural net classifier. (How do you train it? “Here’s 5000 examples of corrigibility, here’s 5000 examples of incorrigibility”?? Or what? Beats me...) In this post I proposed that the amygdala houses a supervised learning algorithm which does a sorta “interpretability” thing where it tries to decode the latent variables inside the neocortex, and then those signals are inputs to the reward calculation. I don’t see how that kind of mechanism would apply to more complicated goals, and I’m not sure how robust it is. Anyway, yeah, could be anything, I’m open-minded.
Oh I’m very open-minded. I was writing that section for an audience of non-AGI-safety-experts and didn’t want to make things over-complicated by working through the full range of possible solutions to the problem, I just wanted to say enough to convince readers that there is a problem here, and it’s not trivial.
The Judge box (usually I call it “steering subsystem”) can be anything. There could even be a tower of AGIs steering AGIs, IDA-style, but I don’t know the details, like what you would put at the base of the tower. I haven’t really thought about it. Or it could be deep neural net classifier. (How do you train it? “Here’s 5000 examples of corrigibility, here’s 5000 examples of incorrigibility”?? Or what? Beats me...) In this post I proposed that the amygdala houses a supervised learning algorithm which does a sorta “interpretability” thing where it tries to decode the latent variables inside the neocortex, and then those signals are inputs to the reward calculation. I don’t see how that kind of mechanism would apply to more complicated goals, and I’m not sure how robust it is. Anyway, yeah, could be anything, I’m open-minded.