I think a bit too much mindshare is being spent on these sci-fi scenario discussions, although they are fun.
Honestly I have trouble following these arguments about deception evolving in RL. In particular I can’t quite wrap my head around how the agent ends up optimizing for something else (not a proxy objective, but a possibly totally orthogonal objective like “please my human masters so I can later do X”). In any case, it seems self awareness is required for the type of deception that you’re envisioning. Which brings up an interesting question—can a purely feed-forward network develop self-awareness during training? I don’t know about you, but I have trouble picturing it happening unless there is some sort of loop involved.
Yeah, but don’t you expect successful human equivalent neural networks to have some sort of loop involved? It seems pretty likely to me that the ML researchers will successfully figure out how to put self analysis loops into neural nets.
Networks with loops are much harder to train.. that was one of the motivations for going to transformers instead of RNNs. But yeah, sure, I agree. My objection is more that posts like this are so high level I have trouble following the argument, if that makes sense. The argument seems roughly plausible but not making contact with any real object level stuff makes it a lot weaker, at least to me. The argument seems to rely on “emergence of self-awareness / discovery of malevolence/deception during SGD” being likely which is unjustified in my view. I’m not saying the argument is wrong, more that I personally don’t find it very convincing.
I think a bit too much mindshare is being spent on these sci-fi scenario discussions, although they are fun.
Honestly I have trouble following these arguments about deception evolving in RL. In particular I can’t quite wrap my head around how the agent ends up optimizing for something else (not a proxy objective, but a possibly totally orthogonal objective like “please my human masters so I can later do X”). In any case, it seems self awareness is required for the type of deception that you’re envisioning. Which brings up an interesting question—can a purely feed-forward network develop self-awareness during training? I don’t know about you, but I have trouble picturing it happening unless there is some sort of loop involved.
Yeah, but don’t you expect successful human equivalent neural networks to have some sort of loop involved? It seems pretty likely to me that the ML researchers will successfully figure out how to put self analysis loops into neural nets.
Networks with loops are much harder to train.. that was one of the motivations for going to transformers instead of RNNs. But yeah, sure, I agree. My objection is more that posts like this are so high level I have trouble following the argument, if that makes sense. The argument seems roughly plausible but not making contact with any real object level stuff makes it a lot weaker, at least to me. The argument seems to rely on “emergence of self-awareness / discovery of malevolence/deception during SGD” being likely which is unjustified in my view. I’m not saying the argument is wrong, more that I personally don’t find it very convincing.