I can agree with “RLHF doesn’t robustly disincentivize misaligned powerseeking that has occurred through other means” (I would expect it often does but often doesn’t). Separately from all this, I’m not so worried about LLMs because their method of gaining capabilities is based on imitation learning, but if you are more worried about imitation learning than I am or people start gaining more capabilities from “real agency” then I’d say my post doesn’t disprove the possibility of misaligned powerseeking, only arguing that it’s not what RLHF favors.
I can agree with “RLHF doesn’t robustly disincentivize misaligned powerseeking that has occurred through other means” (I would expect it often does but often doesn’t). Separately from all this, I’m not so worried about LLMs because their method of gaining capabilities is based on imitation learning, but if you are more worried about imitation learning than I am or people start gaining more capabilities from “real agency” then I’d say my post doesn’t disprove the possibility of misaligned powerseeking, only arguing that it’s not what RLHF favors.
My point is that RLHF incentivizes all sorts of tnings and these things depend on content of trained model, not on what RLHF is.
It depends on both.