I mean to argue against your meta-strategy which relies on obtaining relevant understanding about deception or alignment as we get larger models and see how they work. I agree that we will obtain some understanding, but it seems like we shouldn’t expect that understanding to be very close to sufficient for making AI go well (see my previous argument), and hence not a very promising meta-strategy.
I read your previous comment as suggesting that the improved understanding would mainly be used for pursuing a specific strategy for dealing with deception, namely “to learn the properties of what looks like deception to humans, and instill those properties into a loss function”. And it seemed to me that the problem you raised was specific to that particular strategy for dealing with deception, as opposed to something else that we might come up with?
I mean to argue against your meta-strategy which relies on obtaining relevant understanding about deception or alignment as we get larger models and see how they work. I agree that we will obtain some understanding, but it seems like we shouldn’t expect that understanding to be very close to sufficient for making AI go well (see my previous argument), and hence not a very promising meta-strategy.
I read your previous comment as suggesting that the improved understanding would mainly be used for pursuing a specific strategy for dealing with deception, namely “to learn the properties of what looks like deception to humans, and instill those properties into a loss function”. And it seemed to me that the problem you raised was specific to that particular strategy for dealing with deception, as opposed to something else that we might come up with?