Yeah I think this is the fundamental problem. But it’s a very simple way to state it. Perhaps useful for someone who doesn’t believe ai alignment is a problem?
Here’s my summary: Even at the limit of the amount of data & variety you can provide via RLHF, when the learned policy generalizes perfectly to all new situations you can throw at it, the result will still almost certainly be malign because there are still near infinite such policies, and they each behave differently on the infinite remaining types of situation you didn’t manage to train it on yet. Because the particular policy is just one of many, it is unlikely to be correct.
But more importantly, behavior upon self improvement and reflection is likely something we didn’t test. Because we can’t. The alignment problem now requires we look into the details of generalization. This is where all the interesting stuff is.
Yeah I think this is the fundamental problem. But it’s a very simple way to state it. Perhaps useful for someone who doesn’t believe ai alignment is a problem?
Here’s my summary: Even at the limit of the amount of data & variety you can provide via RLHF, when the learned policy generalizes perfectly to all new situations you can throw at it, the result will still almost certainly be malign because there are still near infinite such policies, and they each behave differently on the infinite remaining types of situation you didn’t manage to train it on yet. Because the particular policy is just one of many, it is unlikely to be correct.
But more importantly, behavior upon self improvement and reflection is likely something we didn’t test. Because we can’t. The alignment problem now requires we look into the details of generalization. This is where all the interesting stuff is.