What’s to stop the AI from instead learning that “good” and “bad” are just subjective mental states or words from the programmer, rather than some deep natural category of the universe? So instead of doing things it thinks the human programmer would call “good”, it just tortures the programmer and forces them to say “good” repeatedly.
The pictures and videos of torture in the training set that are labelled “bad”.
It is not perfect, but I think the idea is that with a large and diverse training set the hope is that it alternative models of “good/bad” become extremely contrived, and the human one you are aiming for becomes the simplest model.
I found the material in the post very interesting. It holds out hope that after training your world model, it might not be as opaque as people fear.
The pictures and videos of torture in the training set that are labelled “bad”.
It is not perfect, but I think the idea is that with a large and diverse training set the hope is that it alternative models of “good/bad” become extremely contrived, and the human one you are aiming for becomes the simplest model.
I found the material in the post very interesting. It holds out hope that after training your world model, it might not be as opaque as people fear.