I think getting agents to robustly do what the trainers want would be a huge win.
I want to mention that I sort of conjecture that this is the best result alignment can realistically get, at least without invoking mind-control/controlling of values directly, and that societal alignment is either impossible or trivial, depending on the constraints.
Hmm again it depends on whether you’re defining “alignment” narrowly (the technical problem of getting superhumanly powerful machines to robustly attempt to do what humans actually want) or more broadly (eg the whole scope of navigating the transition from sapiens controlling the world to superhumanly powerful machines controlling the world in a way that helps humans survive and flourish)
If the former, I disagree with you slightly; I think “human values” are possibly broad enough that some recognizably-human values are easier to align than others. Consider the caricature of an amoral businessman vs someone trying to do principled preference utilitarianism for all of humanity.
If the latter, I think I disagree very strongly. There are many incremental improvements short of mind-control to make the loading of human values go more safely, eg, having good information security, theoretical work in preference aggregation, increasing certain types of pluralism, basic safeguards in lab and corporate governance, trying to make the subjects of value-loading to be of a larger set of people than a few lab heads and/or gov’t leaders, advocacy for moral reflection and moral uncertainty, (assuming slowish takeoff) trying to make sure collective epistemics don’t go haywire during the advent of ~human-level or slightly superhuman intelligences, etc.
I want to mention that I sort of conjecture that this is the best result alignment can realistically get, at least without invoking mind-control/controlling of values directly, and that societal alignment is either impossible or trivial, depending on the constraints.
Hmm again it depends on whether you’re defining “alignment” narrowly (the technical problem of getting superhumanly powerful machines to robustly attempt to do what humans actually want) or more broadly (eg the whole scope of navigating the transition from sapiens controlling the world to superhumanly powerful machines controlling the world in a way that helps humans survive and flourish)
If the former, I disagree with you slightly; I think “human values” are possibly broad enough that some recognizably-human values are easier to align than others. Consider the caricature of an amoral businessman vs someone trying to do principled preference utilitarianism for all of humanity.
If the latter, I think I disagree very strongly. There are many incremental improvements short of mind-control to make the loading of human values go more safely, eg, having good information security, theoretical work in preference aggregation, increasing certain types of pluralism, basic safeguards in lab and corporate governance, trying to make the subjects of value-loading to be of a larger set of people than a few lab heads and/or gov’t leaders, advocacy for moral reflection and moral uncertainty, (assuming slowish takeoff) trying to make sure collective epistemics don’t go haywire during the advent of ~human-level or slightly superhuman intelligences, etc.