I expect academia to have more appetite for AI safety work that looks like (adversarial) robustness, mechanistic interpretability, etc, than alignment qua alignment. From the outside, it doesn’t seem very unlikely for academia to do projects similar to what Redwood Research does, for example.
Though typical elite academics might also be distracted by shiny publishable projects rather than be as focused/dedicated on core problems, compared to e.g. Redwood. This is somewhat counterbalanced by the potential of academia having higher quantity and possibly quality of high-end intellectual horsepower/rigor/training.
The thing that probably won’t get funding is aligning a fully autonomous agent with the implicit interests of all humans (not just trainers), which generalizes to the ex-risk problem.
I think getting agents to robustly do what the trainers want would be a huge win. Instilling the right values conditional upon being able to instill any reasonable values seems like a) plausibly an easier problem, b) not entirely (or primarily?) technical, and c) a reasonable continuation of existing nontechnical work in AI governance, moral philosophy, political science, and well, having a society.
I think getting agents to robustly do what the trainers want would be a huge win.
I want to mention that I sort of conjecture that this is the best result alignment can realistically get, at least without invoking mind-control/controlling of values directly, and that societal alignment is either impossible or trivial, depending on the constraints.
Hmm again it depends on whether you’re defining “alignment” narrowly (the technical problem of getting superhumanly powerful machines to robustly attempt to do what humans actually want) or more broadly (eg the whole scope of navigating the transition from sapiens controlling the world to superhumanly powerful machines controlling the world in a way that helps humans survive and flourish)
If the former, I disagree with you slightly; I think “human values” are possibly broad enough that some recognizably-human values are easier to align than others. Consider the caricature of an amoral businessman vs someone trying to do principled preference utilitarianism for all of humanity.
If the latter, I think I disagree very strongly. There are many incremental improvements short of mind-control to make the loading of human values go more safely, eg, having good information security, theoretical work in preference aggregation, increasing certain types of pluralism, basic safeguards in lab and corporate governance, trying to make the subjects of value-loading to be of a larger set of people than a few lab heads and/or gov’t leaders, advocacy for moral reflection and moral uncertainty, (assuming slowish takeoff) trying to make sure collective epistemics don’t go haywire during the advent of ~human-level or slightly superhuman intelligences, etc.
I expect academia to have more appetite for AI safety work that looks like (adversarial) robustness, mechanistic interpretability, etc, than alignment qua alignment. From the outside, it doesn’t seem very unlikely for academia to do projects similar to what Redwood Research does, for example.
Though typical elite academics might also be distracted by shiny publishable projects rather than be as focused/dedicated on core problems, compared to e.g. Redwood. This is somewhat counterbalanced by the potential of academia having higher quantity and possibly quality of high-end intellectual horsepower/rigor/training.
I think getting agents to robustly do what the trainers want would be a huge win. Instilling the right values conditional upon being able to instill any reasonable values seems like a) plausibly an easier problem, b) not entirely (or primarily?) technical, and c) a reasonable continuation of existing nontechnical work in AI governance, moral philosophy, political science, and well, having a society.
I want to mention that I sort of conjecture that this is the best result alignment can realistically get, at least without invoking mind-control/controlling of values directly, and that societal alignment is either impossible or trivial, depending on the constraints.
Hmm again it depends on whether you’re defining “alignment” narrowly (the technical problem of getting superhumanly powerful machines to robustly attempt to do what humans actually want) or more broadly (eg the whole scope of navigating the transition from sapiens controlling the world to superhumanly powerful machines controlling the world in a way that helps humans survive and flourish)
If the former, I disagree with you slightly; I think “human values” are possibly broad enough that some recognizably-human values are easier to align than others. Consider the caricature of an amoral businessman vs someone trying to do principled preference utilitarianism for all of humanity.
If the latter, I think I disagree very strongly. There are many incremental improvements short of mind-control to make the loading of human values go more safely, eg, having good information security, theoretical work in preference aggregation, increasing certain types of pluralism, basic safeguards in lab and corporate governance, trying to make the subjects of value-loading to be of a larger set of people than a few lab heads and/or gov’t leaders, advocacy for moral reflection and moral uncertainty, (assuming slowish takeoff) trying to make sure collective epistemics don’t go haywire during the advent of ~human-level or slightly superhuman intelligences, etc.