Here be cynical opinions with little data to back them.
It’s important to point out that “AI Safety” in an academic context usually means something slightly different from typical LW fare. For starters, as most AI work descended from computer science, its pretty hard [1] to get anything published in a serious AI venue (conference/journal) unless you
Demonstrate a thing works
Use theory to explain a preexisting phenomenon
Both PhD students and their advisors want to publish things in established venues, so by default one should expect academic AI Safety research to have a near-term prioritization and be less focused on AGI/ex-risk. That isn’t to say research can’t accomplish both things at once, but its worth noting.
Because AI Safety in the academic sense hasn’t traditionally meant safety from AGI ruin, there is a long history of EA aligned people not really being aware of or caring about safety research. Safety has been getting funding for a long time, but it looked less like MIRI and more like the University of York’s safe autonomy lab [2] or the DARPA Assured Autonomy program [3]. With these dynamics in mind, I fully expect the majority of new AI safety funding to go to one of the following areas:
Aligning current gen AI with the explicit intentions of its trainers in adversarial environments, e.g. make my chatbot not tell users how to make bombs when users ask, reduce the risk of my car hitting pedestrians.
Blurring the line between “responsible use” and “safety” (which is a sort of alignment problem), e.g. make my chatbot less xyz-ist, protecting training data privacy, ethics of AI use.
Old school hazard analysis and mitigation. This is like the hazard analysis a plane goes through before the FAA lets it fly, but now the planes have AI components.
The thing that probably won’t get funding is aligning a fully autonomous agent with the implicit interests of all humans (not just trainers), which generalizes to the ex-risk problem. Perhaps I lack imagination, but with the way things are I can’t really imagine how you get enough published in the usual venues about this to build a dissertation out of it.
[1] Yeah, of course you can get it published, but I think most would agree that its harder to get a pure theory ex-risk paper published in a traditional CS/AI venue than other types of papers. Perhaps this will change as new tracks open up, but I’m not sure.
I expect academia to have more appetite for AI safety work that looks like (adversarial) robustness, mechanistic interpretability, etc, than alignment qua alignment. From the outside, it doesn’t seem very unlikely for academia to do projects similar to what Redwood Research does, for example.
Though typical elite academics might also be distracted by shiny publishable projects rather than be as focused/dedicated on core problems, compared to e.g. Redwood. This is somewhat counterbalanced by the potential of academia having higher quantity and possibly quality of high-end intellectual horsepower/rigor/training.
The thing that probably won’t get funding is aligning a fully autonomous agent with the implicit interests of all humans (not just trainers), which generalizes to the ex-risk problem.
I think getting agents to robustly do what the trainers want would be a huge win. Instilling the right values conditional upon being able to instill any reasonable values seems like a) plausibly an easier problem, b) not entirely (or primarily?) technical, and c) a reasonable continuation of existing nontechnical work in AI governance, moral philosophy, political science, and well, having a society.
I think getting agents to robustly do what the trainers want would be a huge win.
I want to mention that I sort of conjecture that this is the best result alignment can realistically get, at least without invoking mind-control/controlling of values directly, and that societal alignment is either impossible or trivial, depending on the constraints.
Hmm again it depends on whether you’re defining “alignment” narrowly (the technical problem of getting superhumanly powerful machines to robustly attempt to do what humans actually want) or more broadly (eg the whole scope of navigating the transition from sapiens controlling the world to superhumanly powerful machines controlling the world in a way that helps humans survive and flourish)
If the former, I disagree with you slightly; I think “human values” are possibly broad enough that some recognizably-human values are easier to align than others. Consider the caricature of an amoral businessman vs someone trying to do principled preference utilitarianism for all of humanity.
If the latter, I think I disagree very strongly. There are many incremental improvements short of mind-control to make the loading of human values go more safely, eg, having good information security, theoretical work in preference aggregation, increasing certain types of pluralism, basic safeguards in lab and corporate governance, trying to make the subjects of value-loading to be of a larger set of people than a few lab heads and/or gov’t leaders, advocacy for moral reflection and moral uncertainty, (assuming slowish takeoff) trying to make sure collective epistemics don’t go haywire during the advent of ~human-level or slightly superhuman intelligences, etc.
Here be cynical opinions with little data to back them.
It’s important to point out that “AI Safety” in an academic context usually means something slightly different from typical LW fare. For starters, as most AI work descended from computer science, its pretty hard [1] to get anything published in a serious AI venue (conference/journal) unless you
Demonstrate a thing works
Use theory to explain a preexisting phenomenon
Both PhD students and their advisors want to publish things in established venues, so by default one should expect academic AI Safety research to have a near-term prioritization and be less focused on AGI/ex-risk. That isn’t to say research can’t accomplish both things at once, but its worth noting.
Because AI Safety in the academic sense hasn’t traditionally meant safety from AGI ruin, there is a long history of EA aligned people not really being aware of or caring about safety research. Safety has been getting funding for a long time, but it looked less like MIRI and more like the University of York’s safe autonomy lab [2] or the DARPA Assured Autonomy program [3]. With these dynamics in mind, I fully expect the majority of new AI safety funding to go to one of the following areas:
Aligning current gen AI with the explicit intentions of its trainers in adversarial environments, e.g. make my chatbot not tell users how to make bombs when users ask, reduce the risk of my car hitting pedestrians.
Blurring the line between “responsible use” and “safety” (which is a sort of alignment problem), e.g. make my chatbot less xyz-ist, protecting training data privacy, ethics of AI use.
Old school hazard analysis and mitigation. This is like the hazard analysis a plane goes through before the FAA lets it fly, but now the planes have AI components.
The thing that probably won’t get funding is aligning a fully autonomous agent with the implicit interests of all humans (not just trainers), which generalizes to the ex-risk problem. Perhaps I lack imagination, but with the way things are I can’t really imagine how you get enough published in the usual venues about this to build a dissertation out of it.
[1] Yeah, of course you can get it published, but I think most would agree that its harder to get a pure theory ex-risk paper published in a traditional CS/AI venue than other types of papers. Perhaps this will change as new tracks open up, but I’m not sure.
[2] https://www.york.ac.uk/safe-autonomy/research/assurance/
[3] https://www.darpa.mil/program/assured-autonomy
I expect academia to have more appetite for AI safety work that looks like (adversarial) robustness, mechanistic interpretability, etc, than alignment qua alignment. From the outside, it doesn’t seem very unlikely for academia to do projects similar to what Redwood Research does, for example.
Though typical elite academics might also be distracted by shiny publishable projects rather than be as focused/dedicated on core problems, compared to e.g. Redwood. This is somewhat counterbalanced by the potential of academia having higher quantity and possibly quality of high-end intellectual horsepower/rigor/training.
I think getting agents to robustly do what the trainers want would be a huge win. Instilling the right values conditional upon being able to instill any reasonable values seems like a) plausibly an easier problem, b) not entirely (or primarily?) technical, and c) a reasonable continuation of existing nontechnical work in AI governance, moral philosophy, political science, and well, having a society.
I want to mention that I sort of conjecture that this is the best result alignment can realistically get, at least without invoking mind-control/controlling of values directly, and that societal alignment is either impossible or trivial, depending on the constraints.
Hmm again it depends on whether you’re defining “alignment” narrowly (the technical problem of getting superhumanly powerful machines to robustly attempt to do what humans actually want) or more broadly (eg the whole scope of navigating the transition from sapiens controlling the world to superhumanly powerful machines controlling the world in a way that helps humans survive and flourish)
If the former, I disagree with you slightly; I think “human values” are possibly broad enough that some recognizably-human values are easier to align than others. Consider the caricature of an amoral businessman vs someone trying to do principled preference utilitarianism for all of humanity.
If the latter, I think I disagree very strongly. There are many incremental improvements short of mind-control to make the loading of human values go more safely, eg, having good information security, theoretical work in preference aggregation, increasing certain types of pluralism, basic safeguards in lab and corporate governance, trying to make the subjects of value-loading to be of a larger set of people than a few lab heads and/or gov’t leaders, advocacy for moral reflection and moral uncertainty, (assuming slowish takeoff) trying to make sure collective epistemics don’t go haywire during the advent of ~human-level or slightly superhuman intelligences, etc.