My viewpoint is that the most dangerous risks rely on inner alignment issues, and that is basically because of very bad transparency tools, instrumental convergence issues toward power and deception, and mesa-optimizers essentially ruining what outer alignment you have. If you could figure out a reliable way to detect or make sure that deceptive models could never be reached in your training process, that would relieve a lot of my fears of X-risk from AI.
I actually think Eliezer is underrating civilizational competence once AGI is released via the MNM effect, as it happened for Covid, unfortunately this only buys time before the end. A superhuman intelligence that is deceiving human civilization based on instrumental convergence essentially will win barring pivotal acts, as Eliezer says. The goal of AI safety is to make alignment not dependent on heroic, pivotal actions.
So Andrew Critich’s hopes of not needing pivotal acts only works if significant portions of the alignment problem are solved or at least ameliorated, which we are not super close to doing, So whether alignment will require pivotal acts is directly dependent on solving the alignment problem more generally.
Pivotal acts are a worse solution to alignment and shouldn’t be thought of as the default solution, but it is a back-pocket solution we shouldn’t forget about.
If I had to rate a crux between Eliezer Yudkowsky’s/Rob Bensinger’s/Nate Soares’s/MIRI’s views and Deepmind’s Safety Team views or Andrew Critich’s view, it’s whether the Alignment problem is foresight-loaded (And thus civilization will be incompetent as well as safety requiring more pivotal acts) or empirically-loaded, where we don’t need to see the bullets in advance (And thus civilization could be more competent and pivotal acts matter less). It’s an interesting crux to be sure.
PS: Does Deepmind’s Safety Team have real power to disapprove AI projects? Or are they like the Google Ethics team, where they had no power to disapprove AI projects without being fired.
Correct. I think that doing internal outreach to build an alignment-aware company culture and building relationships with key decision-makers can go a long way. I don’t think it’s possible to have complete binding power over capabilities projects anyway, since the people who want to run the project could in principle leave and start their own org.
The MNM effect is essentially people can strongly react to disasters once they happen since they don’t want to die, and this can prevent the worst outcomes from happening. It’s a short-term response, but can become a control system.
My viewpoint is that the most dangerous risks rely on inner alignment issues, and that is basically because of very bad transparency tools, instrumental convergence issues toward power and deception, and mesa-optimizers essentially ruining what outer alignment you have. If you could figure out a reliable way to detect or make sure that deceptive models could never be reached in your training process, that would relieve a lot of my fears of X-risk from AI.
I actually think Eliezer is underrating civilizational competence once AGI is released via the MNM effect, as it happened for Covid, unfortunately this only buys time before the end. A superhuman intelligence that is deceiving human civilization based on instrumental convergence essentially will win barring pivotal acts, as Eliezer says. The goal of AI safety is to make alignment not dependent on heroic, pivotal actions.
So Andrew Critich’s hopes of not needing pivotal acts only works if significant portions of the alignment problem are solved or at least ameliorated, which we are not super close to doing, So whether alignment will require pivotal acts is directly dependent on solving the alignment problem more generally.
Pivotal acts are a worse solution to alignment and shouldn’t be thought of as the default solution, but it is a back-pocket solution we shouldn’t forget about.
If I had to rate a crux between Eliezer Yudkowsky’s/Rob Bensinger’s/Nate Soares’s/MIRI’s views and Deepmind’s Safety Team views or Andrew Critich’s view, it’s whether the Alignment problem is foresight-loaded (And thus civilization will be incompetent as well as safety requiring more pivotal acts) or empirically-loaded, where we don’t need to see the bullets in advance (And thus civilization could be more competent and pivotal acts matter less). It’s an interesting crux to be sure.
PS: Does Deepmind’s Safety Team have real power to disapprove AI projects? Or are they like the Google Ethics team, where they had no power to disapprove AI projects without being fired.
We don’t have the power to shut down projects, but we can make recommendations and provide input into decisions about projects
So you can have non-binding recommendations and input, but no actual binding power over the capabilities researchers, right?
Correct. I think that doing internal outreach to build an alignment-aware company culture and building relationships with key decision-makers can go a long way. I don’t think it’s possible to have complete binding power over capabilities projects anyway, since the people who want to run the project could in principle leave and start their own org.
What’s the MNM effect?
The MNM effect is essentially people can strongly react to disasters once they happen since they don’t want to die, and this can prevent the worst outcomes from happening. It’s a short-term response, but can become a control system.
Here’s a link: https://www.lesswrong.com/posts/EgdHK523ZM4zPiX5q/coronavirus-as-a-test-run-for-x-risks