Overall effectiveness of bag-of-tricks safety methods: Under the assumption that the alignment problem will not be solved in a general principled way, players in the critical period who are at least partially worried about safety will likely resort to a bag-of-tricks of sorts to avoid obvious failure modes (like not giving access to the internet, having fast kill-switches, lots of terms in the value function, and other such patches), the overall effectiveness of this bag-of-tricks determines the maximum level of intelligence that could more-or-less safely be used. A negative effect is inducing a false sense of comfort in the team overseeing the AI, the team’s culture around safety is very important here for avoiding negative outcomes from overconfidence in the bag-of-tricks.
Overall effectiveness of bag-of-tricks safety methods: Under the assumption that the alignment problem will not be solved in a general principled way, players in the critical period who are at least partially worried about safety will likely resort to a bag-of-tricks of sorts to avoid obvious failure modes (like not giving access to the internet, having fast kill-switches, lots of terms in the value function, and other such patches), the overall effectiveness of this bag-of-tricks determines the maximum level of intelligence that could more-or-less safely be used. A negative effect is inducing a false sense of comfort in the team overseeing the AI, the team’s culture around safety is very important here for avoiding negative outcomes from overconfidence in the bag-of-tricks.