Capability control methods, particularly boxing and stunting, run the risk of creating a capability overhang: a gap in optimization power between the controlled AI and an uncontrolled version. This capability overhang creates an additional external hazard: that another AI team, hearing of the first AI team’s success will believe that less capability control is required than used by the initial team (possibly due to different assumptions, or motivated cognition). They will want to create a less controlled version of the AI to attempt to gain greater optimization power and a boost over their rivals. This continues until someone crosses the line to an unsafe AI.
This isn’t a problem if you assume all AI researchers agree on all substantial aspects of the control problem or forced to coordinate. I’m not convinced this is a likely outcome.
Capability control methods, particularly boxing and stunting, run the risk of creating a capability overhang: a gap in optimization power between the controlled AI and an uncontrolled version. This capability overhang creates an additional external hazard: that another AI team, hearing of the first AI team’s success will believe that less capability control is required than used by the initial team (possibly due to different assumptions, or motivated cognition). They will want to create a less controlled version of the AI to attempt to gain greater optimization power and a boost over their rivals. This continues until someone crosses the line to an unsafe AI.
This isn’t a problem if you assume all AI researchers agree on all substantial aspects of the control problem or forced to coordinate. I’m not convinced this is a likely outcome.
Does this model make sense?