You’ll run into ELK-like difficulties when trying to train B: maybe it looks to the human that the off switch is still accessible but in reality you would not be able to activate it for some reason—A steers into a Faraday cage or something more clever.
(In response to your reply to acylhalide) Assuming you solve the training difficulties, B has to be at least as good at modelling the world as A: it needs to be able to predict the consequences of action sequences from A to see if the off switch is accessible.
(In response to your reply to Evan) I don’t necessarily expect A+B+R to be as capable as A+R: adding B introduces a huge shift to the environment that A has to deal with. If A always disabled the off switch in regular training, and now is unable to execute action sequences that involve that, then it may not be able to act effectively.
A couple of things that come to mind:
You’ll run into ELK-like difficulties when trying to train B: maybe it looks to the human that the off switch is still accessible but in reality you would not be able to activate it for some reason—A steers into a Faraday cage or something more clever.
(In response to your reply to acylhalide) Assuming you solve the training difficulties, B has to be at least as good at modelling the world as A: it needs to be able to predict the consequences of action sequences from A to see if the off switch is accessible.
(In response to your reply to Evan) I don’t necessarily expect A+B+R to be as capable as A+R: adding B introduces a huge shift to the environment that A has to deal with. If A always disabled the off switch in regular training, and now is unable to execute action sequences that involve that, then it may not be able to act effectively.