The goal would be to start any experiment which might plausibly lead to AGI with a metaphorical gun to the computer’s head, such that being less than (observably) perfectly honest with us, “pulling any funny business,” etc. would lead to its destruction. If you can make it so that the path of least resistance to make it safely out of a box is to cooperate rather than try to defect and risk getting caught, you should be able to productively manipulate AGIs in many (albeit not all) possible worlds. Obviously this should be done on top of other alignment methods, but I doubt it would hurt things much, and would likely help as a significant buffer.
On reflection, I see your point, and will cross that section out for now, with the caveat that there may be variants of this idea which have significant safety value.
The goal would be to start any experiment which might plausibly lead to AGI with a metaphorical gun to the computer’s head, such that being less than (observably) perfectly honest with us, “pulling any funny business,” etc. would lead to its destruction. If you can make it so that the path of least resistance to make it safely out of a box is to cooperate rather than try to defect and risk getting caught, you should be able to productively manipulate AGIs in many (albeit not all) possible worlds. Obviously this should be done on top of other alignment methods, but I doubt it would hurt things much, and would likely help as a significant buffer.
On reflection, I see your point, and will cross that section out for now, with the caveat that there may be variants of this idea which have significant safety value.