I agree that boxing is at least a first step, so that it doesn’t get more compute, or worse, FOOM.
The tricky problem is we need to be able to train away a deceptive AI or forbid it entirely, without making it being obfuscated so that it looks trained away.
This is why we need to move beyond the black box paradigm, and why strong interpretability tools are necessary.
I agree that boxing is at least a first step, so that it doesn’t get more compute, or worse, FOOM.
The tricky problem is we need to be able to train away a deceptive AI or forbid it entirely, without making it being obfuscated so that it looks trained away.
This is why we need to move beyond the black box paradigm, and why strong interpretability tools are necessary.