I like the first idea. But can we really guarantee that after changing its source code to give itself maximum utility, it will stop all other actions? If it has access to its own source code, what ensures that its utility is “maximum” when it can change the limit arbitrarily? And if all possible actions have the same expected utility, an optimizer could output any solution—”no action” would be the trivial one but it’s not the only one.
An AI that has achieved all of its goals might still be dangerous, since it would presumably lose all high-level executive function (its optimization behavior) but have no incentive to turn off any sub-programs that are still running.
Both proposals have the possible failure mode that the AI will discover or guess that this mechanism exists, and then it will only care about making sure it gets activated—which might mean doing bad enough things that humans are forced to open the box and shut it down.
Are you talking about a local game in NY or a correspondence thing?