Hardcode the AGI to need our approval indefinitely?

I wouldn’t be surprised if others have thought about this before and decided it doesn’t work, but I haven’t seen it discussed. Is there a version of this proposal that’s promising? If so, what would it look like? If not, why not?

What I’m imagining is that there will be a panel of people, and every n days, the AGI will receive a punishment signal (-1) by default by design, unless that panel votes (with e.g. >90% consensus) to replace that signal with 0 and instead give the AGI a reward signal (+1, default 0), which they should commit to doing if and only if the AGI hasn’t done anything horrible. I think this could be implemented with (something like) cooperative inverse reinforcement learning.

The AGI should be told that these punishments are being prevented manually and the rewards, given manually, so that it knows that killing everyone or taking control is likely to cost it a lot, unless it can hack the signals first. Furthermore, if it finds a way to disable the punishment signal without being able to hack the reward itself, too, it will just not get either signal with a nonzero value, indefinitely.

Here are some important details that I expect need to be fleshed out:

  1. We make sure that these signals have greater weight than everything else the AGI might care about, but not so much that it leads to paralysis or fanaticism. I had one idea here.

  2. We make sure the signal is secure and too hard for the AGI to hack or otherwise manipulate without alerting us, after which the AGI would be punished and/​or stop receiving the panel’s rewards.

  3. Successors for the panel must be decided on, so that this arrangement can be maintained indefinitely. We’d like to do this in a way that makes it unlikely that the AGI can get on the panel itself or put someone there to represent itself. Maybe the panelists decide successors.