Then perhaps we should research ways to measure and restrict intelligence/optimization power.
Just off the top of my head, one way would be to add another term to it’s utility function. Representing the amount of computing power used (or time). It would then have an incentive to use as little computing power as possible to meet it’s goal.
An example, you ask the AI to solve a problem for you. The utility function is maximizing the probability that it’s answer will be accepted by you as a solution. But after the probability goes above 90%, the utility stops, and a penalty is added for using more computing power.
So the AI tries to solve the problem, but uses the minimal amount of optimization necessary, and doesn’t over optimize.
Those approaches fail the “subagent problem”. As in, the AI can pass it by creating a subagent to solve the problem for it, without the subagent having those restrictions.
I’m assuming the AI exists in a contained box. We can accurately measure the time it is on and/or resources used within the box. So it can’t create any subagents that also don’t use up it’s resources and count towards the penalty.
If the AI can escape from the box, we’ve already failed. There is little point in trying to control what it can do with it’s output channel.
Then perhaps we should research ways to measure and restrict intelligence/optimization power.
Just off the top of my head, one way would be to add another term to it’s utility function. Representing the amount of computing power used (or time). It would then have an incentive to use as little computing power as possible to meet it’s goal.
An example, you ask the AI to solve a problem for you. The utility function is maximizing the probability that it’s answer will be accepted by you as a solution. But after the probability goes above 90%, the utility stops, and a penalty is added for using more computing power.
So the AI tries to solve the problem, but uses the minimal amount of optimization necessary, and doesn’t over optimize.
Those approaches fail the “subagent problem”. As in, the AI can pass it by creating a subagent to solve the problem for it, without the subagent having those restrictions.
I’m assuming the AI exists in a contained box. We can accurately measure the time it is on and/or resources used within the box. So it can’t create any subagents that also don’t use up it’s resources and count towards the penalty.
If the AI can escape from the box, we’ve already failed. There is little point in trying to control what it can do with it’s output channel.
Reduced impact can control an AI that has the ability to get out of its box. That’s what I like about it.