I’m wondering how and agent using the attainable utility implementation in equations 3, 4, and 5 could actually be superhuman.
In the “superhuman” analysis post, I was considering whether that reward function would incentivize good policies if you assumed a superintelligently strong optimizer optimized that reward function.
For example, suppose the AI is rewarded for making paperclips, but all it can do in the next timestep is start moving its arm towards wire. If it’s just rewarded for making paperclips, and it can’t make a paperclip the next timestep, so the AI would instead focus on minimizing impact and not do anything.
Not necessarily; an optimal policy maximizes the sum of discounted reward over time, and so it’s possible for the agent to take actions which aren’t locally rewarding but which lead to long-term reward. For example, in a two-step game where I can get rewarded on both time steps, I’d pick actions a1,a2 which maximize R(s1,a1)+γR(s2,a2). In this case, R(s1,a1) could be 0, but the pair of actions could still be optimal.
I know you could adjust the reward function to reward the AI doing things that you think will help it accomplish your primary goal in the future. For example, you know the AI moving its arm towards the wire is useful, so you could reward that. But then I don’t see how the AI could do anything clever or superhuman to make paperclips.
This idea is called “reward shaping” and there’s a good amount of literature on it!
Is there much the reduced-impact agent with reward shaping could do that an agent using human mimicry couldn’t?
Perhaps it could improve over mimicry by being able to consider all actions, while a human mimic would only in effect consider the actions a human would. But I don’t think there are usually many single-step actions to choose from, so I’m guessing this isn’t a big benefit. Could the performance improvement come from better understanding the current state than mimics could? I’m not sure when this would make a big difference, though.
I’m also still concerned the reduced-impact agent would find some clever way to cause devastation while avoiding the impact penalty, but I’m less concerned about human mimics causing devastation. Are there other, major risks to using mimicry that the reduced-impact agent avoids?
In the “superhuman” analysis post, I was considering whether that reward function would incentivize good policies if you assumed a superintelligently strong optimizer optimized that reward function.
Not necessarily; an optimal policy maximizes the sum of discounted reward over time, and so it’s possible for the agent to take actions which aren’t locally rewarding but which lead to long-term reward. For example, in a two-step game where I can get rewarded on both time steps, I’d pick actions a1,a2 which maximize R(s1,a1)+γR(s2,a2). In this case, R(s1,a1) could be 0, but the pair of actions could still be optimal.
This idea is called “reward shaping” and there’s a good amount of literature on it!
Is there much the reduced-impact agent with reward shaping could do that an agent using human mimicry couldn’t?
Perhaps it could improve over mimicry by being able to consider all actions, while a human mimic would only in effect consider the actions a human would. But I don’t think there are usually many single-step actions to choose from, so I’m guessing this isn’t a big benefit. Could the performance improvement come from better understanding the current state than mimics could? I’m not sure when this would make a big difference, though.
I’m also still concerned the reduced-impact agent would find some clever way to cause devastation while avoiding the impact penalty, but I’m less concerned about human mimics causing devastation. Are there other, major risks to using mimicry that the reduced-impact agent avoids?