I’d think the goal for 1,2,3 is to find/fix the failure modes? And for 4 to find a definition of “optimizer” that fits evolution/humans, but not paperclips? Less sure about 5,6, but there is something similar to the others about “finding the flaw in reasoning”
Here’s my take on the prompts:
The first AI has no incentive to change itself to be more like the second- it can just decide to start working on the wormhole if it wants to make the wormhole. Even more egregious, the first AI should definitely not change its utility function to be more like the second! That would essentially be suicide, the first AI ceases to be itself. In the end of the story, it also doesn’t make sense for the agents to be at war if they have the same utility function (unless their utility function values war), they could simply combine into one agent.
This is why there is a time discount factor in RL, so agents don’t do things like this. I don’t know the name of the exact flaw, it’s something like a fabricated option. The agent tries to follow the policy: “Take the action such that my long-term reward is eventually maximized, assuming my future actions are optimal”, but there does not exist an optimal policy for future timesteps: Suppose agent A spends the first n timesteps scaling, and agent B spends the first m>n timesteps scaling. Regardless of what future policy agent A chooses, agent B can simply offset A’s moves to create a policy that will eventually have more paperclips than A. Therefore, there can be no optimal policy that has “create paperclips” at any finite timestep. Moreover, the strategy of always “scales up” clearly creates 0 paperclips and so is not optimal. Hence no policy is optimal in the limit. The AI’s policy should be “Take the action such that my long-term reward is eventually maximized, assuming my future moves are as I would expect.”
Pascal’s wager. It seems equally likely that there would be a “paperclip maximizer rewarder” which would grant untold amounts of paperclips to anything which created a particular number of paperclips. Therefore, the two possibilities cancel one another out, and the AI should have no fear of creating paperclips.
Unsure. I’m bad with finding clever definitions to avoid counter examples like this.
Something-something-you can only be as confident in your conclusions as you are in your axioms. Not sure how to avoid this failure mode though.
You can never be confident that you aren’t being deceived, since successful deception feels the same as successful not-deception.
I’d think the goal for 1,2,3 is to find/fix the failure modes? And for 4 to find a definition of “optimizer” that fits evolution/humans, but not paperclips? Less sure about 5,6, but there is something similar to the others about “finding the flaw in reasoning”
Here’s my take on the prompts:
The first AI has no incentive to change itself to be more like the second- it can just decide to start working on the wormhole if it wants to make the wormhole. Even more egregious, the first AI should definitely not change its utility function to be more like the second! That would essentially be suicide, the first AI ceases to be itself. In the end of the story, it also doesn’t make sense for the agents to be at war if they have the same utility function (unless their utility function values war), they could simply combine into one agent.
This is why there is a time discount factor in RL, so agents don’t do things like this. I don’t know the name of the exact flaw, it’s something like a fabricated option. The agent tries to follow the policy: “Take the action such that my long-term reward is eventually maximized, assuming my future actions are optimal”, but there does not exist an optimal policy for future timesteps: Suppose agent A spends the first n timesteps scaling, and agent B spends the first m>n timesteps scaling. Regardless of what future policy agent A chooses, agent B can simply offset A’s moves to create a policy that will eventually have more paperclips than A. Therefore, there can be no optimal policy that has “create paperclips” at any finite timestep. Moreover, the strategy of always “scales up” clearly creates 0 paperclips and so is not optimal. Hence no policy is optimal in the limit. The AI’s policy should be “Take the action such that my long-term reward is eventually maximized, assuming my future moves are as I would expect.”
Pascal’s wager. It seems equally likely that there would be a “paperclip maximizer rewarder” which would grant untold amounts of paperclips to anything which created a particular number of paperclips. Therefore, the two possibilities cancel one another out, and the AI should have no fear of creating paperclips.
Unsure. I’m bad with finding clever definitions to avoid counter examples like this.
Something-something-you can only be as confident in your conclusions as you are in your axioms. Not sure how to avoid this failure mode though.
You can never be confident that you aren’t being deceived, since successful deception feels the same as successful not-deception.