Sounds correct to me. As long as the AI has no model of the outside world and no model of itself (and perhaps a few extra assumptions), it should keep playing within the given constraints. It may produce results that are incomprehensive to us, but it would not do so on purpose.
It’s when the “tool AI” has the model of the world—including itself, the humans, how the rewards are generated, how it could generate better results by obtaining more resources, and how humans could interfere with its goals—when the agent-ness emerges as a side effect of trying to solve the problem.
“Find the best GO move in this tree” is safe. “Find the best GO move, given the fact that the guy in the next room hates computers and will try to turn you off, which would be considered a failure at finding the best move” is dangerous. “Find the best GO move, given the fact that more computing power would likely allow you to make better moves, but humans would try to prevent you from getting too much resources” is an x-risk.
Sounds correct to me. As long as the AI has no model of the outside world and no model of itself (and perhaps a few extra assumptions), it should keep playing within the given constraints. It may produce results that are incomprehensive to us, but it would not do so on purpose.
It’s when the “tool AI” has the model of the world—including itself, the humans, how the rewards are generated, how it could generate better results by obtaining more resources, and how humans could interfere with its goals—when the agent-ness emerges as a side effect of trying to solve the problem.
“Find the best GO move in this tree” is safe. “Find the best GO move, given the fact that the guy in the next room hates computers and will try to turn you off, which would be considered a failure at finding the best move” is dangerous. “Find the best GO move, given the fact that more computing power would likely allow you to make better moves, but humans would try to prevent you from getting too much resources” is an x-risk.