I can totally believe that agents that competently and cooperatively seek out to fulfill a goal, rather than seeking to trick evaluators of that goal to think it gets fulfilled, can exist.
However, whether you get such agents out of an algorithm depends on the details of that algorithm. Current reinforcement learning algorithms mostly don’t create agents that competently do anything. If they were more powerful while still doing essentially the same thing they currently do, most of them would end up tricked by the agents they create, rather than having aligned agents.
I can totally believe that agents that competently and cooperatively seek out to fulfill a goal, rather than seeking to trick evaluators of that goal to think it gets fulfilled, can exist.
However, whether you get such agents out of an algorithm depends on the details of that algorithm. Current reinforcement learning algorithms mostly don’t create agents that competently do anything. If they were more powerful while still doing essentially the same thing they currently do, most of them would end up tricked by the agents they create, rather than having aligned agents.