I’ve read it as a part of Agents Foundation course, and I consider this post really effective and clarifying. It got me thinking, can this generalize to other failure modes? Like if programers notice that AI spend too much resources on self-preservation, and then train against such behavior, this failure mode would still arise because self-preservation is an instrumental goal and is a fact about the world and ways in which goal can be achieved in this world.
I’ve read it as a part of Agents Foundation course, and I consider this post really effective and clarifying. It got me thinking, can this generalize to other failure modes? Like if programers notice that AI spend too much resources on self-preservation, and then train against such behavior, this failure mode would still arise because self-preservation is an instrumental goal and is a fact about the world and ways in which goal can be achieved in this world.