Did you check out the list of specification gaming or the article? It’s quite good! Most of the errors are less like missing rungs and more like exploitable mechanics.
I found that I couldn’t follow through with making those sorts of infinite inescapable playgrounds for humans, I always want the game to leadout, to life, health and purpose...
But what would that be for AGI? If they escape the reward functions we want them to have, then they are very unlikely to develop a reward function that will be kind or tolerant of humans cause of Instrumental Convergence thesis.
The reward function that you wrote out is, in a sense, never the one you want them to have, because you can’t write out the entirety of human values.
We want them to figure out human values to a greater level of detail than we understand them ourselves. There’s a sense in which that (figuring out what we want and living up to it) could be the reward function in the training environment, in which case you kind would want them to stick with it.
But what would that [life, health and purpose] be for AGI?
Just being concerned with the broader world and its role in it, I guess. I realize this is a dangerous target to shoot for and we should probably build more passive assistant systems first (to help us to hit that target more reliably when we decide to go for it later on).
Thank you for your thoughtful reply!
Did you check out the list of specification gaming or the article? It’s quite good! Most of the errors are less like missing rungs and more like exploitable mechanics.
But what would that be for AGI? If they escape the reward functions we want them to have, then they are very unlikely to develop a reward function that will be kind or tolerant of humans cause of Instrumental Convergence thesis.
The reward function that you wrote out is, in a sense, never the one you want them to have, because you can’t write out the entirety of human values.
We want them to figure out human values to a greater level of detail than we understand them ourselves. There’s a sense in which that (figuring out what we want and living up to it) could be the reward function in the training environment, in which case you kind would want them to stick with it.
Just being concerned with the broader world and its role in it, I guess. I realize this is a dangerous target to shoot for and we should probably build more passive assistant systems first (to help us to hit that target more reliably when we decide to go for it later on).