In the images case, I meant that if you had a richer dataset with more images in more conditions, accompanied with touch-based information, perhaps even audio, and the agent were allowed to interact with the world and see through these input mechanisms what the world did in response, then it would learn concepts that allow it to understand the world the way we do—it wouldn’t be fooled by occlusions, or by putting picture of a baseball on top of an ocean picture, etc. (This also requires a sufficiently large dataset; I don’t know how large.)
I’m not saying that such a dataset would lead it to learn what we value. I don’t know what that dataset would look like, partly because it’s not clear to me what exactly we value.
In the images case, I meant that if you had a richer dataset with more images in more conditions, accompanied with touch-based information, perhaps even audio, and the agent were allowed to interact with the world and see through these input mechanisms what the world did in response, then it would learn concepts that allow it to understand the world the way we do—it wouldn’t be fooled by occlusions, or by putting picture of a baseball on top of an ocean picture, etc. (This also requires a sufficiently large dataset; I don’t know how large.)
I’m not saying that such a dataset would lead it to learn what we value. I don’t know what that dataset would look like, partly because it’s not clear to me what exactly we value.