This doesn’t seem dangerous to me. So the agent values both, and there was an event which differentially strengthened the looks-like-diamond shard (assuming the agent could tell the difference at a visual remove, during training), but there are lots of other reward events, many of which won’t really involve that shard (like video games where the agent collects diamonds, or text rpgs where the agent quests for lots of diamonds). (I’m not adding these now, I was imagining this kind of curriculum before, to be clear—see the “game” shard.)
So maybe there’s a shard with predicates like “would be sensory-perceived by naive people to be a diamond” that gets hit by all of these, but I expect that shard to be relatively gradient starved and relatively complex in the requisite way → not a very substantial update. Not sure why that’s a big problem.
But I’ll think more and see if I can’t salvage your argument in some form.
This doesn’t seem dangerous to me. So the agent values both, and there was an event which differentially strengthened the looks-like-diamond shard (assuming the agent could tell the difference at a visual remove, during training), but there are lots of other reward events, many of which won’t really involve that shard (like video games where the agent collects diamonds, or text rpgs where the agent quests for lots of diamonds). (I’m not adding these now, I was imagining this kind of curriculum before, to be clear—see the “game” shard.)
So maybe there’s a shard with predicates like “would be sensory-perceived by naive people to be a diamond” that gets hit by all of these, but I expect that shard to be relatively gradient starved and relatively complex in the requisite way → not a very substantial update. Not sure why that’s a big problem.
But I’ll think more and see if I can’t salvage your argument in some form.
I found this annoying.