And indeed, if you’re trying to train for full alignment, you should almost certainly train for having a pointer, rather than training to give correct answers on e.g. trolley problems.
Yep, agreed. Although I worry that—if we try to train agents to have a pointer—these agents might end up having a goal more like:
maximize the arrangement of the universe according to this particular balance of beauty, non-suffering, joy, non-boredom, autonomy, sacredness, [217 other shards of human values, possibly including parochial desires unique to this principal].
I think it depends on how path-dependent the training process is. The pointer seems simpler, so the agent settles on the pointer in the low path-dependence world. But agents form representations of things like beauty, non-suffering, etc. before they form representations of human desires, so maybe these agents’ goals crystallize around these things in the high path-dependence world.
Thanks, this comment was clarifying.
Yep, agreed. Although I worry that—if we try to train agents to have a pointer—these agents might end up having a goal more like:
I think it depends on how path-dependent the training process is. The pointer seems simpler, so the agent settles on the pointer in the low path-dependence world. But agents form representations of things like beauty, non-suffering, etc. before they form representations of human desires, so maybe these agents’ goals crystallize around these things in the high path-dependence world.