Shard theory people sometimes say that a problem of aligning system to single task/goal, like “put two strawberries on plate” or “maximize amount of diamond in the universe” is meaningless, because actual system will inevitably end up with multiple goals. I disargee, because even if SGD on real-world data usually produces multiple-goal system, if you understand interpretability enough and shard theory is true, you can identify and delete irrelevant value shards, and reinforce relevant, so instead of getting 1% of value you get 90%+.
Shard theory people sometimes say that a problem of aligning system to single task/goal, like “put two strawberries on plate” or “maximize amount of diamond in the universe” is meaningless, because actual system will inevitably end up with multiple goals. I disargee, because even if SGD on real-world data usually produces multiple-goal system, if you understand interpretability enough and shard theory is true, you can identify and delete irrelevant value shards, and reinforce relevant, so instead of getting 1% of value you get 90%+.