I’m thinking that maybe alignment should also be thought of as primarily about the platform. That is, what might be more feasible is not aligning agents with their individual values, but aligning maps-of-concepts, things at a lower level than whole agents (meanings of language and words, and salient concepts without words). This way, agents are aligned mostly by default in a similar sense to how humans are, assembled from persistent intents/concepts, even as it remains possible to instantiate alien or even evil agents intentionally.
The GPTs fall short while they can’t do something like IDA/MCTS that makes more central/normative episodes for the concepts accessible while maintaining/improving their alignment (intended meaning, relation to the real world, relation to other concepts). That is, like going from understanding the idea of winning at chess to being a world champion level good at that (accessing rollouts where that’s observed), but for all language use, through generation of appropriate training data (instead of only using external training data).
All this without focusing on specific agents, or values of agents, or even kinds of worlds that the episodes discuss, maintaining alignment of the map with the territory of real observations (external training data, as the alignment target), not doing any motivated reasoning fine-tuning that might distort the signal of those observations. Only after this works is it the time for instantiating specific products (aligned agents), without changing/distoring/fine-tuning/misaligning the underlying simulator.
I’m thinking that maybe alignment should also be thought of as primarily about the platform. That is, what might be more feasible is not aligning agents with their individual values, but aligning maps-of-concepts, things at a lower level than whole agents (meanings of language and words, and salient concepts without words). This way, agents are aligned mostly by default in a similar sense to how humans are, assembled from persistent intents/concepts, even as it remains possible to instantiate alien or even evil agents intentionally.
The GPTs fall short while they can’t do something like IDA/MCTS that makes more central/normative episodes for the concepts accessible while maintaining/improving their alignment (intended meaning, relation to the real world, relation to other concepts). That is, like going from understanding the idea of winning at chess to being a world champion level good at that (accessing rollouts where that’s observed), but for all language use, through generation of appropriate training data (instead of only using external training data).
All this without focusing on specific agents, or values of agents, or even kinds of worlds that the episodes discuss, maintaining alignment of the map with the territory of real observations (external training data, as the alignment target), not doing any motivated reasoning fine-tuning that might distort the signal of those observations. Only after this works is it the time for instantiating specific products (aligned agents), without changing/distoring/fine-tuning/misaligning the underlying simulator.