Interesting, I think this clarifies things, but the framing also isn’t quite as neat as I’d like.
I’d be tempted to redefine/reframe this as follows:
• Outer alignment for a simulator—Perfectly defining what it means to simulate a character. For example, how can we create a specification language so that we can pick out the character that we want? And what do we do with counterfactuals given they aren’t actually literal?
• Inner alignment for a simulator—Training a simulator to perfectly simulate the assigned character
• Outer alignment for characters—finding a character who would create good outcomes if successfully simulated
In this model, there wouldn’t be a separate notion of inner alignment for characters as that would be automatic if the simulator was both inner and outer aligned.
Interesting, I think this clarifies things, but the framing also isn’t quite as neat as I’d like.
I’d be tempted to redefine/reframe this as follows:
• Outer alignment for a simulator—Perfectly defining what it means to simulate a character. For example, how can we create a specification language so that we can pick out the character that we want? And what do we do with counterfactuals given they aren’t actually literal?
• Inner alignment for a simulator—Training a simulator to perfectly simulate the assigned character
• Outer alignment for characters—finding a character who would create good outcomes if successfully simulated
In this model, there wouldn’t be a separate notion of inner alignment for characters as that would be automatic if the simulator was both inner and outer aligned.
Thoughts?