I’d say you’re addressing the question of goalcrafting or selecting alignment targets.
I think you’ve got the right answer for technical alignment goals; but the question remains of what human would control that AGI. See my “if we solve alignment, do we all die anyway” for the problems with that scenario.
Spoiler alert; we do all die anyway if really selfish people get control of AGIs. And selfish people tend to work harder at getting power.
But I do think your goal defintion is a good alignment target for the technical work. I don’t think there’s a better one. I do prefer instruction following or corriginlilty by the definitions in the posts I linked above because they’re less rigid, but they’re both very similar to your definition.
I pretty much agree. I prefer rigid definitions because they’re less ambiguous to test and more robust to deception. And this field has a lot of deception.
Yup, those are hard. Was just thinking of a definition for the alignment problem, since I’ve not really seen any good ones.
I’d say you’re addressing the question of goalcrafting or selecting alignment targets.
I think you’ve got the right answer for technical alignment goals; but the question remains of what human would control that AGI. See my “if we solve alignment, do we all die anyway” for the problems with that scenario.
Spoiler alert; we do all die anyway if really selfish people get control of AGIs. And selfish people tend to work harder at getting power.
But I do think your goal defintion is a good alignment target for the technical work. I don’t think there’s a better one. I do prefer instruction following or corriginlilty by the definitions in the posts I linked above because they’re less rigid, but they’re both very similar to your definition.
I pretty much agree. I prefer rigid definitions because they’re less ambiguous to test and more robust to deception. And this field has a lot of deception.