As you probably imagine given my biography :) , I am never against any research, and definitely not for reasons of practical utility. So am definitely very supportive of research on alignment, and not claiming that it shouldn’t be done. In my view, one of the interesting technical questions is to what extent can long-term goals emerge from systems trained with short-term objectives, and (if it happens) whether we can prevent this while still keeping short-term performance as good. One reason I like the focus on the horizon rather than alignment with human values is that the former might be easier to define and argue about. But this doesn’t mean that we should not care about the latter.
I definitely think it’s interesting to understand and control whether a model is pursuing a long-horizon goal (though talking about the “goal” of a model seems quite slippery).
I think that most work on alignment doesn’t need to get into the difficulties of defining or arguing about human values. I’m normally focused more on goals like: “does my AI make statements that it knows to be unambiguously false?” (see ELK).
As you probably imagine given my biography :) , I am never against any research, and definitely not for reasons of practical utility. So am definitely very supportive of research on alignment, and not claiming that it shouldn’t be done. In my view, one of the interesting technical questions is to what extent can long-term goals emerge from systems trained with short-term objectives, and (if it happens) whether we can prevent this while still keeping short-term performance as good. One reason I like the focus on the horizon rather than alignment with human values is that the former might be easier to define and argue about. But this doesn’t mean that we should not care about the latter.
I definitely think it’s interesting to understand and control whether a model is pursuing a long-horizon goal (though talking about the “goal” of a model seems quite slippery).
I think that most work on alignment doesn’t need to get into the difficulties of defining or arguing about human values. I’m normally focused more on goals like: “does my AI make statements that it knows to be unambiguously false?” (see ELK).