I definitely think it’s interesting to understand and control whether a model is pursuing a long-horizon goal (though talking about the “goal” of a model seems quite slippery).
I think that most work on alignment doesn’t need to get into the difficulties of defining or arguing about human values. I’m normally focused more on goals like: “does my AI make statements that it knows to be unambiguously false?” (see ELK).
I definitely think it’s interesting to understand and control whether a model is pursuing a long-horizon goal (though talking about the “goal” of a model seems quite slippery).
I think that most work on alignment doesn’t need to get into the difficulties of defining or arguing about human values. I’m normally focused more on goals like: “does my AI make statements that it knows to be unambiguously false?” (see ELK).