Oh, I definitely do. For example, the boat race example turned out to be a minor warning shot on the dangers of getting the reward function wrong (though I don’t really understand why it was so influential; it seems so clear that an incorrect reward function can lead to bad behavior).
Okay, sure—in that case, I think a lot of our disagreement on warning shots might just be a different understanding of the term. I don’t think I expect homogeneity to really change the probability of finding issues during training or in other laboratory settings, though I think there is a difference between e.g. having studied and understood reactor meltdowns in the lab and actually having Chernobyl as an example.
Why is there homogeneity in misaligned goals?
Some reasons you might expect homogeneity of misaligned goals:
If you do lots of copying of the exact same system, then trivially they’ll all have homogenous misaligned goals (unless those goals are highly indexical, but even then I expect the different AIs to be able to cooperate on those indexical preferences with each other pretty effectively).
If you’re using your AI systems at time step t to help you build your AI systems at time step t+1, then if that first set of systems is misaligned and deceptive, they can influence the development of the second set of systems to be misaligned in the same way.
If you do a lot of fine-tuning to produce your next set of AIs, then I expect fine-tuning to mostly preserve existing misaligned goals, like I mentioned previously.
Even if you aren’t doing fine-tuning, as long as you’re keeping the basic training process the same, I expect you’ll usually get pretty similar misaligned proxies—e.g. the ones that are simpler/faster/generally favored by your inductive biases.
Okay, sure—in that case, I think a lot of our disagreement on warning shots might just be a different understanding of the term. I don’t think I expect homogeneity to really change the probability of finding issues during training or in other laboratory settings, though I think there is a difference between e.g. having studied and understood reactor meltdowns in the lab and actually having Chernobyl as an example.
Some reasons you might expect homogeneity of misaligned goals:
If you do lots of copying of the exact same system, then trivially they’ll all have homogenous misaligned goals (unless those goals are highly indexical, but even then I expect the different AIs to be able to cooperate on those indexical preferences with each other pretty effectively).
If you’re using your AI systems at time step t to help you build your AI systems at time step t+1, then if that first set of systems is misaligned and deceptive, they can influence the development of the second set of systems to be misaligned in the same way.
If you do a lot of fine-tuning to produce your next set of AIs, then I expect fine-tuning to mostly preserve existing misaligned goals, like I mentioned previously.
Even if you aren’t doing fine-tuning, as long as you’re keeping the basic training process the same, I expect you’ll usually get pretty similar misaligned proxies—e.g. the ones that are simpler/faster/generally favored by your inductive biases.