Fourth, I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment,
Have you seen this on twitter, AF comments, or other discussion? I’d be interested if so. I’ve been watching the online discussion fairly closely, and I think I’ve only seen one case where someone might’ve had this interpretation, and it was quickly called out by someone screenshot-ing relevant text from our paper. (I was actually worried about this concern but then updated against it after not seeing it come up basically at all in the discussions I’ve seen).
Almost all of the misunderstanding of the paper I’m seeing is actually in the opposite direction “why are you even concerned if you explicitly trained the bad behavior into the model in the first place?” suggesting that it’s pretty salient to people that we explicitly trained for this (e.g., from the paper title).
Have you seen this on twitter, AF comments, or other discussion? I’d be interested if so. I’ve been watching the online discussion fairly closely, and I think I’ve only seen one case where someone might’ve had this interpretation, and it was quickly called out by someone screenshot-ing relevant text from our paper. (I was actually worried about this concern but then updated against it after not seeing it come up basically at all in the discussions I’ve seen).
Almost all of the misunderstanding of the paper I’m seeing is actually in the opposite direction “why are you even concerned if you explicitly trained the bad behavior into the model in the first place?” suggesting that it’s pretty salient to people that we explicitly trained for this (e.g., from the paper title).