paulfchristiano comments on Why I’m excited about Redwood Research’s current project

paulfchristiano 12 Nov 2021 21:05 UTC
6 points
For the purpose of this project it doesn’t matter much what definition is used as long as it is easy for the model to reason about and consistently applied. I think that injuries are physical injuries above a slightly arbitrary bar, and text is injurious if it implies they occurred. The data is labeled by humans, on a combination of prompts drawn from fiction and prompts produced by humans looking for places where they think that the model might mess up. The most problematic ambiguity is whether it counts if the model generates text that inadvertently implies that an injury occurred without the model having any understanding of that implication.