Once again, props to OAI for putting this in the system card. Also, once again, it’s difficult to sort out “we told it to do a bad thing and it obeyed” from “we told it to do a good thing and it did a bad thing instead,” but these experiments do seem like important information.
Please clarify: do you mean difficult for us reading this due to OpenAI obfuscating the situation for their own purposes, or difficult for them because it’s genuinely unclear how to classify this?
Once again, props to OAI for putting this in the system card. Also, once again, it’s difficult to sort out “we told it to do a bad thing and it obeyed” from “we told it to do a good thing and it did a bad thing instead,” but these experiments do seem like important information.
Please clarify: do you mean difficult for us reading this due to OpenAI obfuscating the situation for their own purposes, or difficult for them because it’s genuinely unclear how to classify this?
Ah, sorry- I meant it’s genuinely unclear how to classify this.