Eliezer: OpenAI probably thought they were trying hard at precautions; but they didn’t have anybody on their team who was really creative about breaking stuff, let alone anyone as creative as the combined Internet; so it got jailbroken in like a day after something smarter looked at it.
I think this is very weak evidence. “Jailbreaking it” did as far as I know no damage. At least I haven’t seen anybody point to any damage created. On the other hand, it did give OpenAI training data it could use to fix many of the holes.
Even if you don’t agree with that strategy, I see no evidence that this wasn’t the planned strategy.
On reflection, I agree that it is only weak evidence. I agree we know nothing about damage. I agree that we have no evidence that this wasn’t the planned strategy. Still, the evidence the other way (that this was deliberate to gather training data) is IMHO weaker.
My point in the “Review” section is that OpenAI’s plan committed them to transparency about these questions, and yet we have to rely on speculations.
I find the fact that they used the training data in a short time to massively reduce the “jailbreak”-cases evidence in the direction that the point of the exercise was to gather training data.
ChatGPT has a mode where it labels your question as illegitimate and colors it red but still gives you an answer. Then there’s the feedback button to tell OpenAI if it made a mistake. This behavior prioritizes gathering training data over not giving any problematic answers.
Maybe the underlying reason why we are interpreting the evidence in different ways is because we are holding OpenAI to different standards:
Compared to a standard company, having a feedback button is evidence of competence. Quickly incorporating training data is also a positive update, as is having an explicit graphical representation of illegitimate questions.
I am comparing OpenAI to the extremely high standard of “Being able to solve the alignment problem”. Against this standard, having a feedback button is absolutely expected, and even things like Eliezers suggestion (publishing hashes of your gambits) should be obvious to companies competent enough to have a chance of solving the alignment problem.
It’s important to be able to distinguish factual questions from questions about judgments. “Did the OpenAI release happen the way OpenAI expected?” is a factual question that has nothing to do with the question of what standards we should have for OpenAI.
If you get the factual questions wrong it’s very easy for people within OpenAI to easily dismiss your arguments.
Why do we think it didn’t go as aspected?
Eliezer: OpenAI probably thought they were trying hard at precautions; but they didn’t have anybody on their team who was really creative about breaking stuff, let alone anyone as creative as the combined Internet; so it got jailbroken in like a day after something smarter looked at it.
I think this is very weak evidence. “Jailbreaking it” did as far as I know no damage. At least I haven’t seen anybody point to any damage created. On the other hand, it did give OpenAI training data it could use to fix many of the holes.
Even if you don’t agree with that strategy, I see no evidence that this wasn’t the planned strategy.
On reflection, I agree that it is only weak evidence. I agree we know nothing about damage. I agree that we have no evidence that this wasn’t the planned strategy. Still, the evidence the other way (that this was deliberate to gather training data) is IMHO weaker.
My point in the “Review” section is that OpenAI’s plan committed them to transparency about these questions, and yet we have to rely on speculations.
I find the fact that they used the training data in a short time to massively reduce the “jailbreak”-cases evidence in the direction that the point of the exercise was to gather training data.
ChatGPT has a mode where it labels your question as illegitimate and colors it red but still gives you an answer. Then there’s the feedback button to tell OpenAI if it made a mistake. This behavior prioritizes gathering training data over not giving any problematic answers.
Maybe the underlying reason why we are interpreting the evidence in different ways is because we are holding OpenAI to different standards:
Compared to a standard company, having a feedback button is evidence of competence. Quickly incorporating training data is also a positive update, as is having an explicit graphical representation of illegitimate questions.
I am comparing OpenAI to the extremely high standard of “Being able to solve the alignment problem”. Against this standard, having a feedback button is absolutely expected, and even things like Eliezers suggestion (publishing hashes of your gambits) should be obvious to companies competent enough to have a chance of solving the alignment problem.
It’s important to be able to distinguish factual questions from questions about judgments. “Did the OpenAI release happen the way OpenAI expected?” is a factual question that has nothing to do with the question of what standards we should have for OpenAI.
If you get the factual questions wrong it’s very easy for people within OpenAI to easily dismiss your arguments.
I fully agree that it is a factual question, and OpenAI could easily shed light on the circumstances around the launch if they chose to do so.
That’s not even an assertion that it didn’t go as they expected, let alone an explanation of why one would assume that.
Seems to me Yudkowsky was (way) too pessimistic about OpenAI there. They probably knew something like this would happen.