“Safe” as in “safe enough for it to be on net better to run it” or “safe enough it wouldn’t definitely kill everyone”. It’s not that I don’t have popular intuition that GPT wouldn’t kill anyone. It’s just that I don’t think it’s a good habit to run progressively more capable systems while relying on informal intuitions about their safety. And then maybe I will see an explanation for why future safety tools would outpace capability progress, when now we are already at the point where current safety tools are not applicable to current AI systems.
I’m pretty unconvinced by this. I do not think that any substantial fraction of AI x-risk comes from an alignment research who thinks carefully about x-risk deciding that a GPT-3 level system isn’t scary enough to take significant precautions with re boxing.
I think taking frivolous risks is bad, but that risk aversion to the point of not being able to pursue otherwise promising research directions seems pretty costly, while the benefits of averting risks >1e-9 is pretty negligible in comparison.
(To be clear, this argument does not apply to more powerful systems! As systems get smarter we should be more capable, and try to be very conservative! But ultimately everything is a trade-off—letting GPT-3 talk to human contractors giving feedback is a way of letting it out of the box!)
I just want the trade-off to be made explicitly. If it turns out that −7 people in expectation is better than thinking about utility functions and all other alternatives—fine. But that’s the argument that depends on actual numbers. Yes, it’s possible to think informally and correctly. But maybe “an alignment research who thinks carefully about x-risk” wasn’t what was happening.
To be clear, this argument does not apply to more powerful systems!
Before running InstructGPT what was the technical reason why it wouldn’t be powerful?
While I wouldn’t accept that level of risk-aversion, I do agree with a related question: Why do they think they can make significant progress on alignment, exactly?
There is a school of thought that says you need to mathematically prove that your AGI will be aligned, before you even start building any kind of AI system at all. IMO this would be a great approach if our civilization had strong coordination abilities and unlimited time.
So, what’s the technical (the one ending with “therefore the probability of disaster is < 1e-9”) reason why training InstructGPT was safe?
Who is claiming that it is safe? I didn’t get that implication from the post
“Safe” as in “safe enough for it to be on net better to run it” or “safe enough it wouldn’t definitely kill everyone”. It’s not that I don’t have popular intuition that GPT wouldn’t kill anyone. It’s just that I don’t think it’s a good habit to run progressively more capable systems while relying on informal intuitions about their safety. And then maybe I will see an explanation for why future safety tools would outpace capability progress, when now we are already at the point where current safety tools are not applicable to current AI systems.
I’m pretty unconvinced by this. I do not think that any substantial fraction of AI x-risk comes from an alignment research who thinks carefully about x-risk deciding that a GPT-3 level system isn’t scary enough to take significant precautions with re boxing.
I think taking frivolous risks is bad, but that risk aversion to the point of not being able to pursue otherwise promising research directions seems pretty costly, while the benefits of averting risks >1e-9 is pretty negligible in comparison.
(To be clear, this argument does not apply to more powerful systems! As systems get smarter we should be more capable, and try to be very conservative! But ultimately everything is a trade-off—letting GPT-3 talk to human contractors giving feedback is a way of letting it out of the box!)
I just want the trade-off to be made explicitly. If it turns out that −7 people in expectation is better than thinking about utility functions and all other alternatives—fine. But that’s the argument that depends on actual numbers. Yes, it’s possible to think informally and correctly. But maybe “an alignment research who thinks carefully about x-risk” wasn’t what was happening.
Before running InstructGPT what was the technical reason why it wouldn’t be powerful?
While I wouldn’t accept that level of risk-aversion, I do agree with a related question: Why do they think they can make significant progress on alignment, exactly?
I mean, I would be glad to hear any number.
Why not? They are running experiments and getting real hands-on experience with AI systems that keep getting better. Seems to me a plausible approach.
There is a school of thought that says you need to mathematically prove that your AGI will be aligned, before you even start building any kind of AI system at all. IMO this would be a great approach if our civilization had strong coordination abilities and unlimited time.