Well, the answer is either “yes, but irrelevant” or “we can’t know until it’s too late”. It’s certainly not “yes” with any confidence.
There are problems with each of your criteria, any of which could break confidence in the result.
Behaviorally Safe. We haven’t seen enough things to know how common or rare they might be, even under the conditions of well-intentioned users. The way RLHF works, it’s quite likely that there are unusual/tail prompts or contexts which the underlying LLM attempts to address but is out-of-domain for the RL guiderails. How often that’s “unsafe” is hard to measure, but is very likely nonzero.
Non-adversarial conditions. This is pretty weakly specified. Does it cover well-intentioned prompts that happen to bypass the RLHF guardrails? Does it cover people just getting lucky with contexts that happen to result in harm (to someone)? More importantly, does it matter at all, since the real world contains plenty of adversarial conditions?
Solved. Is there some idea of “total harm” here—frequency times magnitude? Why is it considered “solved” if one in ten-billion uses result in an extra spec of dust in someone’s eye? What’s the threshold here?
Well, the answer is either “yes, but irrelevant” or “we can’t know until it’s too late”. It’s certainly not “yes” with any confidence.
There are problems with each of your criteria, any of which could break confidence in the result.
Behaviorally Safe. We haven’t seen enough things to know how common or rare they might be, even under the conditions of well-intentioned users. The way RLHF works, it’s quite likely that there are unusual/tail prompts or contexts which the underlying LLM attempts to address but is out-of-domain for the RL guiderails. How often that’s “unsafe” is hard to measure, but is very likely nonzero.
Non-adversarial conditions. This is pretty weakly specified. Does it cover well-intentioned prompts that happen to bypass the RLHF guardrails? Does it cover people just getting lucky with contexts that happen to result in harm (to someone)? More importantly, does it matter at all, since the real world contains plenty of adversarial conditions?
Solved. Is there some idea of “total harm” here—frequency times magnitude? Why is it considered “solved” if one in ten-billion uses result in an extra spec of dust in someone’s eye? What’s the threshold here?