Any argument which features a “by definition” has probably gone astray at an earlier point.
In this case, your by-definition-aligned LLM can still cause harm, so what’s the use of your definition of alignment? As one example among many, the part where the LLM “output[s] text that consistently” does something (whether it be “reflects human value judgements” or otherwise), is not something RLHF is actually capable of guaranteeing with any level of certainty, which is one of many conditions a LLM-based superintelligence would need to fulfill to be remotely safe to use.
I think it would need to be closer to “interacting with the LLM cannot result in exceptionally bad outcomes in expectation”, rather than a focus on compliance of text output.
I think a fairly common-here mental model of alignment requires context awareness, and by that definition an LLM with no attached memory couldn’t be aligned.
Any argument which features a “by definition” has probably gone astray at an earlier point.
In this case, your by-definition-aligned LLM can still cause harm, so what’s the use of your definition of alignment? As one example among many, the part where the LLM “output[s] text that consistently” does something (whether it be “reflects human value judgements” or otherwise), is not something RLHF is actually capable of guaranteeing with any level of certainty, which is one of many conditions a LLM-based superintelligence would need to fulfill to be remotely safe to use.
What is your definition of “Aligned” for an LLM with no attached memory then?
Wouldn’t it have to be
“The LLM outputs text which is compliant with the creator’s ethical standards and intentions”?
I think it would need to be closer to “interacting with the LLM cannot result in exceptionally bad outcomes in expectation”, rather than a focus on compliance of text output.
I think a fairly common-here mental model of alignment requires context awareness, and by that definition an LLM with no attached memory couldn’t be aligned.