Isn’t this model behavior what’s described as jailbroken? It sounds similar to reports of jailbroken model behavior. It seems like the reports I’ve glanced at could be described as having lowered inhibitions and being closer to base model behavior.
In any case, good work investigating alternate explanations of the emergent misalignment paper.
Isn’t this model behavior what’s described as jailbroken? It sounds similar to reports of jailbroken model behavior. It seems like the reports I’ve glanced at could be described as having lowered inhibitions and being closer to base model behavior.
In any case, good work investigating alternate explanations of the emergent misalignment paper.