You were also talking about “systems generated by any process broadly encompassed by the current ML training paradigm”—which is a larger class than just LLMs.
Yeah, and safety properties of LLMs extend to more than just LLMs. E. g., I’m pretty sure CNNs scaled arbitrarily far are also safe, for the same reasons LLMs are. And there are likely ML models more sophisticated and capable than LLMs, which nevertheless are also safe (and capability-upper-bounded) for the reasons LLMs are safe.
If AGI reasoning is made from LLMs, aligning LLMs, in a sense of making them say stuff we want them to say/not say stuff we do not want them to say, is not only absolutely crucial to aligning AGI, but mostly reduces to it.
I don’t think that’d work out this way. Why would the overarching scaffolded system satisfy the safety guarantees of the LLMs it’s built out of? Say we make LLMs never talk about murder. But the scaffolded agent, inasmuch as it’s generally intelligent, should surely be able to consider situations that involve murder in order to make workable plans, including scenarios where it itself (deliberately or accidentally) causes death. If nothing else, in order to avoid that.
So it’d need to find some way to circumvent the “my components can’t talk about murder” thing, and it’d probably just evolve some sort of jail-break, or define a completely new term that would stand-in for the forbidden “murder” word.
General form of the Deep Deceptiveness argument applies here. It is ground truth that the GI would be more effective at what it does if it could reason about such stuff. And so, inasmuch as the system is generally intelligent, it’d have the functionality to somehow slip such non-robust constraints. Conversely, if it can’t slip them, it’s not generally intelligent.
Yeah, and safety properties of LLMs extend to more than just LLMs. E. g., I’m pretty sure CNNs scaled arbitrarily far are also safe, for the same reasons LLMs are. And there are likely ML models more sophisticated and capable than LLMs, which nevertheless are also safe (and capability-upper-bounded) for the reasons LLMs are safe.
Oh, certainly. I’m a large fan of interpretability tools, as well.
I don’t think that’d work out this way. Why would the overarching scaffolded system satisfy the safety guarantees of the LLMs it’s built out of? Say we make LLMs never talk about murder. But the scaffolded agent, inasmuch as it’s generally intelligent, should surely be able to consider situations that involve murder in order to make workable plans, including scenarios where it itself (deliberately or accidentally) causes death. If nothing else, in order to avoid that.
So it’d need to find some way to circumvent the “my components can’t talk about murder” thing, and it’d probably just evolve some sort of jail-break, or define a completely new term that would stand-in for the forbidden “murder” word.
General form of the Deep Deceptiveness argument applies here. It is ground truth that the GI would be more effective at what it does if it could reason about such stuff. And so, inasmuch as the system is generally intelligent, it’d have the functionality to somehow slip such non-robust constraints. Conversely, if it can’t slip them, it’s not generally intelligent.