Imagine what happens when AutoGPT stops being a toy and people start pouring billions of dollars into propper scaffoldings and specialized LLMs
I predict that this can’t happen with the standard LLM setup; and that more complex LLM setups, for which this may work, would not meaningfully count as “just an LLM”. See e. g. the “concrete scenario” section.
By “LLMs should be totally safe” I mean literal LLMs as trained today, but scaled up. A thousand times the parameter count, a hundred times the number of layers, trained on correspondingly more multimodal data, etc. But no particularly clever scaffolding or tweaks.
I think we can be decently confident it won’t do anything. I’d been a bit worried about scaling up context windows, but we’ve got 100k-tokens-long ones, and that didn’t do anything. They still can’t even stay on-target, still hallucinate like crazy. Seems fine to update all the way to “this architecture is safe”. Especially given some of the theoretical arguments on that.
(Hey, check this out, @TurnTrout, I too can update in a more optimistic direction sometimes.)
(Indeed, this update was possible to make all the way back in the good old days of GPT-3, as evidenced by nostalgebraist here. In my defense, I wasn’t in the alignment field back then, and it took me a year to catch up and build a proper model of it.)
By “LLMs should be totally safe” I mean literal LLMs as trained today, but scaled up.
You were also talking about “systems generated by any process broadly encompassed by the current ML training paradigm”—which is a larger class than just LLMs.
If you claim that arbitrary scaled LLMs are safe from becoming scary agents on their own—it’s more believable. I’d give it around 90%. Still better safe than sorry. And there are other potential problems like creating an actually sentient models without noticing it—which would be an ethical catastrophe. So catiousness and beter interpretability tools are necessary.
I predict that this can’t happen with the standard LLM setup; and that more complex LLM setups, for which this may work, would not meaningfully count as “just an LLM”. See e. g. the “concrete scenario” section.
I’m talking about “just LLMs” but with clever scaffoldings written in explicit code. All the black box AI-stuff is still only in LLMs. This doesn’t contradict your claim that LLM’s without any additional scaffoldings won’t be able to do it. But it does contradict your titular claim that Current AIs Provide Nearly No Data Relevant to AGI Alignment. If AGI reasoning is made from LLMs, aligning LLMs, in a sense of making them say stuff we want them to say/not say stuff we do not want them to say, is not only absolutely crucial to aligning AGI, but mostly reduces to it.
You were also talking about “systems generated by any process broadly encompassed by the current ML training paradigm”—which is a larger class than just LLMs.
Yeah, and safety properties of LLMs extend to more than just LLMs. E. g., I’m pretty sure CNNs scaled arbitrarily far are also safe, for the same reasons LLMs are. And there are likely ML models more sophisticated and capable than LLMs, which nevertheless are also safe (and capability-upper-bounded) for the reasons LLMs are safe.
If AGI reasoning is made from LLMs, aligning LLMs, in a sense of making them say stuff we want them to say/not say stuff we do not want them to say, is not only absolutely crucial to aligning AGI, but mostly reduces to it.
I don’t think that’d work out this way. Why would the overarching scaffolded system satisfy the safety guarantees of the LLMs it’s built out of? Say we make LLMs never talk about murder. But the scaffolded agent, inasmuch as it’s generally intelligent, should surely be able to consider situations that involve murder in order to make workable plans, including scenarios where it itself (deliberately or accidentally) causes death. If nothing else, in order to avoid that.
So it’d need to find some way to circumvent the “my components can’t talk about murder” thing, and it’d probably just evolve some sort of jail-break, or define a completely new term that would stand-in for the forbidden “murder” word.
General form of the Deep Deceptiveness argument applies here. It is ground truth that the GI would be more effective at what it does if it could reason about such stuff. And so, inasmuch as the system is generally intelligent, it’d have the functionality to somehow slip such non-robust constraints. Conversely, if it can’t slip them, it’s not generally intelligent.
I predict that this can’t happen with the standard LLM setup; and that more complex LLM setups, for which this may work, would not meaningfully count as “just an LLM”. See e. g. the “concrete scenario” section.
By “LLMs should be totally safe” I mean literal LLMs as trained today, but scaled up. A thousand times the parameter count, a hundred times the number of layers, trained on correspondingly more multimodal data, etc. But no particularly clever scaffolding or tweaks.
I think we can be decently confident it won’t do anything. I’d been a bit worried about scaling up context windows, but we’ve got 100k-tokens-long ones, and that didn’t do anything. They still can’t even stay on-target, still hallucinate like crazy. Seems fine to update all the way to “this architecture is safe”. Especially given some of the theoretical arguments on that.
(Hey, check this out, @TurnTrout, I too can update in a more optimistic direction sometimes.)
(Indeed, this update was possible to make all the way back in the good old days of GPT-3, as evidenced by nostalgebraist here. In my defense, I wasn’t in the alignment field back then, and it took me a year to catch up and build a proper model of it.)
You were also talking about “systems generated by any process broadly encompassed by the current ML training paradigm”—which is a larger class than just LLMs.
If you claim that arbitrary scaled LLMs are safe from becoming scary agents on their own—it’s more believable. I’d give it around 90%. Still better safe than sorry. And there are other potential problems like creating an actually sentient models without noticing it—which would be an ethical catastrophe. So catiousness and beter interpretability tools are necessary.
I’m talking about “just LLMs” but with clever scaffoldings written in explicit code. All the black box AI-stuff is still only in LLMs. This doesn’t contradict your claim that LLM’s without any additional scaffoldings won’t be able to do it. But it does contradict your titular claim that Current AIs Provide Nearly No Data Relevant to AGI Alignment. If AGI reasoning is made from LLMs, aligning LLMs, in a sense of making them say stuff we want them to say/not say stuff we do not want them to say, is not only absolutely crucial to aligning AGI, but mostly reduces to it.
Yeah, and safety properties of LLMs extend to more than just LLMs. E. g., I’m pretty sure CNNs scaled arbitrarily far are also safe, for the same reasons LLMs are. And there are likely ML models more sophisticated and capable than LLMs, which nevertheless are also safe (and capability-upper-bounded) for the reasons LLMs are safe.
Oh, certainly. I’m a large fan of interpretability tools, as well.
I don’t think that’d work out this way. Why would the overarching scaffolded system satisfy the safety guarantees of the LLMs it’s built out of? Say we make LLMs never talk about murder. But the scaffolded agent, inasmuch as it’s generally intelligent, should surely be able to consider situations that involve murder in order to make workable plans, including scenarios where it itself (deliberately or accidentally) causes death. If nothing else, in order to avoid that.
So it’d need to find some way to circumvent the “my components can’t talk about murder” thing, and it’d probably just evolve some sort of jail-break, or define a completely new term that would stand-in for the forbidden “murder” word.
General form of the Deep Deceptiveness argument applies here. It is ground truth that the GI would be more effective at what it does if it could reason about such stuff. And so, inasmuch as the system is generally intelligent, it’d have the functionality to somehow slip such non-robust constraints. Conversely, if it can’t slip them, it’s not generally intelligent.