I don’t get why you think this is true? EG, it seems like almost no insights about how to train faithful CoT would transfer to systems speaking pure neuralese. It seems to me that what little safety/alignment we have for LLMs is largely a byproduct of how language-focused they are (which gives us a weak sort of interpretability, a very useful safety resource which we are at risk of losing soon).
To put it a different way: human imitation is a relatively safe outer-optimization target, so we should be concerned to the extent that the industry is moving away from it. It sounds like you think safety lessons from the human-imitation regime generalize beyond the human-imitation regime. Maybe I agree that we can derive some abstract lessons like “RL with really dense feedback can avoid instrumental convergence”, but we’re moving away from the regime where such dense feedback is available, so I don’t see what lessons transfer.
I don’t get why you think this is true? EG, it seems like almost no insights about how to train faithful CoT would transfer to systems speaking pure neuralese. It seems to me that what little safety/alignment we have for LLMs is largely a byproduct of how language-focused they are (which gives us a weak sort of interpretability, a very useful safety resource which we are at risk of losing soon).
I think the crux is I think that the important parts of of LLMs re safety isn’t their safety properties specifically, but rather the evidence they give to what alignment-relevant properties future AIs have (and note that I’m also using evidence from non-LLM sources like MCTS algorithm that was used for AlphaGo), and I also don’t believe interpretability is why LLMs are mostly safe at all, but rather I think they’re safe due to a combo of incapacity, not having extreme instrumental convergence, and the ability to steer them with data.\
Language is a simple example, but one that is generalizable pretty far.
It sounds like you think safety lessons from the human-imitation regime generalize beyond the human-imitation regime
Note that the primary points would apply to basically a whole lot of AI designs like MCTS for AlphaGo or a lot of other future architecture designs which don’t imitate humans, barring ones which prevent you from steering it at all with data, or have very sparse feedback, which translates into weakly constraining instrumental convergence.
but we’re moving away from the regime where such dense feedback is available, so I don’t see what lessons transfer.
I think this is a crux, in that I don’t buy o1 as progressing to a regime where we lose so much dense feedback that it’s alignment relevant, because I think sparse-feedback RL will almost certainly be super-uncompetitive with every other AI architecture until well after AI automates all alignment research.
Also, AIs will still have instrumental convergence, it’s just that their goals will be more local and more focused around the training task, so unless the training task rewards global power-seeking significantly, you won’t get it.
I think the crux is I think that the important parts of of LLMs re safety isn’t their safety properties specifically, but rather the evidence they give to what alignment-relevant properties future AIs have
[insert standard skepticism about these sorts of generalizations when generalizing to superintelligence]
But what lesson do you think you can generalize, and why do you think you can generalize that?
I think this is a crux, in that I don’t buy o1 as progressing to a regime where we lose so much dense feedback that it’s alignment relevant, because I think sparse-feedback RL will almost certainly be super-uncompetitive with every other AI architecture until well after AI automates all alignment research.
So, as a speculative example, further along in the direction of o1 you could have something like MCTS help train these things to solve very difficult math problems, with the sparse feedback being given for complete formal proofs.
Similarly, playing text-based video games, with the sparse feedback given for winning.
Similarly, training CoT to reason about code, with sparse feedback given for predictions of the code output.
Etc.
You think these sorts of things just won’t work well enough to be relevant?
So, as a speculative example, further along in the direction of o1 you could have something like MCTS help train these things to solve very difficult math problems, with the sparse feedback being given for complete formal proofs.
Similarly, playing text-based video games, with the sparse feedback given for winning.
Similarly, training CoT to reason about code, with sparse feedback given for predictions of the code output.
Etc.
You think these sorts of things just won’t work well enough to be relevant?
Assuming the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all, I do think that the system won’t work well enough to be relevant, since it requires way too much time training, and plausibly way too much compute depending on how sparse the feedback actually is.
Other AIs relying on much denser feedback will already rule the world before that happens.
[insert standard skepticism about these sorts of generalizations when generalizing to superintelligence]
But what lesson do you think you can generalize, and why do you think you can generalize that?
Alright, I’ll give 2 lessons that I do think generalize to superintelligence:
The data is a large factor in both it’s capabilities and alignment, and alignment strategies should not ignore the data sources when trying to make predictions or trying to intervene on the AI for alignment purposes.
Instrumental convergence in a weak sense will likely exist, because having some ability to get more resources are useful for a lot of goals, but the extremely unconstrained versions of instrumental convergence often assumed where an AI will grab so much power as to effectively control humanity is unlikely to exist, given the constraints and feedback given to the AI.
For 1, the basic answer for why is because a lot of AI success in fields like Go and language modeling etc was jumpstarted by good data.
More importantly, I remember this post, and while I think it overstates things in stating that an LLM is just the dataset (it probably isn’t now with o1), it does matter that LLMs are influenced by their data sources.
For 2, the basic reason for this is that the strongest capabilities we have seen that come out of RL either require immense amounts of data on pretty narrow tasks, or non-instrumental world models.
This is because constraints prevent you from having to deal with the problem where you produce completely useless RL artifacts, and evolution got around this constraint by accepting far longer timelines and far more computation in FLOPs than the world economy can tolerate.
Assuming the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all, I do think that the system won’t work well enough to be relevant, since it requires way too much time training, and plausibly way too much compute depending on how sparse the feedback actually is.
Ah, I wasn’t thinking “sparse” here meant anywhere near that sparse. I thought your dense-vs-sparse was doing something like contrasting RLHF (very dense, basically no instrumental convergence) with chess (very sparse, plenty of instrumental convergence).
I still think o1 is moving towards chess on this spectrum.
And AIs have already been superhuman at chess for very long, yet that domain gives very little incentive for very strong instrumental convergence.
I am claiming that for practical AIs, the results of training them in the real world with goals will give them instrumental convergence, but without further incentives, will not give them so much instrumental convergence that it leads to power-seeking to disempower humans by default.
Chess is like a bounded, mathematically described universe where all the instrumental convergence stays contained, and only accomplishes a very limited instrumentality in our universe (IE chess programs gain a limited sort of power here by being good playmates).
LLMs touch on the real world far more than that, such that MCTS-like skill at navigating “the LLM world” in contrast to chess sounds to me like it may create a concerning level of real-world-relevant instrumental convergence.
I agree chess is an extreme example, such that I think that more realistic versions would probably develop instrumental convergence at least in a local sense.
(We already have o1 at least capable of a little instrumental convergence.)
My main substantive claim is that constraining instrumental goals such that the AI doesn’t try to take power via long-term methods is very useful for capabilities, and more generally instrumental convergence is an area where there is a positive manifold for both capabilities and alignment, where alignment methods increase capabilities and vice versa.
I don’t get why you think this is true? EG, it seems like almost no insights about how to train faithful CoT would transfer to systems speaking pure neuralese. It seems to me that what little safety/alignment we have for LLMs is largely a byproduct of how language-focused they are (which gives us a weak sort of interpretability, a very useful safety resource which we are at risk of losing soon).
To put it a different way: human imitation is a relatively safe outer-optimization target, so we should be concerned to the extent that the industry is moving away from it. It sounds like you think safety lessons from the human-imitation regime generalize beyond the human-imitation regime. Maybe I agree that we can derive some abstract lessons like “RL with really dense feedback can avoid instrumental convergence”, but we’re moving away from the regime where such dense feedback is available, so I don’t see what lessons transfer.
I think the crux is I think that the important parts of of LLMs re safety isn’t their safety properties specifically, but rather the evidence they give to what alignment-relevant properties future AIs have (and note that I’m also using evidence from non-LLM sources like MCTS algorithm that was used for AlphaGo), and I also don’t believe interpretability is why LLMs are mostly safe at all, but rather I think they’re safe due to a combo of incapacity, not having extreme instrumental convergence, and the ability to steer them with data.\
Language is a simple example, but one that is generalizable pretty far.
Note that the primary points would apply to basically a whole lot of AI designs like MCTS for AlphaGo or a lot of other future architecture designs which don’t imitate humans, barring ones which prevent you from steering it at all with data, or have very sparse feedback, which translates into weakly constraining instrumental convergence.
I think this is a crux, in that I don’t buy o1 as progressing to a regime where we lose so much dense feedback that it’s alignment relevant, because I think sparse-feedback RL will almost certainly be super-uncompetitive with every other AI architecture until well after AI automates all alignment research.
Also, AIs will still have instrumental convergence, it’s just that their goals will be more local and more focused around the training task, so unless the training task rewards global power-seeking significantly, you won’t get it.
[insert standard skepticism about these sorts of generalizations when generalizing to superintelligence]
But what lesson do you think you can generalize, and why do you think you can generalize that?
So, as a speculative example, further along in the direction of o1 you could have something like MCTS help train these things to solve very difficult math problems, with the sparse feedback being given for complete formal proofs.
Similarly, playing text-based video games, with the sparse feedback given for winning.
Similarly, training CoT to reason about code, with sparse feedback given for predictions of the code output.
Etc.
You think these sorts of things just won’t work well enough to be relevant?
To answer the question:
Assuming the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all, I do think that the system won’t work well enough to be relevant, since it requires way too much time training, and plausibly way too much compute depending on how sparse the feedback actually is.
Other AIs relying on much denser feedback will already rule the world before that happens.
Alright, I’ll give 2 lessons that I do think generalize to superintelligence:
The data is a large factor in both it’s capabilities and alignment, and alignment strategies should not ignore the data sources when trying to make predictions or trying to intervene on the AI for alignment purposes.
Instrumental convergence in a weak sense will likely exist, because having some ability to get more resources are useful for a lot of goals, but the extremely unconstrained versions of instrumental convergence often assumed where an AI will grab so much power as to effectively control humanity is unlikely to exist, given the constraints and feedback given to the AI.
For 1, the basic answer for why is because a lot of AI success in fields like Go and language modeling etc was jumpstarted by good data.
More importantly, I remember this post, and while I think it overstates things in stating that an LLM is just the dataset (it probably isn’t now with o1), it does matter that LLMs are influenced by their data sources.
https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/
For 2, the basic reason for this is that the strongest capabilities we have seen that come out of RL either require immense amounts of data on pretty narrow tasks, or non-instrumental world models.
This is because constraints prevent you from having to deal with the problem where you produce completely useless RL artifacts, and evolution got around this constraint by accepting far longer timelines and far more computation in FLOPs than the world economy can tolerate.
Ah, I wasn’t thinking “sparse” here meant anywhere near that sparse. I thought your dense-vs-sparse was doing something like contrasting RLHF (very dense, basically no instrumental convergence) with chess (very sparse, plenty of instrumental convergence).
I still think o1 is moving towards chess on this spectrum.
Oh, now I understand.
And AIs have already been superhuman at chess for very long, yet that domain gives very little incentive for very strong instrumental convergence.
I am claiming that for practical AIs, the results of training them in the real world with goals will give them instrumental convergence, but without further incentives, will not give them so much instrumental convergence that it leads to power-seeking to disempower humans by default.
Chess is like a bounded, mathematically described universe where all the instrumental convergence stays contained, and only accomplishes a very limited instrumentality in our universe (IE chess programs gain a limited sort of power here by being good playmates).
LLMs touch on the real world far more than that, such that MCTS-like skill at navigating “the LLM world” in contrast to chess sounds to me like it may create a concerning level of real-world-relevant instrumental convergence.
I agree chess is an extreme example, such that I think that more realistic versions would probably develop instrumental convergence at least in a local sense.
(We already have o1 at least capable of a little instrumental convergence.)
My main substantive claim is that constraining instrumental goals such that the AI doesn’t try to take power via long-term methods is very useful for capabilities, and more generally instrumental convergence is an area where there is a positive manifold for both capabilities and alignment, where alignment methods increase capabilities and vice versa.