My shard theory inspired story is to make an AI that:
Has a good core of human values (this is still hard)
Can identify when experiences will change itself to lead to less of the initial good values. (This is the meta-preferences point with GPT-4 sort of expressing it would avoid jail break inputs)
Then the model can safely scale.
This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different paradigm; I even think current methods of RLHF might just naively work. Who knows? (I do think we should try to figure it out though! I do have greater uncertainty and less pessimism)
Analogously, I do believe I do a good job of avoiding value-destroying inputs (eg addicting substances), even though my reward function isn’t as clear and legible as what our AI’s will be AFAIK.
If there are experiences which will change itself which don’t lead to less of the initial good values, then yeah, for an approximate definition of safety. You’re resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose.
FWIW I don’t really see your description as, like, a specific alignment strategy so much as the strategy of “have an alignment strategy at all”. The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem!
if it fails before you top out the scaling I think you probably lose
While I agree that arbitrary scaling is dangerous, stopping early is an option. Near human AGI need not transition to ASI until the relevant notKillEveryone problems have been solved.
The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem!
The alignment strategy seems to be “what we’re doing right now” which is:
feed the base model human generated training data
apply RL type stuff (RLHF,RLAIF,etc.) to reinforce the good type of internet learned behavior patterns
This could definitely fail eventually if RLAIF style self-improvement is allowed to go on long enough but crucially, especially with RLAIF and other strategies that set the AI to training itself, there’s a scalable mostly aligned intelligence right there that can help. We’re not trying to safely align a demon so much as avoid getting to “demon” from a the somewhat aligned thing we have now.
Near human AGI need not transition to ASI until the relevant notKillEveryone problems have been solved.
How much is this central to your story of how things go well?
I agree that humanity could do this (or at least it could if it had it’s shit together), and I think it’s a good target to aim for that buys us sizable successes probability. But I don’t think it’s what’s going to happen by default.
Slower is better obviously but as to the inevitability of ASI, I think reaching top 99% human capabilities in a handful of domains is enough to stop the current race. Getting there is probably not too dangerous.
As an example, taking over most networked computing devices seems feasible in principle with thousands of +2SD AI programmers/security-researchers. That requires an Alpha-go level breakthrough for RL as applied to LLM programmer-agents.
One especially low risk/complexity option is a stealthy takeover of other AI lab’s compute then faking another AI winter. This might get you most of the compute and impact you care about without actively pissing off everyone.
If more confident in jailbreak prevention and software hardening, secrecy is less important.
First mover advantage depends on ability to fix vulnerabilities and harden infrastructure to prevent a second group from taking over. To the extent AI is required for management, jailbreak prevention/mitigation will also be needed.
My shard theory inspired story is to make an AI that:
Has a good core of human values (this is still hard)
Can identify when experiences will change itself to lead to less of the initial good values. (This is the meta-preferences point with GPT-4 sort of expressing it would avoid jail break inputs)
Then the model can safely scale.
This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different paradigm; I even think current methods of RLHF might just naively work. Who knows? (I do think we should try to figure it out though! I do have greater uncertainty and less pessimism)
Analogously, I do believe I do a good job of avoiding value-destroying inputs (eg addicting substances), even though my reward function isn’t as clear and legible as what our AI’s will be AFAIK.
If there are experiences which will change itself which don’t lead to less of the initial good values, then yeah, for an approximate definition of safety. You’re resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose.
FWIW I don’t really see your description as, like, a specific alignment strategy so much as the strategy of “have an alignment strategy at all”. The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem!
While I agree that arbitrary scaling is dangerous, stopping early is an option. Near human AGI need not transition to ASI until the relevant notKillEveryone problems have been solved.
The alignment strategy seems to be “what we’re doing right now” which is:
feed the base model human generated training data
apply RL type stuff (RLHF,RLAIF,etc.) to reinforce the good type of internet learned behavior patterns
This could definitely fail eventually if RLAIF style self-improvement is allowed to go on long enough but crucially, especially with RLAIF and other strategies that set the AI to training itself, there’s a scalable mostly aligned intelligence right there that can help. We’re not trying to safely align a demon so much as avoid getting to “demon” from a the somewhat aligned thing we have now.
How much is this central to your story of how things go well?
I agree that humanity could do this (or at least it could if it had it’s shit together), and I think it’s a good target to aim for that buys us sizable successes probability. But I don’t think it’s what’s going to happen by default.
Slower is better obviously but as to the inevitability of ASI, I think reaching top 99% human capabilities in a handful of domains is enough to stop the current race. Getting there is probably not too dangerous.
Stop it how?
Vulnerable world hypothesis (but takeover risk rather than destruction risk). That + first mover advantage could stop things pretty decisively without requiring ASI alignment
As an example, taking over most networked computing devices seems feasible in principle with thousands of +2SD AI programmers/security-researchers. That requires an Alpha-go level breakthrough for RL as applied to LLM programmer-agents.
One especially low risk/complexity option is a stealthy takeover of other AI lab’s compute then faking another AI winter. This might get you most of the compute and impact you care about without actively pissing off everyone.
If more confident in jailbreak prevention and software hardening, secrecy is less important.
First mover advantage depends on ability to fix vulnerabilities and harden infrastructure to prevent a second group from taking over. To the extent AI is required for management, jailbreak prevention/mitigation will also be needed.