By contrast, the world where sharp left turns happen near human-level capabilities seems pretty dangerous. In that world, we don’t get to study things like deceptive alignment empirically. We don’t get to leverage powerful AI researcher tools.
We actually can study and iterate on alignment (and deception) in simulation sandboxes while leveraging any powerful introspection/analysis tools. Containing human-level AGI in (correctly constructed) simulations is likely easy.
But then later down you say:
For instance, we can put weak models in environments that encourage alignment failure so as to study left turns empirically (shifting towards the “weak” worlds).
Which seems like a partial contradiction, unless you believe we can’t contain human-level agents?
Is this something you’ve thought deeply about and or care to expand on? Curious about your source of skepticism, considering:
we can completely design and constrain the knowledge base of sim agents, limiting them to the equivalent of 1000BC human knowledge, or whatever we want
we can completely and automatically monitor their inner monologues, thoughts, etc
How do you concretely propose a sim agent will break containment? (which firstly requires sim-awareness)
How would you break containment now, assuming you are in a sim?
Also, Significant speed/memory advantages go against the definition of ‘human-level agents’ and are intrinsically unlikely anyway as 2x speed/memory agents are simply preceded by 1x speed/memory agents, and especially due to the constraints of GPUs which permit acceleration by parallelization of learning across agents, more so than serial speedup of individual agents.
My modal expectation is that these are going to look something like large language models, meaning that while we’ll have controlled the training corpus it won’t be something hand-curated. So there will almost certainly be lots of information in there that models could use to e.g. figure out how to manipulate people, program computers, and so on. I expect then that we won’t be limiting their knowledge as much as you think we will, because I expect them to be trained with giant corpuses of (mostly) post-1000BC text.
We do not currently have the interpretability tools needed to monitor their inner thoughts. I would feel much better about this problem if we had such tools!
My best guess is that containment fails because a model convinces one of the people in the lab to let it access the internet. That could look like getting someone to think it’s genuinely a good idea (because it’ll be helpful for some project), or it could look like bribery (with offer of future payment) or blackmail.
I’m guessing that we’ll want to run some agents fast because that’s how you get lots of serial intellectual work done. So even if all we get are human-level agents, there’ll be a big incentive to run them fast.
Indeed a model trained on (or with full access to) our internet could be very hard to contain. That is in fact a key part of my argument. If it is hard to contain, it is hard to test. If it’s hard to test, we are unlikely to succeed.
So if we agree there, then perhaps you agree that giving untested AGI access to our immense knowledge base is profoundly unwise.
I expect then that we won’t be limiting their knowledge as much as you think we will, because I expect them to be trained with giant corpuses of (mostly) post-1000BC text.
That’s then just equivalent to saying “I expect then that we won’t even bother with testing our alignment designs”. Do you actually believe that testing is unnecessary? Or do you just believe the leading teams won’t care? And if you agree that testing is necessary, then shouldn’t this be key to any successful alignment plan?
My best guess is that containment fails because a model convinces one of the people in the lab to let it access the internet.
Which obviously is nearly impossible if it doesn’t know it is in a sim, and doesn’t know what a lab or the internet are, and lacks even the precursor concepts. Since you seem to be focused on the latest fad (large language models), consider the plight of poor simulated Elon Musk—who at least is aware of the sim argument.
Since you seem to be focused on the latest fad (large language models)
Is this your way of saying you don’t think LLMs will scale to AGI (or AGI-level capabilities)?
So if we agree there, then perhaps you agree that giving untested AGI access to our immense knowledge base is profoundly unwise.
This seems like it will happen with economic incentives (I mean, WebGPT already exists) and models will already be trained on the entire internet, so it’s not clear to me how we’d prevent this?
Is this your way of saying you don’t think LLMs will scale to AGI (or AGI-level capabilities)?
It’s more that AGI will contain a LLM module as the brain contains linguistic cortex modules. Indeed several new neuroscience studies have shown LLMs are functional equivalent of human brain linguistic cortex—trained the same way on the same data resulting in very similar learned representations.
It’s also obvious from the LLM scaling that the larger LLMs are already comparable to linguistic cortex but lag the brain in many core capabilities. You don’t get those capabilities from just making a larger vanilla LLM / linguistic cortex.
This seems like it will happen with economic incentives (I mean, WebGPT already exists) and models will already be trained on the entire internet
Training just a language module on the internet is not dangerous by itself, but of course yes the precedent is concerning. There are now several projects working on multi-modal foundation agents that control virtual terminals with full web access. If that continues and reaches AGI before the safer sim/game training path, then we may be in trouble.
so it’s not clear to me how we’d prevent this?
Well if we agree that we obviously can contain human-level agents in sims if we want to, then it’s some combination of the standard spreading awareness, advocacy, and advancing sim techniques.
Consider for example if there was a complete alignment solution available today—and it necessarily required changing how we trained models. Clearly, the fact that there is current inertia in the wrong direction can’t be some knockdown argument against doing the right thing. If you learn your train is on a collision course you change course or jump.
We actually can study and iterate on alignment (and deception) in simulation sandboxes while leveraging any powerful introspection/analysis tools. Containing human-level AGI in (correctly constructed) simulations is likely easy.
But then later down you say:
Which seems like a partial contradiction, unless you believe we can’t contain human-level agents?
I am skeptical that we can contain human-level agents, particularly if they have speed/memory advantages over humans.
Is this something you’ve thought deeply about and or care to expand on? Curious about your source of skepticism, considering:
we can completely design and constrain the knowledge base of sim agents, limiting them to the equivalent of 1000BC human knowledge, or whatever we want
we can completely and automatically monitor their inner monologues, thoughts, etc
How do you concretely propose a sim agent will break containment? (which firstly requires sim-awareness)
How would you break containment now, assuming you are in a sim?
Also, Significant speed/memory advantages go against the definition of ‘human-level agents’ and are intrinsically unlikely anyway as 2x speed/memory agents are simply preceded by 1x speed/memory agents, and especially due to the constraints of GPUs which permit acceleration by parallelization of learning across agents, more so than serial speedup of individual agents.
My modal expectation is that these are going to look something like large language models, meaning that while we’ll have controlled the training corpus it won’t be something hand-curated. So there will almost certainly be lots of information in there that models could use to e.g. figure out how to manipulate people, program computers, and so on. I expect then that we won’t be limiting their knowledge as much as you think we will, because I expect them to be trained with giant corpuses of (mostly) post-1000BC text.
We do not currently have the interpretability tools needed to monitor their inner thoughts. I would feel much better about this problem if we had such tools!
My best guess is that containment fails because a model convinces one of the people in the lab to let it access the internet. That could look like getting someone to think it’s genuinely a good idea (because it’ll be helpful for some project), or it could look like bribery (with offer of future payment) or blackmail.
I’m guessing that we’ll want to run some agents fast because that’s how you get lots of serial intellectual work done. So even if all we get are human-level agents, there’ll be a big incentive to run them fast.
Indeed a model trained on (or with full access to) our internet could be very hard to contain. That is in fact a key part of my argument. If it is hard to contain, it is hard to test. If it’s hard to test, we are unlikely to succeed.
So if we agree there, then perhaps you agree that giving untested AGI access to our immense knowledge base is profoundly unwise.
That’s then just equivalent to saying “I expect then that we won’t even bother with testing our alignment designs”. Do you actually believe that testing is unnecessary? Or do you just believe the leading teams won’t care? And if you agree that testing is necessary, then shouldn’t this be key to any successful alignment plan?
Which obviously is nearly impossible if it doesn’t know it is in a sim, and doesn’t know what a lab or the internet are, and lacks even the precursor concepts. Since you seem to be focused on the latest fad (large language models), consider the plight of poor simulated Elon Musk—who at least is aware of the sim argument.
Thanks for pushing back on some stuff here.
Is this your way of saying you don’t think LLMs will scale to AGI (or AGI-level capabilities)?
This seems like it will happen with economic incentives (I mean, WebGPT already exists) and models will already be trained on the entire internet, so it’s not clear to me how we’d prevent this?
It’s more that AGI will contain a LLM module as the brain contains linguistic cortex modules. Indeed several new neuroscience studies have shown LLMs are functional equivalent of human brain linguistic cortex—trained the same way on the same data resulting in very similar learned representations.
It’s also obvious from the LLM scaling that the larger LLMs are already comparable to linguistic cortex but lag the brain in many core capabilities. You don’t get those capabilities from just making a larger vanilla LLM / linguistic cortex.
Training just a language module on the internet is not dangerous by itself, but of course yes the precedent is concerning. There are now several projects working on multi-modal foundation agents that control virtual terminals with full web access. If that continues and reaches AGI before the safer sim/game training path, then we may be in trouble.
Well if we agree that we obviously can contain human-level agents in sims if we want to, then it’s some combination of the standard spreading awareness, advocacy, and advancing sim techniques.
Consider for example if there was a complete alignment solution available today—and it necessarily required changing how we trained models. Clearly, the fact that there is current inertia in the wrong direction can’t be some knockdown argument against doing the right thing. If you learn your train is on a collision course you change course or jump.