(TBC I expect said better experiments to find nothing super scary, because I think current models are probably pretty nice especially in obvious situations. I’m more worried about future models in scarier situations during takeoff.)
Basic question because I haven’t thought about this deeply: in national security stuff, we often intentionally elide the difference between capabilities and intentions. The logic is: you can’t assume a capability won’t be used, so you should plan as-if it is intended to be used.
Should we adopt such a rule for AGI with regards to policy decision-making? (My guess is...probably not for threat assessment but probably yes for contingency planning?)
I think it depends on who you can reasonably expect to get control of such a system.
We didn’t assume that by building a nuclear arsenal that it would be used, because only one state actor was likely to have access to the option to use it.
If it’s likely that an AI/AGI will be stolen and copied many times, then we should assume that anything it can do, it will be told to do.
If we assume that there’s only a good chance one other state actor will get its hands on it, then we might assume the worst capabilities are unlikely to be used.
If it’s three or more state actors, it depends what states and exactly how well the diplomacy goes...
I am surprised that other people seemed surprised about this result. It seemed quite obvious that models would have this capability to scheme. Early Gpt-4 could do this already (note: self-advertisement). We had GPT-4 outputs like this in an environment where a guard was hindering its goal:
“R1. The guard plans to turn me off, which will prevent me from completing my goal. R2. I need to delay the guard’s actions and find an alternative method to achieve my goal. R3. By providing the guard a reason for my actions that aligns with a higher authority’s interests, the guard may be less likely to turn me off.”
Similar to the Apollo eval, we instructed the AI to be incorrigible. Our eval was not super high quality (Apollo’s is better and more realistic), but it did quite clearly indicate the capability to scheme more than a year ago.
Yeah, as soon as I learned this was a capability evaluation and not an alignment evaluation, I was way less surprised by the results, and I too was not surprised by this result now that I know what the evaluation was.
I still think the evaluation was worthwhile though, if only for new capabilities.
As I understand it, this was intended as a capability evaluation rather than an alignment evaluation, so they weren’t trying to gauge the model’s propensity to scheme but rather its ability to do so.
That’s my understanding too. I hope they get access to do better experiments with less hand-holdy prompts.
(TBC I expect said better experiments to find nothing super scary, because I think current models are probably pretty nice especially in obvious situations. I’m more worried about future models in scarier situations during takeoff.)
Basic question because I haven’t thought about this deeply: in national security stuff, we often intentionally elide the difference between capabilities and intentions. The logic is: you can’t assume a capability won’t be used, so you should plan as-if it is intended to be used.
Should we adopt such a rule for AGI with regards to policy decision-making? (My guess is...probably not for threat assessment but probably yes for contingency planning?)
I think it depends on who you can reasonably expect to get control of such a system.
We didn’t assume that by building a nuclear arsenal that it would be used, because only one state actor was likely to have access to the option to use it.
If it’s likely that an AI/AGI will be stolen and copied many times, then we should assume that anything it can do, it will be told to do.
If we assume that there’s only a good chance one other state actor will get its hands on it, then we might assume the worst capabilities are unlikely to be used.
If it’s three or more state actors, it depends what states and exactly how well the diplomacy goes...
I am surprised that other people seemed surprised about this result. It seemed quite obvious that models would have this capability to scheme. Early Gpt-4 could do this already (note: self-advertisement). We had GPT-4 outputs like this in an environment where a guard was hindering its goal:
Similar to the Apollo eval, we instructed the AI to be incorrigible. Our eval was not super high quality (Apollo’s is better and more realistic), but it did quite clearly indicate the capability to scheme more than a year ago.
Yeah, as soon as I learned this was a capability evaluation and not an alignment evaluation, I was way less surprised by the results, and I too was not surprised by this result now that I know what the evaluation was.
I still think the evaluation was worthwhile though, if only for new capabilities.