Here’s an AI safety case I sketched out in a few minutes. I think it’d be nice if more (single-AI) safety cases focused on getting good circuits / shards into the model, as I think that’s an extremely tractable problem:
Premise 0 (not goal-directed at initialization): The model prior to RL training is not goal-directed (in the sense required for x-risk).
Premise 1 (good circuit-forming): For any ϵ>0, we can select a curriculum and reinforcement signal which do not entrain any “bad” subset of circuits B such that 1A the circuit subset B in fact explains more than ϵ percent of the logit variance[1] in the induced deployment distribution 1B if the bad circuits had amplified influence over the logits, the model would (with high probability) execute a string of actions which lead to human extinction
Premise 2 (majority rules): There exists K>0 such that, if a circuit subset doesn’t explain at least K⋅ϵ of the logit variance, then the marginal probability on x-risk trajectories[2] is less than ϵ. (NOTE: Not sure if there should be one K for all ϵ>0?)
Conclusion: The AI very probably does not cause x-risk. “Proof”: Let the target probability of xrisk be δ. Select a reinforcement curriculum such that it has ϵ<δ/K chance of executing a doom trajectory.
By premise 0, the AI doesn’t start out goal-directed. By premise 1, RL doesn’t entrain influential bad circuits—so the overall logit variance explained is less than δ/K. By premise 2, the overall probability on bad trajectories is less than δ.
Notice how the “single AI” assumption sweeps all multipolar dynamics into this one “marginal probability” measurement! That is, if there are other AIs doing stuff, how do we credit-assign whether the trajectory was the “AI’s fault” or not? I guess it’s more of a conceptual question. I think that this doesn’t tank the aspiration of “let’s control generalization” implied by the safety case.
Words are really, really loose, and can hide a lot of nuance and mechanism and difficulty.
That is mostly an argument by intuition. To make it more rigorous and transparent, here is a constructive proof:
Let the curriculum be sampled uniformly at random. This has ~no mutual information with the world. Therefore the AI does not learn any methods in the world that can cause human extinction.
Yes. TurnTrout’s intuitive argument did not contain any premises that implied the AI learnt any methods that provide any economic benefit, so I thought it wouldn’t be necessary to include that in the constructive proof either.
Right, but that isn’t a good safety case because such an AI hasn’t learnt about the world and isn’t capable of doing anything useful. I don’t see why anyone would dedicate resources to training such a machine.
I didn’t understand TurnTrouts original argument to be limited to only “trivially safe” (ie. non-functional) AI systems.
I can see that the condition you’ve given, that a “curriculum be sampled uniformly at random” with no mutual information with the real world is sufficient for a curriculum to satisfy Premise 1 of TurnTrouts argument.
But it isn’t immediately obvious to me that it is a sufficient and necessary condition (and therefore equivalent to Premise 1).
I’m not claiming to have shown something equivalent to premise 1, I’m claiming to have shown something equivalent to the conclusion of the proof (that it’s possible to make an AI which very probably does not cause x-risk), inspired by the general idea of the proof but simplifying/constructifying it to be more rigorous and transparent.
I might be misunderstanding something crucial or am not expressing myself clearly.
I understand TurnTrout’s original post to be an argument for a set of conditions which, if satisfied, prove the AI is (probably) safe. There are no restrictions on the capabilities of the system given in the argument.
You do constructively show “that it’s possible to make an AI which very probably does not cause x-risk” using a system that cannot do anything coherent when deployed.
But TurnTrout’s post is not merely arguing that it is “possible” to build a safe AI.
Your conclusion is trivially true and there are simpler examples of “safe” systems if you don’t require them to do anything useful or coherent. For example, a fried, unpowered GPU is guaranteed to be “safe” but that isn’t telling me anything useful.
Can you sketch out some ideas for showing/proving premises 1 and 2? More specifically:
For 1, how would you rule out future distributional shifts increasing the influence of “bad” circuits beyond ϵ?
For 2, it seems that you actually need to show a specific K, not just that there exists K>0, otherwise how would you be able to show that x-risk is low for a given curriculum? But this seems impossible, because the “bad” subset of circuits could constitute a malign superintelligence strategically manipulating the overall AI’s output while staying within a logit variance budget of ϵ (i.e., your other premises do not rule this out), and how could you predict what such a malign SI might be able to accomplish?
One thing in particular that stands out to me: The whole framing seems useless unless Premise 1 is modified to include a condition like
[...] we can select a curriculum and reinforcement signal which [...] and which makes the model highly “useful/capable”.
Otherwise, Premise 1 is trivially true: We could (e.g.) set all the model’s weights to 0.0; thereby guaranteeing the non-entrainment of any (“bad”) circuits.
I’m curious: what do you think would be a good (...useful?) operationalization of “useful/capable”?
Another issue: K and epsilon might need to be unrealistically small: Once the model starts modifying itself (or constructing successor models) (and possibly earlier), a single strategically-placed sign-flip in the model’s outputs might cause catastrophe. [2]
I think writing one’s thoughts/intuitions out like this is valuable—for sharing frames/ideas, getting feedback, etc. Thus: thanks for writing it up. Separately, I think the presented frame/case is probably confused, and almost useless (at best).
Although that might require the control structures (be they Shards or a utility function or w/e) of the model to be highly “localized/concentrated” in some sense. (OTOH, that seems likely to at least eventually be the case?)
Is a difficulty in moving from statements about the variance in logits to statements about x-risk?
One is a statement about the output of a computation after a single timestep, the other is a statement about the cumulative impact of the policy over multiple time-steps in a dynamic environment that reacts in a complex way to the actions taken.
My intuition is that for any ϵ>0 bounding the variance in the logits, you could always construct a suitably pathological environment that will always amplify these cumulative deviations into a catastrophy.
(There is at least a 30% chance I haven’t grasped your idea correctly)
The authors hypothesized that RMU causes randomness in the logits of generated tokens.
They proved that the logit of a token generated by an RMU model follows a Normal distribution.
This randomness in logits is interpreted as low confidence, leading to nonsensical or incorrect responses.
The analysis suggests that the variance of the logit distribution is influenced by the coefficient value and properties of the model layers.
Impact of the coefficient value:
The coefficient ‘c’ in RMU affects how well the forget sample representations align with the random vector.
Theoretical analysis showed a positive correlation between the coefficient and the alignment.
Larger coefficient values lead to more alignment between forget sample representations and the random vector.
However, the optimal coefficient value varies depending on the layer of the model being unlearned.
Robustness against adversarial jailbreak attacks:
The authors analyzed RMU’s effectiveness against white-box jailbreak attacks from an attack-defense game perspective.
They showed that RMU causes the attacker to receive unreliable and uninformative gradient signals.
This makes it difficult for the attacker to find optimal adversarial tokens for replacement.
The analysis explains why RMU models demonstrate strong robustness against methods like Greedy Coordinate Gradient (GCG) attacks.
The theoretical analysis provides a foundation for understanding why RMU works and how its parameters affect its performance. This understanding led to the development of the improved Adaptive RMU method proposed in the paper.′
Here’s an AI safety case I sketched out in a few minutes. I think it’d be nice if more (single-AI) safety cases focused on getting good circuits / shards into the model, as I think that’s an extremely tractable problem:
(Notice how this safety case doesn’t require “we can grade all of the AI’s actions.” Instead, it tightly hugs problem of “how do we get generalization to assign low probability to bad outcome”?)
I don’t think this is an amazing operationalization, but hopefully it gestures in a promising direction.
Notice how the “single AI” assumption sweeps all multipolar dynamics into this one “marginal probability” measurement! That is, if there are other AIs doing stuff, how do we credit-assign whether the trajectory was the “AI’s fault” or not? I guess it’s more of a conceptual question. I think that this doesn’t tank the aspiration of “let’s control generalization” implied by the safety case.
Words are really, really loose, and can hide a lot of nuance and mechanism and difficulty.
That is mostly an argument by intuition. To make it more rigorous and transparent, here is a constructive proof:
Let the curriculum be sampled uniformly at random. This has ~no mutual information with the world. Therefore the AI does not learn any methods in the world that can cause human extinction.
Does this not mean the AI has also learnt no methods that provide any economic benefit either?
Yes. TurnTrout’s intuitive argument did not contain any premises that implied the AI learnt any methods that provide any economic benefit, so I thought it wouldn’t be necessary to include that in the constructive proof either.
Right, but that isn’t a good safety case because such an AI hasn’t learnt about the world and isn’t capable of doing anything useful. I don’t see why anyone would dedicate resources to training such a machine.
I didn’t understand TurnTrouts original argument to be limited to only “trivially safe” (ie. non-functional) AI systems.
How did you understand the argument instead?
I can see that the condition you’ve given, that a “curriculum be sampled uniformly at random” with no mutual information with the real world is sufficient for a curriculum to satisfy Premise 1 of TurnTrouts argument.
But it isn’t immediately obvious to me that it is a sufficient and necessary condition (and therefore equivalent to Premise 1).
I’m not claiming to have shown something equivalent to premise 1, I’m claiming to have shown something equivalent to the conclusion of the proof (that it’s possible to make an AI which very probably does not cause x-risk), inspired by the general idea of the proof but simplifying/constructifying it to be more rigorous and transparent.
I might be misunderstanding something crucial or am not expressing myself clearly.
I understand TurnTrout’s original post to be an argument for a set of conditions which, if satisfied, prove the AI is (probably) safe. There are no restrictions on the capabilities of the system given in the argument.
You do constructively show “that it’s possible to make an AI which very probably does not cause x-risk” using a system that cannot do anything coherent when deployed.
But TurnTrout’s post is not merely arguing that it is “possible” to build a safe AI.
Your conclusion is trivially true and there are simpler examples of “safe” systems if you don’t require them to do anything useful or coherent. For example, a fried, unpowered GPU is guaranteed to be “safe” but that isn’t telling me anything useful.
Can you sketch out some ideas for showing/proving premises 1 and 2? More specifically:
For 1, how would you rule out future distributional shifts increasing the influence of “bad” circuits beyond ϵ?
For 2, it seems that you actually need to show a specific K, not just that there exists K>0, otherwise how would you be able to show that x-risk is low for a given curriculum? But this seems impossible, because the “bad” subset of circuits could constitute a malign superintelligence strategically manipulating the overall AI’s output while staying within a logit variance budget of ϵ (i.e., your other premises do not rule this out), and how could you predict what such a malign SI might be able to accomplish?
Upvoted and disagreed. [1]
One thing in particular that stands out to me: The whole framing seems useless unless Premise 1 is modified to include a condition like
Otherwise, Premise 1 is trivially true: We could (e.g.) set all the model’s weights to 0.0; thereby guaranteeing the non-entrainment of any (“bad”) circuits.
I’m curious: what do you think would be a good (...useful?) operationalization of “useful/capable”?
Another issue: K and epsilon might need to be unrealistically small: Once the model starts modifying itself (or constructing successor models) (and possibly earlier), a single strategically-placed sign-flip in the model’s outputs might cause catastrophe. [2]
I think writing one’s thoughts/intuitions out like this is valuable—for sharing frames/ideas, getting feedback, etc. Thus: thanks for writing it up. Separately, I think the presented frame/case is probably confused, and almost useless (at best).
Although that might require the control structures (be they Shards or a utility function or w/e) of the model to be highly “localized/concentrated” in some sense. (OTOH, that seems likely to at least eventually be the case?)
Is a difficulty in moving from statements about the variance in logits to statements about x-risk?
One is a statement about the output of a computation after a single timestep, the other is a statement about the cumulative impact of the policy over multiple time-steps in a dynamic environment that reacts in a complex way to the actions taken.
My intuition is that for any ϵ>0 bounding the variance in the logits, you could always construct a suitably pathological environment that will always amplify these cumulative deviations into a catastrophy.
(There is at least a 30% chance I haven’t grasped your idea correctly)
Potentially relevant: the theoretical results from On Effects of Steering Latent Representation for Large Language Model Unlearning. From Claude chat: ’Point 2 refers to the theoretical analysis provided in the paper on three key aspects of the Representation Misdirection for Unlearning (RMU) method:
Effect on token confidence:
The authors hypothesized that RMU causes randomness in the logits of generated tokens.
They proved that the logit of a token generated by an RMU model follows a Normal distribution.
This randomness in logits is interpreted as low confidence, leading to nonsensical or incorrect responses.
The analysis suggests that the variance of the logit distribution is influenced by the coefficient value and properties of the model layers.
Impact of the coefficient value:
The coefficient ‘c’ in RMU affects how well the forget sample representations align with the random vector.
Theoretical analysis showed a positive correlation between the coefficient and the alignment.
Larger coefficient values lead to more alignment between forget sample representations and the random vector.
However, the optimal coefficient value varies depending on the layer of the model being unlearned.
Robustness against adversarial jailbreak attacks:
The authors analyzed RMU’s effectiveness against white-box jailbreak attacks from an attack-defense game perspective.
They showed that RMU causes the attacker to receive unreliable and uninformative gradient signals.
This makes it difficult for the attacker to find optimal adversarial tokens for replacement.
The analysis explains why RMU models demonstrate strong robustness against methods like Greedy Coordinate Gradient (GCG) attacks.
The theoretical analysis provides a foundation for understanding why RMU works and how its parameters affect its performance. This understanding led to the development of the improved Adaptive RMU method proposed in the paper.′