As I understand it, the security mindset asserts a premise that’s roughly: “The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions.”
This seems… like a correct description but it’s missing the spirit?
Like the intuitions are primarily about “what features are salient” and “what thoughts are easy to think.”
However, I don’t see why this should be the case.
Roughly, the core distinction between software engineering and computer security is whether the system is thinking back. Software engineering typically involves working with dynamic systems and thinking optimistically how the system could work. Computer security typically involves working with reactive systems and thinking pessimistically about how the system could break.
I think it is an extremely basic AI alignment skill to look at your alignment proposal and ask “how does this break?” or “what happens if the AI thinks about this?”.
Additionally, there’s a straightforward reason why alignment research (specifically the part of alignment that’s about training AIs to have good values) is not like security: there’s usually no adversarial intelligence cleverly trying to find any possible flaws in your approaches and exploit them.
I must admit some frustration, here; in this section it feels like your point is “look, computer security is for dealing with intelligence as part of your system. But the only intelligence in our system is sometimes malicious users!” In my world, the whole point of Artificial Intelligence was the Intelligence. The call is coming from inside the house!
Maybe we just have some linguistic disagreement? “Sure, computer security is relevant to transformative AI but not LLMs”? If so, then I think the earlier point about whether capabilities enhancements break alignment techniques is relevant: if these alignment techniques work because the system isn’t thinking about them, then are you confident they will continue to work when the system is thinking about them?
Roughly, the core distinction between software engineering and computer security is whether the system is thinking back.
Yes, and my point in that section is that the fundamental laws governing how AI training processes work are not “thinking back”. They’re not adversaries. If you created a misaligned AI, then it would be “thinking back”, and you’d be in an adversarial position where security mindset is appropriate.
“Building an AI that doesn’t game your specifications” is the actual “alignment question” we should be doing research on. The mathematical principles which determine how much a given AI training process games your specifications are not adversaries. It’s also a problem we’ve made enormous progress on, mostly by using large pretrained models with priors over how to appropriately generalize from limited specification signals. E.g., Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually) shows how the process of pretraining an LM causes it to go from “gaming” a limited set of finetuning data via shortcut learning / memorization, to generalizing with the appropriate linguistic prior knowledge.
“Building an AI that doesn’t game your specifications” is the actual “alignment question” we should be doing research on.
Ok, it sounds to me like you’re saying:
“When you train ML systems, they game your specifications because the training dynamics are too dumb to infer what you actually want. We just need One Weird Trick to get the training dynamics to Do What You Mean Not What You Say, and then it will all work out, and there’s not a demon that will create another obstacle given that you surmounted this one.”
That is, training processes are not neutral; there’s the bad training processes that we have now (or had before the recent positive developments) and eventually will be good training processes that create aligned-by-default systems.
Is this roughly right, or am I misunderstanding you?
If you created a misaligned AI, then it would be “thinking back”, and you’d be in an adversarial position where security mindset is appropriate.
Cool, we agree on this point.
my point in that section is that the fundamental laws governing how AI training processes work are not “thinking back”. They’re not adversaries.
I think we agree here on the local point but disagree on its significance to the broader argument. [I’m not sure how much we agree-I think of training dynamics as ‘neutral’, but also I think of them as searching over program-space in order to find a program that performs well on a (loss function, training set) pair, and so you need to be reasoning about search. But I think we agree the training dynamics are not trying to trick you / be adversarial and instead are straightforwardly ‘trying’ to make Number Go Down.]
In my picture, we have the neutral training dynamics paired with the (loss function, training set) which creates the AI system, and whether the resulting AI system is adversarial or not depends mostly on the choice of (loss function, training set). It seems to me that we probably have a disagreement about how much of the space of (loss function, training set) leads to misaligned vs. aligned AI (if it hits ‘AI’ at all), where I think aligned AI is a narrow target to hit that most loss functions will miss, and hitting that narrow target requires security mindset.
To explain further, it’s not that the (loss function, training set) is thinking back at you on its own; it’s that the AI that’s created by training is thinking back at you. So before you decide to optimize X you need to check whether or not you actually want something that’s optimizing X, or if you need to optimize for Y instead.
So from my perspective it seems like you need security mindset in order to pick the right inputs to ML training to avoid getting misaligned models.
This seems… like a correct description but it’s missing the spirit?
Like the intuitions are primarily about “what features are salient” and “what thoughts are easy to think.”
Roughly, the core distinction between software engineering and computer security is whether the system is thinking back. Software engineering typically involves working with dynamic systems and thinking optimistically how the system could work. Computer security typically involves working with reactive systems and thinking pessimistically about how the system could break.
I think it is an extremely basic AI alignment skill to look at your alignment proposal and ask “how does this break?” or “what happens if the AI thinks about this?”.
What’s your story for specification gaming?
I must admit some frustration, here; in this section it feels like your point is “look, computer security is for dealing with intelligence as part of your system. But the only intelligence in our system is sometimes malicious users!” In my world, the whole point of Artificial Intelligence was the Intelligence. The call is coming from inside the house!
Maybe we just have some linguistic disagreement? “Sure, computer security is relevant to transformative AI but not LLMs”? If so, then I think the earlier point about whether capabilities enhancements break alignment techniques is relevant: if these alignment techniques work because the system isn’t thinking about them, then are you confident they will continue to work when the system is thinking about them?
Yes, and my point in that section is that the fundamental laws governing how AI training processes work are not “thinking back”. They’re not adversaries. If you created a misaligned AI, then it would be “thinking back”, and you’d be in an adversarial position where security mindset is appropriate.
“Building an AI that doesn’t game your specifications” is the actual “alignment question” we should be doing research on. The mathematical principles which determine how much a given AI training process games your specifications are not adversaries. It’s also a problem we’ve made enormous progress on, mostly by using large pretrained models with priors over how to appropriately generalize from limited specification signals. E.g., Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually) shows how the process of pretraining an LM causes it to go from “gaming” a limited set of finetuning data via shortcut learning / memorization, to generalizing with the appropriate linguistic prior knowledge.
Ok, it sounds to me like you’re saying:
“When you train ML systems, they game your specifications because the training dynamics are too dumb to infer what you actually want. We just need One Weird Trick to get the training dynamics to Do What You Mean Not What You Say, and then it will all work out, and there’s not a demon that will create another obstacle given that you surmounted this one.”
That is, training processes are not neutral; there’s the bad training processes that we have now (or had before the recent positive developments) and eventually will be good training processes that create aligned-by-default systems.
Is this roughly right, or am I misunderstanding you?
Cool, we agree on this point.
I think we agree here on the local point but disagree on its significance to the broader argument. [I’m not sure how much we agree-I think of training dynamics as ‘neutral’, but also I think of them as searching over program-space in order to find a program that performs well on a (loss function, training set) pair, and so you need to be reasoning about search. But I think we agree the training dynamics are not trying to trick you / be adversarial and instead are straightforwardly ‘trying’ to make Number Go Down.]
In my picture, we have the neutral training dynamics paired with the (loss function, training set) which creates the AI system, and whether the resulting AI system is adversarial or not depends mostly on the choice of (loss function, training set). It seems to me that we probably have a disagreement about how much of the space of (loss function, training set) leads to misaligned vs. aligned AI (if it hits ‘AI’ at all), where I think aligned AI is a narrow target to hit that most loss functions will miss, and hitting that narrow target requires security mindset.
To explain further, it’s not that the (loss function, training set) is thinking back at you on its own; it’s that the AI that’s created by training is thinking back at you. So before you decide to optimize X you need to check whether or not you actually want something that’s optimizing X, or if you need to optimize for Y instead.
So from my perspective it seems like you need security mindset in order to pick the right inputs to ML training to avoid getting misaligned models.
As a commentary from an observer: this is distinct from the proposition “the minds created with those laws are not thinking back.”