Roughly, the core distinction between software engineering and computer security is whether the system is thinking back.
Yes, and my point in that section is that the fundamental laws governing how AI training processes work are not “thinking back”. They’re not adversaries. If you created a misaligned AI, then it would be “thinking back”, and you’d be in an adversarial position where security mindset is appropriate.
“Building an AI that doesn’t game your specifications” is the actual “alignment question” we should be doing research on. The mathematical principles which determine how much a given AI training process games your specifications are not adversaries. It’s also a problem we’ve made enormous progress on, mostly by using large pretrained models with priors over how to appropriately generalize from limited specification signals. E.g., Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually) shows how the process of pretraining an LM causes it to go from “gaming” a limited set of finetuning data via shortcut learning / memorization, to generalizing with the appropriate linguistic prior knowledge.
“Building an AI that doesn’t game your specifications” is the actual “alignment question” we should be doing research on.
Ok, it sounds to me like you’re saying:
“When you train ML systems, they game your specifications because the training dynamics are too dumb to infer what you actually want. We just need One Weird Trick to get the training dynamics to Do What You Mean Not What You Say, and then it will all work out, and there’s not a demon that will create another obstacle given that you surmounted this one.”
That is, training processes are not neutral; there’s the bad training processes that we have now (or had before the recent positive developments) and eventually will be good training processes that create aligned-by-default systems.
Is this roughly right, or am I misunderstanding you?
If you created a misaligned AI, then it would be “thinking back”, and you’d be in an adversarial position where security mindset is appropriate.
Cool, we agree on this point.
my point in that section is that the fundamental laws governing how AI training processes work are not “thinking back”. They’re not adversaries.
I think we agree here on the local point but disagree on its significance to the broader argument. [I’m not sure how much we agree-I think of training dynamics as ‘neutral’, but also I think of them as searching over program-space in order to find a program that performs well on a (loss function, training set) pair, and so you need to be reasoning about search. But I think we agree the training dynamics are not trying to trick you / be adversarial and instead are straightforwardly ‘trying’ to make Number Go Down.]
In my picture, we have the neutral training dynamics paired with the (loss function, training set) which creates the AI system, and whether the resulting AI system is adversarial or not depends mostly on the choice of (loss function, training set). It seems to me that we probably have a disagreement about how much of the space of (loss function, training set) leads to misaligned vs. aligned AI (if it hits ‘AI’ at all), where I think aligned AI is a narrow target to hit that most loss functions will miss, and hitting that narrow target requires security mindset.
To explain further, it’s not that the (loss function, training set) is thinking back at you on its own; it’s that the AI that’s created by training is thinking back at you. So before you decide to optimize X you need to check whether or not you actually want something that’s optimizing X, or if you need to optimize for Y instead.
So from my perspective it seems like you need security mindset in order to pick the right inputs to ML training to avoid getting misaligned models.
Yes, and my point in that section is that the fundamental laws governing how AI training processes work are not “thinking back”. They’re not adversaries. If you created a misaligned AI, then it would be “thinking back”, and you’d be in an adversarial position where security mindset is appropriate.
“Building an AI that doesn’t game your specifications” is the actual “alignment question” we should be doing research on. The mathematical principles which determine how much a given AI training process games your specifications are not adversaries. It’s also a problem we’ve made enormous progress on, mostly by using large pretrained models with priors over how to appropriately generalize from limited specification signals. E.g., Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually) shows how the process of pretraining an LM causes it to go from “gaming” a limited set of finetuning data via shortcut learning / memorization, to generalizing with the appropriate linguistic prior knowledge.
Ok, it sounds to me like you’re saying:
“When you train ML systems, they game your specifications because the training dynamics are too dumb to infer what you actually want. We just need One Weird Trick to get the training dynamics to Do What You Mean Not What You Say, and then it will all work out, and there’s not a demon that will create another obstacle given that you surmounted this one.”
That is, training processes are not neutral; there’s the bad training processes that we have now (or had before the recent positive developments) and eventually will be good training processes that create aligned-by-default systems.
Is this roughly right, or am I misunderstanding you?
Cool, we agree on this point.
I think we agree here on the local point but disagree on its significance to the broader argument. [I’m not sure how much we agree-I think of training dynamics as ‘neutral’, but also I think of them as searching over program-space in order to find a program that performs well on a (loss function, training set) pair, and so you need to be reasoning about search. But I think we agree the training dynamics are not trying to trick you / be adversarial and instead are straightforwardly ‘trying’ to make Number Go Down.]
In my picture, we have the neutral training dynamics paired with the (loss function, training set) which creates the AI system, and whether the resulting AI system is adversarial or not depends mostly on the choice of (loss function, training set). It seems to me that we probably have a disagreement about how much of the space of (loss function, training set) leads to misaligned vs. aligned AI (if it hits ‘AI’ at all), where I think aligned AI is a narrow target to hit that most loss functions will miss, and hitting that narrow target requires security mindset.
To explain further, it’s not that the (loss function, training set) is thinking back at you on its own; it’s that the AI that’s created by training is thinking back at you. So before you decide to optimize X you need to check whether or not you actually want something that’s optimizing X, or if you need to optimize for Y instead.
So from my perspective it seems like you need security mindset in order to pick the right inputs to ML training to avoid getting misaligned models.
As a commentary from an observer: this is distinct from the proposition “the minds created with those laws are not thinking back.”