Algorithmic complexity is precisely analogous to difficulty-of-learning-to-predict, so saying “it’s not about learning to predict, it’s about algorithmic complexity” doesn’t make sense. One read of the original is: learning to respect common sense moral side constraints is tricky[1], but AI systems will learn how to do it in the end. I’d be happy to call this read correct, and is consistent with the observation that today’s AI systems do respect common sense moral side constraints given straightforward requests, and that it took a few years to figure out how to do it. That read doesn’t really jive with your commentary.
Your commentary seems to situate this post within a larger argument: teaching a system to “act” is different to teaching it to “predict” because in the former case a sufficiently capable learner’s behaviour can collapse to a pathological policy, whereas teaching a capable learner to predict does not risk such collapse. Thus “prediction” is distinguished from “algorithmic complexity”. Furthermore, commonsense moral side constraints are complex enough to risk such collapse when we train an “actor” but not a “predictor”. This seems confused.
First, all we need to turn a language model prediction into an action is a means of turning text into action, and we have many such means. So the distinction between text predictor and actor is suspect. We could consider an alternative knows/cares distinction: does a system act properly when properly incentivised (“knows”) vs does it act properly when presented with whatever context we are practically able to give it (“”“cares”””)? Language models usually act properly given simple prompts, so in this sense they “care”. So rejecting evidence from language models does not seem well justified.
Second, there’s no need to claim that commonsense moral side constraints in particular are so hard that trying to develop AI systems that respect them leads to policy collapse. It need only be the case that one of the things we try to teach them to do leads to policy collapse. Teaching values is not particularly notable among all the things we might want AI systems to do; it certainly does not seem to be among the hardest. Focussing on values makes the argument unnecessarily weak.
Third, algorithmic complexity is measured with respect to a prior. The post invokes (but does not justify) an “English speaking evil genie” prior. I don’t think anyone thinks this is a serious prior for reasoning about advanced AI system behaviour. But the post is (according to your commentary, if not the post itself) making a quantitative point—values are sufficiently complex to induce policy collapse—but it’s measuring this quantity using a nonsense prior. If the quantitative argument was indeed the original point, it is mystifying why a nonsense prior was chosen to make it, and also why no effort was made to justify the prior.
My question is why is the following statement below true, exactly?
Second, there’s no need to claim that commonsense moral side constraints in particular are so hard that trying to develop AI systems that respect them leads to policy collapse. It need only be the case that one of the things we try to teach them to do leads to policy collapse.
Here’s a basic model of policy collapse: suppose there exist pathological policies of low prior probability (/high algorithmic complexity) such that they play the training game when it is strategically wise to do so, and when they get a good opportunity they defect in order to pursue some unknown aim.
Because they play the training game, a wide variety of training objectives will collapse to one of these policies if the system in training starts exploring policies of sufficiently high algorithmic complexity. So, according to this crude model, there’s a complexity bound: stay under it and you’re fine, go over it and you get pathological behaviour. Roughly, whatever desired behaviour requires the most algorithmically complex policy is the one that is most pertinent for assessing policy collapse risk (because that’s the one that contributes most of the algorithmic complexity, and so it give your first order estimate of whether or not you’re crossing the collapse threshold). So, which desired behaviour requires the most complex policy: is it, for example, respecting commonsense moral constraints, or is it inventing molecular nanotechnology?
Tangentially, the policy collapse theory does not predict outcomes that look anything like malicious compliance. It predicts that, if you’re in a position of power over the AI system, your mother is saved exactly as you want her to be. If you are not in such a position then your mother is not saved at all and you get a nanobot war instead or something. That is, if you do run afoul of policy collapse, it doesn’t matter if you want your system to pursue simple or complex goals, you’re up shit creek either way.
Algorithmic complexity is precisely analogous to difficulty-of-learning-to-predict, so saying “it’s not about learning to predict, it’s about algorithmic complexity” doesn’t make sense. One read of the original is: learning to respect common sense moral side constraints is tricky[1], but AI systems will learn how to do it in the end. I’d be happy to call this read correct, and is consistent with the observation that today’s AI systems do respect common sense moral side constraints given straightforward requests, and that it took a few years to figure out how to do it. That read doesn’t really jive with your commentary.
Your commentary seems to situate this post within a larger argument: teaching a system to “act” is different to teaching it to “predict” because in the former case a sufficiently capable learner’s behaviour can collapse to a pathological policy, whereas teaching a capable learner to predict does not risk such collapse. Thus “prediction” is distinguished from “algorithmic complexity”. Furthermore, commonsense moral side constraints are complex enough to risk such collapse when we train an “actor” but not a “predictor”. This seems confused.
First, all we need to turn a language model prediction into an action is a means of turning text into action, and we have many such means. So the distinction between text predictor and actor is suspect. We could consider an alternative knows/cares distinction: does a system act properly when properly incentivised (“knows”) vs does it act properly when presented with whatever context we are practically able to give it (“”“cares”””)? Language models usually act properly given simple prompts, so in this sense they “care”. So rejecting evidence from language models does not seem well justified.
Second, there’s no need to claim that commonsense moral side constraints in particular are so hard that trying to develop AI systems that respect them leads to policy collapse. It need only be the case that one of the things we try to teach them to do leads to policy collapse. Teaching values is not particularly notable among all the things we might want AI systems to do; it certainly does not seem to be among the hardest. Focussing on values makes the argument unnecessarily weak.
Third, algorithmic complexity is measured with respect to a prior. The post invokes (but does not justify) an “English speaking evil genie” prior. I don’t think anyone thinks this is a serious prior for reasoning about advanced AI system behaviour. But the post is (according to your commentary, if not the post itself) making a quantitative point—values are sufficiently complex to induce policy collapse—but it’s measuring this quantity using a nonsense prior. If the quantitative argument was indeed the original point, it is mystifying why a nonsense prior was chosen to make it, and also why no effort was made to justify the prior.
the text proposes full value alignment as a solution to the commonsense side constraints problem, but this turned out to be stronger than necessary.
My question is why is the following statement below true, exactly?
Here’s a basic model of policy collapse: suppose there exist pathological policies of low prior probability (/high algorithmic complexity) such that they play the training game when it is strategically wise to do so, and when they get a good opportunity they defect in order to pursue some unknown aim.
Because they play the training game, a wide variety of training objectives will collapse to one of these policies if the system in training starts exploring policies of sufficiently high algorithmic complexity. So, according to this crude model, there’s a complexity bound: stay under it and you’re fine, go over it and you get pathological behaviour. Roughly, whatever desired behaviour requires the most algorithmically complex policy is the one that is most pertinent for assessing policy collapse risk (because that’s the one that contributes most of the algorithmic complexity, and so it give your first order estimate of whether or not you’re crossing the collapse threshold). So, which desired behaviour requires the most complex policy: is it, for example, respecting commonsense moral constraints, or is it inventing molecular nanotechnology?
Tangentially, the policy collapse theory does not predict outcomes that look anything like malicious compliance. It predicts that, if you’re in a position of power over the AI system, your mother is saved exactly as you want her to be. If you are not in such a position then your mother is not saved at all and you get a nanobot war instead or something. That is, if you do run afoul of policy collapse, it doesn’t matter if you want your system to pursue simple or complex goals, you’re up shit creek either way.