Here’s a basic model of policy collapse: suppose there exist pathological policies of low prior probability (/high algorithmic complexity) such that they play the training game when it is strategically wise to do so, and when they get a good opportunity they defect in order to pursue some unknown aim.
Because they play the training game, a wide variety of training objectives will collapse to one of these policies if the system in training starts exploring policies of sufficiently high algorithmic complexity. So, according to this crude model, there’s a complexity bound: stay under it and you’re fine, go over it and you get pathological behaviour. Roughly, whatever desired behaviour requires the most algorithmically complex policy is the one that is most pertinent for assessing policy collapse risk (because that’s the one that contributes most of the algorithmic complexity, and so it give your first order estimate of whether or not you’re crossing the collapse threshold). So, which desired behaviour requires the most complex policy: is it, for example, respecting commonsense moral constraints, or is it inventing molecular nanotechnology?
Tangentially, the policy collapse theory does not predict outcomes that look anything like malicious compliance. It predicts that, if you’re in a position of power over the AI system, your mother is saved exactly as you want her to be. If you are not in such a position then your mother is not saved at all and you get a nanobot war instead or something. That is, if you do run afoul of policy collapse, it doesn’t matter if you want your system to pursue simple or complex goals, you’re up shit creek either way.
Here’s a basic model of policy collapse: suppose there exist pathological policies of low prior probability (/high algorithmic complexity) such that they play the training game when it is strategically wise to do so, and when they get a good opportunity they defect in order to pursue some unknown aim.
Because they play the training game, a wide variety of training objectives will collapse to one of these policies if the system in training starts exploring policies of sufficiently high algorithmic complexity. So, according to this crude model, there’s a complexity bound: stay under it and you’re fine, go over it and you get pathological behaviour. Roughly, whatever desired behaviour requires the most algorithmically complex policy is the one that is most pertinent for assessing policy collapse risk (because that’s the one that contributes most of the algorithmic complexity, and so it give your first order estimate of whether or not you’re crossing the collapse threshold). So, which desired behaviour requires the most complex policy: is it, for example, respecting commonsense moral constraints, or is it inventing molecular nanotechnology?
Tangentially, the policy collapse theory does not predict outcomes that look anything like malicious compliance. It predicts that, if you’re in a position of power over the AI system, your mother is saved exactly as you want her to be. If you are not in such a position then your mother is not saved at all and you get a nanobot war instead or something. That is, if you do run afoul of policy collapse, it doesn’t matter if you want your system to pursue simple or complex goals, you’re up shit creek either way.