I don’t think that’s the claim. Reward chisels cognition. GAA is chiseling a preference for simple, reliable plans that achieve some objective in training. Such plans are more likely to be desirable in deployment, but it’s a heuristic, not a proof. Independently, such plans are more likely to be interpretable, robust, legible, etc. I’m thinking of this similarly to Impact Regularization.
Real world problems can sometimes be best addressed by complicated, unreliable plans, because there simply aren’t simple reliable plans that make them happen. And sometimes there are simple reliable plans to seize control of the reward button.
Legibility is an interesting point. Once alignment work even hints at tackling alignment in the sense of doing good things and not bad things, I tend to tunnel on evaluating it through that lens.
Martin is representing my claim well in this exchange, but I also think it’s important to mention that the simple/convoluted plan continuum does not have a perfect correspondence with the sharp/flat policy continuum. For example, wireheading may be simple in abstract, but I still expect a wireheading policy to be extremely sharp. If a wireheading policy takes, let’s say 5 distinct actions (WWWWW) to execute, and the agent’s policy is WWWWW, then it would receive arbitrarily high reward because the agent controls the reward button. However, if the policy is similar, like WWWWX, it would receive much lower reward because a plan that is 80% wireheading would likely not score well on the reward function representing the true goal.
I strongly suspect that if you try to set the regularization without checking how well it does, you’ll either get an unintelligent policy that’s extrordinarily robust, or you’ll get wireheading with error-correction (if wireheading was incentivized without the regularization).
My claim: “Such plans are more likely to be desirable in deployment”. I’m unclear if you disagree with that, or are just emphasizing the point that “it’s a heuristic, not a proof”.
Real world problems can sometimes be best addressed by complicated, unreliable plans, because there simply aren’t simple reliable plans that make them happen.
I agree. But as a human, if I find a complicated, unreliable plan to solve a problem, I typically look for a new plan, or a new problem. This is discussed a bit in Death with Dignity. I’m concerned about the other side of the heuristic failure.
I don’t think that’s the claim. Reward chisels cognition. GAA is chiseling a preference for simple, reliable plans that achieve some objective in training. Such plans are more likely to be desirable in deployment, but it’s a heuristic, not a proof. Independently, such plans are more likely to be interpretable, robust, legible, etc. I’m thinking of this similarly to Impact Regularization.
Real world problems can sometimes be best addressed by complicated, unreliable plans, because there simply aren’t simple reliable plans that make them happen. And sometimes there are simple reliable plans to seize control of the reward button.
Legibility is an interesting point. Once alignment work even hints at tackling alignment in the sense of doing good things and not bad things, I tend to tunnel on evaluating it through that lens.
Martin is representing my claim well in this exchange, but I also think it’s important to mention that the simple/convoluted plan continuum does not have a perfect correspondence with the sharp/flat policy continuum. For example, wireheading may be simple in abstract, but I still expect a wireheading policy to be extremely sharp. If a wireheading policy takes, let’s say 5 distinct actions (WWWWW) to execute, and the agent’s policy is WWWWW, then it would receive arbitrarily high reward because the agent controls the reward button. However, if the policy is similar, like WWWWX, it would receive much lower reward because a plan that is 80% wireheading would likely not score well on the reward function representing the true goal.
I strongly suspect that if you try to set the regularization without checking how well it does, you’ll either get an unintelligent policy that’s extrordinarily robust, or you’ll get wireheading with error-correction (if wireheading was incentivized without the regularization).
My claim: “Such plans are more likely to be desirable in deployment”. I’m unclear if you disagree with that, or are just emphasizing the point that “it’s a heuristic, not a proof”.
I agree. But as a human, if I find a complicated, unreliable plan to solve a problem, I typically look for a new plan, or a new problem. This is discussed a bit in Death with Dignity. I’m concerned about the other side of the heuristic failure.