I never said anything about (mis)alignment. Of course using stupid training rewards will result in stupid programs. But the problem with those programs is not that they are instrumentally convergent (see the title of my post).
The training program, which has the utility function, does exhibit convergence, but the resulting agent, which has no utilities, does not usually exhibit it. E.g. if training environment involves a scenario where the agent is turned off (which results in 0 utility), then the training program will certainly build an agent that resists being turned off. But if the training environment does not include that scenario, then the resulting agent is unlikely to resist it.
Yes, but what are you arguing against? At no point did I claim that it is impossible for the training program to build a convergent agent (indeed, if you search for an agent that exhibits instrumental convergence, then you might find one). Nor did I ever claim that all agents are safe—I only meant that they are safer than the hypothesis of instrumental convergence would imply.
Also, you worry about the agent learning convergent behaviors by accident, but that’s a little silly when you know that the agent often fails to learn what you want it to learn. E.g. if you do explicitly include the scenario of the agent being turned off, and you want the agent to resist it, you know it’s likely that the agent will, e.g. overfit and will resist only in that single scenario. But then, when you don’t intentionally include any such scenario, and don’t want the agent to resist, it seems likely to you that the agent will correctly learn to resist anyway? Yes, it’s strictly possible that you will unintentionally train an agent that robustly resists being turned off. But the odds are in our favour.
Let’s go back to the OP. I’m claiming that not all intelligent agents would exhibit instrumental convergence, and that, in fact, the majority wouldn’t. What part of that exactly do you disagree with? Maybe we actually agree?
I never said anything about (mis)alignment. Of course using stupid training rewards will result in stupid programs. But the problem with those programs is not that they are instrumentally convergent (see the title of my post).
The training program, which has the utility function, does exhibit convergence, but the resulting agent, which has no utilities, does not usually exhibit it. E.g. if training environment involves a scenario where the agent is turned off (which results in 0 utility), then the training program will certainly build an agent that resists being turned off. But if the training environment does not include that scenario, then the resulting agent is unlikely to resist it.
The problem is that “the training environment does not include that scenario” is far from guaranteed.
Yes, but what are you arguing against? At no point did I claim that it is impossible for the training program to build a convergent agent (indeed, if you search for an agent that exhibits instrumental convergence, then you might find one). Nor did I ever claim that all agents are safe—I only meant that they are safer than the hypothesis of instrumental convergence would imply.
Also, you worry about the agent learning convergent behaviors by accident, but that’s a little silly when you know that the agent often fails to learn what you want it to learn. E.g. if you do explicitly include the scenario of the agent being turned off, and you want the agent to resist it, you know it’s likely that the agent will, e.g. overfit and will resist only in that single scenario. But then, when you don’t intentionally include any such scenario, and don’t want the agent to resist, it seems likely to you that the agent will correctly learn to resist anyway? Yes, it’s strictly possible that you will unintentionally train an agent that robustly resists being turned off. But the odds are in our favour.
Let’s go back to the OP. I’m claiming that not all intelligent agents would exhibit instrumental convergence, and that, in fact, the majority wouldn’t. What part of that exactly do you disagree with? Maybe we actually agree?