Yes, but what are you arguing against? At no point did I claim that it is impossible for the training program to build a convergent agent (indeed, if you search for an agent that exhibits instrumental convergence, then you might find one). Nor did I ever claim that all agents are safe—I only meant that they are safer than the hypothesis of instrumental convergence would imply.
Also, you worry about the agent learning convergent behaviors by accident, but that’s a little silly when you know that the agent often fails to learn what you want it to learn. E.g. if you do explicitly include the scenario of the agent being turned off, and you want the agent to resist it, you know it’s likely that the agent will, e.g. overfit and will resist only in that single scenario. But then, when you don’t intentionally include any such scenario, and don’t want the agent to resist, it seems likely to you that the agent will correctly learn to resist anyway? Yes, it’s strictly possible that you will unintentionally train an agent that robustly resists being turned off. But the odds are in our favour.
Let’s go back to the OP. I’m claiming that not all intelligent agents would exhibit instrumental convergence, and that, in fact, the majority wouldn’t. What part of that exactly do you disagree with? Maybe we actually agree?
Yes, but what are you arguing against? At no point did I claim that it is impossible for the training program to build a convergent agent (indeed, if you search for an agent that exhibits instrumental convergence, then you might find one). Nor did I ever claim that all agents are safe—I only meant that they are safer than the hypothesis of instrumental convergence would imply.
Also, you worry about the agent learning convergent behaviors by accident, but that’s a little silly when you know that the agent often fails to learn what you want it to learn. E.g. if you do explicitly include the scenario of the agent being turned off, and you want the agent to resist it, you know it’s likely that the agent will, e.g. overfit and will resist only in that single scenario. But then, when you don’t intentionally include any such scenario, and don’t want the agent to resist, it seems likely to you that the agent will correctly learn to resist anyway? Yes, it’s strictly possible that you will unintentionally train an agent that robustly resists being turned off. But the odds are in our favour.
Let’s go back to the OP. I’m claiming that not all intelligent agents would exhibit instrumental convergence, and that, in fact, the majority wouldn’t. What part of that exactly do you disagree with? Maybe we actually agree?