There is a difference between an AI that does X, and an AI that has a goal to do X. Not sure what architectures you’re referring to, but I suspect you may be conflating the goals of the AI’s constructor (or construction process) with the goals of the AI.
It’s almost certainly true that a random program from the set of programs that do X, is more dangerous than a random program from some larger set, however I claim that the fraction of dangerous programs is still pretty small.
Now, there are some obviously dangerous sets of programs, and it’s possible that humans will pick an AI form such set. In other news, if you shoot yourself in the foot, you get hurt.
I think habryka is thinking about modern machine learning architectures that are studied by AI researchers. AI research is in fact a distinct subfield from programming-in-general, because AI programs are in fact a distinct subgroup from programs-in-general.
I’m very much aware what architectures machine learning studies, and indeed it (usually) isn’t generic programs in the sense of raw instruction lists (although, any other Turing complete set of programs can be perfectly well called “programs-in-general”—instruction lists are in no way unique).
The problem is that everyone’s favorite architecture—the plain neural network—does not contain a utility function. It is built using a utility/cost function, but that’s very different.
This doesn’t make the AI any safer, given the whole (neural network + neural network builder¹) is still a misaligned AI. Real life examples happen all the time.
¹: I don’t know if there is a standard term for this.
I never said anything about (mis)alignment. Of course using stupid training rewards will result in stupid programs. But the problem with those programs is not that they are instrumentally convergent (see the title of my post).
The training program, which has the utility function, does exhibit convergence, but the resulting agent, which has no utilities, does not usually exhibit it. E.g. if training environment involves a scenario where the agent is turned off (which results in 0 utility), then the training program will certainly build an agent that resists being turned off. But if the training environment does not include that scenario, then the resulting agent is unlikely to resist it.
Yes, but what are you arguing against? At no point did I claim that it is impossible for the training program to build a convergent agent (indeed, if you search for an agent that exhibits instrumental convergence, then you might find one). Nor did I ever claim that all agents are safe—I only meant that they are safer than the hypothesis of instrumental convergence would imply.
Also, you worry about the agent learning convergent behaviors by accident, but that’s a little silly when you know that the agent often fails to learn what you want it to learn. E.g. if you do explicitly include the scenario of the agent being turned off, and you want the agent to resist it, you know it’s likely that the agent will, e.g. overfit and will resist only in that single scenario. But then, when you don’t intentionally include any such scenario, and don’t want the agent to resist, it seems likely to you that the agent will correctly learn to resist anyway? Yes, it’s strictly possible that you will unintentionally train an agent that robustly resists being turned off. But the odds are in our favour.
Let’s go back to the OP. I’m claiming that not all intelligent agents would exhibit instrumental convergence, and that, in fact, the majority wouldn’t. What part of that exactly do you disagree with? Maybe we actually agree?
There is a difference between an AI that does X, and an AI that has a goal to do X. Not sure what architectures you’re referring to, but I suspect you may be conflating the goals of the AI’s constructor (or construction process) with the goals of the AI.
It’s almost certainly true that a random program from the set of programs that do X, is more dangerous than a random program from some larger set, however I claim that the fraction of dangerous programs is still pretty small.
Now, there are some obviously dangerous sets of programs, and it’s possible that humans will pick an AI form such set. In other news, if you shoot yourself in the foot, you get hurt.
I think habryka is thinking about modern machine learning architectures that are studied by AI researchers. AI research is in fact a distinct subfield from programming-in-general, because AI programs are in fact a distinct subgroup from programs-in-general.
I’m very much aware what architectures machine learning studies, and indeed it (usually) isn’t generic programs in the sense of raw instruction lists (although, any other Turing complete set of programs can be perfectly well called “programs-in-general”—instruction lists are in no way unique).
The problem is that everyone’s favorite architecture—the plain neural network—does not contain a utility function. It is built using a utility/cost function, but that’s very different.
This doesn’t make the AI any safer, given the whole (neural network + neural network builder¹) is still a misaligned AI. Real life examples happen all the time.
¹: I don’t know if there is a standard term for this.
I never said anything about (mis)alignment. Of course using stupid training rewards will result in stupid programs. But the problem with those programs is not that they are instrumentally convergent (see the title of my post).
The training program, which has the utility function, does exhibit convergence, but the resulting agent, which has no utilities, does not usually exhibit it. E.g. if training environment involves a scenario where the agent is turned off (which results in 0 utility), then the training program will certainly build an agent that resists being turned off. But if the training environment does not include that scenario, then the resulting agent is unlikely to resist it.
The problem is that “the training environment does not include that scenario” is far from guaranteed.
Yes, but what are you arguing against? At no point did I claim that it is impossible for the training program to build a convergent agent (indeed, if you search for an agent that exhibits instrumental convergence, then you might find one). Nor did I ever claim that all agents are safe—I only meant that they are safer than the hypothesis of instrumental convergence would imply.
Also, you worry about the agent learning convergent behaviors by accident, but that’s a little silly when you know that the agent often fails to learn what you want it to learn. E.g. if you do explicitly include the scenario of the agent being turned off, and you want the agent to resist it, you know it’s likely that the agent will, e.g. overfit and will resist only in that single scenario. But then, when you don’t intentionally include any such scenario, and don’t want the agent to resist, it seems likely to you that the agent will correctly learn to resist anyway? Yes, it’s strictly possible that you will unintentionally train an agent that robustly resists being turned off. But the odds are in our favour.
Let’s go back to the OP. I’m claiming that not all intelligent agents would exhibit instrumental convergence, and that, in fact, the majority wouldn’t. What part of that exactly do you disagree with? Maybe we actually agree?