I’m still quite unconvinced, which of course you’d predict. Like, regarding 3:
“There’s no attractor basin where if you have half of human values, you’re under more pressure to acquire the other half.”
Sure there is—over course of learning anything you get better and better feedback from training as your mistakes get more fine-grained. If you acquire a “don’t lie” principle without acquiring also “but it’s ok to lie to Nazis” then you’ll be punished, for instance. After you learn the more basic things, you’ll be pushed to acquire the less basic ones, so the reinforcement you get becomes more and more detailed. This is just like an RL model learns to stumble forward before it learns to walk cleanly or LLMs learn associations before learning higher-order correlations.
The there is no attractor basin in the world for ML, apart from actual mechanisms by which there are attractor basins for a thing! MIRI always talks as if there’s an abstract basin that rules things that gives us instrumental convergence, without reference to a particular training technique! But we control literally all the gradients our training techniques. “Don’t hurl coffee across the kitchen at the human when they ask for it” sits in the same high-dimensional basin as “Don’t kill all humans when they ask for a cure for cancer.”
In contrast, if you give AI high levels of capability in half the capabilities, it will tend to want all the rest of the capabilities too.
ML doesn’t acquire wants over the space of training techniques that are used to give it capabilities, it acquires “wants” from reinforced behaviors within the space of training techniques. These reinforced behaviors can be literally as human-morality-sensitive as we’d like. If we don’t put it in a circumstance where a particular kind coherence is rewarded, it just won’t get that kind of coherence; the ease with which we’ll be able to do this is of course emphasized by how blind most ML systems are.
Thanks for the response.
I’m still quite unconvinced, which of course you’d predict. Like, regarding 3:
Sure there is—over course of learning anything you get better and better feedback from training as your mistakes get more fine-grained. If you acquire a “don’t lie” principle without acquiring also “but it’s ok to lie to Nazis” then you’ll be punished, for instance. After you learn the more basic things, you’ll be pushed to acquire the less basic ones, so the reinforcement you get becomes more and more detailed. This is just like an RL model learns to stumble forward before it learns to walk cleanly or LLMs learn associations before learning higher-order correlations.
The there is no attractor basin in the world for ML, apart from actual mechanisms by which there are attractor basins for a thing! MIRI always talks as if there’s an abstract basin that rules things that gives us instrumental convergence, without reference to a particular training technique! But we control literally all the gradients our training techniques. “Don’t hurl coffee across the kitchen at the human when they ask for it” sits in the same high-dimensional basin as “Don’t kill all humans when they ask for a cure for cancer.”
ML doesn’t acquire wants over the space of training techniques that are used to give it capabilities, it acquires “wants” from reinforced behaviors within the space of training techniques. These reinforced behaviors can be literally as human-morality-sensitive as we’d like. If we don’t put it in a circumstance where a particular kind coherence is rewarded, it just won’t get that kind of coherence; the ease with which we’ll be able to do this is of course emphasized by how blind most ML systems are.