How does a parochially aligned AI paperclip? My preferences over the task T definitely include “don’t paperclip the universe”. Is the idea that the AI incorrectly infers my preferences about the task T, and then tries to optimize that?
It doesn’t *necessarily*. But it sounds like what you’re thinking of here is some form of “sufficient alignment”.
The point is that you could give an AI a reward function that leads it to be a good personal assistant program, so long as it remains restricted to doing the sort of things we expect a personal assistant program to do, and isn’t doing things like manipulating the stock market when you ask it to invest some money for you (unless that’s what you expect from a personal assistant). If it knows it could do things like that, but doesn’t want to, then it’s more like something sufficiently aligned. If it doesn’t do such things because it doesn’t realize they are possibilities (yet), or because it hasn’t figured out a good way to use it’s actuators to have that kind of effect (yet), because you’ve done a good job boxing it, then it’s more like “parochially aligned”.
Another way of putting it: A parochially aligned AI (for task T) needs to understand task T, but doesn’t need to have common sense “background values” like “don’t kill anyone”.
Narrow AIs might require parochial alignment techniques in order to learn to perform tasks that we don’t know how to write a good reward function for. And we might try to combine parochial alignment with capability control in order to get something like a genie without having to teach it background values. When/whether that would be a good idea is unclear ATM.
How does a parochially aligned AI paperclip? My preferences over the task T definitely include “don’t paperclip the universe”. Is the idea that the AI incorrectly infers my preferences about the task T, and then tries to optimize that?
It doesn’t *necessarily*. But it sounds like what you’re thinking of here is some form of “sufficient alignment”.
The point is that you could give an AI a reward function that leads it to be a good personal assistant program, so long as it remains restricted to doing the sort of things we expect a personal assistant program to do, and isn’t doing things like manipulating the stock market when you ask it to invest some money for you (unless that’s what you expect from a personal assistant). If it knows it could do things like that, but doesn’t want to, then it’s more like something sufficiently aligned. If it doesn’t do such things because it doesn’t realize they are possibilities (yet), or because it hasn’t figured out a good way to use it’s actuators to have that kind of effect (yet), because you’ve done a good job boxing it, then it’s more like “parochially aligned”.
Another way of putting it: A parochially aligned AI (for task T) needs to understand task T, but doesn’t need to have common sense “background values” like “don’t kill anyone”.
Narrow AIs might require parochial alignment techniques in order to learn to perform tasks that we don’t know how to write a good reward function for. And we might try to combine parochial alignment with capability control in order to get something like a genie without having to teach it background values. When/whether that would be a good idea is unclear ATM.
Got it, thanks. The part of the definition that I didn’t grasp was “H’s preferences over the intended task domain”.