Thanks for the very thoughtful comment Rohin. I was on retreat last week after I published the article and upon returning to computer usage I was delighted by the engagement from you and others.
Generality of intelligence: The generality of O’s intelligence is a function of the number and diversity of tasks T that it can solve, as well as its performance on those tasks.
I like this.
We’ll presumably need to give O some information about the goal / target configuration set for each task. We could say that a robot capable of moving a vase around is a little bit general since we can have it solve the tasks of placing the vase at many different locations by inputting some latitude/longitude into some appropriate memory location. But this means we’re actually pasting in a different object O for each task T—each of the objects differs in those memory locations into which we’re pasting the latitude/longitude. It might be helpful to think of a “agent schema” function that maps goals to objects, so we take the goal part of the task, compute the object O for that goal, then paste this object into the environment.
It’s also important that O be able to solve the task for a reasonably broad range of environments.
Inner alignment
Perhaps we could look at it this way: take a system containing a human that is trying to get something done. This is presumably an optimizing system as humans often robustly move their environment towards some desired target configuration set. Then an inner-aligned AI is an object O such that adding it to this environment does not change the target configuration set, but does change the speed and/or robustness of convergence to that target configuration set.
Intent alignment
Yup very difficult to say much about intentions using the pure outside view approach of this framework. Perhaps we could say that an intent-aligned AI is an inner-aligned AI modulo less robustness. Or perhaps we could say that an intent-aligned AI is an AI that would achieve the goal in a large set of benign environments, but might not achieve it in the presence of unlikely mistakes, unlikely environmental conditions, or the presence of other powerful basins of attraction.
But this doesn’t really get at the spirit of Paul’s idea, which I think is about really looking inside the AI and understanding its goals.
We’ll presumably need to give O some information about the goal / target configuration set for each task.
I was imagining that the tasks can come equipped with some specification, but some sort of counterfactual also makes sense. This also gets around issues of the AI system not being appropriately “motivated”—e.g. I might be capable of performing the task “lock up puppies in cages”, but I wouldn’t do it, and so if you only look at my behavior you couldn’t say that I was capable of doing that task.
But this doesn’t really get at the spirit of Paul’s idea, which I think is about really looking inside the AI and understanding its goals.
Thanks for the very thoughtful comment Rohin. I was on retreat last week after I published the article and upon returning to computer usage I was delighted by the engagement from you and others.
I like this.
We’ll presumably need to give O some information about the goal / target configuration set for each task. We could say that a robot capable of moving a vase around is a little bit general since we can have it solve the tasks of placing the vase at many different locations by inputting some latitude/longitude into some appropriate memory location. But this means we’re actually pasting in a different object O for each task T—each of the objects differs in those memory locations into which we’re pasting the latitude/longitude. It might be helpful to think of a “agent schema” function that maps goals to objects, so we take the goal part of the task, compute the object O for that goal, then paste this object into the environment.
It’s also important that O be able to solve the task for a reasonably broad range of environments.
Perhaps we could look at it this way: take a system containing a human that is trying to get something done. This is presumably an optimizing system as humans often robustly move their environment towards some desired target configuration set. Then an inner-aligned AI is an object O such that adding it to this environment does not change the target configuration set, but does change the speed and/or robustness of convergence to that target configuration set.
Yup very difficult to say much about intentions using the pure outside view approach of this framework. Perhaps we could say that an intent-aligned AI is an inner-aligned AI modulo less robustness. Or perhaps we could say that an intent-aligned AI is an AI that would achieve the goal in a large set of benign environments, but might not achieve it in the presence of unlikely mistakes, unlikely environmental conditions, or the presence of other powerful basins of attraction.
But this doesn’t really get at the spirit of Paul’s idea, which I think is about really looking inside the AI and understanding its goals.
+1 to all of this.
I was imagining that the tasks can come equipped with some specification, but some sort of counterfactual also makes sense. This also gets around issues of the AI system not being appropriately “motivated”—e.g. I might be capable of performing the task “lock up puppies in cages”, but I wouldn’t do it, and so if you only look at my behavior you couldn’t say that I was capable of doing that task.
+1 especially to this