I do wish to note that value-learning models, with further work, could at least get us within shouting distance of Coherent Volition: even if we can’t expect them to extrapolate our desires for us, we can expect them to follow the values we signal ourselves as having (ie: on some level, to more-or-less follow human orders, as potentially-unsafe as that may be in itself).
But more broadly, why should Specific Purpose AI want to do anything other than its specific assigned job, as humans understand that job?
But more broadly, why should Specific Purpose AI want to do anything other than its specific assigned job,...
No reason it would want to do anything else.
...as humans understand that job?
Ah, that’s the rub: the AI will do what we say, not what we want. That’s the whole challenge.
I do wish to note that value-learning models...
I’m playing around with the idea of combining value-learning with reduced impact, if that’s possible—to see whether we can use reduced impact to ensure the AI doesn’t try to manipulate humans in a dodgy fashion, so that human feedback can then be safely used to calibrate the AI...
I’m playing around with the idea of combining value-learning with reduced impact, if that’s possible—to see whether we can use reduced impact to ensure the AI doesn’t try to manipulate humans in a dodgy fashion, so that human feedback can then be safely used to calibrate the AI...
I would say that the current model of value-learning is already safe for this. As I read Dewey’s paper, a value-learning agent doesn’t care about optimizing its own reward/reinforcement feedback, and in fact doesn’t even care about its own future utility function, even when able to predict that it will change. It cares about learning well from the feedback given to it and following the utility function that most probably models the feedback history.
I wouldn’t call that a truly and surely Friendly AI, but I would call it a halfway reasonably safe AI that could potentially be utilized as a seed. Turn on a value-learner and use feedback to train it until you can actively teach it about things like reduced-impact AI or CEV, then have it spit out some constructive mathematics regarding those and turn itself off (or you turn it off, since you control its training feedback you can train it to prefer never acting against a human who tries to turn it off).
But the mathematics of value-learners need a bunch of work. I actually emailed Dewey but he hasn’t gotten back to me. I’ll just have to slave over the paper more myself.
I would say that the current model of value-learning is already safe for this.
I found a “cake-or-death” problem with the initial formulation (http://lesswrong.com/lw/f3v/cake_or_death/). If such problems can be found with a formulation that looked pretty solid initially, then I’m certainly not confident we can say the current model is safe...
Safe enough to do mathematics on, surely. I wouldn’t declare anything safe to build unless someone hands me a hard hat and a one-time portal to a parallel universe.
I do wish to note that value-learning models, with further work, could at least get us within shouting distance of Coherent Volition: even if we can’t expect them to extrapolate our desires for us, we can expect them to follow the values we signal ourselves as having (ie: on some level, to more-or-less follow human orders, as potentially-unsafe as that may be in itself).
But more broadly, why should Specific Purpose AI want to do anything other than its specific assigned job, as humans understand that job?
No reason it would want to do anything else.
Ah, that’s the rub: the AI will do what we say, not what we want. That’s the whole challenge.
I’m playing around with the idea of combining value-learning with reduced impact, if that’s possible—to see whether we can use reduced impact to ensure the AI doesn’t try to manipulate humans in a dodgy fashion, so that human feedback can then be safely used to calibrate the AI...
I would say that the current model of value-learning is already safe for this. As I read Dewey’s paper, a value-learning agent doesn’t care about optimizing its own reward/reinforcement feedback, and in fact doesn’t even care about its own future utility function, even when able to predict that it will change. It cares about learning well from the feedback given to it and following the utility function that most probably models the feedback history.
I wouldn’t call that a truly and surely Friendly AI, but I would call it a halfway reasonably safe AI that could potentially be utilized as a seed. Turn on a value-learner and use feedback to train it until you can actively teach it about things like reduced-impact AI or CEV, then have it spit out some constructive mathematics regarding those and turn itself off (or you turn it off, since you control its training feedback you can train it to prefer never acting against a human who tries to turn it off).
But the mathematics of value-learners need a bunch of work. I actually emailed Dewey but he hasn’t gotten back to me. I’ll just have to slave over the paper more myself.
I found a “cake-or-death” problem with the initial formulation (http://lesswrong.com/lw/f3v/cake_or_death/). If such problems can be found with a formulation that looked pretty solid initially, then I’m certainly not confident we can say the current model is safe...
Safe enough to do mathematics on, surely. I wouldn’t declare anything safe to build unless someone hands me a hard hat and a one-time portal to a parallel universe.
You are wise, my child ;-)