I’m playing around with the idea of combining value-learning with reduced impact, if that’s possible—to see whether we can use reduced impact to ensure the AI doesn’t try to manipulate humans in a dodgy fashion, so that human feedback can then be safely used to calibrate the AI...
I would say that the current model of value-learning is already safe for this. As I read Dewey’s paper, a value-learning agent doesn’t care about optimizing its own reward/reinforcement feedback, and in fact doesn’t even care about its own future utility function, even when able to predict that it will change. It cares about learning well from the feedback given to it and following the utility function that most probably models the feedback history.
I wouldn’t call that a truly and surely Friendly AI, but I would call it a halfway reasonably safe AI that could potentially be utilized as a seed. Turn on a value-learner and use feedback to train it until you can actively teach it about things like reduced-impact AI or CEV, then have it spit out some constructive mathematics regarding those and turn itself off (or you turn it off, since you control its training feedback you can train it to prefer never acting against a human who tries to turn it off).
But the mathematics of value-learners need a bunch of work. I actually emailed Dewey but he hasn’t gotten back to me. I’ll just have to slave over the paper more myself.
I would say that the current model of value-learning is already safe for this.
I found a “cake-or-death” problem with the initial formulation (http://lesswrong.com/lw/f3v/cake_or_death/). If such problems can be found with a formulation that looked pretty solid initially, then I’m certainly not confident we can say the current model is safe...
Safe enough to do mathematics on, surely. I wouldn’t declare anything safe to build unless someone hands me a hard hat and a one-time portal to a parallel universe.
I would say that the current model of value-learning is already safe for this. As I read Dewey’s paper, a value-learning agent doesn’t care about optimizing its own reward/reinforcement feedback, and in fact doesn’t even care about its own future utility function, even when able to predict that it will change. It cares about learning well from the feedback given to it and following the utility function that most probably models the feedback history.
I wouldn’t call that a truly and surely Friendly AI, but I would call it a halfway reasonably safe AI that could potentially be utilized as a seed. Turn on a value-learner and use feedback to train it until you can actively teach it about things like reduced-impact AI or CEV, then have it spit out some constructive mathematics regarding those and turn itself off (or you turn it off, since you control its training feedback you can train it to prefer never acting against a human who tries to turn it off).
But the mathematics of value-learners need a bunch of work. I actually emailed Dewey but he hasn’t gotten back to me. I’ll just have to slave over the paper more myself.
I found a “cake-or-death” problem with the initial formulation (http://lesswrong.com/lw/f3v/cake_or_death/). If such problems can be found with a formulation that looked pretty solid initially, then I’m certainly not confident we can say the current model is safe...
Safe enough to do mathematics on, surely. I wouldn’t declare anything safe to build unless someone hands me a hard hat and a one-time portal to a parallel universe.
You are wise, my child ;-)