Apart from the obvious problems with this approach, (The AI can do a lot with the output channel other than what you wanted it to do, choosing an appropriate value for λ, etc.) I don’t see why this approach would be any easier to implement than CEV.
Once you know what a bounded approximation of an ideal algorithm is supposed to look like, how the bounded version is supposed to reason about it’s idealised version and how to refer to arbitrary physical data, as the algorithm defined in your post assume, then implementing CEV really doesn’t seem to be that hard of a problem.
So could you explain why you believe that implementing CEV would be so much harder than what you propose in your post?
So could you explain why you believe that implementing CEV would be so much harder than what you propose in your post?
This post assume the AI understands physical concepts to a certain degree, and has a reflection principle (and that we have an adequate U).
CEV requires that we solve the issue of extracting preferences from current people, have a method for combining them, have a method for extrapolating them, and have an error-catching mechanism to check that things haven’t gone wrong. We have none of these things, even in principle. CEV itself is a severely underspecified concept (as far as I know, my attempt here is the only serious attempt to define it; and it’s not very good http://lesswrong.com/r/discussion/lw/8qb/cevinspired_models/ ).
More simply, CEV requires that we solve moral problems and their grounding in reality; reduced impact requires that we solve physics and position.
I do wish to note that value-learning models, with further work, could at least get us within shouting distance of Coherent Volition: even if we can’t expect them to extrapolate our desires for us, we can expect them to follow the values we signal ourselves as having (ie: on some level, to more-or-less follow human orders, as potentially-unsafe as that may be in itself).
But more broadly, why should Specific Purpose AI want to do anything other than its specific assigned job, as humans understand that job?
But more broadly, why should Specific Purpose AI want to do anything other than its specific assigned job,...
No reason it would want to do anything else.
...as humans understand that job?
Ah, that’s the rub: the AI will do what we say, not what we want. That’s the whole challenge.
I do wish to note that value-learning models...
I’m playing around with the idea of combining value-learning with reduced impact, if that’s possible—to see whether we can use reduced impact to ensure the AI doesn’t try to manipulate humans in a dodgy fashion, so that human feedback can then be safely used to calibrate the AI...
I’m playing around with the idea of combining value-learning with reduced impact, if that’s possible—to see whether we can use reduced impact to ensure the AI doesn’t try to manipulate humans in a dodgy fashion, so that human feedback can then be safely used to calibrate the AI...
I would say that the current model of value-learning is already safe for this. As I read Dewey’s paper, a value-learning agent doesn’t care about optimizing its own reward/reinforcement feedback, and in fact doesn’t even care about its own future utility function, even when able to predict that it will change. It cares about learning well from the feedback given to it and following the utility function that most probably models the feedback history.
I wouldn’t call that a truly and surely Friendly AI, but I would call it a halfway reasonably safe AI that could potentially be utilized as a seed. Turn on a value-learner and use feedback to train it until you can actively teach it about things like reduced-impact AI or CEV, then have it spit out some constructive mathematics regarding those and turn itself off (or you turn it off, since you control its training feedback you can train it to prefer never acting against a human who tries to turn it off).
But the mathematics of value-learners need a bunch of work. I actually emailed Dewey but he hasn’t gotten back to me. I’ll just have to slave over the paper more myself.
I would say that the current model of value-learning is already safe for this.
I found a “cake-or-death” problem with the initial formulation (http://lesswrong.com/lw/f3v/cake_or_death/). If such problems can be found with a formulation that looked pretty solid initially, then I’m certainly not confident we can say the current model is safe...
Safe enough to do mathematics on, surely. I wouldn’t declare anything safe to build unless someone hands me a hard hat and a one-time portal to a parallel universe.
Apart from the obvious problems with this approach, (The AI can do a lot with the output channel other than what you wanted it to do, choosing an appropriate value for λ, etc.) I don’t see why this approach would be any easier to implement than CEV.
Once you know what a bounded approximation of an ideal algorithm is supposed to look like, how the bounded version is supposed to reason about it’s idealised version and how to refer to arbitrary physical data, as the algorithm defined in your post assume, then implementing CEV really doesn’t seem to be that hard of a problem.
So could you explain why you believe that implementing CEV would be so much harder than what you propose in your post?
This post assume the AI understands physical concepts to a certain degree, and has a reflection principle (and that we have an adequate U).
CEV requires that we solve the issue of extracting preferences from current people, have a method for combining them, have a method for extrapolating them, and have an error-catching mechanism to check that things haven’t gone wrong. We have none of these things, even in principle. CEV itself is a severely underspecified concept (as far as I know, my attempt here is the only serious attempt to define it; and it’s not very good http://lesswrong.com/r/discussion/lw/8qb/cevinspired_models/ ).
More simply, CEV requires that we solve moral problems and their grounding in reality; reduced impact requires that we solve physics and position.
I do wish to note that value-learning models, with further work, could at least get us within shouting distance of Coherent Volition: even if we can’t expect them to extrapolate our desires for us, we can expect them to follow the values we signal ourselves as having (ie: on some level, to more-or-less follow human orders, as potentially-unsafe as that may be in itself).
But more broadly, why should Specific Purpose AI want to do anything other than its specific assigned job, as humans understand that job?
No reason it would want to do anything else.
Ah, that’s the rub: the AI will do what we say, not what we want. That’s the whole challenge.
I’m playing around with the idea of combining value-learning with reduced impact, if that’s possible—to see whether we can use reduced impact to ensure the AI doesn’t try to manipulate humans in a dodgy fashion, so that human feedback can then be safely used to calibrate the AI...
I would say that the current model of value-learning is already safe for this. As I read Dewey’s paper, a value-learning agent doesn’t care about optimizing its own reward/reinforcement feedback, and in fact doesn’t even care about its own future utility function, even when able to predict that it will change. It cares about learning well from the feedback given to it and following the utility function that most probably models the feedback history.
I wouldn’t call that a truly and surely Friendly AI, but I would call it a halfway reasonably safe AI that could potentially be utilized as a seed. Turn on a value-learner and use feedback to train it until you can actively teach it about things like reduced-impact AI or CEV, then have it spit out some constructive mathematics regarding those and turn itself off (or you turn it off, since you control its training feedback you can train it to prefer never acting against a human who tries to turn it off).
But the mathematics of value-learners need a bunch of work. I actually emailed Dewey but he hasn’t gotten back to me. I’ll just have to slave over the paper more myself.
I found a “cake-or-death” problem with the initial formulation (http://lesswrong.com/lw/f3v/cake_or_death/). If such problems can be found with a formulation that looked pretty solid initially, then I’m certainly not confident we can say the current model is safe...
Safe enough to do mathematics on, surely. I wouldn’t declare anything safe to build unless someone hands me a hard hat and a one-time portal to a parallel universe.
You are wise, my child ;-)