These are all task specific problem definition issues that occurred while fine tuning algorithms (but yes they do show how things could get out of hand)
Humans already do this very well, for example tax loopholes that are exploited but are not in the ‘spirit of the law’.
The ideal (but incredibly difficult) solution would be for AI’s to have multiple layers of abstraction, where each decision gets passed up and is then evaluated as “is this really what they wanted”, or “am I just gaming the system”.
Let me clarify why I asked. I think the “multiple layers of abstraction” idea is essentially “build in a lot of ‘manual’ checks that the AI isn’t misbehaving”, and I don’t think that is a desirable or even possible solution. You can write n layer of checks, but how do you know that you don’t need n+1?
The idea being—as has been pointed out here on LW—that what you really want and need is a mathematical model of morality, which the AI will implement and which moral behaviour will fall out of without you having to specify it explicitly. This is what MIRI are working on with CEV & co.
Whether or not CEV or whatever emerges as the best model to use are gameable is itself a mathematical question,[1] central to the FAI problem.
[1] There are also implementation details to consider, e.g. “can I mess with the substrate” or “can I trust my substrate”.
These are all task specific problem definition issues that occurred while fine tuning algorithms (but yes they do show how things could get out of hand)
Humans already do this very well, for example tax loopholes that are exploited but are not in the ‘spirit of the law’.
The ideal (but incredibly difficult) solution would be for AI’s to have multiple layers of abstraction, where each decision gets passed up and is then evaluated as “is this really what they wanted”, or “am I just gaming the system”.
What happens if an AI manages to game the system despite the n layers of abstraction?
This is the fundamental problem that is being researched—the top layer of abstraction would be that difficult to define one called “Be Friendly”.
Instead of friendly AI maybe we should look at “dont be an asshole” AI (DBAAAI) - this may be simpler to test and monitor.
Let me clarify why I asked. I think the “multiple layers of abstraction” idea is essentially “build in a lot of ‘manual’ checks that the AI isn’t misbehaving”, and I don’t think that is a desirable or even possible solution. You can write n layer of checks, but how do you know that you don’t need n+1?
The idea being—as has been pointed out here on LW—that what you really want and need is a mathematical model of morality, which the AI will implement and which moral behaviour will fall out of without you having to specify it explicitly. This is what MIRI are working on with CEV & co.
Whether or not CEV or whatever emerges as the best model to use are gameable is itself a mathematical question,[1] central to the FAI problem.
[1] There are also implementation details to consider, e.g. “can I mess with the substrate” or “can I trust my substrate”.