Most games-as-in-game-theory that you can scrape together for training are much more simple than your average Atari game. Since you’re relying on your training data to do so much of the work here, you want to have some idea of what training data will teach what, with what learning algorithm. You don’t want to leave the AI a nebulous fog, nor do you want to solve problems by stipulating that the training data will get arbitrarily large and complicated.
Instead, the sort of proposal I think is most helpful is the kind where, if achieved, it will show that you can solve an important problem with a certain architecture. That’s sort of what I meant by “shortcuts”—is the problem of learning not to cheat an easy way to demonstrate some value learning capability we need to work on? An example of this kind of capability-demonstration might be interpolating smoothly between objects as a demonstration that neural networks are learning high-level features that are similar to human-intelligible concepts.
Now, you might say “of course—learning not to cheat is itself the skill we want the AI to have.” But I’m not convinced that not cheating at chess or whatever demonstrates that the AI is not going to over-optimize the world, because those are very different domains. The trick, sometimes, is breaking down “don’t over-optimize the world” into little pieces that you can work on without having to jump all the way there, and then demonstrating milestones for those little pieces.
My definition of cheating for these purposes is essentially “don’t do what we don’t want you to do, even if we never bothered to tell you so and expected you to notice it on your own”. This skill would translate well to real-world domains.
Of course, if the games you are using to teach what cheating is are too simple, then you don’t want to use those kinds of games. If neither board games nor simple game theory games are complex enough, then obviously you need to come up with a more complicated kind of game. It seems to me that finding a difficult game to play that teaches you about human expectations and cheating is significantly easier than defining “what is cheating” manually.
One simple example that could be used to teach an AI: let it play an empire-building videogame, and ask it to “reduce unemployment”. Does it end up murdering everyone who is unemployed? That would be cheating. This particular example even translates really well to reality, for obvious reasons.
By the way, why would you not want the AI to be left in “a nebulous fog”. The more uncertain the AI is about what is and is not cheating, the more cautious it will be.
Most games-as-in-game-theory that you can scrape together for training are much more simple than your average Atari game. Since you’re relying on your training data to do so much of the work here, you want to have some idea of what training data will teach what, with what learning algorithm. You don’t want to leave the AI a nebulous fog, nor do you want to solve problems by stipulating that the training data will get arbitrarily large and complicated.
Instead, the sort of proposal I think is most helpful is the kind where, if achieved, it will show that you can solve an important problem with a certain architecture. That’s sort of what I meant by “shortcuts”—is the problem of learning not to cheat an easy way to demonstrate some value learning capability we need to work on? An example of this kind of capability-demonstration might be interpolating smoothly between objects as a demonstration that neural networks are learning high-level features that are similar to human-intelligible concepts.
Now, you might say “of course—learning not to cheat is itself the skill we want the AI to have.” But I’m not convinced that not cheating at chess or whatever demonstrates that the AI is not going to over-optimize the world, because those are very different domains. The trick, sometimes, is breaking down “don’t over-optimize the world” into little pieces that you can work on without having to jump all the way there, and then demonstrating milestones for those little pieces.
My definition of cheating for these purposes is essentially “don’t do what we don’t want you to do, even if we never bothered to tell you so and expected you to notice it on your own”. This skill would translate well to real-world domains.
Of course, if the games you are using to teach what cheating is are too simple, then you don’t want to use those kinds of games. If neither board games nor simple game theory games are complex enough, then obviously you need to come up with a more complicated kind of game. It seems to me that finding a difficult game to play that teaches you about human expectations and cheating is significantly easier than defining “what is cheating” manually.
One simple example that could be used to teach an AI: let it play an empire-building videogame, and ask it to “reduce unemployment”. Does it end up murdering everyone who is unemployed? That would be cheating. This particular example even translates really well to reality, for obvious reasons.
By the way, why would you not want the AI to be left in “a nebulous fog”. The more uncertain the AI is about what is and is not cheating, the more cautious it will be.