You have the problem reversed. AI safety/conrol/friendliness currently doesn’t have any standard tests to measure progress, and thus there is little objective way to compare methods. You need a clear optimization criteria to drive forward progress.
It would be great to have such tests in AI safety/control/Friendliness, but to me they look really difficult to create. Do you have any ideas?
I think general RL AI is now advanced to the point where testing some specific AI safety/control subproblems is becoming realistic. The key is to decompose and reduce down to something testable at small scale.
One promising route is to build on RL game playing agents, and extend the work to social games such as MMOs. In the MMO world we already have a model of social vs antisocial behavior—ie playerkillers vs achievers vs cooperators.
So take some sort of MMO that has simple enough visuals and or world complexity while retaining key features such as gold/xp progression, the ability to kill players with consequences, etc (a pure text game may be easier to learn, or maybe not). We can then use that to train agents and test various theories.
For example, I would expect that training an RL agent based on maximizing a score type function directly would result in playerkilling emerging as a natural strategy. If the agents were powerful enough to communicate, perhaps even simple game theory and cooperation would emerge.
I expect that training with Wissner-Gross style maximization of future freedom of action (if you could get it to scale) would also result in playerkillers/sociopaths.
In fact I expect that many of MIRI’s intuitions about the difficulty of getting a ‘friendly’ agent out of simple training specification are probably mostly correct. For example, you could train an agent with a carefully crafted score function that penalizes ‘playerkilling’ while maintaining regular gold/xp reward. I expect that those approaches will generally fail if the world is complex enough—the agent will simply learn a griefing exploit that doesn’t technically break the specific injunctions you put in place (playerkilling or whatever). I expect this based on specific experiences with early MMOs such as Ultima Online where the designers faced a simliar problem of trying to outlaw or regulate griefing/playerkilling through code—and they found it was incredibly difficult.
I also expect that IRL based approaches could eventually succeed in training an agent that basically learns to emulate the value function and behaviours of a ‘good/moral/friendly’ human player—given sufficient example history.
I think this type of research platform would put AI safety research on equal footing with mainstream ML research, and allow experimental progress to supplement/prove many concepts that up until now exist only as vague ideas.
It would be great to have such tests in AI safety/control/Friendliness, but to me they look really difficult to create. Do you have any ideas?
Yes.
I think general RL AI is now advanced to the point where testing some specific AI safety/control subproblems is becoming realistic. The key is to decompose and reduce down to something testable at small scale.
One promising route is to build on RL game playing agents, and extend the work to social games such as MMOs. In the MMO world we already have a model of social vs antisocial behavior—ie playerkillers vs achievers vs cooperators.
So take some sort of MMO that has simple enough visuals and or world complexity while retaining key features such as gold/xp progression, the ability to kill players with consequences, etc (a pure text game may be easier to learn, or maybe not). We can then use that to train agents and test various theories.
For example, I would expect that training an RL agent based on maximizing a score type function directly would result in playerkilling emerging as a natural strategy. If the agents were powerful enough to communicate, perhaps even simple game theory and cooperation would emerge.
I expect that training with Wissner-Gross style maximization of future freedom of action (if you could get it to scale) would also result in playerkillers/sociopaths.
In fact I expect that many of MIRI’s intuitions about the difficulty of getting a ‘friendly’ agent out of simple training specification are probably mostly correct. For example, you could train an agent with a carefully crafted score function that penalizes ‘playerkilling’ while maintaining regular gold/xp reward. I expect that those approaches will generally fail if the world is complex enough—the agent will simply learn a griefing exploit that doesn’t technically break the specific injunctions you put in place (playerkilling or whatever). I expect this based on specific experiences with early MMOs such as Ultima Online where the designers faced a simliar problem of trying to outlaw or regulate griefing/playerkilling through code—and they found it was incredibly difficult.
I also expect that IRL based approaches could eventually succeed in training an agent that basically learns to emulate the value function and behaviours of a ‘good/moral/friendly’ human player—given sufficient example history.
I think this type of research platform would put AI safety research on equal footing with mainstream ML research, and allow experimental progress to supplement/prove many concepts that up until now exist only as vague ideas.