The problem I see here is that the mainstream AI / machine learning community measures progress mainly by this kind of contest.
The mainstream AI/ML community measures progress by these types of contests because they are a straightforward way to objectively measure progress towards human-level AI, and also tend to result in meaningful near-term applications.
Researchers are incentivized to use whatever method they can find or invent to gain a few tenths of a percent in some contest, which allows them to claim progress at an AI task and publish a paper.
Gains of a few tenths of a percent aren’t necessarily meaningful—especially when proper variance/uncertainty estimates are unavailable.
The big key papers that get lots of citations tend to feature large, meaningful gains.
The problem in this specific case is that the Imagenet contest had an unofficial rule that was not explicit enough. They could have easily prevented this category of problem by using blind submissions and separate public/private leaderboards, ala kaggle.
Even as the AI safety / control / Friendliness field gets more attention and funding, it seems easy to foresee a future where mainstream AI researchers continue to ignore such work because it does not contribute to the tenths of a percent that they are seeking but instead can only hinder their efforts. What can be done to change this?
You have the problem reversed. AI safety/conrol/friendliness currently doesn’t have any standard tests to measure progress, and thus there is little objective way to compare methods. You need a clear optimization criteria to drive forward progress.
You have the problem reversed. AI safety/conrol/friendliness currently doesn’t have any standard tests to measure progress, and thus there is little objective way to compare methods. You need a clear optimization criteria to drive forward progress.
It would be great to have such tests in AI safety/control/Friendliness, but to me they look really difficult to create. Do you have any ideas?
I think general RL AI is now advanced to the point where testing some specific AI safety/control subproblems is becoming realistic. The key is to decompose and reduce down to something testable at small scale.
One promising route is to build on RL game playing agents, and extend the work to social games such as MMOs. In the MMO world we already have a model of social vs antisocial behavior—ie playerkillers vs achievers vs cooperators.
So take some sort of MMO that has simple enough visuals and or world complexity while retaining key features such as gold/xp progression, the ability to kill players with consequences, etc (a pure text game may be easier to learn, or maybe not). We can then use that to train agents and test various theories.
For example, I would expect that training an RL agent based on maximizing a score type function directly would result in playerkilling emerging as a natural strategy. If the agents were powerful enough to communicate, perhaps even simple game theory and cooperation would emerge.
I expect that training with Wissner-Gross style maximization of future freedom of action (if you could get it to scale) would also result in playerkillers/sociopaths.
In fact I expect that many of MIRI’s intuitions about the difficulty of getting a ‘friendly’ agent out of simple training specification are probably mostly correct. For example, you could train an agent with a carefully crafted score function that penalizes ‘playerkilling’ while maintaining regular gold/xp reward. I expect that those approaches will generally fail if the world is complex enough—the agent will simply learn a griefing exploit that doesn’t technically break the specific injunctions you put in place (playerkilling or whatever). I expect this based on specific experiences with early MMOs such as Ultima Online where the designers faced a simliar problem of trying to outlaw or regulate griefing/playerkilling through code—and they found it was incredibly difficult.
I also expect that IRL based approaches could eventually succeed in training an agent that basically learns to emulate the value function and behaviours of a ‘good/moral/friendly’ human player—given sufficient example history.
I think this type of research platform would put AI safety research on equal footing with mainstream ML research, and allow experimental progress to supplement/prove many concepts that up until now exist only as vague ideas.
The mainstream AI/ML community measures progress by these types of contests because they are a straightforward way to objectively measure progress towards human-level AI, and also tend to result in meaningful near-term applications.
Gains of a few tenths of a percent aren’t necessarily meaningful—especially when proper variance/uncertainty estimates are unavailable.
The big key papers that get lots of citations tend to feature large, meaningful gains.
The problem in this specific case is that the Imagenet contest had an unofficial rule that was not explicit enough. They could have easily prevented this category of problem by using blind submissions and separate public/private leaderboards, ala kaggle.
You have the problem reversed. AI safety/conrol/friendliness currently doesn’t have any standard tests to measure progress, and thus there is little objective way to compare methods. You need a clear optimization criteria to drive forward progress.
It would be great to have such tests in AI safety/control/Friendliness, but to me they look really difficult to create. Do you have any ideas?
Yes.
I think general RL AI is now advanced to the point where testing some specific AI safety/control subproblems is becoming realistic. The key is to decompose and reduce down to something testable at small scale.
One promising route is to build on RL game playing agents, and extend the work to social games such as MMOs. In the MMO world we already have a model of social vs antisocial behavior—ie playerkillers vs achievers vs cooperators.
So take some sort of MMO that has simple enough visuals and or world complexity while retaining key features such as gold/xp progression, the ability to kill players with consequences, etc (a pure text game may be easier to learn, or maybe not). We can then use that to train agents and test various theories.
For example, I would expect that training an RL agent based on maximizing a score type function directly would result in playerkilling emerging as a natural strategy. If the agents were powerful enough to communicate, perhaps even simple game theory and cooperation would emerge.
I expect that training with Wissner-Gross style maximization of future freedom of action (if you could get it to scale) would also result in playerkillers/sociopaths.
In fact I expect that many of MIRI’s intuitions about the difficulty of getting a ‘friendly’ agent out of simple training specification are probably mostly correct. For example, you could train an agent with a carefully crafted score function that penalizes ‘playerkilling’ while maintaining regular gold/xp reward. I expect that those approaches will generally fail if the world is complex enough—the agent will simply learn a griefing exploit that doesn’t technically break the specific injunctions you put in place (playerkilling or whatever). I expect this based on specific experiences with early MMOs such as Ultima Online where the designers faced a simliar problem of trying to outlaw or regulate griefing/playerkilling through code—and they found it was incredibly difficult.
I also expect that IRL based approaches could eventually succeed in training an agent that basically learns to emulate the value function and behaviours of a ‘good/moral/friendly’ human player—given sufficient example history.
I think this type of research platform would put AI safety research on equal footing with mainstream ML research, and allow experimental progress to supplement/prove many concepts that up until now exist only as vague ideas.