Looks interesting and ambitious! But I am sensing some methodological
obstacles here, which I would like to point out and explore. You write:
Within those projects, I’m aiming to work on subprojects that are:
Posed in terms that are familiar to conventional ML;
interesting to solve from the conventional ML perspective;
and whose solutions can be extended to the big issues in AI safety.
Now, take the example of the capture the cube game from the
Deepmind blog
post.
This is a game where player 1 tries to move a cube into the white
zone. and player 2 tries to move it into the blue zone on the other
end of the board. If the agents learn to betray each other here, how
would you fix this?
We’ll experiment with ways of motivating the agents to avoid betrayals, or getting anywhere near to them, and see if these ideas scale.
There are three approaches to motivating agents to avoid betrayals in
capture the cube that I can see:
change the physical reality of the game: change the physics of the
game world or the initial state of the game world
change the reward functions of the players
change the ML algorithms inside the players, so that they are no
longer capable of finding the optimal betrayal-based strategy.
Your agenda says that you want to find solutions that are interesting
from the conventional ML perspective. However, in the conventional
ML perspective:
tweaking the physics of the toy environment to improve agent
behavior is out of scope. It is close to cheating on the benchmark.
any consideration of reward function design is out of
scope. Tweaking it to improve learned behavior is again close to
cheating.
introducing damage into your ML algorithms so that they will no
longer find the optimal policy is just plain weird, out of scope, and
close to cheating.
So I’d argue that you have nowhere to move if you want to solve this
problem while also pleasing conventional ML researchers. Conventional
ML researchers will always respond by saying that your solution is
trivial, problem-specific, and therefore uninteresting.
OK, maybe I am painting too much of a hard-core bitter
lesson
picture of conventional ML research here. I could make the above
observations disappear by using a notion of conventional ML research
that is more liberal in what it will treat as in-scope, instead of as
cheating.
What I would personally find exiting would be a methodological
approach where you experiment with 1) and 2) above, and ignore 3).
In the capture the cube game, you might experiment with reward
functions that give more points for a fast capture followed by a fast
move to a winning zone, which ends the game, and less for a slow one.
If you also make this an iterated game (it may already be a de-facto
iterated game depending on the ML setup), I would expect that you can
produce robust collaborative behavior with this time-based reward
function. The agents may learn to do the equivalent of flipping a
coin at the start to decide who will win this time: they will
implicitly evolve a social contract about sharing scarce resources.
You might also investigate a game scoring variant with different time
discount factors, factors which more heavily or more lightly penalize
wins which take longer to achieve. I would expect that with higher
penalties for taking a longer time to win, collaborative behavior
under differences between player intelligence and ability will remain
more robust, because even a weaker player can always slow down a
stronger player a bit if they want to. This penalty approach might
then generalize to other types of games.
The kind of thing I have in mind above could also be explored in much
more simple toy worlds than those offered by XLand. I have been
thinking of a game where we drop two players on a barren planet, where
one has the reward function to maximize paperclips, and one to
maximize staples. If the number of paperclips and staples is
discounted, e,g, the time-based reward functions are paperclips0.1
and staples0.1, this might produce more collaborative/sharing
behavior, and suppress a risky fight to capture total dominance over
resources.
Potentially, some branch of game theory has already produced a whole
body of knowledge that examines this type of approach to turning
competitive games into collaborative games, and has come up with
useful general results and design principles. Do not know. I
sometimes wonder about embarking on a broad game theory literature
search to find out. The methodological danger of using XLand to
examine these game theoretical questions is that by spending months
working in the lab, you will save hours in the library.
These general methodological issues have been on my mind recently. I
have been wondering if AI alignment/safety researchers should spend
less time with ML researchers and their worldview, and more time with
game theory people.
I would be interested in your thoughts on these methodological issues,
specifically your thoughts about how you will handle them in this
particular subproject. One option I did not discuss above is transfer
learning which primes the agents on collaborative games only, to then
explore their behavior on competitive games.
Looks interesting and ambitious! But I am sensing some methodological obstacles here, which I would like to point out and explore. You write:
Now, take the example of the capture the cube game from the Deepmind blog post. This is a game where player 1 tries to move a cube into the white zone. and player 2 tries to move it into the blue zone on the other end of the board. If the agents learn to betray each other here, how would you fix this?
There are three approaches to motivating agents to avoid betrayals in capture the cube that I can see:
change the physical reality of the game: change the physics of the game world or the initial state of the game world
change the reward functions of the players
change the ML algorithms inside the players, so that they are no longer capable of finding the optimal betrayal-based strategy.
Your agenda says that you want to find solutions that are interesting from the conventional ML perspective. However, in the conventional ML perspective:
tweaking the physics of the toy environment to improve agent behavior is out of scope. It is close to cheating on the benchmark.
any consideration of reward function design is out of scope. Tweaking it to improve learned behavior is again close to cheating.
introducing damage into your ML algorithms so that they will no longer find the optimal policy is just plain weird, out of scope, and close to cheating.
So I’d argue that you have nowhere to move if you want to solve this problem while also pleasing conventional ML researchers. Conventional ML researchers will always respond by saying that your solution is trivial, problem-specific, and therefore uninteresting.
OK, maybe I am painting too much of a hard-core bitter lesson picture of conventional ML research here. I could make the above observations disappear by using a notion of conventional ML research that is more liberal in what it will treat as in-scope, instead of as cheating.
What I would personally find exiting would be a methodological approach where you experiment with 1) and 2) above, and ignore 3).
In the capture the cube game, you might experiment with reward functions that give more points for a fast capture followed by a fast move to a winning zone, which ends the game, and less for a slow one. If you also make this an iterated game (it may already be a de-facto iterated game depending on the ML setup), I would expect that you can produce robust collaborative behavior with this time-based reward function. The agents may learn to do the equivalent of flipping a coin at the start to decide who will win this time: they will implicitly evolve a social contract about sharing scarce resources.
You might also investigate a game scoring variant with different time discount factors, factors which more heavily or more lightly penalize wins which take longer to achieve. I would expect that with higher penalties for taking a longer time to win, collaborative behavior under differences between player intelligence and ability will remain more robust, because even a weaker player can always slow down a stronger player a bit if they want to. This penalty approach might then generalize to other types of games.
The kind of thing I have in mind above could also be explored in much more simple toy worlds than those offered by XLand. I have been thinking of a game where we drop two players on a barren planet, where one has the reward function to maximize paperclips, and one to maximize staples. If the number of paperclips and staples is discounted, e,g, the time-based reward functions are paperclips0.1 and staples0.1, this might produce more collaborative/sharing behavior, and suppress a risky fight to capture total dominance over resources.
Potentially, some branch of game theory has already produced a whole body of knowledge that examines this type of approach to turning competitive games into collaborative games, and has come up with useful general results and design principles. Do not know. I sometimes wonder about embarking on a broad game theory literature search to find out. The methodological danger of using XLand to examine these game theoretical questions is that by spending months working in the lab, you will save hours in the library.
These general methodological issues have been on my mind recently. I have been wondering if AI alignment/safety researchers should spend less time with ML researchers and their worldview, and more time with game theory people.
I would be interested in your thoughts on these methodological issues, specifically your thoughts about how you will handle them in this particular subproject. One option I did not discuss above is transfer learning which primes the agents on collaborative games only, to then explore their behavior on competitive games.