This isn’t crazy— people have tried related techniques. But it needs more details thought out.
In the chess example, the AIs start out very stupid, being wired at random. But in a game between two idiots, moving at random, eventually someone is going to win. And then you reinforce the techniques used by the winner, and de-reinforce the ones used by the loser. In any encounter, you learn, regardless of who wins. But in an encounter between a PM and a programmer, if the programmer fails, who gets reinforced? It might be because the programmer is dumb, and should be de-reinforced. But it might be because the PM is dumb, and asked for something impossible or far beyond what can be done, in which case it should be de-reinforced. But it might be because the PM came up with a task just barely beyond the programmer’s ability, which is good and should be reinforced. We somehow need to keep the PM producing problems which are hard but possible. Maybe the programmer could be tasked with coming up with either a solution or a proof of impossibility?
AlphaGo had a mechanism which tracked how important each move was. It was trained to predict the probability that white would win, on each position encountered in the game. Moves where this probability swung wildly were given a larger weight in reinforcement. This was important for concentrating training on decisive moves, allowing the extraction of information from each move instead of each game. It’s not clear if this is possible in the programming task.
The point was more about creating your own data being easy, just generate code then check it by running it. Save this code, and later use it for training.
If we wanted to go the way of AlphaZero it doesn’t seem crazy.
De-enforce commands, functions, programs which output errors, for a start.
I didn’t think of the pm as being trained by these games, that’s interesting.
Maybe have two instances competing to get closer on some test cases the pm can prepare to go with the task, and have them competing on time, compute, memory, and accuracy. You can de-enforce the less accurate, and if fully accurate they can compete on time, memory, cpu.
I’m not sure “hard but possible” is the bar—you want lots of examples of what doesn’t work along with what does, and you want it for easy problems and hard ones so the model learns everything
This isn’t crazy— people have tried related techniques. But it needs more details thought out.
In the chess example, the AIs start out very stupid, being wired at random. But in a game between two idiots, moving at random, eventually someone is going to win. And then you reinforce the techniques used by the winner, and de-reinforce the ones used by the loser. In any encounter, you learn, regardless of who wins. But in an encounter between a PM and a programmer, if the programmer fails, who gets reinforced? It might be because the programmer is dumb, and should be de-reinforced. But it might be because the PM is dumb, and asked for something impossible or far beyond what can be done, in which case it should be de-reinforced. But it might be because the PM came up with a task just barely beyond the programmer’s ability, which is good and should be reinforced. We somehow need to keep the PM producing problems which are hard but possible. Maybe the programmer could be tasked with coming up with either a solution or a proof of impossibility?
AlphaGo had a mechanism which tracked how important each move was. It was trained to predict the probability that white would win, on each position encountered in the game. Moves where this probability swung wildly were given a larger weight in reinforcement. This was important for concentrating training on decisive moves, allowing the extraction of information from each move instead of each game. It’s not clear if this is possible in the programming task.
The point was more about creating your own data being easy, just generate code then check it by running it. Save this code, and later use it for training.
If we wanted to go the way of AlphaZero it doesn’t seem crazy.
De-enforce commands, functions, programs which output errors, for a start.
I didn’t think of the pm as being trained by these games, that’s interesting. Maybe have two instances competing to get closer on some test cases the pm can prepare to go with the task, and have them competing on time, compute, memory, and accuracy. You can de-enforce the less accurate, and if fully accurate they can compete on time, memory, cpu.
I’m not sure “hard but possible” is the bar—you want lots of examples of what doesn’t work along with what does, and you want it for easy problems and hard ones so the model learns everything