The point was more about creating your own data being easy, just generate code then check it by running it. Save this code, and later use it for training.
If we wanted to go the way of AlphaZero it doesn’t seem crazy.
De-enforce commands, functions, programs which output errors, for a start.
I didn’t think of the pm as being trained by these games, that’s interesting.
Maybe have two instances competing to get closer on some test cases the pm can prepare to go with the task, and have them competing on time, compute, memory, and accuracy. You can de-enforce the less accurate, and if fully accurate they can compete on time, memory, cpu.
I’m not sure “hard but possible” is the bar—you want lots of examples of what doesn’t work along with what does, and you want it for easy problems and hard ones so the model learns everything
The point was more about creating your own data being easy, just generate code then check it by running it. Save this code, and later use it for training.
If we wanted to go the way of AlphaZero it doesn’t seem crazy.
De-enforce commands, functions, programs which output errors, for a start.
I didn’t think of the pm as being trained by these games, that’s interesting. Maybe have two instances competing to get closer on some test cases the pm can prepare to go with the task, and have them competing on time, compute, memory, and accuracy. You can de-enforce the less accurate, and if fully accurate they can compete on time, memory, cpu.
I’m not sure “hard but possible” is the bar—you want lots of examples of what doesn’t work along with what does, and you want it for easy problems and hard ones so the model learns everything