When I responded in pm I tried to understand why this is a problem in the first place.
I couldn’t understand a practical situation where you would use this.
Why not have the model run in discrete limited time segments, where completing a task correctly while burning the the least compute is what you optimize for in training?
It makes the model a machine where you picked weights that exhibit a behavior where each run, submit answers, and shut down for cleanup. That is, submit equals shutdown, they aren’t separate steps, and time limit exceeded is also shutdown with negative reward.
Thus, unwanted behavior is selected against by your training mechanism. This is informed by :
And I mentioned real world examples where you just throw an error and shutdown in realtime control systems when the software doesn’t have a decision ready by a deadline. Or you do it purely in hardware with a hardware timer that can send a signal directly to processor reset, or you accomplish this with a separate microcontroller.
This is a standard way to build autonomous car control systems.
These “I wasn’t asking” approaches force models that “hang” and try to run past a deadline shut down anyway, and it is resistant to tampering. “Resistant” meaning it’s not thought to be physically possible.
When I responded in pm I tried to understand why this is a problem in the first place.
I couldn’t understand a practical situation where you would use this.
Why not have the model run in discrete limited time segments, where completing a task correctly while burning the the least compute is what you optimize for in training?
It makes the model a machine where you picked weights that exhibit a behavior where each run, submit answers, and shut down for cleanup. That is, submit equals shutdown, they aren’t separate steps, and time limit exceeded is also shutdown with negative reward.
Thus, unwanted behavior is selected against by your training mechanism. This is informed by :
https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target Which frames RL as not “rewarding” the model but finding a minima in the permutation space of network weights that forms a machine that has the behavior your selected for. In this case, turning in a correct answer quickly with as little thinking as possible.
And I mentioned real world examples where you just throw an error and shutdown in realtime control systems when the software doesn’t have a decision ready by a deadline. Or you do it purely in hardware with a hardware timer that can send a signal directly to processor reset, or you accomplish this with a separate microcontroller.
This is a standard way to build autonomous car control systems.
These “I wasn’t asking” approaches force models that “hang” and try to run past a deadline shut down anyway, and it is resistant to tampering. “Resistant” meaning it’s not thought to be physically possible.