Human: “Look, can’t you just be normal about this?”
GAA-optimized agent: “Actually-”
Hm, I guess this wouldn’t work if the agent still learns an internalized RL methodology? Or would it? Say we have a base model, not much need for GAA because it’s just doing token pred. We go into some sort of (distilled?) RL-based cot instruct tuning, GAA means it picks up abnormal rewards from the signal more slowly, ie. it doesn’t do the classic boat-spinning-in-circles thing (good test?), but if it internalizes RL at some point its mesaoptimizer wouldn’t be so limited, and that’s a general technique so GAA wouldn’t prevent it? Still, seems like a good first line of defense.
Human: “Look, can’t you just be normal about this?”
GAA-optimized agent: “Actually-”
Hm, I guess this wouldn’t work if the agent still learns an internalized RL methodology? Or would it? Say we have a base model, not much need for GAA because it’s just doing token pred. We go into some sort of (distilled?) RL-based cot instruct tuning, GAA means it picks up abnormal rewards from the signal more slowly, ie. it doesn’t do the classic boat-spinning-in-circles thing (good test?), but if it internalizes RL at some point its mesaoptimizer wouldn’t be so limited, and that’s a general technique so GAA wouldn’t prevent it? Still, seems like a good first line of defense.