Yeah, IMO “RL at scale trains search-based mesa optimizers” hypothesis predicts “solving randomly generated mazes via a roughly unitary mesa objective and heuristic search” with reasonable probability, and that seems like a toy domain to me.
Yeah, IMO “RL at scale trains search-based mesa optimizers” hypothesis predicts “solving randomly generated mazes via a roughly unitary mesa objective and heuristic search” with reasonable probability, and that seems like a toy domain to me.