Maybe another weak solution close to “Take bigger steps”: Use decentralize training.
Meaning: perform several training steps (gradient updates) in parallel on several replicates of the model and periodically synchronize the weights (like average them).
Each replicate has only access to its own inputs and local weights and thus it seems plausible that the gradient hacker can’t as easily cancel gradients going against its mesa-objective.
Maybe another weak solution close to “Take bigger steps”: Use decentralize training.
Meaning: perform several training steps (gradient updates) in parallel on several replicates of the model and periodically synchronize the weights (like average them).
Each replicate has only access to its own inputs and local weights and thus it seems plausible that the gradient hacker can’t as easily cancel gradients going against its mesa-objective.