70 steps is not very many—does training converge if you train for longer (e.g. 700, 7000, 70000)?
Also, in addition to regularization making this strategy not very effective, I’d also suspect that hyperparameter tuning would break it as well—e.g. I’d be interested in what happens if you do black-box hyperparameter tuning on the base training process’s hyperparameters after applying meta-learning (though then, to be fair to the meta-learning process, you’d also probably want to do the meta-learning in a setting with variable hyperparameters).
Thanks for the feedback! I use batch norm regularisation, but not dropout.
I just tried retraining the 100,000 cycle meta learned model in a variety of ways, including for 10,000 steps with 10,000x higher lr, using resilient backprop (which multiplies weights by a factor to increase/decrease them), and using an L2 penalty to decrease weight magnitude. So far, nothing has gotten the network to model the base function. The L2 penalty did reduce weight values to ~the normal range, but the network still didn’t learn the base function.
I now think the increase in weight values is just incidental and that the meta learner found some other way of protecting the network from SGD.
I checked the intermediate network activations. It turns out the meta-learned network generates all-negative activations for the final linear layer, so the the relu activations zero out the final layer’s output (other than bias), regardless of initial network input.
I’ve begun experiments with flipped base and meta functions (network initially models sin(x) and resists being retrained to model f(x) = 1).
70 steps is not very many—does training converge if you train for longer (e.g. 700, 7000, 70000)?
Also, in addition to regularization making this strategy not very effective, I’d also suspect that hyperparameter tuning would break it as well—e.g. I’d be interested in what happens if you do black-box hyperparameter tuning on the base training process’s hyperparameters after applying meta-learning (though then, to be fair to the meta-learning process, you’d also probably want to do the meta-learning in a setting with variable hyperparameters).
Thanks for the feedback! I use batch norm regularisation, but not dropout.
I just tried retraining the 100,000 cycle meta learned model in a variety of ways, including for 10,000 steps with 10,000x higher lr, using resilient backprop (which multiplies weights by a factor to increase/decrease them), and using an L2 penalty to decrease weight magnitude. So far, nothing has gotten the network to model the base function. The L2 penalty did reduce weight values to ~the normal range, but the network still didn’t learn the base function.
I now think the increase in weight values is just incidental and that the meta learner found some other way of protecting the network from SGD.
Interesting! I’d definitely be excited to know if you figure out what it’s doing.
I checked the intermediate network activations. It turns out the meta-learned network generates all-negative activations for the final linear layer, so the the relu activations zero out the final layer’s output (other than bias), regardless of initial network input.
I’ve begun experiments with flipped base and meta functions (network initially models sin(x) and resists being retrained to model f(x) = 1).
Could you please share the results in case you ended up finishing those experiments?