Iām keen to hear how you think your work relates to āActivation plateaus and sensitive directions in LLMsā. Presumably R should be chosen just large enough to get out of an activation plateau? Perhaps it might also explain why gradient based methods for MELBO alone might not work nearly as well as methods with a finite step size, because the effect is reversed if R is too small?
Iām keen to hear how you think your work relates to āActivation plateaus and sensitive directions in LLMsā. Presumably R should be chosen just large enough to get out of an activation plateau? Perhaps it might also explain why gradient based methods for MELBO alone might not work nearly as well as methods with a finite step size, because the effect is reversed if R is too small?