I have been studying the ROME code in depth now for a few months, and I think there’s some really interesting stuff there. One of the contributors is actively working on adding things to their repo, such as integrating the functionality of nostalgebraist’s logit lens.
I’ve just read the Gradient Starvation paper. It was quite interesting and approachable. It has left me wondering what we will see if (when?) someone applies their fix (spectral decoupling) to fine-tune a pre-trained language model…
Reading the Diversify and Disambiguate: Learning from Underspecified Data paper, I find myself thinking that it is solving nearly the same problem as the Gradient Starvation problem. The Gradient Starvation fix of Spectral Decoupling seems on face-value, a little simpler, more elegant and general. I don’t actually have the info I need to compare the two though. I would like to see the two methods compared on things like:
if you have a completely closed test set (e.g. the future), and must make guesses about what distribution shifts you could potentially end up seeing by reasoning logically about degrees of freedom about input features and extrapolating historical trends to higher variance regimes. How do each of these methods handle this harder but more realistic case?
How do they compare across various benchmarks in terms of metrics like accuracy, confusion / regret, compute cost, robustness to outliers, etc.
Can you combine the methods? If so, how does the combination compare?
What about the case of trying to calibrate confidence in the face of heuristics like: high datapoint-density regions of latent space → higher confidence, low datapoint-density regions of latent space → lower confidence, highly ambiguous regions of latent space (non-linearly-seperable) → lower confidence, etc. This is something that beyond the scope of these papers, so I don’t criticize them for not addressing it, but important for real world application. There are various techniques, for example Virtual Outlier Synthesis https://arxiv.org/abs/2202.01197 , for addressing the issue of confidence calibration or model predictions. How well do these techniques combine with the above techniques for generating multiple hypotheses? This seems relevant to me, because it impacts which of the generated hypotheses you ought to promote most highly.
How can multiple hypotheses best be handled in an online learning scenario where live data is streaming in, potentially of questionable quality due to occasional corruption, and competing hypotheses must be held ready. You want a model to be able to switch cleanly between hypotheses about the underlying reality, not get stuck in some implausible mishmash. If one maze-solving hypothesis says it is best to turn left at 4-way junctions, and the second best hypothesis says it is best to turn right, and the model is doing a weighted random choice, you don’t want the model to split between left and straight, you want it to split between left and right. I feel like the multiple heads in ‘Diversify and Disambiguate’ might be a useful tool for setting this up. Not sure how one would do the same thing using Gradient Starvation’s Spectral Decoupling, but it seems like it could be done somehow.
I read the loss landscape paper, and liked their visualizations. I would be so interested to see the same visualizations applied to models trained with and without the Gradient Starvation paper’s Spectral Decoupling.
Reading the Cognitive Psych on GPT-3 paper, I find myself highly suspecting that the correct answers GPT-3 is reported to get are due to memorized patterns. I’m curious whether repeating the large-n experiments with variations in prompt (e.g. ‘you are a brilliant scientist presented with this question’ vs ‘you are an average middle school student presented with this question’) would change the outcome?
‘Mapping language models to conceptually grounded spaces’ shows the valuable insight that GPT-3 175B can work well in this grid-world space, but the smaller versions can’t. This is interesting because I had just been wondering earlier today if a small simple ascii-grid RL environment could be handled by a large language model. The answer is: maybe, if they’re large enough. Probably not if they’re small. More details on that idea here: https://docs.google.com/document/d/1sKMI2eNgLkRdxeg7m_XsFsXhIZx5LhXFxbi7NEFLy7k/edit?usp=sharing
I have been studying the ROME code in depth now for a few months, and I think there’s some really interesting stuff there. One of the contributors is actively working on adding things to their repo, such as integrating the functionality of nostalgebraist’s logit lens.
I’ve just read the Gradient Starvation paper. It was quite interesting and approachable. It has left me wondering what we will see if (when?) someone applies their fix (spectral decoupling) to fine-tune a pre-trained language model…
Reading the Diversify and Disambiguate: Learning from Underspecified Data paper, I find myself thinking that it is solving nearly the same problem as the Gradient Starvation problem. The Gradient Starvation fix of Spectral Decoupling seems on face-value, a little simpler, more elegant and general. I don’t actually have the info I need to compare the two though. I would like to see the two methods compared on things like:
if you have a completely closed test set (e.g. the future), and must make guesses about what distribution shifts you could potentially end up seeing by reasoning logically about degrees of freedom about input features and extrapolating historical trends to higher variance regimes. How do each of these methods handle this harder but more realistic case?
How do they compare across various benchmarks in terms of metrics like accuracy, confusion / regret, compute cost, robustness to outliers, etc.
Can you combine the methods? If so, how does the combination compare?
What about the case of trying to calibrate confidence in the face of heuristics like: high datapoint-density regions of latent space → higher confidence, low datapoint-density regions of latent space → lower confidence, highly ambiguous regions of latent space (non-linearly-seperable) → lower confidence, etc. This is something that beyond the scope of these papers, so I don’t criticize them for not addressing it, but important for real world application. There are various techniques, for example Virtual Outlier Synthesis https://arxiv.org/abs/2202.01197 , for addressing the issue of confidence calibration or model predictions. How well do these techniques combine with the above techniques for generating multiple hypotheses? This seems relevant to me, because it impacts which of the generated hypotheses you ought to promote most highly.
How can multiple hypotheses best be handled in an online learning scenario where live data is streaming in, potentially of questionable quality due to occasional corruption, and competing hypotheses must be held ready. You want a model to be able to switch cleanly between hypotheses about the underlying reality, not get stuck in some implausible mishmash. If one maze-solving hypothesis says it is best to turn left at 4-way junctions, and the second best hypothesis says it is best to turn right, and the model is doing a weighted random choice, you don’t want the model to split between left and straight, you want it to split between left and right. I feel like the multiple heads in ‘Diversify and Disambiguate’ might be a useful tool for setting this up. Not sure how one would do the same thing using Gradient Starvation’s Spectral Decoupling, but it seems like it could be done somehow.
I read the loss landscape paper, and liked their visualizations. I would be so interested to see the same visualizations applied to models trained with and without the Gradient Starvation paper’s Spectral Decoupling.
Reading the Cognitive Psych on GPT-3 paper, I find myself highly suspecting that the correct answers GPT-3 is reported to get are due to memorized patterns. I’m curious whether repeating the large-n experiments with variations in prompt (e.g. ‘you are a brilliant scientist presented with this question’ vs ‘you are an average middle school student presented with this question’) would change the outcome?
‘Mapping language models to conceptually grounded spaces’ shows the valuable insight that GPT-3 175B can work well in this grid-world space, but the smaller versions can’t. This is interesting because I had just been wondering earlier today if a small simple ascii-grid RL environment could be handled by a large language model. The answer is: maybe, if they’re large enough. Probably not if they’re small. More details on that idea here: https://docs.google.com/document/d/1sKMI2eNgLkRdxeg7m_XsFsXhIZx5LhXFxbi7NEFLy7k/edit?usp=sharing