[good generalization] holds across multiple patterns of label noise, even when erroneous labels are biased towards confusing classes.
I will also point to OpenAI’s weak-to-strong results, where increasingly strong students keep improving generalization given labels from a fixed-size teacher. We just don’t live in a world where this issue is an obvious lethality. (EDIT: clarifying scope)
(This also demonstrates some problems with “faithfully learning a function” as a load-bearing teleological description of deep learning. I also remember a discussion of this potential failure mode in my 2022 post on diamond alignment, but I still think I didn’t need to focus too much on it.)
[good generalization] holds across multiple patterns of label noise, even when erroneous labels are biased towards confusing classes.
Reading their experimental procedure and looking at Figures 4 & 5, it looks like their experiments confirm the general story of lethality #20, not disprove it.
The relevant particulars: when they used biased noise, they still ensured that the correct label was the most probable label. Their upper-limit for biased noise made the second-most-probable label equal in probability to the correct one, and in that case the predictor’s generalization accuracy plummeted from near-90% (when the correct label was only slightly more probable than the next-most-probable) to only ~50%.
How this relates to lethality #20: part of what “regular, compactly describable, predictable errors” is saying is that there will be (predictable) cases where the label most probably assigned by a human labeller is not correct (i.e. it’s not what a smart well-informed human would actually want if they had all the relevant info and reflected on it). What the results of the linked paper predict, in that case, is that the net will learn to assign the “incorrect” label—the one which human labellers do, in fact, choose more often than any other. (Though, to be clear, I think this experiment is not very highly relevant one way or the other.)
As for OpenAI’s weak-to-strong results...
I had some back-and-forth about those in a private chat shortly after they came out, and the main thing I remember is that it was pretty tricky to back out the actually-relevant numbers, but it was possible. Going back to the chat log just now, this is the relevant part of my notes:
Rough estimate: on the NLP task the weak model has like 60% accuracy (fig 2).
In cases where the weak model is right, the strong student agrees with it in like 90% of cases (fig 8b). So, on ~6% of cases (10% * 60%), the strong student is wrong by “just being dumb”.
In cases where the weak model is wrong, the strong student’s agreement is very compute-dependent, but let’s pick a middle number and call it 70% (fig 8c). So, on ~28% of cases (70% * 40%), the strong student is wrong by “overfitting to weak supervision”.
So in this particular case, the strong student is wrong about 34% of the time, and 28 of those percentage points are attributable to overfitting to weak supervision.
(Here “overfitting to weak supervision” is the thing where the weak supervisor is predictably wrong, and the stronger model learns to predict those errors.) So in fact what we’re seeing in the weak-to-strong paper is that the strong model learning the weak supervisor’s errors is already the main bottleneck to better ground-truth performance, in the regime that task and models were in.
So overall, I definitely maintain that the empirical evidence is solidly in favor of Doomimir’s story here. (And, separately, I definitely maintain that abstracts in ML tend to be wildly unreliable and misleading about the actual experimental results.)
Reading their experimental procedure and looking at Figures 4 & 5, it looks like their experiments confirm the general story of lethality #20, not disprove it.
“Confirm”? “Disprove”? Seems too aggressive, don’t you think?
Here’s your reasoning, as I understand it:
One experiment labels images of ones as “7” more often than “1″ (using example digits here),
The AI learns to output “7” for images of ones if that was the majority label, and outputs “1″ if that was the majority label,
This confirms the general story of lethality #20.
If this is accurate: I would argue that1+2 do not entail 3 (as you seemed to initially claim, but then maybe back off of in a sentence in the middle of your comment?)
Second, this is not avoidable, in a sense. As you are aware, there is no intrinsic meaning to the “outputs” of a network, there are just output slots and the English names which humans apply to those slots, and a way of comparing a slot prediction (“label”) and the assigned slot of an image.
The relevant particulars: when they used biased noise, they still ensured that the correct label was the most probable label. (and the
Third, I think that the nontrivial prediction of 20 here is about “compactly describable errors. “Mislabelling a large part of the time (but not most of the time)” is certainly a compactly describable error. You would then expect that as the probability of mistakes increased, you’d have a meaningful boost in generalization error, but that doesn’t happen. Easy Bayes update against #20.
(Here “overfitting to weak supervision” is the thing where the weak supervisor is predictably wrong, and the stronger model learns to predict those errors.) So in fact what we’re seeing in the weak-to-strong paper is that the strong model learning the weak supervisor’s errors is already the main bottleneck to better ground-truth performance, in the regime that task and models were in. [emphasis added]
As the student gets “smarter” (more compute), supervisor mistakes become less important as it learns to ignore them (8c):
This shows that, in this instance, larger models do not increasingly overfit the “compactly describable errors” of the weaker supervisor.
And, separately, I definitely maintain that abstracts in ML tend to be wildly unreliable and misleading about the actual experimental results.
You’re free to think that, but FWIW I’d already read (and created flashcards for) the entirety of both papers when I posted my original message.
Third, the nontrivial prediction of 20 here is about “compactly describable errors. “Mislabelling a large part of the time (but not most of the time)” is certainly a compactly describable error. You would then expect that as the probability of mistakes increased, you’d have a meaningful boost in generalization error, but that doesn’t happen. Easy Bayes update against #20. (And if we can’t agree on this, I don’t see what we can agree on.)
I indeed disagree with that, and I see two levels of mistake here. At the object level, there’s a mistake of not thinking through the gears. At the epistemic level, it looks like you’re trying to apply the “what would I have expected in advance?” technique of de-biasing, in a way which does not actually work well in practice. (The latter mistake I think is very common among rationalists.)
First, object-level: let’s walk through the gears of a mental model here. Model: train a model to predict labels for images, and it will learn a distribution of labels for each image (at least that’s how we usually train them). If we relabel 1′s as 7′s 20% of the time, then the obvious guess is that the model will assign about 20% probability (plus its “real underlying uncertainty”, which we’d expect to be small for large fully-trained models) to the label 7 when the digit is in fact a 1.
What does that predict about accuracy? That depends on whether the label we interpret our model as predicting is top-1, or sampled from the predictive distribution. If the former (as is usually used, and IIUC is used in the paper) then this concrete model would predict basically the curves we see in the paper: as noise ramps up, accuracy moves relatively little (especially for large fully-trained models), until the incorrect digit is approximately as probable as the correct digit, as which point accuracy plummets to ~50%. And once the incorrect digit is unambiguously more probable than the incorrect digit, accuracy drops to near-0.
The point: when we think through the gears of the experimental setup, the obvious guess is that the curves are mostly a result of top-1 prediction (as opposed to e.g. sampling from the predictive distribution), in a way which pretty strongly indicates that accuracy would plummet to near-zero as the correct digit ceases to be the most probable digit. And thinking through the gears of Yudkowsky’s #20, the obvious update is that predictable human-labeller-errors which are not the most probable labels are not super relevant (insofar as we use top-1 sampling, i.e. near-zero temperature) whereas human-labeller-errors which are most probable are a problem in basically the way Yudkowsky is saying. (… insofar as we should update at all from this experiment, which we shouldn’t very much.)
Second, epistemic-level: my best guess is that you’re ignoring these gears because they’re not things whose relevance you would have anticipated in advance, and therefore focusing on them in hindsight risks bias[1]. Which, yes, it does risk bias.
Unfortunately, the first rule of experiments is You Are Not Measuring What You Think You Are Measuring. Which means that, in practice, the large majority of experiments which nominally attempt to test some model/theory in a not-already-thoroughly-understood-domain end up getting results which are mostly determined by things unrelated to the model/theory. And, again in practice, few-if-any people have the skill of realizing in advance which things will be relevant to the outcome of any given experiment. “Which things are we actually measuring?” is itself usually figured out (if it’s figured out at all) by looking at data from the experiment.
Now, this is still compatible with using the “what would I have expected in advance?” technique. But it requires that ~all the time, the thing I expect in advance from any given experiment is “this experiment will mostly measure some random-ass thing which has little to do with the model/theory I’m interested in, and I’ll have to dig through the details of the experiment and results to figure out what it measured”. If one tries to apply the “what would I have expected in advance?” technique, in a not-thoroughly-understood domain, without an overwhelming prior that the experimental outcome is mostly determined by things other than the model/theory of interest, then mostly one ends up updating in basically-random directions and becoming very confused.
The point: when we think through the gears of the experimental setup, the obvious guess is that the curves are mostly a result of top-1 prediction (as opposed to e.g. sampling from the predictive distribution), in a way which pretty strongly indicates that accuracy would plummet to near-zero as the correct digit ceases to be the most probable digit.
I think this is a reasonable prediction, but ends up being incorrect:
It decreases far faster than it should; on the top-1 theory, it should be ~flatlined for this whole graph (since for all α>0 the strict majority of labels are still correct). Certainly top-5 should not be decreasing.
Maybe noise makes training worse because the model can’t learn to just ignore it due to insufficient data? (E.g., making training more noisy means convergence/compute efficiency is lower.)
Also, does this decrease the size of the dataset by a factor of 5 in the uniform noise case? (Or did they normalize this by using a fixed set of labeled data and then just added additional noise labels?)
So, on ~28% of cases (70% * 40%), the strong student is wrong by “overfitting to weak supervision”.
Attributing all of these errors to overfitting implies that, if there were no overfitting, the strong student would get 100% accuracy on the subset where the weak model is wrong. But we have no reason to expect that. Instead, these errors are some mixture of overfitting and “just being dumb.”
Note that we should expect the strong and weak models to make somewhat correlated errors even when both are trained on gold labels, i.e. in the hypothetical case where overfitting to weak supervision is not possible. (The task examples vary in difficulty, the two models have various traits in common that could lead to shared “quirks,” etc.)
And indeed, when the weak and strong models use similar amounts of compute, they make very similar predictions—we see this in the upper-leftmost points on each line, which are especially noticeable in Fig 8c. In this regime, the hypothetical “what if we trained strong model on gold labels?” is ~equivalent to the weak model, so ~none of the strong model errors here can be attributed to “overfitting to weak supervision.”
As the compute ratio grows, the errors become both less frequent and less correlated. That’s the main trend we see in 8b and 8c. This reflects the strong model growing more capable, and thus making fewer “just being dumb” errors.
Fig 8 doesn’t provide enough information to determine how much the strong model is being held back by weak supervision at higher ratios, because it doesn’t show strong-trained-on-gold performance. (Fig. 3 does, though.)
IMO the strongest reasons to be skeptical of (the relevance of) these results is in Appendix E, where they show that the strong model overfits a lot when it can easily predict the weak errors.
I will also point to OpenAI’s weak-to-strong results, where increasingly strong students keep improving generalization given labels from a fixed-size teacher. We just don’t live in a world where this issue is a lethality.
For a fixed weak teacher and increasing stronger students from a fixed model stack[1], I think you can probably avoid performance ever going down on most/typical tasks if you properly use early stopping, only use process based feedback, and the model isn’t intentionally trying to perform poorly.
You might have instead expected performance to go up and then eventually go down with scale, but I think you can likely avoid this with early stopping (if you carefully find the right stopping point with scaling laws and analogous validation domains where we can ensure we get good labels or other methods of getting validation).
If I recall, I think we also see something similar in the scaling laws for reward model overoptimization work by Leo Gao (also done at OpenAI). (I think this is probably a more analogous case in most ways than the weak-to-strong results from OpenAI as far as understanding the dynamics of fitting to human errors.)
(Let’s put aside the case where the model intentionally tries to perform poorly. (I’m not even sure this case actually looks that different, but it certainly complicates the analysis. I’m doing some work on this case looking at model organisms of intentionally poor performance, and I expect that for these exact model organisms, we’ll probably see performance going up and then back down again with scale in at least some cases.))
(To be clear, I don’t think this “performance never goes down with correct early stopping” claim is totally obvious. It will depend on the exact rate at which AIs learn to predict errors vs learn what the task is and how to do it. And how these rates evolve with scale. If the sigmoid on error learning rate vs scale has a different mid point and different slope than the sigmoid for learning the task, you can absolutely have actual performance go down.)
While I think you can avoid having the performance ever degrade via process based feedback using scaling laws on most/typical tasks, I think the performance will also plateau due to how early you’ll end up needing to stop.
More concerningly, there might be alteratives to purely process-based human labels which don’t plateau in performance, seem to increase performance, but which rarely results in egregiously bad behavior. Most notably various types of outcomes based feedback might have this property.
As an example of this, I expect that you can create “hackable/exploitable” game environments to exhibit this. More specifically:
We’ll pretrain models on a collection of hackable game envs. We’ll train a model stack of variable training compute.
We’ll finetune these models in a new different (but still hackable) atari env where we expect to see transfer from the prior atari envs.
It’s likely to me that as models get smarter if exploiting is ultimately a better strategy, final finetuned performance goes down even with early stopping.
You might be able to see this on some atari games with added semi-realistic exploits? I’m unsure.
(An aside: I edited my original comment to clarify that I was saying “These results show that #20 is not some obvious lethality which merits the name”, but I still certainly think that labeling mistakes can and will make some things go meaningfully wrong.)
The linked abstract describes how
I will also point to OpenAI’s weak-to-strong results, where increasingly strong students keep improving generalization given labels from a fixed-size teacher. We just don’t live in a world where this issue is an obvious lethality. (EDIT: clarifying scope)
(This also demonstrates some problems with “faithfully learning a function” as a load-bearing teleological description of deep learning. I also remember a discussion of this potential failure mode in my 2022 post on diamond alignment, but I still think I didn’t need to focus too much on it.)
Reading their experimental procedure and looking at Figures 4 & 5, it looks like their experiments confirm the general story of lethality #20, not disprove it.
The relevant particulars: when they used biased noise, they still ensured that the correct label was the most probable label. Their upper-limit for biased noise made the second-most-probable label equal in probability to the correct one, and in that case the predictor’s generalization accuracy plummeted from near-90% (when the correct label was only slightly more probable than the next-most-probable) to only ~50%.
How this relates to lethality #20: part of what “regular, compactly describable, predictable errors” is saying is that there will be (predictable) cases where the label most probably assigned by a human labeller is not correct (i.e. it’s not what a smart well-informed human would actually want if they had all the relevant info and reflected on it). What the results of the linked paper predict, in that case, is that the net will learn to assign the “incorrect” label—the one which human labellers do, in fact, choose more often than any other. (Though, to be clear, I think this experiment is not very highly relevant one way or the other.)
As for OpenAI’s weak-to-strong results...
I had some back-and-forth about those in a private chat shortly after they came out, and the main thing I remember is that it was pretty tricky to back out the actually-relevant numbers, but it was possible. Going back to the chat log just now, this is the relevant part of my notes:
(Here “overfitting to weak supervision” is the thing where the weak supervisor is predictably wrong, and the stronger model learns to predict those errors.) So in fact what we’re seeing in the weak-to-strong paper is that the strong model learning the weak supervisor’s errors is already the main bottleneck to better ground-truth performance, in the regime that task and models were in.
So overall, I definitely maintain that the empirical evidence is solidly in favor of Doomimir’s story here. (And, separately, I definitely maintain that abstracts in ML tend to be wildly unreliable and misleading about the actual experimental results.)
“Confirm”? “Disprove”? Seems too aggressive, don’t you think?
Here’s your reasoning, as I understand it:
One experiment labels images of ones as “7” more often than “1″ (using example digits here),
The AI learns to output “7” for images of ones if that was the majority label, and outputs “1″ if that was the majority label,
This confirms the general story of lethality #20.
If this is accurate: I would argue that1+2 do not entail 3 (as you seemed to initially claim, but then maybe back off of in a sentence in the middle of your comment?)
Second, this is not avoidable, in a sense. As you are aware, there is no intrinsic meaning to the “outputs” of a network, there are just output slots and the English names which humans apply to those slots, and a way of comparing a slot prediction (“label”) and the assigned slot of an image.
Third, I think that the nontrivial prediction of 20 here is about “compactly describable errors. “Mislabelling a large part of the time (but not most of the time)” is certainly a compactly describable error. You would then expect that as the probability of mistakes increased, you’d have a meaningful boost in generalization error, but that doesn’t happen. Easy Bayes update against #20.
As the student gets “smarter” (more compute), supervisor mistakes become less important as it learns to ignore them (8c):
This shows that, in this instance, larger models do not increasingly overfit the “compactly describable errors” of the weaker supervisor.
You’re free to think that, but FWIW I’d already read (and created flashcards for) the entirety of both papers when I posted my original message.
I indeed disagree with that, and I see two levels of mistake here. At the object level, there’s a mistake of not thinking through the gears. At the epistemic level, it looks like you’re trying to apply the “what would I have expected in advance?” technique of de-biasing, in a way which does not actually work well in practice. (The latter mistake I think is very common among rationalists.)
First, object-level: let’s walk through the gears of a mental model here. Model: train a model to predict labels for images, and it will learn a distribution of labels for each image (at least that’s how we usually train them). If we relabel 1′s as 7′s 20% of the time, then the obvious guess is that the model will assign about 20% probability (plus its “real underlying uncertainty”, which we’d expect to be small for large fully-trained models) to the label 7 when the digit is in fact a 1.
What does that predict about accuracy? That depends on whether the label we interpret our model as predicting is top-1, or sampled from the predictive distribution. If the former (as is usually used, and IIUC is used in the paper) then this concrete model would predict basically the curves we see in the paper: as noise ramps up, accuracy moves relatively little (especially for large fully-trained models), until the incorrect digit is approximately as probable as the correct digit, as which point accuracy plummets to ~50%. And once the incorrect digit is unambiguously more probable than the incorrect digit, accuracy drops to near-0.
The point: when we think through the gears of the experimental setup, the obvious guess is that the curves are mostly a result of top-1 prediction (as opposed to e.g. sampling from the predictive distribution), in a way which pretty strongly indicates that accuracy would plummet to near-zero as the correct digit ceases to be the most probable digit. And thinking through the gears of Yudkowsky’s #20, the obvious update is that predictable human-labeller-errors which are not the most probable labels are not super relevant (insofar as we use top-1 sampling, i.e. near-zero temperature) whereas human-labeller-errors which are most probable are a problem in basically the way Yudkowsky is saying. (… insofar as we should update at all from this experiment, which we shouldn’t very much.)
Second, epistemic-level: my best guess is that you’re ignoring these gears because they’re not things whose relevance you would have anticipated in advance, and therefore focusing on them in hindsight risks bias[1]. Which, yes, it does risk bias.
Unfortunately, the first rule of experiments is You Are Not Measuring What You Think You Are Measuring. Which means that, in practice, the large majority of experiments which nominally attempt to test some model/theory in a not-already-thoroughly-understood-domain end up getting results which are mostly determined by things unrelated to the model/theory. And, again in practice, few-if-any people have the skill of realizing in advance which things will be relevant to the outcome of any given experiment. “Which things are we actually measuring?” is itself usually figured out (if it’s figured out at all) by looking at data from the experiment.
Now, this is still compatible with using the “what would I have expected in advance?” technique. But it requires that ~all the time, the thing I expect in advance from any given experiment is “this experiment will mostly measure some random-ass thing which has little to do with the model/theory I’m interested in, and I’ll have to dig through the details of the experiment and results to figure out what it measured”. If one tries to apply the “what would I have expected in advance?” technique, in a not-thoroughly-understood domain, without an overwhelming prior that the experimental outcome is mostly determined by things other than the model/theory of interest, then mostly one ends up updating in basically-random directions and becoming very confused.
Standard disclaimer about guessing what’s going on inside other peoples’ heads being hard, you have more data than I on what’s in your head, etc.
I think this is a reasonable prediction, but ends up being incorrect:
It decreases far faster than it should; on the top-1 theory, it should be ~flatlined for this whole graph (since for all α>0 the strict majority of labels are still correct). Certainly top-5 should not be decreasing.
This is in the data constrained case right?
Maybe noise makes training worse because the model can’t learn to just ignore it due to insufficient data? (E.g., making training more noisy means convergence/compute efficiency is lower.)
Also, does this decrease the size of the dataset by a factor of 5 in the uniform noise case? (Or did they normalize this by using a fixed set of labeled data and then just added additional noise labels?)
Attributing all of these errors to overfitting implies that, if there were no overfitting, the strong student would get 100% accuracy on the subset where the weak model is wrong. But we have no reason to expect that. Instead, these errors are some mixture of overfitting and “just being dumb.”
Note that we should expect the strong and weak models to make somewhat correlated errors even when both are trained on gold labels, i.e. in the hypothetical case where overfitting to weak supervision is not possible. (The task examples vary in difficulty, the two models have various traits in common that could lead to shared “quirks,” etc.)
And indeed, when the weak and strong models use similar amounts of compute, they make very similar predictions—we see this in the upper-leftmost points on each line, which are especially noticeable in Fig 8c. In this regime, the hypothetical “what if we trained strong model on gold labels?” is ~equivalent to the weak model, so ~none of the strong model errors here can be attributed to “overfitting to weak supervision.”
As the compute ratio grows, the errors become both less frequent and less correlated. That’s the main trend we see in 8b and 8c. This reflects the strong model growing more capable, and thus making fewer “just being dumb” errors.
Fig 8 doesn’t provide enough information to determine how much the strong model is being held back by weak supervision at higher ratios, because it doesn’t show strong-trained-on-gold performance. (Fig. 3 does, though.)
IMO the strongest reasons to be skeptical of (the relevance of) these results is in Appendix E, where they show that the strong model overfits a lot when it can easily predict the weak errors.
For a fixed weak teacher and increasing stronger students from a fixed model stack[1], I think you can probably avoid performance ever going down on most/typical tasks if you properly use early stopping, only use process based feedback, and the model isn’t intentionally trying to perform poorly.
You might have instead expected performance to go up and then eventually go down with scale, but I think you can likely avoid this with early stopping (if you carefully find the right stopping point with scaling laws and analogous validation domains where we can ensure we get good labels or other methods of getting validation).
If I recall, I think we also see something similar in the scaling laws for reward model overoptimization work by Leo Gao (also done at OpenAI). (I think this is probably a more analogous case in most ways than the weak-to-strong results from OpenAI as far as understanding the dynamics of fitting to human errors.)
(Let’s put aside the case where the model intentionally tries to perform poorly. (I’m not even sure this case actually looks that different, but it certainly complicates the analysis. I’m doing some work on this case looking at model organisms of intentionally poor performance, and I expect that for these exact model organisms, we’ll probably see performance going up and then back down again with scale in at least some cases.))
(To be clear, I don’t think this “performance never goes down with correct early stopping” claim is totally obvious. It will depend on the exact rate at which AIs learn to predict errors vs learn what the task is and how to do it. And how these rates evolve with scale. If the sigmoid on error learning rate vs scale has a different mid point and different slope than the sigmoid for learning the task, you can absolutely have actual performance go down.)
While I think you can avoid having the performance ever degrade via process based feedback using scaling laws on most/typical tasks, I think the performance will also plateau due to how early you’ll end up needing to stop.
More concerningly, there might be alteratives to purely process-based human labels which don’t plateau in performance, seem to increase performance, but which rarely results in egregiously bad behavior. Most notably various types of outcomes based feedback might have this property.
As an example of this, I expect that you can create “hackable/exploitable” game environments to exhibit this. More specifically:
We’ll pretrain models on a collection of hackable game envs. We’ll train a model stack of variable training compute.
We’ll finetune these models in a new different (but still hackable) atari env where we expect to see transfer from the prior atari envs.
It’s likely to me that as models get smarter if exploiting is ultimately a better strategy, final finetuned performance goes down even with early stopping.
You might be able to see this on some atari games with added semi-realistic exploits? I’m unsure.
As in, you just vary the training compute and set all other values optimally based on this.
(An aside: I edited my original comment to clarify that I was saying “These results show that #20 is not some obvious lethality which merits the name”, but I still certainly think that labeling mistakes can and will make some things go meaningfully wrong.)