On the first section, we disagree on the degree of similarity in the metaphors.
I agree with you that we shouldn’t care about ‘degree of similarity’ and instead build causal models. I think our actual disagreements here lie mostly in those causal models, the unpacking of which goes beyond comment scope. I agree with the very non-groundbreaking insights listed, of course, but that’s not what I’m getting out of it. It is possible that some of this is that a lot of what I’m thinking of as evolutionary evidence, you’re thinking of as coming from another source, or is already in your model in another form to the extent you buy the argument (which often I am guessing you don’t).
On the difference in SLT meanings, what I meant to say was: I think this is sufficient to cause our alignment properties to break.
In case it is not clear: My expectation is that sufficiently large capabilities/intelligence/affordances advances inherently break our desired alignment properties under all known techniques.
On the passage you find baffling: Ah, I do think we had confusion about what we meant by inner optimizer, and I’m likely still conflating the two somewhat. That doesn’t change me not finding this heartening, though? As in, we’re going to see rapid big changes in both the inner optimizer’s power (in all senses) and also in the nature and amount of training data, where we agree that changing the training data details changes alignment outcomes dramatically.
On the impossible-to-you world: This doesn’t seem so weird or impossible to me? And I think I can tell a pretty easy cultural story slash write an alternative universe novel where we honor those who maximize genetic fitness and all that, and have for a long time—and that this could help explain why civilization and our intelligence developed so damn slowly and all that. Although to truly make the full evidential point that world then has to be weirder still where humans are much more reluctant to mode shift in various ways. It’s also possible this points to you having already accepted from other places the evidence I think evolution introduces, so you’re confused why people keep citing it as evidence.
The comment in response to parallels provides some interesting thoughts and I agree with most of it. The two concrete examples are definitely important things to know. I still notice the thing I noticed in my comment about the parallels—I’d encourage thinking about what similar logic would say in the other cases?
On the impossible-to-you world: This doesn’t seem so weird or impossible to me? And I think I can tell a pretty easy cultural story slash write an alternative universe novel where we honor those who maximize genetic fitness and all that, and have for a long time—and that this could help explain why civilization and our intelligence developed so damn slowly and all that. Although to truly make the full evidential point that world then has to be weirder still where humans are much more reluctant to mode shift in various ways. It’s also possible this points to you having already accepted from other places the evidence I think evolution introduces, so you’re confused why people keep citing it as evidence.
The ability to write fiction in a world does not demonstrate its plausibility. Beware generalizing from fictional fictional evidence!
The claim that such a world is impossible is a claim that, were you to try to write a fictional version of it, you would run into major holes in the world that you would have to either ignore or paper over with further unrealistic assumptions.
In case it is not clear: My expectation is that sufficiently large capabilities/intelligence/affordances advances inherently break our desired alignment properties under all known techniques.
Nearly every piece of empirical evidence I’ve seen contradicts this—more capable systems are generally easier to work with in almost every way, and the techniques that worked on less capable versions straightforwardly apply and in fact usually work better than on less intelligent systems.
Presumably you agree this would become false if the system was deceptively aligned or otherwise scheming against us? Perhaps the implicit claim is that we should generalize from current evidence toward thinking the deceptive alignment is very unlikely?
I also think it’s straightforward to construct cases where goodharting implies that applying the technique you used for a less capable model onto a more capable model would result in worse performance for the more capable model. I think it should be straightforward to construct such a case using scaling laws for reward model overoptimization.
(That said, I think if you vary the point of early stopping as models get more capable then you likely get strict performance improvements on most tasks. But, regardless there is a pretty reasonable technique of “train for duration X” which clearly gets worse performance in realistic cases as you go toward more capable systems.)
On the additional commentary section:
On the first section, we disagree on the degree of similarity in the metaphors.
I agree with you that we shouldn’t care about ‘degree of similarity’ and instead build causal models. I think our actual disagreements here lie mostly in those causal models, the unpacking of which goes beyond comment scope. I agree with the very non-groundbreaking insights listed, of course, but that’s not what I’m getting out of it. It is possible that some of this is that a lot of what I’m thinking of as evolutionary evidence, you’re thinking of as coming from another source, or is already in your model in another form to the extent you buy the argument (which often I am guessing you don’t).
On the difference in SLT meanings, what I meant to say was: I think this is sufficient to cause our alignment properties to break.
In case it is not clear: My expectation is that sufficiently large capabilities/intelligence/affordances advances inherently break our desired alignment properties under all known techniques.
On the passage you find baffling: Ah, I do think we had confusion about what we meant by inner optimizer, and I’m likely still conflating the two somewhat. That doesn’t change me not finding this heartening, though? As in, we’re going to see rapid big changes in both the inner optimizer’s power (in all senses) and also in the nature and amount of training data, where we agree that changing the training data details changes alignment outcomes dramatically.
On the impossible-to-you world: This doesn’t seem so weird or impossible to me? And I think I can tell a pretty easy cultural story slash write an alternative universe novel where we honor those who maximize genetic fitness and all that, and have for a long time—and that this could help explain why civilization and our intelligence developed so damn slowly and all that. Although to truly make the full evidential point that world then has to be weirder still where humans are much more reluctant to mode shift in various ways. It’s also possible this points to you having already accepted from other places the evidence I think evolution introduces, so you’re confused why people keep citing it as evidence.
The comment in response to parallels provides some interesting thoughts and I agree with most of it. The two concrete examples are definitely important things to know. I still notice the thing I noticed in my comment about the parallels—I’d encourage thinking about what similar logic would say in the other cases?
The ability to write fiction in a world does not demonstrate its plausibility. Beware generalizing from fictional fictional evidence!
The claim that such a world is impossible is a claim that, were you to try to write a fictional version of it, you would run into major holes in the world that you would have to either ignore or paper over with further unrealistic assumptions.
Nearly every piece of empirical evidence I’ve seen contradicts this—more capable systems are generally easier to work with in almost every way, and the techniques that worked on less capable versions straightforwardly apply and in fact usually work better than on less intelligent systems.
Presumably you agree this would become false if the system was deceptively aligned or otherwise scheming against us? Perhaps the implicit claim is that we should generalize from current evidence toward thinking the deceptive alignment is very unlikely?
I also think it’s straightforward to construct cases where goodharting implies that applying the technique you used for a less capable model onto a more capable model would result in worse performance for the more capable model. I think it should be straightforward to construct such a case using scaling laws for reward model overoptimization.
(That said, I think if you vary the point of early stopping as models get more capable then you likely get strict performance improvements on most tasks. But, regardless there is a pretty reasonable technique of “train for duration X” which clearly gets worse performance in realistic cases as you go toward more capable systems.)