Claim 3: If you don’t control the dataset, it mostly doesn’t matter what pretraining objective you use (assuming you use a simple one rather than e.g. a reward function that encodes all of human values); the properties of the model are going to be roughly similar regardless.
Analogous claim: since any program specifiable under UTM U1 is also expressible under UTM U2, choice of UTM doesn’t matter.
And this is true up to a point: up to constant factors, it doesn’t matter. But U1 can make it easier (simplier, faster, etc) to specify a set of programs than does U2. And so “there exists a program in U2-encoding which implements P in U1-encoding” doesn’t get everything I want: I want to reason about the distribution of programs, about how hard it tends to be to get programs with desirable properties.
Stepping out of the analogy, even though I agree that “reasonable” pretraining objectives are all compatible with aligned / unaligned /arbitrarily behaved models, this argument seems to leave room that some objectives make alignment far more likely, a priori. And you may be noting as much:
(This is probably the weakest argument in the chain; just because most of the influence comes from the dataset doesn’t mean that the pretraining objective can’t have influence as well. I still think the claim is true though, and I still feel pretty confident about the final conclusion in the next claim.)
Yeah, I agree with all this. I still think the pretraining objective basically doesn’t matter for alignment (beyond being “reasonable”) but I don’t think the argument I’ve given establishes that.
I do think the arguments in support of Claim 2 are sufficient to at least raise Claim 3 to attention (and thus Claim 4 as well).
Additional note for posterity: when I talked about “some objectives [may] make alignment far more likely”, I was considering something like “given this pretraining objective and an otherwise fixed training process, what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?”, perhaps also weighting by ease of specification in some sense.
what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?”, perhaps also weighting by ease of specification in some sense.
You’re going to need the ease of specification condition, or something similar; else you’ll probably run into no-free-lunch considerations (at which point I think you’ve stopped talking about anything useful).
Analogous claim: since any program specifiable under UTM U1 is also expressible under UTM U2, choice of UTM doesn’t matter.
And this is true up to a point: up to constant factors, it doesn’t matter. But U1 can make it easier (simplier, faster, etc) to specify a set of programs than does U2. And so “there exists a program in U2-encoding which implements P in U1-encoding” doesn’t get everything I want: I want to reason about the distribution of programs, about how hard it tends to be to get programs with desirable properties.
Stepping out of the analogy, even though I agree that “reasonable” pretraining objectives are all compatible with aligned / unaligned /arbitrarily behaved models, this argument seems to leave room that some objectives make alignment far more likely, a priori. And you may be noting as much:
Yeah, I agree with all this. I still think the pretraining objective basically doesn’t matter for alignment (beyond being “reasonable”) but I don’t think the argument I’ve given establishes that.
I do think the arguments in support of Claim 2 are sufficient to at least raise Claim 3 to attention (and thus Claim 4 as well).
Sure.
Additional note for posterity: when I talked about “some objectives [may] make alignment far more likely”, I was considering something like “given this pretraining objective and an otherwise fixed training process, what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?”, perhaps also weighting by ease of specification in some sense.
You’re going to need the ease of specification condition, or something similar; else you’ll probably run into no-free-lunch considerations (at which point I think you’ve stopped talking about anything useful).