This feels kinda unrealistic for the kind of pretraining that’s common today, but so does actually learning how to do needle-moving alignment research just from next-token prediction. If we condition on the latter, it seems kinda reasonable to imagine there must be cases where an AI has to be able to do needle-moving alignment research in order to improve at next-token prediction, and this feels like a reasonable way that might happen.
I’m not too impressed with gut feelings about what “seems kinda reasonable.” By GPT-2 Nate already should have been updating towards the proposition that big bags of heuristics, composed together, can do human-level cognitive labor, with no need for them to be sneaking out in the dead of night to do internal symbolic reasoning in some Fodorian language of thought, or only be doing that cognitive work as part of a long-term plan, or hack their computers to acquire more resources.
I think there are dangers related to what technologies are close to other technologies; if you develop an alignment research assistant that can do original research, someone is already developing a general purpose research assistant that can do original research on AI design, and a third person is probably working on using the architecture to train an agent that navigates the real world. But I think it’s the wrong model of the world to think that a Jan Leike style research assistant must inherently be trying to navigate the real world.
By GPT-2 Nate already should have been updating towards the proposition that big bags of heuristics, composed together, can do human-level cognitive labor
Yeah, I think Nate doesn’t buy this (even for much more recent systems such as GPT-3.5/GPT-4, much less GPT-2). To the extent that [my model of] Nate thinks that LLMs/LLM-descended models can do useful (“needle-moving”) alignment research, he expects those models to also be dangerous (hence the talk of “conditioning on”); but [my model of] Nate mostly denies the antecedent. Being willing to explore counterfactual branches on your model (e.g. for the purpose of communication, as was the case here, with Holden) doesn’t mean you stop thinking of those branches as counterfactual!
Or, perhaps more directly:
By GPT-2 Nate already should have been updating towards the proposition that big bags of heuristics, composed together, can do human-level cognitive labor, with no need for them to be sneaking out in the dead of night to do internal symbolic reasoning
I think Nate would argue that [a significant subset of] human-level cognitive labor in fact does require“sneaking out in the dead of night to do internal symbolic reasoning”. Humans do that, after all! To the extent that GPT-2 does not do this, it accomplishes [not hacking its hardware/not seeking power or resources/not engaging in “CIS” behavior] primarily by not being very good at cognition.
It’s clear enough that you disagree with Nate about something, but (going by your current comments) I don’t think you’ve located the source of the disagreement. E.g. what you write here in your top-level comment straight up doesn’t apply to Nate’s model, AFAICT:
GPT-4 probably has enough engagement with the hardware that you could program something that acquires more computer resources using the weights of GPT-4. But it never stumbled on such a solution in training, in part because in gradient descent the gradient is calculated using a model of the computation that doesn’t take hacking the computer into account.
I don’t think Nate would have predicted that GPT-4 would (or could) hack its hardware, because [my model of] Nate keeps track of a conceptual divide between useful/dangerous (“CIS”) cognition and non-useful/dangerous cognition, and Nate would not have predicted GPT-4 to cross that divide. (I personally think this divide is a little weird, which I intend to explore further in a different comment, but: presuming the divide or something like it, the rest of Nate’s view feels quite natural to me.) Presuming that his model should have predicted that GPT-4 would hack its hardware or do something along those lines, and criticizing his model on the basis of that failed prediction (that it did not, in fact, make) strikes me as sneaking in a couple of assumptions of your own into [your model of] his model.
The post addresses this
I’m not too impressed with gut feelings about what “seems kinda reasonable.” By GPT-2 Nate already should have been updating towards the proposition that big bags of heuristics, composed together, can do human-level cognitive labor, with no need for them to be sneaking out in the dead of night to do internal symbolic reasoning in some Fodorian language of thought, or only be doing that cognitive work as part of a long-term plan, or hack their computers to acquire more resources.
I think there are dangers related to what technologies are close to other technologies; if you develop an alignment research assistant that can do original research, someone is already developing a general purpose research assistant that can do original research on AI design, and a third person is probably working on using the architecture to train an agent that navigates the real world. But I think it’s the wrong model of the world to think that a Jan Leike style research assistant must inherently be trying to navigate the real world.
Yeah, I think Nate doesn’t buy this (even for much more recent systems such as GPT-3.5/GPT-4, much less GPT-2). To the extent that [my model of] Nate thinks that LLMs/LLM-descended models can do useful (“needle-moving”) alignment research, he expects those models to also be dangerous (hence the talk of “conditioning on”); but [my model of] Nate mostly denies the antecedent. Being willing to explore counterfactual branches on your model (e.g. for the purpose of communication, as was the case here, with Holden) doesn’t mean you stop thinking of those branches as counterfactual!
Or, perhaps more directly:
I think Nate would argue that [a significant subset of] human-level cognitive labor in fact does require “sneaking out in the dead of night to do internal symbolic reasoning”. Humans do that, after all! To the extent that GPT-2 does not do this, it accomplishes [not hacking its hardware/not seeking power or resources/not engaging in “CIS” behavior] primarily by not being very good at cognition.
It’s clear enough that you disagree with Nate about something, but (going by your current comments) I don’t think you’ve located the source of the disagreement. E.g. what you write here in your top-level comment straight up doesn’t apply to Nate’s model, AFAICT:
I don’t think Nate would have predicted that GPT-4 would (or could) hack its hardware, because [my model of] Nate keeps track of a conceptual divide between useful/dangerous (“CIS”) cognition and non-useful/dangerous cognition, and Nate would not have predicted GPT-4 to cross that divide. (I personally think this divide is a little weird, which I intend to explore further in a different comment, but: presuming the divide or something like it, the rest of Nate’s view feels quite natural to me.) Presuming that his model should have predicted that GPT-4 would hack its hardware or do something along those lines, and criticizing his model on the basis of that failed prediction (that it did not, in fact, make) strikes me as sneaking in a couple of assumptions of your own into [your model of] his model.