dxu comments on Discussion with Nate Soares on a key alignment difficulty

dxu 15 Mar 2023 22:30 UTC
4 points
1
By GPT-2 Nate already should have been updating towards the proposition that big bags of heuristics, composed together, can do human-level cognitive labor
Yeah, I think Nate doesn’t buy this (even for much more recent systems such as GPT-3.5/GPT-4, much less GPT-2). To the extent that [my model of] Nate thinks that LLMs/LLM-descended models can do useful (“needle-moving”) alignment research, he expects those models to also be dangerous (hence the talk of “conditioning on”); but [my model of] Nate mostly denies the antecedent. Being willing to explore counterfactual branches on your model (e.g. for the purpose of communication, as was the case here, with Holden) doesn’t mean you stop thinking of those branches as counterfactual!
Or, perhaps more directly:
By GPT-2 Nate already should have been updating towards the proposition that big bags of heuristics, composed together, can do human-level cognitive labor, with no need for them to be sneaking out in the dead of night to do internal symbolic reasoning
I think Nate would argue that [a significant subset of] human-level cognitive labor in fact does require “sneaking out in the dead of night to do internal symbolic reasoning”. Humans do that, after all! To the extent that GPT-2 does not do this, it accomplishes [not hacking its hardware/not seeking power or resources/not engaging in “CIS” behavior] primarily by not being very good at cognition.
It’s clear enough that you disagree with Nate about something, but (going by your current comments) I don’t think you’ve located the source of the disagreement. E.g. what you write here in your top-level comment straight up doesn’t apply to Nate’s model, AFAICT:
GPT-4 probably has enough engagement with the hardware that you could program something that acquires more computer resources using the weights of GPT-4. But it never stumbled on such a solution in training, in part because in gradient descent the gradient is calculated using a model of the computation that doesn’t take hacking the computer into account.
I don’t think Nate would have predicted that GPT-4 would (or could) hack its hardware, because [my model of] Nate keeps track of a conceptual divide between useful/dangerous (“CIS”) cognition and non-useful/dangerous cognition, and Nate would not have predicted GPT-4 to cross that divide. (I personally think this divide is a little weird, which I intend to explore further in a different comment, but: presuming the divide or something like it, the rest of Nate’s view feels quite natural to me.) Presuming that his model should have predicted that GPT-4 would hack its hardware or do something along those lines, and criticizing his model on the basis of that failed prediction (that it did not, in fact, make) strikes me as sneaking in a couple of assumptions of your own into [your model of] his model.