I basically agree (and we discuss this a bit in the appendix section “Cases where we reach qualitatively wildly superhuman performance in a dangerous domain as a side effect of training for usefulness”).
However it seems worth noting that it might be relatively easy for humans to model this in an ok-ish manner by having access to the Internet, multiple people working together, and using gpt-4 in some cases and then thinking about the possible gaps. Then, if they think of plans which can utilize very broad thinking, they could try train train the model to execute on these sorts of plans.
One reasonable objection here is that this team of people might be relatively less good at thinking about how to use broad knowledge and skills compared to the model (even if the team can in principle emulate the range of skills). If this was true, then the model could think of approaches which the human team wouldn’t think of (which would cause issues unless we could somehow elicit this with RL). But, it’s somewhat unclear if this would be true: while the model is trained such that it possesses very broad knowledge and skills, it’s not clear that by default training will result in it learning to utilize these abilities much better than a human team. For instance, current gpt-4 is quite bad at exploiting it’s unique advantages. So, especially if we don’t need models to be too good at utilizing their superhuman breadth, we might be able to avoid running into issues here.
Agreed. I don’t think this is a fatal objection, just an example of the sort of careful thinking that would be required in order to exert significant control over an LLM that is not generally superhuman, but does have a skill profile that is far broader than any individual human: more comparable to a very large team of humans from which you can draw appropriate specialists or they can brainstorm together.
I basically agree (and we discuss this a bit in the appendix section “Cases where we reach qualitatively wildly superhuman performance in a dangerous domain as a side effect of training for usefulness”).
However it seems worth noting that it might be relatively easy for humans to model this in an ok-ish manner by having access to the Internet, multiple people working together, and using gpt-4 in some cases and then thinking about the possible gaps. Then, if they think of plans which can utilize very broad thinking, they could try train train the model to execute on these sorts of plans.
One reasonable objection here is that this team of people might be relatively less good at thinking about how to use broad knowledge and skills compared to the model (even if the team can in principle emulate the range of skills). If this was true, then the model could think of approaches which the human team wouldn’t think of (which would cause issues unless we could somehow elicit this with RL). But, it’s somewhat unclear if this would be true: while the model is trained such that it possesses very broad knowledge and skills, it’s not clear that by default training will result in it learning to utilize these abilities much better than a human team. For instance, current gpt-4 is quite bad at exploiting it’s unique advantages. So, especially if we don’t need models to be too good at utilizing their superhuman breadth, we might be able to avoid running into issues here.
Agreed. I don’t think this is a fatal objection, just an example of the sort of careful thinking that would be required in order to exert significant control over an LLM that is not generally superhuman, but does have a skill profile that is far broader than any individual human: more comparable to a very large team of humans from which you can draw appropriate specialists or they can brainstorm together.
I strongly agree that a bunch of careful thinking and red teaming will be likely required for control to work on early transformatively useful AI.