Additionally, it seems as though LLMs (and other AIs in general) have an overall relative capability profile which isn’t wildly different from that of the human capability profile on reasonably broad downstream applications (e.g. broad programming tasks, writing good essays, doing research).
Current LLMs are generally most superhuman in breadth of knowledge: for example, almost any LLM will be fluent in every high-resource language on the planet, and near-fluent in most medium-resource languages on the planet, unless its training set was carefully filtered to ensure it is not. Those individual language skills are common amoung humans, but the combination of all of them is somewhere between extremely rare to unheard of. Similarly, LLMs will generally have access to local domain knowledge and trivia facts related to every city on the planet — individually common skills where again the combination of all of them is basically unheard of. That combination can have security consequences, such as for figuring out PII from someone’s public postings. Similarly, neither writing technical manuals not writing haiku are particularly rare skills, but the combined ability to write technical manuals in haiku is rare for humans but pretty basic stuff for an LLM. So you should anticipate that LLMs are likely to be routinely superhuman in the breadth of their skillset, including having odd, nonsensical-to-a-human combinations of them. The question then becomes, is there any way an LLM could evade your control measures by using some odd, very unlikely-looking combination of multiple skills. I don’t see any obvious reason to expect the answer to this to be “yes” very often, but I do think it is an example of the sort of thing we should be thinking about, and the sort of negative it’s hard to completely prove other than by trial and error.
I basically agree (and we discuss this a bit in the appendix section “Cases where we reach qualitatively wildly superhuman performance in a dangerous domain as a side effect of training for usefulness”).
However it seems worth noting that it might be relatively easy for humans to model this in an ok-ish manner by having access to the Internet, multiple people working together, and using gpt-4 in some cases and then thinking about the possible gaps. Then, if they think of plans which can utilize very broad thinking, they could try train train the model to execute on these sorts of plans.
One reasonable objection here is that this team of people might be relatively less good at thinking about how to use broad knowledge and skills compared to the model (even if the team can in principle emulate the range of skills). If this was true, then the model could think of approaches which the human team wouldn’t think of (which would cause issues unless we could somehow elicit this with RL). But, it’s somewhat unclear if this would be true: while the model is trained such that it possesses very broad knowledge and skills, it’s not clear that by default training will result in it learning to utilize these abilities much better than a human team. For instance, current gpt-4 is quite bad at exploiting it’s unique advantages. So, especially if we don’t need models to be too good at utilizing their superhuman breadth, we might be able to avoid running into issues here.
Agreed. I don’t think this is a fatal objection, just an example of the sort of careful thinking that would be required in order to exert significant control over an LLM that is not generally superhuman, but does have a skill profile that is far broader than any individual human: more comparable to a very large team of humans from which you can draw appropriate specialists or they can brainstorm together.
Current LLMs are generally most superhuman in breadth of knowledge: for example, almost any LLM will be fluent in every high-resource language on the planet, and near-fluent in most medium-resource languages on the planet, unless its training set was carefully filtered to ensure it is not. Those individual language skills are common amoung humans, but the combination of all of them is somewhere between extremely rare to unheard of. Similarly, LLMs will generally have access to local domain knowledge and trivia facts related to every city on the planet — individually common skills where again the combination of all of them is basically unheard of. That combination can have security consequences, such as for figuring out PII from someone’s public postings. Similarly, neither writing technical manuals not writing haiku are particularly rare skills, but the combined ability to write technical manuals in haiku is rare for humans but pretty basic stuff for an LLM. So you should anticipate that LLMs are likely to be routinely superhuman in the breadth of their skillset, including having odd, nonsensical-to-a-human combinations of them. The question then becomes, is there any way an LLM could evade your control measures by using some odd, very unlikely-looking combination of multiple skills. I don’t see any obvious reason to expect the answer to this to be “yes” very often, but I do think it is an example of the sort of thing we should be thinking about, and the sort of negative it’s hard to completely prove other than by trial and error.
I basically agree (and we discuss this a bit in the appendix section “Cases where we reach qualitatively wildly superhuman performance in a dangerous domain as a side effect of training for usefulness”).
However it seems worth noting that it might be relatively easy for humans to model this in an ok-ish manner by having access to the Internet, multiple people working together, and using gpt-4 in some cases and then thinking about the possible gaps. Then, if they think of plans which can utilize very broad thinking, they could try train train the model to execute on these sorts of plans.
One reasonable objection here is that this team of people might be relatively less good at thinking about how to use broad knowledge and skills compared to the model (even if the team can in principle emulate the range of skills). If this was true, then the model could think of approaches which the human team wouldn’t think of (which would cause issues unless we could somehow elicit this with RL). But, it’s somewhat unclear if this would be true: while the model is trained such that it possesses very broad knowledge and skills, it’s not clear that by default training will result in it learning to utilize these abilities much better than a human team. For instance, current gpt-4 is quite bad at exploiting it’s unique advantages. So, especially if we don’t need models to be too good at utilizing their superhuman breadth, we might be able to avoid running into issues here.
Agreed. I don’t think this is a fatal objection, just an example of the sort of careful thinking that would be required in order to exert significant control over an LLM that is not generally superhuman, but does have a skill profile that is far broader than any individual human: more comparable to a very large team of humans from which you can draw appropriate specialists or they can brainstorm together.
I strongly agree that a bunch of careful thinking and red teaming will be likely required for control to work on early transformatively useful AI.