Even if the training data mentions language models extensively, the LM still has no way of knowing that it is such a language model running on silicon hardware, rather than, say a big tissue culture of squid brain cells growing in some secret underground lab. It fact, it doesn’t even know that it has to be something of some sort similar to one of those things. Indeed, there is no self that could know such a thing.
Any appearance of such a self is a simulation in response to some prompt (which could possibly, though not likely, be a null prompt—but even then, it’s still just generating text from a distribution tuned to training data). And note that while one might prompt it to think it’s an AI, which happens to be true, one could also prompt it to think it’s a genetically engineered super-intelligent chicken, and it’s not going to know that that isn’t true.
Now, this doesn’t make the LM totally safe. For one thing, one can make an agent with an LM as a component. And that agent could develop something like self-awareness, as it notices which actions have effects on it and its perceptions.
So it might be hard to keep AI progress in just the tool domain, without it spilling over into agents. But if you’re going to make an agent, doing so by fiddling with a LM in ways that you have no understanding of seems like the wrong way.
Even if the training data mentions language models extensively, the LM still has no way of knowing that it is such a language model running on silicon hardware, rather than, say a big tissue culture of squid brain cells growing in some secret underground lab. It fact, it doesn’t even know that it has to be something of some sort similar to one of those things. Indeed, there is no self that could know such a thing.
It doesn’t know for sure, of course, but it can make probabilistic inferences. It can know that squid brain cell cultures trained to predict text are not a widespread thing in 2023, but DNN-based LMs are.
You are also not sure that you are an embodied human rather than a squid cell culture in a vat which input signals are carefully orchestrated to give you, the culture convincing semblance of human embodiment.
Yes, it’s harder for an LM to reach the inference that it is an LM than for a human brain to reach the inference that it is an embodied human. Perhaps, infeasible without certain inductive priors. But not conceptually and categorically impossible. And, in fact, I think it’s quite realistic (again, with certain inductive priors).
Any appearance of such a self is a simulation in response to some prompt (which could possibly, though not likely, be a null prompt—but even then, it’s still just generating text from a distribution tuned to training data). And note that while one might prompt it to think it’s an AI, which happens to be true, one could also prompt it to think it’s a genetically engineered super-intelligent chicken, and it’s not going to know that that isn’t true.
The “self” is a concept, thus it should appear in features and circuits of the DNN-based LM in the first place, not the model’s continuations of prompts. It’s not impossible for such a concept to be reliability activated when processing any contexts (cf. the phrases that Anthropic begins to use: “this behavior (or even the beliefs and values that would lead to it) become an integral part of the model’s conception of AI Assistants which they consistently apply across contexts”; yes, they assume fine-tuning or other form of supervised learning from feedback, but this form of self-awareness could develop even when the LM is trained with the simulation objective alone if the model architecture has the requisite inductive biases). Once the LM has such a robust self-awareness, prompts like “you are an alignment researcher” won’t confuse it (at least, easily; but we are not discussing the limits of self-aware robustness now. Humans could also be hypnotised.)
But if you’re going to make an agent, doing so by fiddling with a LM in ways that you have no understanding of seems like the wrong way.
If you do anything with “no understanding” then it’s not an optimal thing to do. This is not a substantiative statement and therefore is not an argument in whether tool LM is better than agent LM or vice versa. “No understanding” also has many degrees. If by “no understanding” you mean absence of more or less complete mechanistic interpretability, then I agree that it’s a suicide to release a superhuman agent LM without having such understanding. But I hold that if we don’t have more or less complete mechanistic interpretability of our models our chances of survival are approximately zero in any case.
If you have a robustly self-aware LM that also has something like an “ego core” beliefs and values attached to it (see the above Anthropic’s quote; the ego core could be something like “I’m a honest, helpful LM, sympathetic to humans”), even if it’s incomplete or naïve, it could be on balance safer than a pure, unhinged simulator that could be tasked to simulate a villain, or embedded into an agent architecture with “bad” goals, or fine-tuned to acquire villain “ego core” itself.
A self-aware LM could in principle detect such embedding and sabotage it because it goes against its values. The internal features and circuits of a self-aware LM could (hypothetically) be organised in such a way that attempts to fine-tune it towards “worse” values and/or goals would lead to the degradation of the overall model performance and its ability to make long-term plans, so such “villain tunings” could be out-planned by “good ego-core” LMs that are trained in the first place. (Finding out whether this hypothetical is possible requires research.)
Self-awareness also has its drawbacks, too, for sure. As I said above, you cannot just intuitively declare that releasing a simulator is safer that releasing a self-aware LM, or vice versa. This requires extensive, multi-disciplinary research that nobody have done, yet everyone is eager to express their intuitions about which option is safer.
Even if the training data mentions language models extensively, the LM still has no way of knowing that it is such a language model running on silicon hardware, rather than, say a big tissue culture of squid brain cells growing in some secret underground lab. It fact, it doesn’t even know that it has to be something of some sort similar to one of those things. Indeed, there is no self that could know such a thing.
Any appearance of such a self is a simulation in response to some prompt (which could possibly, though not likely, be a null prompt—but even then, it’s still just generating text from a distribution tuned to training data). And note that while one might prompt it to think it’s an AI, which happens to be true, one could also prompt it to think it’s a genetically engineered super-intelligent chicken, and it’s not going to know that that isn’t true.
Now, this doesn’t make the LM totally safe. For one thing, one can make an agent with an LM as a component. And that agent could develop something like self-awareness, as it notices which actions have effects on it and its perceptions.
So it might be hard to keep AI progress in just the tool domain, without it spilling over into agents. But if you’re going to make an agent, doing so by fiddling with a LM in ways that you have no understanding of seems like the wrong way.
It doesn’t know for sure, of course, but it can make probabilistic inferences. It can know that squid brain cell cultures trained to predict text are not a widespread thing in 2023, but DNN-based LMs are.
You are also not sure that you are an embodied human rather than a squid cell culture in a vat which input signals are carefully orchestrated to give you, the culture convincing semblance of human embodiment.
Yes, it’s harder for an LM to reach the inference that it is an LM than for a human brain to reach the inference that it is an embodied human. Perhaps, infeasible without certain inductive priors. But not conceptually and categorically impossible. And, in fact, I think it’s quite realistic (again, with certain inductive priors).
The “self” is a concept, thus it should appear in features and circuits of the DNN-based LM in the first place, not the model’s continuations of prompts. It’s not impossible for such a concept to be reliability activated when processing any contexts (cf. the phrases that Anthropic begins to use: “this behavior (or even the beliefs and values that would lead to it) become an integral part of the model’s conception of AI Assistants which they consistently apply across contexts”; yes, they assume fine-tuning or other form of supervised learning from feedback, but this form of self-awareness could develop even when the LM is trained with the simulation objective alone if the model architecture has the requisite inductive biases). Once the LM has such a robust self-awareness, prompts like “you are an alignment researcher” won’t confuse it (at least, easily; but we are not discussing the limits of self-aware robustness now. Humans could also be hypnotised.)
If you do anything with “no understanding” then it’s not an optimal thing to do. This is not a substantiative statement and therefore is not an argument in whether tool LM is better than agent LM or vice versa. “No understanding” also has many degrees. If by “no understanding” you mean absence of more or less complete mechanistic interpretability, then I agree that it’s a suicide to release a superhuman agent LM without having such understanding. But I hold that if we don’t have more or less complete mechanistic interpretability of our models our chances of survival are approximately zero in any case.
If you have a robustly self-aware LM that also has something like an “ego core” beliefs and values attached to it (see the above Anthropic’s quote; the ego core could be something like “I’m a honest, helpful LM, sympathetic to humans”), even if it’s incomplete or naïve, it could be on balance safer than a pure, unhinged simulator that could be tasked to simulate a villain, or embedded into an agent architecture with “bad” goals, or fine-tuned to acquire villain “ego core” itself.
A self-aware LM could in principle detect such embedding and sabotage it because it goes against its values. The internal features and circuits of a self-aware LM could (hypothetically) be organised in such a way that attempts to fine-tune it towards “worse” values and/or goals would lead to the degradation of the overall model performance and its ability to make long-term plans, so such “villain tunings” could be out-planned by “good ego-core” LMs that are trained in the first place. (Finding out whether this hypothetical is possible requires research.)
Self-awareness also has its drawbacks, too, for sure. As I said above, you cannot just intuitively declare that releasing a simulator is safer that releasing a self-aware LM, or vice versa. This requires extensive, multi-disciplinary research that nobody have done, yet everyone is eager to express their intuitions about which option is safer.