One speculative way I see it, that I’ve yet to expand on, is that GPT-N, to minimize prediction error in training, could simulate some sort of entity enacting some reasoning, to minimize the prediction error in non-trivial settings. In a sense, GPT would be a sort of actor interpreting a play through extreme method acting. I have in mind something like what the protagonist of “Pierre Menard, author of Don Quixote” tries to do to replicate the book Don Quixote word by word.
This would mean that, for some set of strings, GPT-N would boot and run some agent A, when seeing string S, just because “being” that agent performed well in similar training strings. This agent, if complex and capable enough (which may need to be, if that was what it needed to predict previous data), this agent himself could, maybe through the placement of careful answer tokens that would guarantee its stability, would be a dangerous and possibly malicious agent.
And, of course, sequence modeling as a paradigm can also be used for RL training.
I feel like there is a failure mode in this line of thinking, that being ‘confusingly pervasive consequentialism’. AI x-risk is concerned with a self-evidently dangerous object, that being superoptimizing agents. But whenever a system is proposed that is possibly intelligent without being superoptimizing, an argument is made, “well, this thing would do its job better if it was a superoptimizer, so the incentives (either internal or external to the system itself) will drive the appearance of a superoptimizer.” Well, yes, if you define the incredibly dangerous thing as the only way to solve any problem, and claim that incentives will force that dangerous thing into existence even if we try to prevent it, then the conclusion flows directly from the premise. You have to permit the existence of something that is not a superoptimizer in order to solve the problem. Otherwise you are essentially defining a problem that, by definition, cannot be solved, and then waving your hands saying “There is no solution!”
I believe I understand your point, but there are two things that I need to clarify, that kind of bypasses some of these criticism: a) I am not assuming any safety technique applied to language models. In a sense, this is the worst-case scenario, one thing that may happen if the language model is run “as-it-is”. In particular, the scenario I described would be mitigated if we could possibly stop the existence of stable sub-agents appearing in language models, although how to do this I do not know. b) The incentives for the language models to be a superoptimizer don’t necessarily need to be that strong, if we consider that we could have many instantiations of GPT-N being used, and only one of them needs to be that kind of stable malicious agent I tried (and probably failed) to describe. One of these stable agents would only need to appear once, in some setting where it can both stabilize itself (maybe through carefully placed prompts), and gain some power to cause harm in the world. If we consider something like the language model being used like GPT-3, in multiple different scenarios, this becomes a weaker assumption.
That being said, I agree with your general line of criticism, of not imagining intelligent but not superoptimizing agents being possible, although whether superoptimizer are attractors for generally intelligent agents, and under which conditions, is an open (and crucially important) question.
I posted a somewhat similar response to MSRayne, with the exception that what you accidentally summon is not an agent with a utility function, but something that tries to appear like one and nevertheless tricks you into making some big mistake.
Here, what you get is a genuine agent which works across prompts by having some internal value function which outputs a different value after each prompt, and acts accordingly, if I understand correctly. It doesn’t seem incredibly unlikely, as there is nothing in the process of evolution that necessarily has to make humans themselves be optimizers, but it happened anyways because that is what best performed in the overall goal of reproduction. This AI will still probably have to somehow convince the people communicating with it to give it “true” agency independent of the user’s inputs. Seems like an instrumental value in this case.
One speculative way I see it, that I’ve yet to expand on, is that GPT-N, to minimize prediction error in training, could simulate some sort of entity enacting some reasoning, to minimize the prediction error in non-trivial settings. In a sense, GPT would be a sort of actor interpreting a play through extreme method acting. I have in mind something like what the protagonist of “Pierre Menard, author of Don Quixote” tries to do to replicate the book Don Quixote word by word.
This would mean that, for some set of strings, GPT-N would boot and run some agent A, when seeing string S, just because “being” that agent performed well in similar training strings. This agent, if complex and capable enough (which may need to be, if that was what it needed to predict previous data), this agent himself could, maybe through the placement of careful answer tokens that would guarantee its stability, would be a dangerous and possibly malicious agent.
And, of course, sequence modeling as a paradigm can also be used for RL training.
I feel like there is a failure mode in this line of thinking, that being ‘confusingly pervasive consequentialism’. AI x-risk is concerned with a self-evidently dangerous object, that being superoptimizing agents. But whenever a system is proposed that is possibly intelligent without being superoptimizing, an argument is made, “well, this thing would do its job better if it was a superoptimizer, so the incentives (either internal or external to the system itself) will drive the appearance of a superoptimizer.” Well, yes, if you define the incredibly dangerous thing as the only way to solve any problem, and claim that incentives will force that dangerous thing into existence even if we try to prevent it, then the conclusion flows directly from the premise. You have to permit the existence of something that is not a superoptimizer in order to solve the problem. Otherwise you are essentially defining a problem that, by definition, cannot be solved, and then waving your hands saying “There is no solution!”
I believe I understand your point, but there are two things that I need to clarify, that kind of bypasses some of these criticism:
a) I am not assuming any safety technique applied to language models. In a sense, this is the worst-case scenario, one thing that may happen if the language model is run “as-it-is”. In particular, the scenario I described would be mitigated if we could possibly stop the existence of stable sub-agents appearing in language models, although how to do this I do not know.
b) The incentives for the language models to be a superoptimizer don’t necessarily need to be that strong, if we consider that we could have many instantiations of GPT-N being used, and only one of them needs to be that kind of stable malicious agent I tried (and probably failed) to describe. One of these stable agents would only need to appear once, in some setting where it can both stabilize itself (maybe through carefully placed prompts), and gain some power to cause harm in the world. If we consider something like the language model being used like GPT-3, in multiple different scenarios, this becomes a weaker assumption.
That being said, I agree with your general line of criticism, of not imagining intelligent but not superoptimizing agents being possible, although whether superoptimizer are attractors for generally intelligent agents, and under which conditions, is an open (and crucially important) question.
I posted a somewhat similar response to MSRayne, with the exception that what you accidentally summon is not an agent with a utility function, but something that tries to appear like one and nevertheless tricks you into making some big mistake.
Here, what you get is a genuine agent which works across prompts by having some internal value function which outputs a different value after each prompt, and acts accordingly, if I understand correctly. It doesn’t seem incredibly unlikely, as there is nothing in the process of evolution that necessarily has to make humans themselves be optimizers, but it happened anyways because that is what best performed in the overall goal of reproduction. This AI will still probably have to somehow convince the people communicating with it to give it “true” agency independent of the user’s inputs. Seems like an instrumental value in this case.