The thing you want to point to is “make the decisions that humans would collectively want you to make, if they were smarter, better informed, had longer to think, etc.” (roughly, Coherent Extrapolated Volition, or something comparable). Even managing to just point to “make the same decisions that humans would collectively want you to make” would get us way past the “don’t kill everyone” minimum threshold, into moderately good alignment, and well into the regions where alignment has a basin of convergence.
Any AGI built in the next few years is going to contain an LLM trained on trillions of tokens of human data output. So it will learn excellent and detailed world models of human behavior and psychology. An LLM’s default base model behavior (before fine-tuning) is to prompt-dependently select some human psychology and then attempt to model it so as to emit the same tokens (and thus make the decisions) that they would. As such, pointing it at “what decision would humans collectively want me to make in this situation” really isn’t that hard. You don’t even need to locate the detailed world models inside it, you can just do all this with a natural language prompt: LLMs handle natural language pointers just fine.
The biggest problem with this is that the process is so prompt-dependent that it’s easily perturbed, if part of your problem context data happens to contain something that perturbs the process in a way that jailbreaks its behavior. Which is probably a good reason why you might want to go ahead and locate those world models inside it, to try ensure that they’re still being used and the model hasn’t been jailbroken into doing something else.
I’d like to discuss this further, but since none of the people who disagree have mentioned why or how, I’m left to try to guess, which doesn’t seem very productive. Do they think it’s unlikely that a near-term AGI will contain an LLM, or do they disagree that you can (usually, though unreliably) use a verbal prompt to point at concepts in the LLM’s world models, or do they have some other objection that hasn’t occurred to me? A concrete example of what I’m discussing here would be Constitutional AI, as used by Anthropic, so it’s a pretty-well-undertood concept that had actually been tried with some moderate success.
The thing you want to point to is “make the decisions that humans would collectively want you to make, if they were smarter, better informed, had longer to think, etc.” (roughly, Coherent Extrapolated Volition, or something comparable). Even managing to just point to “make the same decisions that humans would collectively want you to make” would get us way past the “don’t kill everyone” minimum threshold, into moderately good alignment, and well into the regions where alignment has a basin of convergence.
Any AGI built in the next few years is going to contain an LLM trained on trillions of tokens of human data output. So it will learn excellent and detailed world models of human behavior and psychology. An LLM’s default base model behavior (before fine-tuning) is to prompt-dependently select some human psychology and then attempt to model it so as to emit the same tokens (and thus make the decisions) that they would. As such, pointing it at “what decision would humans collectively want me to make in this situation” really isn’t that hard. You don’t even need to locate the detailed world models inside it, you can just do all this with a natural language prompt: LLMs handle natural language pointers just fine.
The biggest problem with this is that the process is so prompt-dependent that it’s easily perturbed, if part of your problem context data happens to contain something that perturbs the process in a way that jailbreaks its behavior. Which is probably a good reason why you might want to go ahead and locate those world models inside it, to try ensure that they’re still being used and the model hasn’t been jailbroken into doing something else.
I’d like to discuss this further, but since none of the people who disagree have mentioned why or how, I’m left to try to guess, which doesn’t seem very productive. Do they think it’s unlikely that a near-term AGI will contain an LLM, or do they disagree that you can (usually, though unreliably) use a verbal prompt to point at concepts in the LLM’s world models, or do they have some other objection that hasn’t occurred to me? A concrete example of what I’m discussing here would be Constitutional AI, as used by Anthropic, so it’s a pretty-well-undertood concept that had actually been tried with some moderate success.