The goal of this post isn’t to describe hypothetical strong AIs but to describe how humans form values as well as how more human-like near-term AGIs are likely to function.
I think the post does a great job of explaining human value formation, as well as the architecture of human decision-making, at least mechanically.
I’m saying that neuroanatomy seems insufficient to explain how humans function in the most important situations, let alone artificial systems, near or far.
If a hedge fund trader can beat the market, or a chess grandmaster can beat their opponent, what does it matter whether the decision process they use under the hood looks more like tree search, or more like function approximation, or a combination of both?
It might matter quite a lot, if you’re trying to build a human-like AGI! If you just want to know if your AGI is capable of killing you though, both function approximation and tree search at the level humans do them (or even somewhat below that level) seem pretty deadly, if they’re pointed in the wrong direction.
Whether it’s easy or hard to point an artificial system in any particular direction is another question.
Somehow, your brain has solved a complex pointers problem to get you to intrinsically care about a concept that is very far from primary rewards.
I think you’re saying that this is evidence that artificial systems which have similar architectures to human brains will also be able to solve this pointers problems, and if so, I agree.
I’m skeptical that anyone will succeed in building such a system before others build other kinds of systems, but I do think it is a good thing to try, if done carefully.
Though, a world where such systems are easy to build is not one I’d call “benign”, since if it’s easy to “just ask for alignment”, it’s probably also pretty easy to ask for not-alignment. Put another way, in the world where CoEms are the first kind of strong AGIs to get built, I think p(s-risk) goes up dramatically, though p(good outcomes) also goes up, perhaps even more dramatically, and p(molecular squiggles) goes down. I mostly think we’re not in that world, though.
I think you’re saying that this is evidence that artificial systems which have similar architectures to human brains will also be able to solve this pointers problems, and if so, I agree.
I’m skeptical that anyone will succeed in building such a system before others build other kinds of systems, but I do think it is a good thing to try, if done carefully.
I think our biggest crux is this. My idea here is that by default we get systems that look like this—DL systems look like this! and my near-term prediction is that DL systems will scale all the way to AGI. Almost any near-term AGI will almost certainly look ‘human-like’ in a way—some combination of model-free and model-based RL wrapped around an unsupervised world model. In the even nearer-term you might even scale to AGI with pure AutoGPT-style agents which are just doing iterative planning by conditioning the LLM! Both potential AGI designs look way closer to human-like than a pure EY-style utility maximiser. Now EY might still be right in the limit of super intelligence and RSI but that is not what near-term systems seem likely to look like.
Though, a world where such systems are easy to build is not one I’d call “benign”, since if it’s easy to “just ask for alignment”, it’s probably also pretty easy to ask for not-alignment. Put another way, in the world where CoEms are the first kind of strong AGIs to get built, I think p(s-risk) goes up dramatically, though p(good outcomes) also goes up, perhaps even more dramatically, and p(molecular squiggles) goes down. I mostly think we’re not in that world, though.
Yeah I completely agree with this point and I think this is going to be almost inevitable for any alignment strategy. As a consequence of orthogonality thesis, it is likely that given you can align a system at all then you can choose to align it to something bad—like making people suffer—if you want to. I think this is true across almost all worlds—and so we definitely get increasing p(s-risk) along with increased p(survival). This is not a problem technical alignment can solve but instead needs to involve some level of societal agreement / governance.
I think our biggest crux is this. My idea here is that by default we get systems that look like this—DL systems look like this! and my near-term prediction is that DL systems will scale all the way to AGI. Almost any near-term AGI will almost certainly look ‘human-like’ in a way—some combination of model-free and model-based RL wrapped around an unsupervised world model.
Agree this is a crux. A few remarks:
Structural similarity doesn’t necessarily tell us a lot about a system’s macro-level behavior. Examples: Stockfish 1 vs. Stockfish 20, the brain of a supervillain vs. the brain of an average human, a transformer model with random weights vs. one trained to predict the next token in a sequence of text.
Or, if you want to extend the similarity to the training process, a transformer model trained on a corpus of text from the human internet vs. one trained on a corpus of text from an alien internet. An average human vs. a supervillain who have 99%+ identical life experiences from birth. Stockfish implemented by a beginner programmer vs. a professional team.
I’d say, to the extent that current DL systems are structurally similar to human brains, it’s because these structures are instrumentally useful for doing any kind of useful work, regardless of how “values” in those systems are formed, or what those values are. And as you converge towards the most useful structures, there is less room left over for the system to “look similar” to humans, unless humans are pretty close to performing cognition optimally already.
Also, a lot of the structural similarity is in the training process of the foundation models that make up one component of a larger artificial system. The kinds of things people do with LangChain today don’t seem similar in structure to any part of a single human brain, at least to me. For example, I can’t arrange a bunch of copies of myself in a chain or tree, and give them each different prompts running in parallel. I could maybe simulate that by hiring a bunch of people, though it would be OOMs slower and more costly.
I also can’t add a python shell or a “tree search” method, or perform a bunch of experimental neurosurgery on humans, the way I can with artificial systems. These all seem like capabilities-enhancing tools that don’t preserve structural similarity to humans, and may also not preserve similarity of values to the original, un-enhanced artificial system.
I think the post does a great job of explaining human value formation, as well as the architecture of human decision-making, at least mechanically.
I’m saying that neuroanatomy seems insufficient to explain how humans function in the most important situations, let alone artificial systems, near or far.
If a hedge fund trader can beat the market, or a chess grandmaster can beat their opponent, what does it matter whether the decision process they use under the hood looks more like tree search, or more like function approximation, or a combination of both?
It might matter quite a lot, if you’re trying to build a human-like AGI! If you just want to know if your AGI is capable of killing you though, both function approximation and tree search at the level humans do them (or even somewhat below that level) seem pretty deadly, if they’re pointed in the wrong direction.
Whether it’s easy or hard to point an artificial system in any particular direction is another question.
I think you’re saying that this is evidence that artificial systems which have similar architectures to human brains will also be able to solve this pointers problems, and if so, I agree.
I’m skeptical that anyone will succeed in building such a system before others build other kinds of systems, but I do think it is a good thing to try, if done carefully.
Though, a world where such systems are easy to build is not one I’d call “benign”, since if it’s easy to “just ask for alignment”, it’s probably also pretty easy to ask for not-alignment. Put another way, in the world where CoEms are the first kind of strong AGIs to get built, I think p(s-risk) goes up dramatically, though p(good outcomes) also goes up, perhaps even more dramatically, and p(molecular squiggles) goes down. I mostly think we’re not in that world, though.
I think our biggest crux is this. My idea here is that by default we get systems that look like this—DL systems look like this! and my near-term prediction is that DL systems will scale all the way to AGI. Almost any near-term AGI will almost certainly look ‘human-like’ in a way—some combination of model-free and model-based RL wrapped around an unsupervised world model. In the even nearer-term you might even scale to AGI with pure AutoGPT-style agents which are just doing iterative planning by conditioning the LLM! Both potential AGI designs look way closer to human-like than a pure EY-style utility maximiser. Now EY might still be right in the limit of super intelligence and RSI but that is not what near-term systems seem likely to look like.
Yeah I completely agree with this point and I think this is going to be almost inevitable for any alignment strategy. As a consequence of orthogonality thesis, it is likely that given you can align a system at all then you can choose to align it to something bad—like making people suffer—if you want to. I think this is true across almost all worlds—and so we definitely get increasing p(s-risk) along with increased p(survival). This is not a problem technical alignment can solve but instead needs to involve some level of societal agreement / governance.
Agree this is a crux. A few remarks:
Structural similarity doesn’t necessarily tell us a lot about a system’s macro-level behavior. Examples: Stockfish 1 vs. Stockfish 20, the brain of a supervillain vs. the brain of an average human, a transformer model with random weights vs. one trained to predict the next token in a sequence of text.
Or, if you want to extend the similarity to the training process, a transformer model trained on a corpus of text from the human internet vs. one trained on a corpus of text from an alien internet. An average human vs. a supervillain who have 99%+ identical life experiences from birth. Stockfish implemented by a beginner programmer vs. a professional team.
I’d say, to the extent that current DL systems are structurally similar to human brains, it’s because these structures are instrumentally useful for doing any kind of useful work, regardless of how “values” in those systems are formed, or what those values are. And as you converge towards the most useful structures, there is less room left over for the system to “look similar” to humans, unless humans are pretty close to performing cognition optimally already.
Also, a lot of the structural similarity is in the training process of the foundation models that make up one component of a larger artificial system. The kinds of things people do with LangChain today don’t seem similar in structure to any part of a single human brain, at least to me. For example, I can’t arrange a bunch of copies of myself in a chain or tree, and give them each different prompts running in parallel. I could maybe simulate that by hiring a bunch of people, though it would be OOMs slower and more costly.
I also can’t add a python shell or a “tree search” method, or perform a bunch of experimental neurosurgery on humans, the way I can with artificial systems. These all seem like capabilities-enhancing tools that don’t preserve structural similarity to humans, and may also not preserve similarity of values to the original, un-enhanced artificial system.