A lot of the work people have done in alignment had been based on the assumptions that 1) interpretability is difficult/weak, and 2) the dangerous parts of the architecture are mostly trained by SGD or RL or something like that. So you have a blind idiot god making you almost-black boxes. For example, the entire standard framing of inner vs outer alignment has that assumption built into it.
Now suddenly we’re instead looking at a hybrid system where all of that remains true for the LLM part (but plausibly a single LLM forward pass isn’t computationally complex enough to be very dangerous by itself), however the cognitive architecture built on top of it has easy interpretability and even editability (modulo things like steganography, complexity, and sheer volume), looks like combination of a fuzzy textual version of GOFAI with prompt engineering, and its structure is currently hand-coded, and could remain simple enough to be programmed and reasoned about. Parts of alignment research for this might look a lot like writing a constitution, or writing text clearly explaining CEV, and parts might look like the kind of brain-architectural diagrams in the article above.
I believe the only potentially-safe convergent goal to give a seed AI is to build something with the cognitive capability of one-or-multiple smart (and not significantly superhumanly smart) humans, but probably faster, that are capable of doing scientific research well, giving it/them the goal of solving the alignment problem (including corrigibility), somehow ensuring that that goal is locked in,and then somehow monitoring them while they do so
So how would you build a LMCA scientist? It needs to have the ability to:
Have multiple world models, including multiple models of human values, with probability distributions across them.
Do approximate Bayesian updates on them
Creatively generate new hypotheses/world models
It attempts to optimize our human value (under some definition, such as CEV, that it is also uncertain of), and knows that it doesn’t fully understand what that is, though it has some data on it.
It optimizes safely/conservatively in the presence of uncertainty in its model of the world and human values, i.e has solved theoptimizer’s curse Significantly, this basically forces it to make progress on the Alignment Problem in order to ever do anything much outside its training distribution: it has to make very pessimistic assumptions in the presence of human value uncertainty, so it is operationally motivated to reduce human value uncertainty. So even if you build one that isn’t an alignment researcher as its primary goal, but say a medical researcher and tell it to cure cancer, it is likely to rapidly decide that in order to do that well it needs to figure out what Quality-of-life-adjusted Life Years and Informed Consent are, and the Do What I Mean problem, which requires it to first solve the Alignment Problem as a convergent operational goal.
Devise and perform low cost/risk experiments (where the cost/risk is estimated conservatively/pessimistically under current uncertainty in world+human value models) that will distinguish between/can falsify individual world+human value models in order to reduce model uncertainty.
This strongly suggest to me an LMCA architecture combining handling natural language (and probably also images, equations, code, etc) with quantitative/mathematical estimation of important quantities like probabilities, risks, costs, and value, and with an underlying structure built around approximate Bayesian reasoning.
I’d done that with 1 and 2, and probably should’ve cited them. I did cite a recent brief summary of Steve’s work, and my work prior to getting excited about LMCAs was very similar to his. In practice, it seems like what Conjecture is working on is pretty much exactly this. I wasn’t familiar with LeCun’s scheme, so thanks for sharing.
The brainlike cognitive architecture proposed by Steve (and me) and LeCun’s are similar in that they’re cognitive architectures with steering systems. Which is an advantage. I also wrote about this here.
But they don’t have the central advantage of a chain of thought in English. That’s what I’m most excited about. If we don’t get this type of system as the first AGI, I hope we at least get one with a steering system (and I think we will; steering is practical as well as safer). But I think this type of natural language chain of thought alignment approach has large advantages.
A lot of the work people have done in alignment had been based on the assumptions that 1) interpretability is difficult/weak, and 2) the dangerous parts of the architecture are mostly trained by SGD or RL or something like that. So you have a blind idiot god making you almost-black boxes. For example, the entire standard framing of inner vs outer alignment has that assumption built into it.
Now suddenly we’re instead looking at a hybrid system where all of that remains true for the LLM part (but plausibly a single LLM forward pass isn’t computationally complex enough to be very dangerous by itself), however the cognitive architecture built on top of it has easy interpretability and even editability (modulo things like steganography, complexity, and sheer volume), looks like combination of a fuzzy textual version of GOFAI with prompt engineering, and its structure is currently hand-coded, and could remain simple enough to be programmed and reasoned about. Parts of alignment research for this might look a lot like writing a constitution, or writing text clearly explaining CEV, and parts might look like the kind of brain-architectural diagrams in the article above.
I believe the only potentially-safe convergent goal to give a seed AI is to build something with the cognitive capability of one-or-multiple smart (and not significantly superhumanly smart) humans, but probably faster, that are capable of doing scientific research well, giving it/them the goal of solving the alignment problem (including corrigibility), somehow ensuring that that goal is locked in,and then somehow monitoring them while they do so
So how would you build a LMCA scientist? It needs to have the ability to:
Have multiple world models, including multiple models of human values, with probability distributions across them.
Do approximate Bayesian updates on them
Creatively generate new hypotheses/world models
It attempts to optimize our human value (under some definition, such as CEV, that it is also uncertain of), and knows that it doesn’t fully understand what that is, though it has some data on it.
It optimizes safely/conservatively in the presence of uncertainty in its model of the world and human values, i.e has solved the optimizer’s curse Significantly, this basically forces it to make progress on the Alignment Problem in order to ever do anything much outside its training distribution: it has to make very pessimistic assumptions in the presence of human value uncertainty, so it is operationally motivated to reduce human value uncertainty. So even if you build one that isn’t an alignment researcher as its primary goal, but say a medical researcher and tell it to cure cancer, it is likely to rapidly decide that in order to do that well it needs to figure out what Quality-of-life-adjusted Life Years and Informed Consent are, and the Do What I Mean problem, which requires it to first solve the Alignment Problem as a convergent operational goal.
Devise and perform low cost/risk experiments (where the cost/risk is estimated conservatively/pessimistically under current uncertainty in world+human value models) that will distinguish between/can falsify individual world+human value models in order to reduce model uncertainty.
This strongly suggest to me an LMCA architecture combining handling natural language (and probably also images, equations, code, etc) with quantitative/mathematical estimation of important quantities like probabilities, risks, costs, and value, and with an underlying structure built around approximate Bayesian reasoning.
You might want to compare your ideas to (1) Conjecture’s CoEms (2) brain-like AGI safety by @Steven Byrnes (3) Yann LeCun’s ideas.
I’d done that with 1 and 2, and probably should’ve cited them. I did cite a recent brief summary of Steve’s work, and my work prior to getting excited about LMCAs was very similar to his. In practice, it seems like what Conjecture is working on is pretty much exactly this. I wasn’t familiar with LeCun’s scheme, so thanks for sharing.
The brainlike cognitive architecture proposed by Steve (and me) and LeCun’s are similar in that they’re cognitive architectures with steering systems. Which is an advantage. I also wrote about this here.
But they don’t have the central advantage of a chain of thought in English. That’s what I’m most excited about. If we don’t get this type of system as the first AGI, I hope we at least get one with a steering system (and I think we will; steering is practical as well as safer). But I think this type of natural language chain of thought alignment approach has large advantages.