I suppose my preferred strategy would be to derive the process by which human values form and replicate that in AIs. The reason I think this is tractable is because I actually take issue with this statement:
In the same way that the first airplanes did not look like birds, the first human-level AI will not look like humans.
I don’t think bird verses plane is an appropriate analogy for human versus AI learning systems because effective / general learning systems tend to resemble each other. Simple architectures scale best, so we should expect human learning to be simple and scalable, like the first AGI will be. We’re not some “random sample” from the space of possible mind configurations. Once you condition on generality, you actually get a lot of convergence in the resulting learning dynamics. It’s no coincidence that adversarial examples can transfer across architectures. You can see my thoughts on this here.
I also think that human values derive from a relatively straightforward interaction between our reward circuitry and our learning system, which I discuss in more detail here. The gist of it is that the brain really seems like the sort of place where inner alignment failures should happen, and inner alignment failures seem like they’d be hard for evolution to stop. Thus: the brain is probably full of inner alignment failure (as in, full of competing / cooperating quasi-agentic circuits).
Additionally, if you actually think about the incentives that derive from an inner alignment failure, they seem to have a starting resemblance to the actual ways in which our values work. Many deep / “weird” seeming values intuitions seem to coincide with a multi-agent inner alignment failure story.
I think odds are good that we’ll be able to replicate such a process in an AI and get values that are compatible with humanity’s continued survival.
I suppose my preferred strategy would be to derive the process by which human values form and replicate that in AIs. The reason I think this is tractable is because I actually take issue with this statement:
I don’t think bird verses plane is an appropriate analogy for human versus AI learning systems because effective / general learning systems tend to resemble each other. Simple architectures scale best, so we should expect human learning to be simple and scalable, like the first AGI will be. We’re not some “random sample” from the space of possible mind configurations. Once you condition on generality, you actually get a lot of convergence in the resulting learning dynamics. It’s no coincidence that adversarial examples can transfer across architectures. You can see my thoughts on this here.
I also think that human values derive from a relatively straightforward interaction between our reward circuitry and our learning system, which I discuss in more detail here. The gist of it is that the brain really seems like the sort of place where inner alignment failures should happen, and inner alignment failures seem like they’d be hard for evolution to stop. Thus: the brain is probably full of inner alignment failure (as in, full of competing / cooperating quasi-agentic circuits).
Additionally, if you actually think about the incentives that derive from an inner alignment failure, they seem to have a starting resemblance to the actual ways in which our values work. Many deep / “weird” seeming values intuitions seem to coincide with a multi-agent inner alignment failure story.
I think odds are good that we’ll be able to replicate such a process in an AI and get values that are compatible with humanity’s continued survival.