We thereby gain valuable information about the dynamics of that learning process. For example, people care about lots of things (cars, flowers, animals, friends), and don’t just have a single unitary mesa-objective.
What does the fact that evolution found the human learning process in particular say about the rest of learning-process hyperparameter space?
Since we probably won’t be training AIs using a brain-like architecture, this part is the alignment-relevant part!
EG “Humans care about lots of things” is an upwards update on RL-training agents to have lots of values and goals.
“Humans care about things besides their own reward signals or tight correlates thereof” is a downward update on reward hacking being hard to avoid in terms of architecture. We make this update because we have observed a sample from evolution’s “empirical distribution” over learning processes.
There’s a confounder, though, of what that distribution is!
If the human learning process were somehow super weird relative to AI, our AI-related inferences are weaker.
But if evolution was basically optimizing for real-world compute efficiency and happened to find the human learning process, humans are probably typical real-world general intelligences along many axes.
There are many unknown unknowns. I expect humans to be weird in some ways, but not in any particular way. EG maybe we are unusually prone to forming lots of values (and not just valuing a single quantity), but I don’t have any reason to suspect that we’re weird in that way in particular, relative to other learning processes (like RL-finetuned LLMs).
The existence of the human genome yields at least two classes of evidence which I’m strongly interested in.
Humans provide many highly correlated datapoints on general intelligence (human minds), as developed within one kind of learning process (best guess: massively parallel circuitry, locally randomly initialized, self-supervised learning + RL).
We thereby gain valuable information about the dynamics of that learning process. For example, people care about lots of things (cars, flowers, animals, friends), and don’t just have a single unitary mesa-objective.
What does the fact that evolution found the human learning process in particular say about the rest of learning-process hyperparameter space?
Since we probably won’t be training AIs using a brain-like architecture, this part is the alignment-relevant part!
EG “Humans care about lots of things” is an upwards update on RL-training agents to have lots of values and goals.
“Humans care about things besides their own reward signals or tight correlates thereof” is a downward update on reward hacking being hard to avoid in terms of architecture. We make this update because we have observed a sample from evolution’s “empirical distribution” over learning processes.
There’s a confounder, though, of what that distribution is!
If the human learning process were somehow super weird relative to AI, our AI-related inferences are weaker.
But if evolution was basically optimizing for real-world compute efficiency and happened to find the human learning process, humans are probably typical real-world general intelligences along many axes.
There are many unknown unknowns. I expect humans to be weird in some ways, but not in any particular way. EG maybe we are unusually prone to forming lots of values (and not just valuing a single quantity), but I don’t have any reason to suspect that we’re weird in that way in particular, relative to other learning processes (like RL-finetuned LLMs).