Monkeys that are pretty-good-in-practice at promoting genetic fitness, including having some intuitions leading them to sacrifice themselves in-practice for two-ish children or eight-ish cousins, don’t in fact have a clean “inclusive genetic fitness” concept that you can readily make them optimize. An LLM espousing various human moral intuitions doesn’t have a clean concept for pan-sentience CEV such that the universe turns out OK if that concept is optimized.
Humans also don’t have a “clean concept for pan-sentience CEV such that the universe turns out OK if that concept is optimized” in our heads. However, we do have a concept of human values in a more narrow sense, and I expect LLMs in the coming years to pick up roughly the same concept during training.
The evolution analogy seems more analogous to an LLM that’s rewarded for telling funny jokes, but it doesn’t understand what makes a joke funny. So it learns a strategy of repeatedly telling certain popular jokes because those are rated as funny. In that case it’s not surprising that the LLM wouldn’t be funny when taken out of its training distribution. But that’s just because it never learned what humor was to begin with. If the LLM understood the essence of humor during training, then it’s much more likely that the property of being humorous would generalize outside its training distribution.
LLMs will likely learn the concept of human values during training about as well as most humans learn the concept. There’s still a problem of getting LLMs to care and act on those values, but it’s noteworthy that the LLM will understand what we are trying to get it to care about nonetheless.
Humans also don’t have a “clean concept for pan-sentience CEV such that the universe turns out OK if that concept is optimized” in our heads. However, we do have a concept of human values in a more narrow sense, and I expect LLMs in the coming years to pick up roughly the same concept during training.
The evolution analogy seems more analogous to an LLM that’s rewarded for telling funny jokes, but it doesn’t understand what makes a joke funny. So it learns a strategy of repeatedly telling certain popular jokes because those are rated as funny. In that case it’s not surprising that the LLM wouldn’t be funny when taken out of its training distribution. But that’s just because it never learned what humor was to begin with. If the LLM understood the essence of humor during training, then it’s much more likely that the property of being humorous would generalize outside its training distribution.
LLMs will likely learn the concept of human values during training about as well as most humans learn the concept. There’s still a problem of getting LLMs to care and act on those values, but it’s noteworthy that the LLM will understand what we are trying to get it to care about nonetheless.