In classic, non-mesa-optimized AGI risk scenarios, an AI is typically imagined whose reward function is directly related to the optimization pressure that it exerts on the world: e.g. the paperclip maximizer. However, it seems that human values are related to the brain’s underlying reward function in a highly circuitous way, and in some sense might be better thought of as an elaborate complex of learned behaviors, contextual actions, fleeting heuristic goals, etc. If AGI is created in the near-term using an architecture similar to the human brain, it seems plausible that the actual optimization pressure exerted by said AGI will be similar, so developing a good understanding of how this works in the human case might be pretty important. Thus: what’s the best mechanistic account of how “human values” actually emerge from the brain that we currently have?
[Question] What’s the Relationship Between “Human Values” and the Brain’s Reward System?
In classic, non-mesa-optimized AGI risk scenarios, an AI is typically imagined whose reward function is directly related to the optimization pressure that it exerts on the world: e.g. the paperclip maximizer. However, it seems that human values are related to the brain’s underlying reward function in a highly circuitous way, and in some sense might be better thought of as an elaborate complex of learned behaviors, contextual actions, fleeting heuristic goals, etc. If AGI is created in the near-term using an architecture similar to the human brain, it seems plausible that the actual optimization pressure exerted by said AGI will be similar, so developing a good understanding of how this works in the human case might be pretty important. Thus: what’s the best mechanistic account of how “human values” actually emerge from the brain that we currently have?