BTW I found this linked article fascinating, and it’s a very important subject for those trying to create safe brain-like AGI by reverse engineering the brain’s alignment mechanisms (empathy/love/altruism).
It’s sometimes called the ‘pointing problem’: the agent’s utility function must use specific concepts from the dynamic learned world model, but it seems difficult to make this connection, because the utility function is innate and the specific concepts we want it to reference will be small targets that could appear anywhere in the complex learned world model.
I describe the problem a bit here, and my current best guess of how the brain solves this (correlation guided proxy matching) is a generalization of imprinting mechanisms. Shard theory is also concerned with the problem.
BTW I found this linked article fascinating, and it’s a very important subject for those trying to create safe brain-like AGI by reverse engineering the brain’s alignment mechanisms (empathy/love/altruism).
It’s sometimes called the ‘pointing problem’: the agent’s utility function must use specific concepts from the dynamic learned world model, but it seems difficult to make this connection, because the utility function is innate and the specific concepts we want it to reference will be small targets that could appear anywhere in the complex learned world model.
I describe the problem a bit here, and my current best guess of how the brain solves this (correlation guided proxy matching) is a generalization of imprinting mechanisms. Shard theory is also concerned with the problem.
Jacob—thanks! Glad you found that article interesting. Much appreciated. I’ll read the links essays when I can.