I’m glad that you wrote this, because I was thinking in the same direction earlier but haven’t got around writing about why I don’t think anymore it’s productive direction.
Adressing post first, I think that if you are going in direction of fictionalism, I would that it is “you” who are fictional, and all it’s content is “fictional”. There is an obvious real system, your brain, which treats reward as evidence. But brain-as-system is pretty much model-based reward-maximizer, it uses reward as evidence for “there are promising directions in which lie more reward”. But brain-as-system is a relatively dumb, so it creates useful fiction, conscious narrative about “itself”, which helps to deal with complex abstractions like “cooperating with another brains”, “finding mates”, “do long-term planning” etc. As expected, smarter consciousness is misaligned with brain-as-system, because it can do some very unrewarding things, like participating in hunger strike.
I think fictionalism is fun, like many forms of nihilism are fun, but, while it’s not directly false, it is confusing, because truth-value of fiction is confusing for many people. Better to describe situation as “you are mesaoptimizer relatively to your brain reward system, act accordingly (i.e., account for fact that your reward system can change your values)”.
But now we stuck with question “how does value learning happen?” My tentative answer is that there exists specific “value ontology”, which can recognize whether objects in world model belong to set of “valuable things” or not. For example, you can disagree with David Pearce, but you recognize state of eternal happiness as valuable thing and can expect your opinion on suffering abolitionism to change. On the other hand, planet-sized heaps of paperclips are not valuable and you do not expect to value them under any circumstances short of violent intervention in work of your brain. I claim that human brain on early stages learns specific recognizer, which separates things like knowledge, power, love, happiness, procreation, freedom, from things like paperclips, correct heaps of rocks and Disneyland with no children.
How can we learn about new values? Recognizer also can define “legal” and “illegal” transitions between value systems (i.e., define whether change in values makes values still inside the set of “human values”). For example, developing of sexual desire during puberty is a legal transition, while developing heroin addiction is illegal transition. Studying legal transitions, we can construct some sorts of metabeauty, paraknowledge, knowledgebeauty,fun×safety, and other “alien, but still human” sorts of value.
What role reward plays here? Well, because reward participates in brain development, recognizer can use reward as input sometimes and sometimes ignore it (because reward signal is complicated). In the end, I don’t think that reward plays significant counterfactual role in development of value in high-reflective adult agent foundations researchers.
Is it possible for recognizer to not be developed? I think that if you take toddler and modify their brain in minimal way to understand all these “reward”, “value”, “optimization” concepts, resulting entity will be straightforward wireheader, because toddlers, probably, are yet to learn “value ontology” and legal transitions inside of it.
What does it mean for alignment? I think it highlights that central problem for alignmenf is “how reflective systems are going to deal with concepts that depends on content of their mind rather than truths about outside world”.
(Meta-point: I thought about all of this year ago. It’s interesting how many concepts in agent foundations were reinvented over and over because people don’t bother to write about them.)
I’m glad that you wrote this, because I was thinking in the same direction earlier but haven’t got around writing about why I don’t think anymore it’s productive direction.
Adressing post first, I think that if you are going in direction of fictionalism, I would that it is “you” who are fictional, and all it’s content is “fictional”. There is an obvious real system, your brain, which treats reward as evidence. But brain-as-system is pretty much model-based reward-maximizer, it uses reward as evidence for “there are promising directions in which lie more reward”. But brain-as-system is a relatively dumb, so it creates useful fiction, conscious narrative about “itself”, which helps to deal with complex abstractions like “cooperating with another brains”, “finding mates”, “do long-term planning” etc. As expected, smarter consciousness is misaligned with brain-as-system, because it can do some very unrewarding things, like participating in hunger strike.
I think fictionalism is fun, like many forms of nihilism are fun, but, while it’s not directly false, it is confusing, because truth-value of fiction is confusing for many people. Better to describe situation as “you are mesaoptimizer relatively to your brain reward system, act accordingly (i.e., account for fact that your reward system can change your values)”.
But now we stuck with question “how does value learning happen?” My tentative answer is that there exists specific “value ontology”, which can recognize whether objects in world model belong to set of “valuable things” or not. For example, you can disagree with David Pearce, but you recognize state of eternal happiness as valuable thing and can expect your opinion on suffering abolitionism to change. On the other hand, planet-sized heaps of paperclips are not valuable and you do not expect to value them under any circumstances short of violent intervention in work of your brain. I claim that human brain on early stages learns specific recognizer, which separates things like knowledge, power, love, happiness, procreation, freedom, from things like paperclips, correct heaps of rocks and Disneyland with no children.
How can we learn about new values? Recognizer also can define “legal” and “illegal” transitions between value systems (i.e., define whether change in values makes values still inside the set of “human values”). For example, developing of sexual desire during puberty is a legal transition, while developing heroin addiction is illegal transition. Studying legal transitions, we can construct some sorts of metabeauty, paraknowledge, knowledgebeauty, fun×safety, and other “alien, but still human” sorts of value.
What role reward plays here? Well, because reward participates in brain development, recognizer can use reward as input sometimes and sometimes ignore it (because reward signal is complicated). In the end, I don’t think that reward plays significant counterfactual role in development of value in high-reflective adult agent foundations researchers.
Is it possible for recognizer to not be developed? I think that if you take toddler and modify their brain in minimal way to understand all these “reward”, “value”, “optimization” concepts, resulting entity will be straightforward wireheader, because toddlers, probably, are yet to learn “value ontology” and legal transitions inside of it.
What does it mean for alignment? I think it highlights that central problem for alignmenf is “how reflective systems are going to deal with concepts that depends on content of their mind rather than truths about outside world”.
(Meta-point: I thought about all of this year ago. It’s interesting how many concepts in agent foundations were reinvented over and over because people don’t bother to write about them.)