This rings the bell of shared/collective intelligence, consisting of shared models (Friston et al. 2022):
We have noted that intelligence as self-evidencing is inherently perspectival, as it involves actively making sense of and engaging with the world from a specific point of view (i.e., given a set of beliefs). Importantly, if the origins of intelligence indeed lie in the partitioning of the universe into subsystems by probabilistic boundaries, then intelligence never arises singly but always exists on either side of such a boundary [103, 104]. The world that one models is almost invariably composed of other intelligent agents that model one in turn.
This brings us back to the insight that intelligence must, at some level, be distributed over every agent and over every scale at which agents exist. Active inference is naturally a theory of collective intelligence. There are many foundational issues that arise from this take on intelligence; ranging from communication to cultural niche construction: from theory of mind to selfhood [103–107]. On the active inference account, shared goals emerge from shared narratives, which are provided by shared generative models [108]. Furthermore—on the current analysis—certain things should then be curious about each other.
However, Hipolito & Van Es (2022) reject the blend of ToM and this collective intelligence account through model sharing:
While some social cognition theories seemingly take an enactive perspective on social cognition, they explain it as the attribution of mental states to other people, by assuming representational structures, in line with the classic Theory of Mind (ToM). Holding both enactivism and ToM, we argue, entails contradiction and confusion due to two ToM assumptions widely known to be rejected by enactivism: that (1) social cognition reduces to mental representation and (2) social cognition is a hardwired contentful ‘toolkit’ or ‘starter pack’ that fuels the model-like theorising supposed in (1).
[...] human feedback works best when the AI is already well-aligned with us. Getting that likely involves solving issues like value extrapolation.
I agree with this. I think this is because for the feedback to be helpful (reducing the divergence between probabilistic preference models of humans and AI), normative assumptions should be shared. There should also be good alignment on semantics and philosophy of language. I wrote about this in the context of “eliciting ‘true’ model’s beliefs/aligning world models”, and the reasoning about adjusting preferences via human feedback will be the same—after all, preferences are a part of the world model:
Note, again, how the philosophy of language, semantics, and inner alignment are bigger problems here. If your and model’s world models are not inner-aligned (i. e., not equally grounded in reality), some linguistic statements can be misinterpreted, which makes, in turn, these methods for eliciting beliefs unreliable. Consider, for example, the question like “do you believe that killing anyone could be good?”, and humans and the model are inner misaligned on what “good” means. No matter how reliable your elicitation technique, what you elicit is useless garbage if you don’t already share a lot of beliefs.
It seems that it implies that the alignment process is unreliable unless humans and the model are already (almost) aligned; consequently, the alignment should start relatively early in training superhuman models, not after the model is already trained. Enter: model “development”, “upbringing”, vs. indiscriminate “self-supervised training” on text from the internet in random order.
What I primarily disagree with you in your program is the focus on human preferences and their extrapolation. What about the preferences of cyborg people with Neuralink-style implants? Cyborgs? Uploaded people? I’d argue that the way they would extrapolate their values would be different from “vanilla humans”. So, I sense in “human value extrapolation” anthropocentrism that will not stand the test of time.
Most theories of morality (philosophical, religious, and spiritual ethics alike) that humans have created to date are deductive reconstructions of a theory of ethics from the heuristics for preference learning that humans have: values and moral intuitions.
This deductive approach couldn’t produce good, general theories of ethics, as has become evident recently with a wave of ethical questions about entities that most ethical theories of the past are totally unprepared to consider (Doctor et al. 2022), ranging from AI and robots (Müller 2020; Owe et al. 2022) to hybrots and chimaeras (Clawson & Levin 2022) and organoids (Sawai et al. 2022). And as the pace of technological progress increases, we should expect the transformation of the environments to happen even faster (which implies that the applied theories of ethics within these environments should also change), and more such novel objects of moral concern to appear.
There have been exceptions to this deductive approach: most notably, Kantian ethics. However, Kantian morality is a part of Kant’s wider theories of cognition (intelligence, agency) and philosophy of mind. The current state-of-the-art theories of cognitive science and philosophy of mind are far less wrong than Kant’s ones. So, the time is ripe for the development of new theories of axiology and ethics from first principles.
This rings the bell of shared/collective intelligence, consisting of shared models (Friston et al. 2022):
However, Hipolito & Van Es (2022) reject the blend of ToM and this collective intelligence account through model sharing:
I agree with this. I think this is because for the feedback to be helpful (reducing the divergence between probabilistic preference models of humans and AI), normative assumptions should be shared. There should also be good alignment on semantics and philosophy of language. I wrote about this in the context of “eliciting ‘true’ model’s beliefs/aligning world models”, and the reasoning about adjusting preferences via human feedback will be the same—after all, preferences are a part of the world model:
What I primarily disagree with you in your program is the focus on human preferences and their extrapolation. What about the preferences of cyborg people with Neuralink-style implants? Cyborgs? Uploaded people? I’d argue that the way they would extrapolate their values would be different from “vanilla humans”. So, I sense in “human value extrapolation” anthropocentrism that will not stand the test of time.
As I write in my proposal for a research agenda for a scale-free theory of ethics:
I developed this proposal completely unbeknown of your paper “Recognising the importance of preference change: A call for a coordinated multidisciplinary research effort in the age of AI”. I would be very interested to know your take on the “scale-free ethics” proposal.