I though I was going to have more ideas, but after sitting around a bit I’ll just post what I had in terms of friendliness architectural/design choices that are disjoint from safety choices:
Moral and procedural self-reflection:
Human feedback / human in the loop
System should do active inference / actively help humans give good feedback.
Situational awareness to the extent that the situation at hand is relevant to self-reflection.
Bootstrapping:
The system should start out safe, with a broad knowledge about humans and ability to implement human feedback.
The “Broad knowledge yet safe” part definitely incentivizes language models being involved somewhere.
Implementing feedback, however, means more dangerous technology such as automated interpretability or a non-transformer architecture that’s good at within-lifetime learning.
Moral pluralism / conservatism
The central idea is to avoid (or at least be averse to) parts of state space where the learned notions of value are worse abstractions / where different ways of generalizing human values wildly diverge.
Requires specifying goals not in terms of a value function (or similar), but in terms of a recipe for learning about the world that eventually spits out a value function (or similar).
Easier to do if the outer layer of the system is classical AI, though we probably shouldn’t rule out options.
Dataset / Human-side design:
Broad sample of humanity is able to give feedback.
Transparency and accountability of the feedback process is important.
Capabilities useful for friendliness and problems for safety:
Situational awareness
Independent planning and action when needed (some amount of consequentialist evaluation of soliciting human feedback/oversight).
I though I was going to have more ideas, but after sitting around a bit I’ll just post what I had in terms of friendliness architectural/design choices that are disjoint from safety choices:
Moral and procedural self-reflection:
Human feedback / human in the loop
System should do active inference / actively help humans give good feedback.
Situational awareness to the extent that the situation at hand is relevant to self-reflection.
Bootstrapping:
The system should start out safe, with a broad knowledge about humans and ability to implement human feedback.
The “Broad knowledge yet safe” part definitely incentivizes language models being involved somewhere.
Implementing feedback, however, means more dangerous technology such as automated interpretability or a non-transformer architecture that’s good at within-lifetime learning.
Moral pluralism / conservatism
The central idea is to avoid (or at least be averse to) parts of state space where the learned notions of value are worse abstractions / where different ways of generalizing human values wildly diverge.
Requires specifying goals not in terms of a value function (or similar), but in terms of a recipe for learning about the world that eventually spits out a value function (or similar).
Easier to do if the outer layer of the system is classical AI, though we probably shouldn’t rule out options.
Dataset / Human-side design:
Broad sample of humanity is able to give feedback.
Transparency and accountability of the feedback process is important.
Capabilities useful for friendliness and problems for safety:
Situational awareness
Independent planning and action when needed (some amount of consequentialist evaluation of soliciting human feedback/oversight).
Generality
Within-lifetime learning