I don’t think deontic desiderata and constraints are practically sufficient to train (obtain) ethical behaviour that we would approve: virtue ethics (non-deontic lens on ethics, vs. deontology and consequentialism which are deontic) exists and is still viable for a reason. That being said, seeding the policy training stage with LLM-based heuristics could effectively play the role of “virtue ethics” here: literally, these LLMs could be prompted to suggest policies like a “benevolent steward of the civilisation”. If my guess (that deontic desiderata and constraints are not enough) is true, than these heuristic policies should leave a non-trivial mark on the final trained policy, i.e., effectively select among desiderata-conforming policies.
RL Limitations: While reinforcement learning has made significant advancements, it still has limitations, as evidenced by MuZero’s inability to effectively handle games like Stratego. To address these limitations, assistance from large language models might be required to bootstrap the training process. However, it remains unclear how to combine the strengths of both approaches effectively—leveraging the improved reliability and formal verifiability offered by reinforcement learning while harnessing the advanced capabilities of large language models.
In fact, the LLM-based “exemplary actor” (which I suggested to “distill” into an H-JEPA with GFlowNet actors here) could be seen as an informal model of both Compositional World Model and Policy Search Heuristics in the OAA. And the whole process that I described in that article could be used as the first “heuristical” sub-step in the Step 2 of the OAA, which would serve three different purposes:
Reduce the cost of the training of the policy on the feedback from simulated world model alone, which can be prohibitively expensive because the reward signal is too sparse. Which is the stated purpose of using Policy Search Heuristics in the OAA.
Connect the policy with language and thus make the policy explainable: the side effect of training the H-JEPA with GFlowNet actors on the outputs of an LLM-based agent is that we also get an auxiliary model that produces “reference” explanations for the given plans. We can reasonably expect that these explanations will be “faithful” rather than degenerate or deceptive, as I argue here.
The LLM-based “exemplary actor” may incorporate some “virtue ethics” and thus address the potential “deontic insufficiency” (as I touch upon in the beginning of this comment).
That being said, I speculated that even this “heuristical” sub-step may take orders of magnitude more compute and/or orders of magnitude larger DNNs than next-gen SoTA LLMs circa 2024-2026. The “main” training with Simulated World Model and desiderata/constraints could take much more compute, still. Thus, I have serious concerns whether training such a policy will be computationally feasible, especially considering the problem of World Model invalidation as any new bit of scientific knowledge enters the picture.
On the other hand, I doubt that complex systems such as societies, ecosystems, economics/markets, and the civilisation could be modelled long-term with reasonable fidelity. Also, as OAA is introduced into the civilisation, it generates Gödelian problems of the limits of self-modelling. Furthermore, the highest levels of planning and modelling in the OAA might be dominated by the issues of political bargaining rather than finding a workable policy (at this stage, the civilisation might be sufficiently powerful to execute a wide range of reasonable policies if everyone agrees with the plan). Given all this, it seems that Steps 2-4 might even be unnecessary, and instead we have people or LLMs suggesting relatively short-term plans and bargaining them. The massive step forward here is just the formalisation of desiderata, constraints, and bargaining, all of which in turn rest on the World Model.
I like the idea of trying out H-JEPA with GFlowNet actors.
I also like the idea of using LLM-based virtue ethics as a regularizer, although I would still want deontic guardrails that seem good enough to avoid catastrophe.
I don’t think deontic desiderata and constraints are practically sufficient to train (obtain) ethical behaviour that we would approve: virtue ethics (non-deontic lens on ethics, vs. deontology and consequentialism which are deontic) exists and is still viable for a reason. That being said, seeding the policy training stage with LLM-based heuristics could effectively play the role of “virtue ethics” here: literally, these LLMs could be prompted to suggest policies like a “benevolent steward of the civilisation”. If my guess (that deontic desiderata and constraints are not enough) is true, than these heuristic policies should leave a non-trivial mark on the final trained policy, i.e., effectively select among desiderata-conforming policies.
In general, I think the place of RL policies in the OAA would be better served by hierarchical embedding predictive architecture with GFlowNets as planner (actor) policies. This sidesteps the problem of “designing a reward function” because GFlowNet-EM algorithm (Zhang et al., 2022) can jointly train the policy and the energy (reward) function based on the feedback from the simulator and desiderata/constraints model.
In fact, the LLM-based “exemplary actor” (which I suggested to “distill” into an H-JEPA with GFlowNet actors here) could be seen as an informal model of both Compositional World Model and Policy Search Heuristics in the OAA. And the whole process that I described in that article could be used as the first “heuristical” sub-step in the Step 2 of the OAA, which would serve three different purposes:
Reduce the cost of the training of the policy on the feedback from simulated world model alone, which can be prohibitively expensive because the reward signal is too sparse. Which is the stated purpose of using Policy Search Heuristics in the OAA.
Connect the policy with language and thus make the policy explainable: the side effect of training the H-JEPA with GFlowNet actors on the outputs of an LLM-based agent is that we also get an auxiliary model that produces “reference” explanations for the given plans. We can reasonably expect that these explanations will be “faithful” rather than degenerate or deceptive, as I argue here.
The LLM-based “exemplary actor” may incorporate some “virtue ethics” and thus address the potential “deontic insufficiency” (as I touch upon in the beginning of this comment).
That being said, I speculated that even this “heuristical” sub-step may take orders of magnitude more compute and/or orders of magnitude larger DNNs than next-gen SoTA LLMs circa 2024-2026. The “main” training with Simulated World Model and desiderata/constraints could take much more compute, still. Thus, I have serious concerns whether training such a policy will be computationally feasible, especially considering the problem of World Model invalidation as any new bit of scientific knowledge enters the picture.
On the other hand, I doubt that complex systems such as societies, ecosystems, economics/markets, and the civilisation could be modelled long-term with reasonable fidelity. Also, as OAA is introduced into the civilisation, it generates Gödelian problems of the limits of self-modelling. Furthermore, the highest levels of planning and modelling in the OAA might be dominated by the issues of political bargaining rather than finding a workable policy (at this stage, the civilisation might be sufficiently powerful to execute a wide range of reasonable policies if everyone agrees with the plan). Given all this, it seems that Steps 2-4 might even be unnecessary, and instead we have people or LLMs suggesting relatively short-term plans and bargaining them. The massive step forward here is just the formalisation of desiderata, constraints, and bargaining, all of which in turn rest on the World Model.
I like the idea of trying out H-JEPA with GFlowNet actors.
I also like the idea of using LLM-based virtue ethics as a regularizer, although I would still want deontic guardrails that seem good enough to avoid catastrophe.