I’m glad that someone is talking about automating philosophy. It seem to have huge potential for alignment because in the end alignment is about ethical reasoning. So
Make an ethical simulator using LLM capable of evaluating plans and answering whether a course of action is ethical or not. Test this simulator in multiple situations.
Use it as “alignment module” for an LLM-based agent composed of multiple LLMs processing every step of the reasoning explicitly and transparently. Everytime an agents is taking an action verify it with alignment module. If the action is ethical—proceed, else—try something else.
Test agents behavior in multiple situations. Check the reasoning process to figure out potential issues and fix them
Restrict any other approach to agentic AI. Restrict training larger than current LLM.
Improve the reasoning of the agent via Socratic method, rationality techniques, etc, explicitly writing them in the code of the agent.
Congratulations! We’ve achived transparent interpretability; tractable alignment that can be tested with minimal real world consequenses and doesn’t have to be done perfectly from the first try; slow take off.
Something will probably go wrong. Maybe agents designed like that would be very inferior to humans. But someone really have to try investigating this direction.
It seems that the “ethical simulator” from point 1. and the LLM-based agent from point 2. overlap, so you just overcomplicate things if make them two distinct systems. An LLM prompted with the right “system prompt” (virtue ethics) + doing some branching-tree search for optimal plans according to some trained “utility/value” evaluator (consequentialism) + filtering out plans which have actions that are always prohibited (law, deontology). The second component is the closest to what you described as an “ethical simulator”, but is not quite it: the “utility/value” evaluator cannot say whether an action or a plan is ethical or not in absolute terms, it can only compare some proposed plans for the particular situation by some planner.
They are not supposed to be two distinct systems. One is a subsystem of the other. There may be implementations where its the same LLM doing all the generative work for every step of the reasoning via prompt engineering but it doesn’t have to be this way. It can can be multiple more specific LLMs that went through different RLHF processes.
I’m glad that someone is talking about automating philosophy. It seem to have huge potential for alignment because in the end alignment is about ethical reasoning. So
Make an ethical simulator using LLM capable of evaluating plans and answering whether a course of action is ethical or not. Test this simulator in multiple situations.
Use it as “alignment module” for an LLM-based agent composed of multiple LLMs processing every step of the reasoning explicitly and transparently. Everytime an agents is taking an action verify it with alignment module. If the action is ethical—proceed, else—try something else.
Test agents behavior in multiple situations. Check the reasoning process to figure out potential issues and fix them
Restrict any other approach to agentic AI. Restrict training larger than current LLM.
Improve the reasoning of the agent via Socratic method, rationality techniques, etc, explicitly writing them in the code of the agent.
Congratulations! We’ve achived transparent interpretability; tractable alignment that can be tested with minimal real world consequenses and doesn’t have to be done perfectly from the first try; slow take off.
Something will probably go wrong. Maybe agents designed like that would be very inferior to humans. But someone really have to try investigating this direction.
It seems that the “ethical simulator” from point 1. and the LLM-based agent from point 2. overlap, so you just overcomplicate things if make them two distinct systems. An LLM prompted with the right “system prompt” (virtue ethics) + doing some branching-tree search for optimal plans according to some trained “utility/value” evaluator (consequentialism) + filtering out plans which have actions that are always prohibited (law, deontology). The second component is the closest to what you described as an “ethical simulator”, but is not quite it: the “utility/value” evaluator cannot say whether an action or a plan is ethical or not in absolute terms, it can only compare some proposed plans for the particular situation by some planner.
They are not supposed to be two distinct systems. One is a subsystem of the other. There may be implementations where its the same LLM doing all the generative work for every step of the reasoning via prompt engineering but it doesn’t have to be this way. It can can be multiple more specific LLMs that went through different RLHF processes.