Good post. I’ve been independently thinking about something similar.
I want to highlight an implied class of alignment solution.
If we can give rise to just the system on the right side of your chart, then the inner alignment problem is avoided: a system without terminal goals avoids instrumental convergence.
If we could further give rise to an oracle that intakes ‘questions’ (to answer) rather than ‘goals’ (which actually just imply a specific kind of question being asked, ‘what action scores highest on <value function>’), we could also avoid the outer alignment problem. (I have a short post about this here [edit: link removed])
This sounds a bit like davidad’s agenda in ARIA, except you also limit the AI to only writing provable mathematical solutions to mathematical questions to begin with.
In general, I would say that you need possibly better feedback loops than that, possibly by writing more on LW, or consulting with more people, or joining a fellowship or other programs.
This seems like a misunderstanding / not my intent. (Could you maybe quote the part that gave you this impression?)
I believe Dusan was trying to say that davidad’s agenda limits the planner AI to only writing provable mathematical solutions. To expand, I believe that compared to what you briefly describe, the idea in davidad’s agenda is that you don’t try to build a planner that’s definitely inner aligned, you simply have a formal verification system that ~guarantees what effects a plan will and won’t have if implemented.
To answer things which Raymond did not, it is hard for me to say who has the agenda which you think has good chances for solving alignment. I’d encourage you to reaching out to people who pass your bar perhaps more frequently than you do and establish those connections. Your limits on no audio or video do make it hard to participate in something like the PIBBSS Fellowship, but perhaps worth taking a shot at it or others. See if people whose ideas you like are mentoring in some programs—getting to work with them in structured ways may be easier than otherwise.
Good post. I’ve been independently thinking about something similar.
I want to highlight an implied class of alignment solution.
If we can give rise to just the system on the right side of your chart, then the inner alignment problem is avoided: a system without terminal goals avoids instrumental convergence.
If we could further give rise to an oracle that intakes ‘questions’ (to answer) rather than ‘goals’ (which actually just imply a specific kind of question being asked, ‘what action scores highest on <value function>’), we could also avoid the outer alignment problem. (I have a short post about this here [edit: link removed])
[deleted]
This sounds a bit like davidad’s agenda in ARIA, except you also limit the AI to only writing provable mathematical solutions to mathematical questions to begin with. In general, I would say that you need possibly better feedback loops than that, possibly by writing more on LW, or consulting with more people, or joining a fellowship or other programs.
[deleted]
I believe Dusan was trying to say that davidad’s agenda limits the planner AI to only writing provable mathematical solutions. To expand, I believe that compared to what you briefly describe, the idea in davidad’s agenda is that you don’t try to build a planner that’s definitely inner aligned, you simply have a formal verification system that ~guarantees what effects a plan will and won’t have if implemented.
To answer things which Raymond did not, it is hard for me to say who has the agenda which you think has good chances for solving alignment. I’d encourage you to reaching out to people who pass your bar perhaps more frequently than you do and establish those connections. Your limits on no audio or video do make it hard to participate in something like the PIBBSS Fellowship, but perhaps worth taking a shot at it or others. See if people whose ideas you like are mentoring in some programs—getting to work with them in structured ways may be easier than otherwise.