How about using Yoshua Bengio’s AI scientist (https://yoshuabengio.org/2023/05/07/ai-scientists-safe-and-useful-ai/) for alignment? The idea for AI scientist is to train AI to just understand the world (quite like LLMs do) but don’t do any human values alignment. AI scientist just sincerely answers questions but doesn’t consider any implications of providing the answer or whether humans like the answer or not. It doesn’t have any goals.
When user asks the main autonomous system to produce detailed plan to achieve the given goal—this plan may be too complicated to be understood for human. Human may not spot potential hidden agenda. But AI scientist may be used to look at the plan and answer the questions about potential implications—can it be illegal, controversial, harm any humans, etc. Wouldn’t that prevent rogue AGI scenario?
It probably would. But convincing the entire world to do that instead, and not build agentic AGI, sounds very questionable. That’s why I’m looking for alignment strategies with low alignment tax, for the types of AGI likely to be the first ones built.
Several of the superscalers have public plans of the form: Step 1) build an AI scientist, or at least research assistant 2) point it at the Aligment Problem 3) check it output until the Alignment Problem is solved 4) Profit! This is basically the same proposal as Value Leaning, just done as a team effort.
How about using Yoshua Bengio’s AI scientist (https://yoshuabengio.org/2023/05/07/ai-scientists-safe-and-useful-ai/) for alignment? The idea for AI scientist is to train AI to just understand the world (quite like LLMs do) but don’t do any human values alignment. AI scientist just sincerely answers questions but doesn’t consider any implications of providing the answer or whether humans like the answer or not. It doesn’t have any goals.
When user asks the main autonomous system to produce detailed plan to achieve the given goal—this plan may be too complicated to be understood for human. Human may not spot potential hidden agenda. But AI scientist may be used to look at the plan and answer the questions about potential implications—can it be illegal, controversial, harm any humans, etc. Wouldn’t that prevent rogue AGI scenario?
It probably would. But convincing the entire world to do that instead, and not build agentic AGI, sounds very questionable. That’s why I’m looking for alignment strategies with low alignment tax, for the types of AGI likely to be the first ones built.
Several of the superscalers have public plans of the form: Step 1) build an AI scientist, or at least research assistant 2) point it at the Aligment Problem 3) check it output until the Alignment Problem is solved 4) Profit!
This is basically the same proposal as Value Leaning, just done as a team effort.