I agree that this is way too dangerous to just give a command and have the agent go off and do something big based on its interpretation. Any failures are too many.
The argument there is that we’ll want lots of redundant checks for capabilities and mundane safety as well as existential risks.
I think this will apply to mundane requests like “start a business selling cute outfits on Ebay” as well. You don’t want the agent to take actions on your behalf that don’t do what you meant. You don’t want it to spend all the money you gave it in stupid ways, irritate people on your behalf, etc. So adding checks before executing is helpful for mundane safety. You’ll probably have human involvement in any complex plans; you don’t even want to spend a bunch of money on LLM calls exploring fundamentally misdirected plans.
I agree that this is way too dangerous to just give a command and have the agent go off and do something big based on its interpretation. Any failures are too many.
I’ve written about this in Internal independent review for language model agent alignment
The argument there is that we’ll want lots of redundant checks for capabilities and mundane safety as well as existential risks.
I think this will apply to mundane requests like “start a business selling cute outfits on Ebay” as well. You don’t want the agent to take actions on your behalf that don’t do what you meant. You don’t want it to spend all the money you gave it in stupid ways, irritate people on your behalf, etc. So adding checks before executing is helpful for mundane safety. You’ll probably have human involvement in any complex plans; you don’t even want to spend a bunch of money on LLM calls exploring fundamentally misdirected plans.