RohanS comments on RohanS’s Shortform

RohanS 31 Dec 2024 16:11 UTC
3 points
0
I’ve stated my primary area of research interest for the past several months as “foundation model agent (FMA) safety.” When I talk about FMAs, I have in mind systems like AutoGPT that equip foundation models with memory, tool use, and other affordances so they can perform multi-step tasks autonomously. I think having FMAs as a central object of study is productive for the following reasons.
1. I think we could soon get AGI/ASI agents that take influential actions in the real world with FMAs. I think foundation models without tool use and multistep autonomy are unlikely to have nearly the level of real-world impact I expect from AGI/ASI. Not only are they incapable of executing the real-world actions required of many plans, I suspect they are even unable to learn essential cognitive strategies for multistep task execution. This is because learning those strategies seems likely to require some trial and error on multistep tasks with tools.
2. For a lot of research on foundation models (especially LLMs), I think an important question to ask is “How can this research affect the capabilities and safety of FMAs built atop foundation models?” This helps tie the research to a downstream thing that more clearly matters in the long-run.
3. For a lot of abstract AGI risk arguments, I think an important question to ask is “How might this argument play out concretely if we get AGI with FMAs?” (Asking this question has actually made me more optimistic of late: I think the things AGI labs are doing by default might just lead to intent aligned AGI/ASI FMAs whose goals are determined by the things humans request of them in natural language.)
4. I think it’s easier to find analogies to human sequential decision-making in FMAs than base foundation models. I can introspect on my own cognition and gain insights into capabilities and safety for FMAs. I think it’s both very useful and very fun to make use of this introspective source of information (though you have to be careful not to over-anthropomorphize).
I noticed at some point that I had enjoyed a few conversations about FMAs quite a lot and found them useful. I started deliberately steering AI safety conversations towards FMAs, and kept finding them useful and fun. They’re kind of a cheat code for having useful conversations. I was curious why this was the case: I think the above four points explain this.