Seth Herd comments on ~80 Interesting Questions about Foundation Model Agent Safety

Seth Herd 28 Oct 2024 20:21 UTC
10 points
2
Takes on a few more important questions:
Should safety-focused people support the advancement of FMA capabilities?
Probably. The advantages of a system without goal-directed RL (RL is used, but only to get the “oracle” to answer questions as the user intended them) and with a legible train-of-thought seem immense. I don’t see how we close the floodgates of AGI development now. Given that we’re getting AGI, it really seems like our best bet is FMA AGI.
But I’m not ready to help anyone develop AGI until this route to alignment and survival has been more thoroughly worked through in the abstract. I really wish more alignment skeptics would engage with specific plans isntead of just pointing to general arguments about how alignment would be difficult, some of which don’t apply to the ways we’d really align FMAs (see my other comment on this post). We may be getting close; Shut It All Down isn’t a viable option AFAICT so we need to get together our best shot.
1. Will the first transformative AIs be FMAs?
Probably, but not certainly. I’d be very curious to get a survey of people who’ve really thought about this. People who are sure they won’t give reasons I find highly dubious. At the least it seems likely enough that we should be thinking about aligning them in more detail, because we can see their general shape better than other possible first AGIs
2. Will narrow FMAs for a variety of specific domains be transformatively useful before we get transformatively useful general FMAs?
No. There are advantages to creating FMAs for specific domains, but there are also very large advantages to working on general reasoning. Humans are not limited to narrow domains, but can learn but anything through instruction or self-instruction. Language models trained on human “thought” can do the same as soon as they have any sort of useful persistent memory. Existing memory systems don’t work well, but they will be improved, probably rapidly.
3. If FMAs are the first transformative AIs (TAIs), how long will FMAs remain the leading paradigm?
This is a really important question. I really hope they remain the leading paradigm long enough to become useful in aligning other types of AGI. And that they remain free of goal-directed RL adequately to remain alignable.