For a very hand-wavy sketch of how that might go, consider asking GPT-N to generate 1000s of candidate high-level plans, then rate them by feasibility, then break each plan into steps and re-evaluate, etc
FWIW, I’d call this “weakly agentic” in the sense that you’re searching through some options, but the number of options you’re looking through is fairly small.
It’s plausible that this is enough to get good results and also avoid disasters, but it’s actually not obvious to me. The basic reason: if the top 1000 plans are good enough to get superior performance, they might also be “good enough” to be dangerous. While it feels like there’s some separation between “useful and safe” and “dangerous” plans and this scheme might yield plans all of the former type, I don’t presently see a stronger reason to believe that this is true.
Separately from whether the plans themselves are safe or dangerous, I think the key question is whether the process that generated the plans is trying to deceive you (so it can break out into the real world or whatever).
If it’s not trying to deceive you, then it seems like you can just build in various safeguards (like asking, “is this plan safe?”, as well as more sophisticated checks), and be okay.
FWIW, I’d call this “weakly agentic” in the sense that you’re searching through some options, but the number of options you’re looking through is fairly small.
It’s plausible that this is enough to get good results and also avoid disasters, but it’s actually not obvious to me. The basic reason: if the top 1000 plans are good enough to get superior performance, they might also be “good enough” to be dangerous. While it feels like there’s some separation between “useful and safe” and “dangerous” plans and this scheme might yield plans all of the former type, I don’t presently see a stronger reason to believe that this is true.
Separately from whether the plans themselves are safe or dangerous, I think the key question is whether the process that generated the plans is trying to deceive you (so it can break out into the real world or whatever).
If it’s not trying to deceive you, then it seems like you can just build in various safeguards (like asking, “is this plan safe?”, as well as more sophisticated checks), and be okay.