Tho as a counterpoint, maybe Auto-GPT presents some opportunities to empirically test the IDA proposal? To have a decent experiment, you would need a good metric for alignment (does that exist?) and demonstrate that as you implement IDA using Auto-GPT, your metric is at least maintained, even as capabilities improve on the newer models.
I’m overall skeptical of my particular proposal however, because 1. I’m not aware of any well-rounded “alignment” metrics, and 2. you’d need to be confident that you can scale it up without losing control (because if the experiment fails, then by definition you’ve developed a more powerful AI which is less aligned).
But it’s plausible to me that someone could find some good use for Auto-GPT for alignment research, now that it has been developed. It’s just not clear to me how you would do so in a net-positive way.
it seems like future systems will decompose / delegates tasks to some sub-agents, so if we think either 1) it will be an important part of the final model that successfully recursively self-improves 2) there are non-trivial chances that this leads us to AGI before we can try other things, maybe it’s high EV to focus more on IDA-like approaches?
Tho as a counterpoint, maybe Auto-GPT presents some opportunities to empirically test the IDA proposal? To have a decent experiment, you would need a good metric for alignment (does that exist?) and demonstrate that as you implement IDA using Auto-GPT, your metric is at least maintained, even as capabilities improve on the newer models.
I’m overall skeptical of my particular proposal however, because 1. I’m not aware of any well-rounded “alignment” metrics, and 2. you’d need to be confident that you can scale it up without losing control (because if the experiment fails, then by definition you’ve developed a more powerful AI which is less aligned).
But it’s plausible to me that someone could find some good use for Auto-GPT for alignment research, now that it has been developed. It’s just not clear to me how you would do so in a net-positive way.
The evidence I’m interested goes something like:
we have more empirical ways to test IDA
it seems like future systems will decompose / delegates tasks to some sub-agents, so if we think either 1) it will be an important part of the final model that successfully recursively self-improves 2) there are non-trivial chances that this leads us to AGI before we can try other things, maybe it’s high EV to focus more on IDA-like approaches?