Very little alignment work of note, despite tons of published work on developing agents. I’m puzzled as to why the alignment community hasn’t turned more of their attention toward language model cognitive architectures/agents, but I’m also reluctant to publish more work advertising how easily they might achieve AGI.
ARC Evals did set up a methodology for Evaluating Language-Model Agents on Realistic Autonomous Tasks. I view this as a useful acknowledgment of the real danger of better LLMs, but I think it’s inherently inadequate, because it’s based on the evals team doing the scaffolding to make the LLM into an agent. They’re not going to be able to devote nearly as much time to that as other groups will down the road. New capabilities are certainly going to be developed by combinations of LLM improvements, and hard work at improving the cognitive architecture scaffolding around them.
I think evals are fantastic (ie obviously a good and correct thing to do; dramatically better than doing nothing) but there is a little bit of awkwardness in terms of deciding how hard to try. You don’t really want to spend a well-funded-startup’s worth of effort to trigger dangerous capabilities (and potentially cause your own destruction), but you know eventually that someone will. I don’t know how to resolve this.
Very little alignment work of note, despite tons of published work on developing agents. I’m puzzled as to why the alignment community hasn’t turned more of their attention toward language model cognitive architectures/agents, but I’m also reluctant to publish more work advertising how easily they might achieve AGI.
ARC Evals did set up a methodology for Evaluating Language-Model Agents on Realistic Autonomous Tasks. I view this as a useful acknowledgment of the real danger of better LLMs, but I think it’s inherently inadequate, because it’s based on the evals team doing the scaffolding to make the LLM into an agent. They’re not going to be able to devote nearly as much time to that as other groups will down the road. New capabilities are certainly going to be developed by combinations of LLM improvements, and hard work at improving the cognitive architecture scaffolding around them.
I think evals are fantastic (ie obviously a good and correct thing to do; dramatically better than doing nothing) but there is a little bit of awkwardness in terms of deciding how hard to try. You don’t really want to spend a well-funded-startup’s worth of effort to trigger dangerous capabilities (and potentially cause your own destruction), but you know eventually that someone will. I don’t know how to resolve this.