It took longer to get from AutoGPT to Devin than I initially thought it would, though in retrospect it only took “this long” because that’s literally about how long it takes to productize something comparatively new like this.
It does make me realize though that the baking timer has dinged and we’re about to see a lot more of this stuff coming out of the oven.
See also MultiOn and Maisa. Both are different agent enhancements for LLMs that claim notable new abilities on benchmarks. MultiOn can do web tasks, Maisa scores better on reasoning tasks than COT prompting and uses more efficient calls for lower cost. Neither is in deployment yet, neither company exains exactly how they’re engineered. Ding! Ding!
I also thought developing agents was taking too long until talking to a few people actually working on them. LLMs include new types of unexpected behavior, so engineering around that is a challenge. And there’s the standard time to engineer anything reliable and usable enough to be useful.
So, we’re right on track for language model cognitive architectures with alarmingly fast timelines, coupled with a slow enough takeoff that we’ll get some warning shots.
Edit: I just heard about another one, GoodAI, developing the episodic (long term) memory that I think will be a key element of LMCA agents. They outperform 128k context GPT4T with only 8k of context, on a memory benchmark of their own design, at 16% of the inference cost. Thanks, I hate it.
Edit: I just heard about another one, GoodAI, developing the episodic (long term) memory that I think will be a key element of LMCA agents. They outperform 128k context GPT4T with only 8k of context, on a memory benchmark of their own design, at 16% of the inference cost. Thanks, I hate it.
GoodAI’s Web site says they’re working on controlling drones, too (although it looks like a personal pet project that’s probably not gonna go that far). The fun part is that their marketing sells “swarms of autonomous surveillance drones” as “safety”. I mean, I guess it doesn’t say killer drones...
Very little alignment work of note, despite tons of published work on developing agents. I’m puzzled as to why the alignment community hasn’t turned more of their attention toward language model cognitive architectures/agents, but I’m also reluctant to publish more work advertising how easily they might achieve AGI.
ARC Evals did set up a methodology for Evaluating Language-Model Agents on Realistic Autonomous Tasks. I view this as a useful acknowledgment of the real danger of better LLMs, but I think it’s inherently inadequate, because it’s based on the evals team doing the scaffolding to make the LLM into an agent. They’re not going to be able to devote nearly as much time to that as other groups will down the road. New capabilities are certainly going to be developed by combinations of LLM improvements, and hard work at improving the cognitive architecture scaffolding around them.
I think evals are fantastic (ie obviously a good and correct thing to do; dramatically better than doing nothing) but there is a little bit of awkwardness in terms of deciding how hard to try. You don’t really want to spend a well-funded-startup’s worth of effort to trigger dangerous capabilities (and potentially cause your own destruction), but you know eventually that someone will. I don’t know how to resolve this.
Totally agree with you here. I think probably half of their development energy was spent getting to where GPT-4 Functions were right when Functions came out and they were probably like...oh...welp.
Is Devin using GPT-4, GPT-4T, or one of the 2 currently available long context models, Claude Opus 200k or Gemini 1.5?
March 14, 2023 is GPT-4, but the “long” context was expensive and initially unavailable to anyone Reason that matters is November 6, 2023 is the announcement for GPT-4T, which is 128k context.
Feb 15, 2024 is Gemini 1.5 LC
March 4, 2024 is Claude 200k is
That makes the timeline less than 4 months, and remember there’s a few weeks generally between “announcement” and “here’s your opportunity to pay for tokens with an API key”.
The prompting structure and meta-analysis for “Devin” was likely in the works since GPT-4, but without the long context you can’t fit:
[system prompt forced on you] [‘be an elite software engineer’ prompt] [ issue description] [ main source file ] [ data structures referenced in the main source file ] [ first attempt to fix ] [ compile or unit test outputs ]
In practice I found that I need Opus 200k to even try when I do the above by hand.
Also remember, GPT-4 128k starts failing near the end of it’s context window, the full 128k is not usable:
I hear that they use GPT-4. If you are looking at timelines, recall that Cognition apparently was founded around January 2024. (The March Bloomberg article says “it didn’t even officially exist as a corporation until two months ago”.) Since it requires many very expensive GPT-4 calls and RL, I doubt they could have done all that much prototyping or development in 2023.
Just seeing this, sorry. I think they could have gotten a lot of the infrastructure going even before GPT-4, just in a sort of toy fashion, but I agree, most of the development probably happened after GPT-4 became available. I don’t think long context was as necessary, because my guess is the infrastructure set up behind the scenes was already parceling out subtasks to subagents and that probably circumvented the need for super-long context, though I’m sure having longer context definitely helps.
My guess is right now they’re probably trying to optimize which sort of subtasks go to which model by A/B testing. If Claude 3 Opus is as good as people say at coding, maybe they’re using that for actual coding task output? Maybe they’re using GPT-4T or Gemini 1.5 Pro for a central orchestration model? Who knows. I feel like there are lots of conceivable ways to string this kind of thing together, and there will be more and more coming out every week now...
It took longer to get from AutoGPT to Devin than I initially thought it would, though in retrospect it only took “this long” because that’s literally about how long it takes to productize something comparatively new like this.
It does make me realize though that the baking timer has dinged and we’re about to see a lot more of this stuff coming out of the oven.
See also MultiOn and Maisa. Both are different agent enhancements for LLMs that claim notable new abilities on benchmarks. MultiOn can do web tasks, Maisa scores better on reasoning tasks than COT prompting and uses more efficient calls for lower cost. Neither is in deployment yet, neither company exains exactly how they’re engineered. Ding! Ding!
I also thought developing agents was taking too long until talking to a few people actually working on them. LLMs include new types of unexpected behavior, so engineering around that is a challenge. And there’s the standard time to engineer anything reliable and usable enough to be useful.
So, we’re right on track for language model cognitive architectures with alarmingly fast timelines, coupled with a slow enough takeoff that we’ll get some warning shots.
Edit: I just heard about another one, GoodAI, developing the episodic (long term) memory that I think will be a key element of LMCA agents. They outperform 128k context GPT4T with only 8k of context, on a memory benchmark of their own design, at 16% of the inference cost. Thanks, I hate it.
GoodAI’s Web site says they’re working on controlling drones, too (although it looks like a personal pet project that’s probably not gonna go that far). The fun part is that their marketing sells “swarms of autonomous surveillance drones” as “safety”. I mean, I guess it doesn’t say killer drones...
Any new safety studies on LMCA’s?
Very little alignment work of note, despite tons of published work on developing agents. I’m puzzled as to why the alignment community hasn’t turned more of their attention toward language model cognitive architectures/agents, but I’m also reluctant to publish more work advertising how easily they might achieve AGI.
ARC Evals did set up a methodology for Evaluating Language-Model Agents on Realistic Autonomous Tasks. I view this as a useful acknowledgment of the real danger of better LLMs, but I think it’s inherently inadequate, because it’s based on the evals team doing the scaffolding to make the LLM into an agent. They’re not going to be able to devote nearly as much time to that as other groups will down the road. New capabilities are certainly going to be developed by combinations of LLM improvements, and hard work at improving the cognitive architecture scaffolding around them.
I think evals are fantastic (ie obviously a good and correct thing to do; dramatically better than doing nothing) but there is a little bit of awkwardness in terms of deciding how hard to try. You don’t really want to spend a well-funded-startup’s worth of effort to trigger dangerous capabilities (and potentially cause your own destruction), but you know eventually that someone will. I don’t know how to resolve this.
Totally agree with you here. I think probably half of their development energy was spent getting to where GPT-4 Functions were right when Functions came out and they were probably like...oh...welp.
Is Devin using GPT-4, GPT-4T, or one of the 2 currently available long context models, Claude Opus 200k or Gemini 1.5?
March 14, 2023 is GPT-4, but the “long” context was expensive and initially unavailable to anyone
Reason that matters is November 6, 2023 is the announcement for GPT-4T, which is 128k context.
Feb 15, 2024 is Gemini 1.5 LC
March 4, 2024 is Claude 200k is
That makes the timeline less than 4 months, and remember there’s a few weeks generally between “announcement” and “here’s your opportunity to pay for tokens with an API key”.
The prompting structure and meta-analysis for “Devin” was likely in the works since GPT-4, but without the long context you can’t fit:
[system prompt forced on you] [‘be an elite software engineer’ prompt] [ issue description] [ main source file ] [ data structures referenced in the main source file ] [ first attempt to fix ] [ compile or unit test outputs ]
In practice I found that I need Opus 200k to even try when I do the above by hand.
Also remember, GPT-4 128k starts failing near the end of it’s context window, the full 128k is not usable:
I hear that they use GPT-4. If you are looking at timelines, recall that Cognition apparently was founded around January 2024. (The March Bloomberg article says “it didn’t even officially exist as a corporation until two months ago”.) Since it requires many very expensive GPT-4 calls and RL, I doubt they could have done all that much prototyping or development in 2023.
Just seeing this, sorry. I think they could have gotten a lot of the infrastructure going even before GPT-4, just in a sort of toy fashion, but I agree, most of the development probably happened after GPT-4 became available. I don’t think long context was as necessary, because my guess is the infrastructure set up behind the scenes was already parceling out subtasks to subagents and that probably circumvented the need for super-long context, though I’m sure having longer context definitely helps.
My guess is right now they’re probably trying to optimize which sort of subtasks go to which model by A/B testing. If Claude 3 Opus is as good as people say at coding, maybe they’re using that for actual coding task output? Maybe they’re using GPT-4T or Gemini 1.5 Pro for a central orchestration model? Who knows. I feel like there are lots of conceivable ways to string this kind of thing together, and there will be more and more coming out every week now...