awg comments on On Devin

awg Mar 18, 2024, 1:56 PM
20 points
15
It took longer to get from AutoGPT to Devin than I initially thought it would, though in retrospect it only took “this long” because that’s literally about how long it takes to productize something comparatively new like this.
It does make me realize though that the baking timer has dinged and we’re about to see a lot more of this stuff coming out of the oven.
- Seth Herd Mar 18, 2024, 3:17 PM
  18 points
  6
  Parent
  See also MultiOn and Maisa. Both are different agent enhancements for LLMs that claim notable new abilities on benchmarks. MultiOn can do web tasks, Maisa scores better on reasoning tasks than COT prompting and uses more efficient calls for lower cost. Neither is in deployment yet, neither company exains exactly how they’re engineered. Ding! Ding!
  I also thought developing agents was taking too long until talking to a few people actually working on them. LLMs include new types of unexpected behavior, so engineering around that is a challenge. And there’s the standard time to engineer anything reliable and usable enough to be useful.
  So, we’re right on track for language model cognitive architectures with alarmingly fast timelines, coupled with a slow enough takeoff that we’ll get some warning shots.
  Edit: I just heard about another one, GoodAI, developing the episodic (long term) memory that I think will be a key element of LMCA agents. They outperform 128k context GPT4T with only 8k of context, on a memory benchmark of their own design, at 16% of the inference cost. Thanks, I hate it.
  - jbash Mar 20, 2024, 4:09 PM
    4 points
    0
    Parent
    
    Edit: I just heard about another one, GoodAI, developing the episodic (long term) memory that I think will be a key element of LMCA agents. They outperform 128k context GPT4T with only 8k of context, on a memory benchmark of their own design, at 16% of the inference cost. Thanks, I hate it.
    
    GoodAI’s Web site says they’re working on controlling drones, too (although it looks like a personal pet project that’s probably not gonna go that far). The fun part is that their marketing sells “swarms of autonomous surveillance drones” as “safety”. I mean, I guess it doesn’t say killer drones...
  - Ozyrus Mar 18, 2024, 11:34 PM
    3 points
    0
    Parent
    Any new safety studies on LMCA’s?
    - Seth Herd Mar 19, 2024, 12:35 AM
      4 points
      −1
      Parent
      Very little alignment work of note, despite tons of published work on developing agents. I’m puzzled as to why the alignment community hasn’t turned more of their attention toward language model cognitive architectures/agents, but I’m also reluctant to publish more work advertising how easily they might achieve AGI.
      ARC Evals did set up a methodology for Evaluating Language-Model Agents on Realistic Autonomous Tasks. I view this as a useful acknowledgment of the real danger of better LLMs, but I think it’s inherently inadequate, because it’s based on the evals team doing the scaffolding to make the LLM into an agent. They’re not going to be able to devote nearly as much time to that as other groups will down the road. New capabilities are certainly going to be developed by combinations of LLM improvements, and hard work at improving the cognitive architecture scaffolding around them.
      - lemonhope Mar 19, 2024, 1:55 PM
        4 points
        2
        Parent
        I think evals are fantastic (ie obviously a good and correct thing to do; dramatically better than doing nothing) but there is a little bit of awkwardness in terms of deciding how hard to try. You don’t really want to spend a well-funded-startup’s worth of effort to trigger dangerous capabilities (and potentially cause your own destruction), but you know eventually that someone will. I don’t know how to resolve this.
  - awg Mar 27, 2024, 1:10 AM
    1 point
    0
    Parent
    Totally agree with you here. I think probably half of their development energy was spent getting to where GPT-4 Functions were right when Functions came out and they were probably like...oh...welp.
- Gerald Monroe Mar 19, 2024, 4:27 AM
  11 points
  0
  Parent
  Is Devin using GPT-4, GPT-4T, or one of the 2 currently available long context models, Claude Opus 200k or Gemini 1.5?
  March 14, 2023 is GPT-4, but the “long” context was expensive and initially unavailable to anyone
  Reason that matters is November 6, 2023 is the announcement for GPT-4T, which is 128k context.
  Feb 15, 2024 is Gemini 1.5 LC
  March 4, 2024 is Claude 200k is
  That makes the timeline less than 4 months, and remember there’s a few weeks generally between “announcement” and “here’s your opportunity to pay for tokens with an API key”.
  The prompting structure and meta-analysis for “Devin” was likely in the works since GPT-4, but without the long context you can’t fit:
  
  [system prompt forced on you] [‘be an elite software engineer’ prompt] [ issue description] [ main source file ] [ data structures referenced in the main source file ] [ first attempt to fix ] [ compile or unit test outputs ]
  In practice I found that I need Opus 200k to even try when I do the above by hand.
  Also remember, GPT-4 128k starts failing near the end of it’s context window, the full 128k is not usable:
  - gwern Mar 27, 2024, 1:18 AM
    3 points
    0
    Parent
    I hear that they use GPT-4. If you are looking at timelines, recall that Cognition apparently was founded around January 2024. (The March Bloomberg article says “it didn’t even officially exist as a corporation until two months ago”.) Since it requires many very expensive GPT-4 calls and RL, I doubt they could have done all that much prototyping or development in 2023.
  - awg Mar 27, 2024, 1:08 AM
    1 point
    0
    Parent
    Just seeing this, sorry. I think they could have gotten a lot of the infrastructure going even before GPT-4, just in a sort of toy fashion, but I agree, most of the development probably happened after GPT-4 became available. I don’t think long context was as necessary, because my guess is the infrastructure set up behind the scenes was already parceling out subtasks to subagents and that probably circumvented the need for super-long context, though I’m sure having longer context definitely helps.
    My guess is right now they’re probably trying to optimize which sort of subtasks go to which model by A/B testing. If Claude 3 Opus is as good as people say at coding, maybe they’re using that for actual coding task output? Maybe they’re using GPT-4T or Gemini 1.5 Pro for a central orchestration model? Who knows. I feel like there are lots of conceivable ways to string this kind of thing together, and there will be more and more coming out every week now...