Gerald Monroe comments on On Devin

Gerald Monroe Mar 19, 2024, 4:27 AM
11 points
0
Is Devin using GPT-4, GPT-4T, or one of the 2 currently available long context models, Claude Opus 200k or Gemini 1.5?
March 14, 2023 is GPT-4, but the “long” context was expensive and initially unavailable to anyone
Reason that matters is November 6, 2023 is the announcement for GPT-4T, which is 128k context.
Feb 15, 2024 is Gemini 1.5 LC
March 4, 2024 is Claude 200k is
That makes the timeline less than 4 months, and remember there’s a few weeks generally between “announcement” and “here’s your opportunity to pay for tokens with an API key”.
The prompting structure and meta-analysis for “Devin” was likely in the works since GPT-4, but without the long context you can’t fit:

[system prompt forced on you] [‘be an elite software engineer’ prompt] [ issue description] [ main source file ] [ data structures referenced in the main source file ] [ first attempt to fix ] [ compile or unit test outputs ]
In practice I found that I need Opus 200k to even try when I do the above by hand.
Also remember, GPT-4 128k starts failing near the end of it’s context window, the full 128k is not usable:
- gwern Mar 27, 2024, 1:18 AM
  3 points
  0
  Parent
  I hear that they use GPT-4. If you are looking at timelines, recall that Cognition apparently was founded around January 2024. (The March Bloomberg article says “it didn’t even officially exist as a corporation until two months ago”.) Since it requires many very expensive GPT-4 calls and RL, I doubt they could have done all that much prototyping or development in 2023.
- awg Mar 27, 2024, 1:08 AM
  1 point
  0
  Parent
  Just seeing this, sorry. I think they could have gotten a lot of the infrastructure going even before GPT-4, just in a sort of toy fashion, but I agree, most of the development probably happened after GPT-4 became available. I don’t think long context was as necessary, because my guess is the infrastructure set up behind the scenes was already parceling out subtasks to subagents and that probably circumvented the need for super-long context, though I’m sure having longer context definitely helps.
  My guess is right now they’re probably trying to optimize which sort of subtasks go to which model by A/B testing. If Claude 3 Opus is as good as people say at coding, maybe they’re using that for actual coding task output? Maybe they’re using GPT-4T or Gemini 1.5 Pro for a central orchestration model? Who knows. I feel like there are lots of conceivable ways to string this kind of thing together, and there will be more and more coming out every week now...