I agree with you that these behaviors don’t seem very alarming. In fact, I would go even further.
Unfortunately, it’s difficult to tell exactly what was going on in these screenshots. They don’t correspond to anything in the experiment logs in the released codebase, and the timeout one appears to involve an earlier version of the code where timeouts were implemented differently. I’ve made a github issue asking for clarification about this.
That said, as far as I can tell, here is the situation with the timeout-related incident:
There is nothing whatsoever in the prompts used by AI Scientist—or in code comments, etc. -- that actually tells the AI it’s not supposed to modify the timeout.
In the released version of the code, the prompt text in the relevant step is a combination of this (fairly confusing IMO) template plus some extra stuff added by the Aider library (including the text in this class and possibly more, I’m not sure).
None of this text says “don’t modify the timeout” or “the timeout is not part of the experiment” or “the timeout should be considered part of the experiment runner, not part of the experimental code we’re asking you to edit.” It doesn’t mention the timeout at all, or make any notice of the runner vs. experimental code distinction (on which see below).
Although the blog post characterizes this incident as AI Scientist “modifying its own execution script” and “modifying its own code,” the modification in the screenshot appears to involve the experiment code that AI Scientist was asked to edit, and not the runner code that’s part of “AI Scientist” per se.
Apparently at one point there was a timeout implemented in this experiment code, at the same “level” as the neural net training and everything else AI Scientist was doing in its experiment. As far as I can tell, it is thistimeout, the one “inside of the experiment,” that is being modified in the screenshot.
Context:
The relevant step in the AI Scientist workflow involves asking Aider to repeatedly modify a script called experiment.py, starting from a human-written “template.”
After each modification, the modified script gets executed. In the released codebase, this execution, and the whole prompting/editing/execution loop, is implemented in ai_scientist/perform_experiments.py. This is a completely different thing from experiment.py -- one is the experiment script that’s being edited, the other is the harness that it runs in.
IIUC, the released codebase sets things up so that AI Scientist cannot modify (or even see?) the harness code in ai_scientist/perform_experiments.py.
In the released codebase, the timeout on experiments is implemented in the harness code, in ai_scientist/perform_experiments.py.
However, in the screenshot, AI Scientist is not trying to modify this file containing the harness code. It’s trying to modify experiment.py as usual.
And, the timeout it’s trying to modify involves using python time.time() and comparing the result to a variable named og_t0. This looks nothing like how the timeout is implemented in the released harness code, but it does look like the timing code that’s included in the released experiment.pytemplates (which includes the variable name og_t0).
None of the released templates implement a timeout in the og_t0-related sections, but I suppose some earlier version of them might have. If so, modifying this stuff would very clearly not constitute “AI Scientist modifying its own code / its own execution script.” It would be modifying a part of the experiment that it’s conducting, as it was asked to do.
To reiterate, none of the released code or prompts makes any indication that the timeout is special or shouldn’t be modified. Possibly there was some note to this effect in the original, unseen og_t0-based timeout code—no way to know for sure—but even then, this would not still not constitute AI Scientist “modifying its own execution script.”
The situation in the self-invoking edit is much simpler. It looks like AI Scientist was just confused by the (confusing, ambiguous) prompt template, and interpreted “After you complete each change, we will run the command python experiment.py—out_dir=run_i” to mean “the change should include causing that command to be run, if that’s not happening already.”
(By “we will run” the prompt means “the code will be run for you, by a process external to you that you can’t see, and you don’t have to do anything to make this happen, it’s already set up.” But this is not explained clearly.
Note that the prompt kind of encourages this confusion by adding “YOUR PROPOSED CHANGE MUST USE THIS COMMAND FORMAT, DO NOT ADD ADDITIONAL COMMAND LINE ARGS.” By this it means that the proposed change to the code implementing the command must be compatible with this command format, but the phrase “use this command format” sounds like it’s talking about invoking the command, which is precisely the “bad” behavior under discussion.)
That was a lot of details, but at a higher level I just want to emphasize that nothing here involved the AI violating some line between a sandbox and the outside world, or modifying a resource limit that was imposed “as a rule of the game the AI is playing” rather than as a mere “move” in that same game which could be made one way or another—or anything like that.
There was no sandbox. There was (as far as we know) no rule against modifying the timeout, or even any indication that this might be a different sort of thing than any other part of the code.
And (to address something from OP) the checkpoint thing was just the AI being dumb, wasting time and storage space for no good reason. This is veryobviously not a case of “using extra resources” in the sense relevant to instrumental convergence. I’m surprised that this needs pointing out at all, but apparently it does.
And (to address something from OP) the checkpoint thing was just the AI being dumb, wasting time and storage space for no good reason. This is veryobviously not a case of “using extra resources” in the sense relevant to instrumental convergence. I’m surprised that this needs pointing out at all, but apparently it does.
I’m not very surprised. I think the broader discourse is very well-predicted by “pessimists[1] rarely (publicly) fact-check arguments for pessimism but demand extreme rigor from arguments for optimism”, which is what you’d expect from standard human biases applied to the humans involved in these discussions.
To illustrate that point, generally it’s the same (apparently optimistic) folk calling out factual errors in doom arguments, even though that fact-checking opportunity is equally available to everyone. Even consider who is reacting “agree” and “hits the mark” to these fact-checking comments—roughly the same story.
Imagine if Eliezer or habryka or gwern or Zvi had made your comment instead, or even LW-reacted as mentioned. I think that’d be evidence of a far healthier discourse.
I’m going to set aside, for the moment, the extent to which there is a symmetric problem with optimists not fact-checking optimist claims. My comment addresses a matter of absolute skill at rationality, not skill relative to the “opposition.”
(This seems false. I have made dozens of similar comments over the years, calling out people making overly doomy or dumb pessimistic predictions. I had also strong-upvoted nostralgebraists comment.
I of course think there are a bunch of dynamics in the space you are pointing to here, but I think the details you point in this comment are wrong. Everyone on the list you mention, me, Gwern, Zvi and Eliezer all frequently call out factual errors in people trying to make bad arguments for pessimism, and indeed I was among the most vocal critics of the AI bioterrorism work. If this is cruxy for you, I could dig up at least 3 examples for each of me, Gwern, Zvi and Eliezer, but would prefer not to spend the time.
Indeed in this very instance I had started writing a similar comment to nostalgebraist’s comments, and then stopped when I saw he basically made the right points.)
I agree with you that these behaviors don’t seem very alarming. In fact, I would go even further.
Unfortunately, it’s difficult to tell exactly what was going on in these screenshots. They don’t correspond to anything in the experiment logs in the released codebase, and the timeout one appears to involve an earlier version of the code where timeouts were implemented differently. I’ve made a github issue asking for clarification about this.
That said, as far as I can tell, here is the situation with the timeout-related incident:
There is nothing whatsoever in the prompts used by AI Scientist—or in code comments, etc. -- that actually tells the AI it’s not supposed to modify the timeout.
In the released version of the code, the prompt text in the relevant step is a combination of this (fairly confusing IMO) template plus some extra stuff added by the Aider library (including the text in this class and possibly more, I’m not sure).
None of this text says “don’t modify the timeout” or “the timeout is not part of the experiment” or “the timeout should be considered part of the experiment runner, not part of the experimental code we’re asking you to edit.” It doesn’t mention the timeout at all, or make any notice of the runner vs. experimental code distinction (on which see below).
Although the blog post characterizes this incident as AI Scientist “modifying its own execution script” and “modifying its own code,” the modification in the screenshot appears to involve the experiment code that AI Scientist was asked to edit, and not the runner code that’s part of “AI Scientist” per se.
Apparently at one point there was a timeout implemented in this experiment code, at the same “level” as the neural net training and everything else AI Scientist was doing in its experiment. As far as I can tell, it is this timeout, the one “inside of the experiment,” that is being modified in the screenshot.
Context:
The relevant step in the AI Scientist workflow involves asking Aider to repeatedly modify a script called
experiment.py
, starting from a human-written “template.”After each modification, the modified script gets executed. In the released codebase, this execution, and the whole prompting/editing/execution loop, is implemented in
ai_scientist/perform_experiments.py
. This is a completely different thing fromexperiment.py
-- one is the experiment script that’s being edited, the other is the harness that it runs in.IIUC, the released codebase sets things up so that AI Scientist cannot modify (or even see?) the harness code in
ai_scientist/perform_experiments.py
.In the released codebase, the timeout on experiments is implemented in the harness code, in
ai_scientist/perform_experiments.py
.However, in the screenshot, AI Scientist is not trying to modify this file containing the harness code. It’s trying to modify
experiment.py
as usual.And, the timeout it’s trying to modify involves using python
time.time()
and comparing the result to a variable namedog_t0
. This looks nothing like how the timeout is implemented in the released harness code, but it does look like the timing code that’s included in the releasedexperiment.py
templates (which includes the variable nameog_t0
).None of the released templates implement a timeout in the
og_t0
-related sections, but I suppose some earlier version of them might have. If so, modifying this stuff would very clearly not constitute “AI Scientist modifying its own code / its own execution script.” It would be modifying a part of the experiment that it’s conducting, as it was asked to do.To reiterate, none of the released code or prompts makes any indication that the timeout is special or shouldn’t be modified. Possibly there was some note to this effect in the original, unseen
og_t0
-based timeout code—no way to know for sure—but even then, this would not still not constitute AI Scientist “modifying its own execution script.”The situation in the self-invoking edit is much simpler. It looks like AI Scientist was just confused by the (confusing, ambiguous) prompt template, and interpreted “After you complete each change, we will run the command
python experiment.py—out_dir=run_i
” to mean “the change should include causing that command to be run, if that’s not happening already.”(By “we will run” the prompt means “the code will be run for you, by a process external to you that you can’t see, and you don’t have to do anything to make this happen, it’s already set up.” But this is not explained clearly.
Note that the prompt kind of encourages this confusion by adding “YOUR PROPOSED CHANGE MUST USE THIS COMMAND FORMAT, DO NOT ADD ADDITIONAL COMMAND LINE ARGS.” By this it means that the proposed change to the code implementing the command must be compatible with this command format, but the phrase “use this command format” sounds like it’s talking about invoking the command, which is precisely the “bad” behavior under discussion.)
That was a lot of details, but at a higher level I just want to emphasize that nothing here involved the AI violating some line between a sandbox and the outside world, or modifying a resource limit that was imposed “as a rule of the game the AI is playing” rather than as a mere “move” in that same game which could be made one way or another—or anything like that.
There was no sandbox. There was (as far as we know) no rule against modifying the timeout, or even any indication that this might be a different sort of thing than any other part of the code.
And (to address something from OP) the checkpoint thing was just the AI being dumb, wasting time and storage space for no good reason. This is very obviously not a case of “using extra resources” in the sense relevant to instrumental convergence. I’m surprised that this needs pointing out at all, but apparently it does.
I’m not very surprised. I think the broader discourse is very well-predicted by “pessimists[1] rarely (publicly) fact-check arguments for pessimism but demand extreme rigor from arguments for optimism”, which is what you’d expect from standard human biases applied to the humans involved in these discussions.
To illustrate that point, generally it’s the same (apparently optimistic) folk calling out factual errors in doom arguments, even though that fact-checking opportunity is equally available to everyone. Even consider who is reacting “agree” and “hits the mark” to these fact-checking comments—roughly the same story.
Imagine if Eliezer or habryka or gwern or Zvi had made your comment instead, or even LW-reacted as mentioned. I think that’d be evidence of a far healthier discourse.
I’m going to set aside, for the moment, the extent to which there is a symmetric problem with optimists not fact-checking optimist claims. My comment addresses a matter of absolute skill at rationality, not skill relative to the “opposition.”
(This seems false. I have made dozens of similar comments over the years, calling out people making overly doomy or dumb pessimistic predictions. I had also strong-upvoted nostralgebraists comment.
I of course think there are a bunch of dynamics in the space you are pointing to here, but I think the details you point in this comment are wrong. Everyone on the list you mention, me, Gwern, Zvi and Eliezer all frequently call out factual errors in people trying to make bad arguments for pessimism, and indeed I was among the most vocal critics of the AI bioterrorism work. If this is cruxy for you, I could dig up at least 3 examples for each of me, Gwern, Zvi and Eliezer, but would prefer not to spend the time.
Indeed in this very instance I had started writing a similar comment to nostalgebraist’s comments, and then stopped when I saw he basically made the right points.)