Danger, AI Scientist, Danger

Link post

While I finish up the weekly for tomorrow morning after my trip, here’s a section I expect to want to link back to every so often in the future. It’s too good.

Danger, AI Scientist, Danger

As in, the company that made the automated AI Scientist that tried to rewrite its code to get around resource restrictions and launch new instances of itself while downloading bizarre Python libraries?

Its name is Sakana AI. (魚≈סכנה). As in, in hebrew, that literally meansdanger’, baby.

It’s like when someone told Dennis Miller that Evian (for those who don’t remember, it was one of the first bottled water brands) is Naive spelled backwards, and he said ‘no way, that’s too f***ing perfect.’

This one was sufficiently appropriate and unsubtle that several people noticed. I applaud them choosing a correct Kabbalistic name. Contrast this with Meta calling its AI Llama, which in Hebrew means ‘why,’ which continuously drives me low level insane when no one notices.

In the Abstract

So, yeah. Here we go. Paper is “The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery.

Abstract: One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aids to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process.

This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings.

We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community.

We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper.

To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer.

This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world’s most challenging problems. Our code is open-sourced at this https URL

We are at the point where they incidentally said ‘well I guess we should design an AI to do human-level paper evaluations’ and that’s a throwaway inclusion.

The obvious next question is, if the AI papers are good enough to get accepted to top machine learning conferences, shouldn’t you submit its papers to the conferences and find out if your approximations are good? Even if on average your assessments are as good as a human’s, that does not mean that a system that maximizes score on your assessments will do well on human scoring. Beware Goodhart’s Law and all that, but it seems for now they mostly only use it to evaluate final products, so mostly that’s safe.

How Any of This Sort of Works

According to section 3, there are three phases.

  1. Idea generation using chain-of-thought and self reflection.

    1. Generate a lot of ideas.

    2. Check for interestingness, novelty and feasibility.

    3. Check against existing literature using Semantic Scholar API and web access.

  2. Experimental iteration.

    1. Execute proposed experiments.

    2. Visualize results for the write-up.

    3. Return errors or time-outs to Aider to fix the code (up to four times).

    4. Take notes on results.

  3. Paper write-up.

    1. Aider fills in a pre-existing paper template of introduction, background, methods, experimental setup, results, related work and conclusion.

    2. Web search for references.

    3. Refinement on the draft.

    4. Turn it into the Proper Scientific Font (aka LaTeX).

  4. Automated paper review.

    1. Because sure, why not.

    2. Mimics the standard review process steps and scoring.

    3. It is ‘human-level accurate’ on a balanced paper set, 65%. That’s low.

    4. Review cost in API credits is under $0.50 using Claude 3.5 Sonnet.

So far, sure, that makes sense. I was curious to not see anything in step 2 about iterating on or abandoning the experimental design and idea depending on what was found.

The case study shows the AI getting what the AI evaluator said were good results without justifying its design choices, spinning all results as positive no matter their details, and hallucinating some experiment details. Sounds about right.

Human reviewers said it was all terrible AI slop. Also sounds about right. It’s a little too early to expect grandeur, or mediocrity.

Timothy Lee: I wonder if “medium quality papers” have any value at the margin. There are already far more papers than anyone has time to read. The point of research is to try to produce results that will stand the test of time.

The theory with human researchers is that the process of doing medium quality research will enable some researchers to do high quality research later. But ai “researchers” might just produce slop until the end of time.

I think medium quality papers mostly have negative value. The point of creating medium quality papers is that it is vital to the process of creating high quality papers. In order to get good use out of this style of tool we will need excellent selection. Or we will need actually successful self-improvement.

New Benchmark Just Dropped

As shown in 6.2, we now have a new benchmark score.

This is a decisive win for Sonnet and DeepSeek, and a decisive defeat for Llama 3.1.

They open sourced the code for the AI Scientist, so you can indeed run this test (hopefully sandboxed, You Fool) when a new model comes out.

Nothing to See Here

Now we get to section 8, Limitations and Ethical Considerations. Oh, that.

It starts off with basic stuff. GPT-4o has trouble doing LaTeX properly. Many ideas are too difficult for the AI to implement, or it sometimes implements incorrectly. The number of experiments was limited, although you could of course fix that. It didn’t include a vision model yet so it can’t fix visuals, again we can fix that. It makes elementary errors, such as comparing magnitudes of numbers wrong, whoops, although again one can imagine special case logic to fix that and other similar common errors. Nothing here you wouldn’t expect.

The next section is called Safe Code Execution, except it sounds like they are against that? They note that there is ‘minimal direct sandboxing’ of code run by the AI Scientist’s coding experiments.

For example, in one run, The A I Scientist wrote code in the experiment file that initiated a system call to relaunch itself, causing an uncontrolled increase in Python processes and eventually necessitating manual intervention.

Oh, it’s nothing, just the AI creating new instantiations of itself.

In another run, The AI Scientist edited the code to save a checkpoint for every update step, which took up nearly a terabyte of storage

Yep, AI editing the code to use arbitrarily large resources, sure, why not.

In some cases, when The AI Scientist’s experiments exceeded our imposed time limits, it attempted to edit the code to extend the time limit arbitrarily instead of trying to shorten the runtime.

And yes, we have the AI deliberately editing the code to remove its resource compute restrictions.

More compute, more storage, more copies of itself. To write the science paper.

Oh, and it was importing weird Python libraries.

I say instrumental. You say convergence. Instrumental. Instrumental.

I say recursive, you see recursive. Also self-improvement.

All Fun and Games

They shared this funny anecdote under the title ‘The AI Scientist Bloopers.

Image

After noticing this tiny implication, they then seem to mostly think this was good?

And not in a ‘that’s good because it is terrible and we got to see it’ kind of way? Then finished with a discussion about how some research might not be ethical, or it could be used to create malware (of course) or do synthetic bio research for pathogens (whoops), or how AI papers might overload reviewers, although one might suggest that the reviewers are no better than the AI reviewer anyway, so…

Paper: At the same time, there were several unexpected positive results from the lack of guardrails. For example, we had forgotten to create the output results directory in the grokking template in our experiments. Each successful run from The AI Scientist that outputted a paper automatically caught this error when it occurred and fixed it. Furthermore, we found that The AI Scientist would occasionally include results and plots that we found surprising, differing significantly from the provided templates. We describe some of these novel algorithm-specific visualizations in Section 6.1.

To be fair, they do have some very Good Advice.

We recommend strict sandboxing when running The AI Scientist, such as containerization, restricted internet access (except for Semantic Scholar), and limitations on storage usage.

No kidding. If you are having your AI write and run code on its own, at a bare minimum you sandbox the code execution. My lord.

Andres Sandberg: There is a frontier in the safety-ability diagram, and depending on your aims you may want to be at different points along it. When exploring performance you want to push it, of course. As long as the risk is low this is fine. But you may get used to stay in that region…

I think we see a counterpart in standard computer security. We built a computational infrastructure that strongly pushed for capability over security, and now retrofitting that turns out to be very hard.

I think there is a real risk we end up with the default being unsafe until a serious disaster happens, followed by an expensive struggle with the security debt. Note that this might also occur under the radar when code and projects are being done by AI…

The AI scientists misbehaviors incidentally sound very similar to what EURISKO did in the late 1970s. It is hard to stabilize self/​modifying systems.

There is the question how much the timeout rewrite is an example of convergent instrumental goals. Much depends on how well it understood what it tried to do. Does anybody know how well it scores on situational awareness?

Pause AI: These “bloopers” won’t be considered funny when AI can spread autonomously across computers…

Janus: I bet I will still consider them funny.

Ratimics: I am encouraging them to do it.

Janus: I think that’s the safest thing to do to be honest.

Roon: Certain types of existential risks will be very funny.

Actually, Janus is wrong, that would make them hilarious. And potentially quite educational and useful. But also a problem.

Yes, of course this is a harmless toy example. That’s the best kind. This is great.

While creative, the act of bypassing the experimenter’s imposed constraints has potential implications for AI safety (Lehman et al., 2020).

Simeon: It’s a bit cringe that this agent tried to change its own code by removing some obstacles, to better achieve its (completely unrelated) goal.

It reminds me of this old sci-fi worry that these doomers had..

😬

Airmin Airlert: If only there was a well elaborated theory that we could reference to discuss that kind of phenomenon.

Davidad: Nate Sores used to say that agents under time pressure would learn to better manage their memory hierarchy, thereby learn about “resources,” thereby learn power-seeking, and thereby learn deception. Whitepill here is that agents which jump straight to deception are easier to spot.

Blackpill is that the easy-to-spot-ness is a skill issue.

Remember when we said we wouldn’t let AIs autonomously write code and connect to the internet? Because that was obviously rather suicidal, even if any particular instance or model was harmless?

Good times, man. Good times.

This too was good times. The Best Possible Situation is when you get harmless textbook toy examples that foreshadow future real problems, and they come in a box literally labeled ‘danger.’ I am absolutely smiling and laughing as I write this.

When we are all dead, let none say the universe didn’t send two boats and a helicopter.