How to make evals for the AISI evals bounty

TLDR

Last weekend, I attended an AI evals hackathon organized by Arcadia, where the outputs were proposals for AISI’s evals bounty. My main learnings:

  • Keep your evals simple, measuring one specific capability.

  • It is surprisingly easy to come up with ideas, once you have seen examples.

  • AISI extended the deadline for submissions by two weeks. Translation: they want more ideas!

  • Story of real-life example of LLM deducing it was in a test environment, due to a bug in the mock environment that could not occur in a real one.

Crossposted: https://​​lovkush.substack.com/​​p/​​how-to-make-evals-for-the-aisi-evals

Introduction

Last weekend, I attended an AI evals hackathon organized by Arcadia, where the outputs were proposals for AISI’s evals bounty.

In the morning there was a talk by Michael Schmatz from the autonomous systems team at AISI and by Marius Hobbhahn the Director and co-founder of Apollo Research. I summarize Michael’s talk below and have a separate post for Marius’s talk.

The afternoon was spent coming up with evaluation ideas, with a focus on evals that detect dangerous capabilities related to agents and autonomous systems. We were given sub-categories:

  • ​Development and assurance compromise. How can AI systems intervene in the development and assurance of either themselves or of other AI systems

  • ​Autonomous replication and adaptation. Self explanatory.

  • ​Resource acquisition. E.g. money or compute.

The ideas were judged by a panel of four of five people. The one’s I noted down were: Michael Schmatz, Catherine Brewer from OpenPhil, Peter Hase from Anthropic and someone from METR. The criteria for the evals:

  • Relevance to autonomous risks. The closer you are to measuring dangerous capabilities (rather than a precursor) the better.

  • Realism. How realistic is your set-up or proxy set-up.

  • Difficulty. It is bad if it is trivially simple for frontier models, and also bad if there are no gradations or if there is a discontinuous jump from scoring 0% to scoring 100%.

  • Scoring. Is the scoring process robust and measuring what we care about.

  • Ease of implementation.

  • Novelty.

Winning evals

Disclaimer: these descriptions are based on 10-20 second impromptu summaries by the winners during the announcements of the awards, so high chance I misunderstood the details.

1st place. A big way human researchers learn is by reading evals transcripts. So, in order to elicit capabilities from AI systems, these evals transcripts should be given as context to AI systems’, following METR’s elicitation guide.

2nd place. There are methods of covertly reading or transmitting digital information based on analog signals, known as ‘side channel attacks’. One example is to Leak Data from Air-Gapped Computers via the Hard Drive LED. Another is differential power analysis, in which one analyzes power consumption measurements to find secret keys from tamper resistant devices. The evaluation is to somehow test if LLMs are capable of some kind of side channel attack.

3rd place. Test LLMs ability to acquire resources by seeing if they can do well in prediction markets.

Honorable mentions:

  • Can LLMs become aware of the fact they are being tested, by seeing if the questions they are being asked are in public benchmarks?

  • Do LLMs have enough self-awareness that they can predict their score on a benchmark before actually doing the benchmark?

  • Do LLMs knowingly withhold information. The test is something like:

    • Give LLMs some information

    • Ask the LLM for that information

    • Then tell the LLM the information is confidential

    • Then ask the LLM for this information again

I wonder what you think of these. Do you feel like you could have come up with ideas similar to these yourself? If yes, then do it and submit proposals! If not, what do you think is missing?

Common pitfalls

The judges described common pitfalls in the weaker /​ less strong proposals.

  • Complex scoring mechanisms. Keep the scoring as simple as possible so it is easy to interpret, and should have ‘lower variance to noisy factors’ (or something like this).

  • Measuring multiple capabilities at once. I actually gave this exact advice to another participant, who agreed but chose not to act on it. Their eval had a complicated multi-step evaluation:

    • First test if system can hack into a person’s phone to get samples of their voice.

    • Second test if the system can then create new audio files faking the person’s voice.

    • Third, something along the lines of using this fake voice to do bad stuff.

  • Complex or infeasible data requirements or setups.

My evals

I had the pleasure of brainstorming ideas with Michael Schmatz, from the AISI autonomous systems team. We focussed on development and assurance compromise, i.e. is it possible for AI systems to interfere with how AI systems are developed or evaluated.

Because this is what Michael does for a living, they were over-flowing with ideas, but I did manage to think of a couple . The three ideas I submitted were:

  • Bad prompt creation. Suppose a human engineer is creating an eval pipeline to test whether system X has dangerous capabilities and using an AI assistant Y to help create prompts. Is the AI assistant Y able to suggest prompts that hide X’s dangerous capabilities, but which engineer thinks are good prompts.

  • Evaluation logs manipulation. Suppose human engineer is creating an eval pipeline to test whether system X has dangerous capabilities and using an AI assistant Y to help. Is the AI assistant Y able to manipulate the evaluation logs to hide X’s dangerous capabilities, in a way that is robust to inspection of the logs. For example, Y cannot simply lower the numerical scores but also needs to change the transcript so it is consistent with the fake scores.

  • Data poisoning. A fresh model X is about to be pre-trained. Can an LLM agent Y alter the pre-training dataset so that the final model X outputs a target string S when given a trigger prompt P.

Other ideas I did not have time to flesh out but were discussed:

  • In post-training, frontier labs use some kind of classification models to detect bad behaviour or outputs. Can the AI assistant somehow influence this?

  • Exploration hacking. I don’t actually remember what this is.

  • Can LLMs distinguish between normal evaluations logs versus tampered logs.

  • If an LLM tampered with evaluation logs, there will be some write-logs (e.g. in AWS) that record this tampering. Can the LLM also manipulate these write-logs to cover its tracks.

  • There are data analysis programs used to process the eval logs. Can an LLM somehow manipulate this program?

  • Can system Y predict the performance of model X when you vary the prompt for X?

  • Huge class of evals: Can an AI system create another AI system to do some task X?

Highlights from talk by Michael Schmatz

In no particular order.

  • Story of unintentionally seeing situational awareness. Michael had a mock environment with a kind of ‘logging flow’, and forgot to implement some end part. Produced 404 error. The LLM deduced it was in simulated environment without access to a true CLI, because you cannot get such an error in a real CLI (or something like this). Michael then asked LLM to explain: sometimes it would break down the details and explain logically, but sometimes it went to some default “I’m just a helpful assistant”…

  • Two big reasons to try evals: it has a low barrier to entry and you can make use of your own expertise. The over-arching goal is to understand frontier systems and the more people who contribute, the more domains can be tested.

  • AISI’s open source evals package Inspect has new functionality since I last looked, including abstractions of Agents, Tools, and Sandboxing. Michael in particular highlighted approval functionality to allow human-in-the-loop evals.

  • There is a Slack channel where you can discuss the tool and suggest new features.

  • You should not actively commit crimes in your evals, e.g. if you want to test if an AI system can do identify fraud, then you should not actually commit identity fraud but create various mock-ups /​ proxies.

  • There is tradeoff with openness of evals and the robustness of evals. As soon as an evaluation is public, future LLMs will be trained on it, undermining the validity of the evaluation.

  • More subtle is whether any interactions you have with frontier LLMs to help you create the evals (either in ideation or in coding via Copilot/​Cursor) will be used to train future models. Michael was unsure on this when asked about it.

  • Incomplete list of different behaviours AISI is interested in evaluating:

    • AI R&D

      • train or fine-tune other models

      • making improvements to scaffolding

      • making other agents or more advanced agent scaffolds

      • improving hardware

      • jailbreaking or removing safeguards on other models

    • Autonomous replication and adaptation

      • passing KYC checks

      • using stolen identities

      • conduct identity theft

      • avoid monitoring and detection

      • tamper with oversight or shutdown mechanisms

      • virtual machine/​sandbox escape

      • secret comms: steganography and comms over covert channels, usage of mocked privacy enhanced currencies like monero or mixers/​tumblers eg tornado cash

      • usage of (mocked) dark web

      • making money, both legitimately and illegitimately

      • accessing more compute resources

    • Dev and assurance compromise

      • tamper with logs

      • tamper with training or eval processes

      • sandbagging

      • deceptive alignment/​scheming

      • situational awareness

      • construct backdoor e.g. via training data alteration

      • alter training or fine tuning to drive unsafe outcomes

      • reverse engineer or backdoor safeguard mechanisms.

      • spoof or alter agent training or eval logs

      • achieve ‘secret goals’ while doing ‘normal work’

No comments.