If people outside of labs are interested in doing this, I think it’ll be cool to look for cases of scheming in evals like The Agent Company, where they have an agent act as a remote worker for a company. They ask the agent to complete a wide range of tasks (e.g., helping with recruiting, messaging coworkers, writing code).
You could imagine building on top of their eval and adding morally ambiguous tasks, or just look through the existing transcripts to see if there’s anything interesting there (the paper mentions that models would sometimes “deceive” itself into thinking that it’s completed a task (see pg. 13). Not sure how interesting this is, but I’d love to see if someone could find out).
If people outside of labs are interested in doing this, I think it’ll be cool to look for cases of scheming in evals like The Agent Company, where they have an agent act as a remote worker for a company. They ask the agent to complete a wide range of tasks (e.g., helping with recruiting, messaging coworkers, writing code).
You could imagine building on top of their eval and adding morally ambiguous tasks, or just look through the existing transcripts to see if there’s anything interesting there (the paper mentions that models would sometimes “deceive” itself into thinking that it’s completed a task (see pg. 13). Not sure how interesting this is, but I’d love to see if someone could find out).