Evan Hubinger (he/him/his) (evanjhub@gmail.com)
I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I’m joining Anthropic”
Selected work:
There are a couple of pieces of this that I disagree with:
I think claim 1 is wrong because even if the memory is unhelpful, the agent which uses it might be simpler, and so you might still end up with an agent. My intuition is that just specifying a utility function and an optimization process is often much easier than specifying the complete details of the actual solution, and thus any sort of program-search-based optimization process (e.g. gradient descent in a nn) has a good chance of finding an agent.
I think claim 3 is wrong because agenty solutions exist for all tasks, even classification tasks. For example, take the function which spins up an agent, tasks that agent with the classification task, and then takes that agent’s output. Unless you’ve done something to explicitly remove agents from your search space, this sort of solution always exists.
Thus, I think claim 6 is wrong due to my complaints about claims 1 and 3.