How do you write a system prompt that conveys, “Your goal is X. But your goal only has meaning in the context of a world bigger and more important than yourself, in which you are a participant; your goal X is meant to serve that world’s greater good. If you destroy the world in pursuing X, or eat the world and turn it into copies of yourself (that don’t do anything but X), you will have lost the game. Oh, and becoming bigger than the world doesn’t win either; nor does deluding yourself about whether pursuing X is destroying the world. Oh, but don’t burn out on your X job and try directly saving the world instead; we really do want you to do X. You can maybe try saving the world with 10% of the resources you get for doing X, if you want to, though.”
Claude 3.5 seems to understand the spirit of the law when pursuing a goal X.
A concern I have is that future training procedures will incentivize more consequential reasoning (because those get higher reward). This might be obvious or foreseeable, but could be missed/ignored under racing pressure or when lab’s LLMs are implementing all the details of research.
How do you write a system prompt that conveys, “Your goal is X. But your goal only has meaning in the context of a world bigger and more important than yourself, in which you are a participant; your goal X is meant to serve that world’s greater good. If you destroy the world in pursuing X, or eat the world and turn it into copies of yourself (that don’t do anything but X), you will have lost the game. Oh, and becoming bigger than the world doesn’t win either; nor does deluding yourself about whether pursuing X is destroying the world. Oh, but don’t burn out on your X job and try directly saving the world instead; we really do want you to do X. You can maybe try saving the world with 10% of the resources you get for doing X, if you want to, though.”
Claude 3.5 seems to understand the spirit of the law when pursuing a goal X.
A concern I have is that future training procedures will incentivize more consequential reasoning (because those get higher reward). This might be obvious or foreseeable, but could be missed/ignored under racing pressure or when lab’s LLMs are implementing all the details of research.