Mantas Mazeika comments on Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Mantas Mazeika Mar 2, 2025, 6:55 AM
3 points
0
Hey, thanks for the reply.
True. It’s also a standard criticism of those studies that answers to those questions measure what a person would say in response to being asked those questions (or what they’d do within the context of whatever behavioral experiment has been set up), but not necessarily what they’d do in real life when there are many more contextual factors
The same way that people act differently on the internet from in-person, I agree that LLMs might behave differently if they think there are real consequences to their choices. However, I don’t think this means that their values over hypothetical states of the world is less valuable to study. In many horrible episodes of human history, decisions with real consequences were made at a distance without directly engaging with what was happening. If someone says “I hate people from country X”, I think most people would find that worrisome enough, without needing evidence that the person would actually physically harm someone from country X if given the opportunity.
Likewise, these questions might answer what an LLM with the default persona and system prompt might answer when prompted with only these questions, but don’t necessarily tell us what it’d do when prompted to adopt a different persona, when its context window had been filled with a lot of other information, etc..
We ran some experiments on this in the appendix. Prompting it with different personas does change the values (as expected). But within its default persona, we find the values are quite stable to different way of phrasing the comparisons. We also ran a “value drift” experiment where we checked the utilities of a model at various points along long-context SWE-bench logs. We found that the utilities are very stable across the logs.
Some models like Claude 3.6 are a bit infamous for very quickly flipping all of their views into agreement with what it thinks the user’s views are, for instance.
This is a good point, which I hadn’t considered before. I think it’s definitely possible for models to adjust their values in-context. It would be interesting to see if sycophancy creates new, coherent values, and if so whether these values have an instrumental structure or are internalized as intrinsic values.
- Kaj_Sotala Mar 2, 2025, 7:39 AM
  2 points
  0
  Parent
  However, I don’t think this means that their values over hypothetical states of the world is less valuable to study.
  Yeah, I don’t mean that this wouldn’t be interesting or valuable to study—sorry for sounding like I did. My meaning was something more in line with Olli’s comment, that this is interesting but that the generalization from the results to “GPT-4o is willing to trade off” etc. sounds too strong to me.