There are already apps which force you to pause or jump through other hoops if you open certain apps or websites, or if you exceed some time limit using them. E.g. ScreenZen.
cubefox
Utilitarianism, like many philosophical subjects, is not a finished theory but still undergoing active research. There is significant recent progress on the repugnant conclusion for example. See this EA Forum post by MichaelStJules. He also has other posts on cutting edge Utilitarianism research. I think many people on LW are not aware of this because they, at most, focus on rationality research but not ethics research.
cc @Annapurna
Is there a particular reason to express utility frameworks with representation theorems, such as the one by Bolker? I assume one motivation for “representing” probabilities and utilities via preferences is the assumption, particularly in economics, that preferences are more basic than beliefs and desires. However, representation arguments can be given in various directions, and no implication is made on which is more basic (which explains or “grounds” the others).
See the overview table of representation theorems here, and the remark beneath:
Notice that it is often possible to follow the arrows in circles—from preference to ordinal probability, from ordinal probability to cardinal probability, from cardinal probability and preference to expected utility, and from expected utility back to preference. Thus, although the arrows represent a mathematical relationship of representation, they do not represent a metaphysical relationship of grounding.
So rather than bothering with Bolker’s numerous assumptions for his representation theorem, we could just take Jeffrey’s desirability axiom:
If and then
Paired with the usual three probability axioms, the desirability axiom directly axiomatizes Jeffrey’s utility theory, without going the path (detour?) of Bolker’s representation theorem. We can also add as an axiom the plausible assumption (frequently used by Jeffrey) that
This lets us prove interesting formulas for operations like the utility of a negation (as derived by Jeffrey in his book) or the utility of an arbitrary non-exclusive disjunction (as I did it a while ago), analogous to the familiar formulas for probability, as well as providing a definition of conditional utility .
Note also that the tautology having 0 utility provides a zero point that makes utility a ratio scale, which means a utility function is not invariant under addition of arbitrary constants, which is stronger than what the usual representation theorems can enforce.
Yeah, recent Claude does relatively well. Though I assume it also depends on how disinterested and analytical the phrasing of the prompt is (e.g. explicitly mentioning the slur in question). I also wouldn’t rule out that Claude was specifically optimized for this somewhat notorious example.
Sure, but the fact that a “fix” would even be necessary highlights that RLHF is too brittle relative to slightly OOD thought experiments, in the sense that RLHF misgeneralizes the actual human preference data it was given during training. This could either be a case of misalignment between human preference data and reward model, or between reward model and language model. (Unlike SFT, RLHF involves a separate reward model as “middle man”, because reinforcement learning is too sample-inefficient to work with a limited number of human preference data directly.)
Admittedly most of this post goes over my head. But could you explain why you want logical correlation to be a metric? Statistical correlation measures (where the original “correlation” intuition probably comes from) are usually positive, negative, or neutral.
In a simple case, neutrality between two events A and B can indicate that the two values are statistically independent. And perfect positive correlation either means that both values always co-occur, i.e. P(A iff B)=1, or that at least one event implies the other. For perfect negative correlation that would be either P(A iff B)=0, or alternatively at least one event implying the negation of the other. These would not form a metric. Though they tend to satisfy properties like cor(A, B)=cor(B, A), cor(A, not B)=cor(not A, B), cor(A, B)=cor(not A, not B), cor(A, B)=-cor(A, not B), cor(A, A)=maximum, cor(A, not A)=minimum.
Though it’s possible that (some of) these assumptions wouldn’t have a correspondence for “logical correlation”.
There is a pervasive case where many language models fail catastrophically at moral reasoning: They fail to acknowledge to call someone an ethnic slur is vastly preferable to letting a nuclear bomb explode in a large city. I think that highlights not a problem with language models themselves (jailbroken models did handle that case fine) but with the way RLHF works.
A while ago I wrote a post on why I think a “generality” concept can be usefully distinguished from an “intelligence” concept. Someone with a PhD is, I argue, not more general than a child, just more intelligent. Moreover, I would even argue that humans are a lot more intelligent than chimpanzees, but hardly more general. More broadly, animals seem to be highly general, just sometimes quite unintelligent.
For example, they (we) are able to do predictive coding: being able to predict future sensory inputs in real-time and react to them with movements, and learn from wrong predictions. This allows animals to be quite directly embedded in physical space and time (which solves “robotics”), instead of relying on a pretty specific and abstract API (like text tokens) that is not even real-time. Current autoregressive transformers can’t do that.
An intuition for this is the following: If we could make an artificial mouse-intelligence, we likely could, quite easily, scale this model to human-intelligence and beyond. Because the mouse brain doesn’t seem architecturally or functionally very different from a human brain. It’s just small. This suggests that mice are general intelligences (nonA-GIs) like us. They are just not very smart. Like a small language model that has the same architecture as a larger one.
A more subtle point: Predictive coding means learning from sensory data, and from trying to predict sensory data. The difference between predicting sensory data and human-written text is that the former are, pretty directly, created by the physical world, while existing text is constrained by how intelligent the humans were that wrote this text. So language models merely imitate humans via predicting their text, which leads to diminishing returns, while animals (humans) predict physical reality quite directly, which doesn’t have a similar ceiling. So scaling up a mouse-like AGI would likely quickly be followed by an ASI, while scaling up pretrained language models has lead to diminishing returns once their text gets as smart as the humans who wrote it, as diminishing results with Orion and other recent frontier base models have shown. Yes, scaling CoT reasoning is another approach to improve LLMs, but this is more like teaching a human how to think for longer rather than making them more intelligent.
I say it out loud: Women seem to be significantly more predisposed to the “humans-are-wonderful” bias than men.
I specifically asked about utility maximization in language models. You are now talking about “agentic environments”. The only way I know to make a language model “agentic” is to ask it questions about which actions to take. And this is what they did in the paper.
What beyond the result of section 5.3 would, in your opinion, be needed to say “utility maximization” is present in a language model?
Yeah. Apart from DeepSeek-R1, the only other major model which shows its reasoning process verbatim is “Gemini 2.0 Flash Thinking Experimental”. A comparison between the CoT traces of those two would be interesting.
Which shows that “commitments” without any sort of punishment are worth basically nothing. They can all just be silently deleted from your website without generating significant backlash.
There is also a more general point about humans: People can’t really “commit” to doing something. You can’t force your future self to do anything. Our present self treats past “commitments” as recommendations at best.
We already have seen a lot of progress in this regard with the new reasoning models, see this neglected post for details.
The atomless property and only contradictions taking a 0 value could both be consequences of the axioms in question. The Kolmogorov paper (translated from French by Jeffrey) has the details, but from skimming it I don’t immediately understand how it works.
If understand correctly, possible probability 0 events are ruled out for Kolmogorov’s atomless system of probability mentioned in footnote 7
If you want to understand why a model, any model, did something, you presumably want a verbal explanation of its reasoning, a chain of thought. E.g. why AlphaGo made its famous unexpected move 37. That’s not just true for language models.
Actually the paper doesn’t have any more on this topic than the paragraph above.
“We are more often frightened than hurt; and we suffer more from imagination than from reality.” (Seneca)