Alignment Forum recent comments

Steven Byrnes May 1, 2025, 6:08 PM
LW: 4 AF: 3
0
AF
on: Can we safely automate alignment research?
Thanks for writing this! Leaving some comments with reactions as I was reading, not all very confident, and sorry if I missed or misunderstood things you wrote.
Problems with these evaluation techniques can arise in attempting to automate all sorts of domains (I’m particularly interested in comparisons with (a) capabilities research, and (b) other STEM fields). And I think this should be a source of comfort. In particular: these sorts of problems can slow down the automation of capabilities research, too. And to the extent they’re a bottleneck on all sorts of economically valuable automation, we should expect lots of effort to go towards resolving them. … [then more discussion in §6.1]
This feels wrong to me. I feel like “the human must evaluate the output, and doing so is hard” is more of an edge case, applicable to things like “designs for a bridge”, where failure is far away and catastrophic. (And applicable to alignment research, of course.)
Like you mention today’s “reward-hacking” (e.g. o3 deleting unit tests instead of fixing the code) as evidence that evaluation is necessary. But that’s a bad example because the reward-hacked code doesn’t actually work! And people notice that it doesn’t work. If the code worked flawlessly, then people wouldn’t be talking about reward-hacking as if it’s a bad thing. People notice eventually, and that constitutes an evaluation. Likewise, if you hire a lousy head of marketing, then you’ll eventually notice the lack of new customers; if you hire a lousy CTO, then you’ll eventually notice that your website doesn’t work; etc.
OK, you anticipate this reply and then respond with: “…And even if these tasks can be evaluated via more quantitative metrics in the longer-term (e.g., “did this business strategy make money?”), trying to train on these very long-horizon reward signals poses a number of distinctive challenges (e.g., it can take a lot of serial time, long-horizon data points can be scarce, etc).”
But I don’t buy that because, like, humans went to the moon. That was a long-horizon task but humans did not need to train on it, rather they did it with the same brains we’ve been using for millennia. It did require long-horizon goals. But (1) If AI is unable to pursue long-horizon goals, then I don’t think it’s adequate to be an alignment MVP (you address this in §9.1 & here, but I’m more pessimistic, see here & here), (2) If the AI is able to pursue long-horizon goals, then the goal of “the human eventually approves / presses the reward button” is an obvious and easily-trainable approach that will be adequate for capabilities, science, and unprecedented profits (but not alignment), right up until catastrophe. (Bit more discussion here.)
((1) might be related to my other comment, maybe I’m envisioning a more competent “alignment MVP” than you?)

johnswentworth May 1, 2025, 2:37 PM
LW: 13 AF: 6
11
AF
in reply to: ryan_greenblatt’s comment on: Can we safely automate alignment research?
You might hope for elicitation efficiency, as in, you heavily RL the model to produce useful considerations and hope that your optimization is good enough that it covers everything well enough.
“Hope” is indeed a good general-purpose term for plans which rely on an unverifiable assumption in order to work.
(Also I’d note that as of today, heavy RL tends to in fact produce pretty bad results, in exactly the ways one would expect in theory, and in particular in ways which one would expect to get worse rather than better as capabilities increase. RL is not something we can apply in more than small amounts before the system starts to game the reward signal.)

johnswentworth May 1, 2025, 2:32 PM
LW: 11 AF: 6
5
AF
in reply to: Steven Byrnes’s comment on: Can we safely automate alignment research?
That was an excellent summary of how things seem to normally work in the sciences, and explains it better than I would have. Kudos.

Steven Byrnes May 1, 2025, 12:56 PM
LW: 47 AF: 26
14
AF
in reply to: johnswentworth’s comment on: Can we safely automate alignment research?
I think that OP’s discussion of “number-go-up vs normal science vs conceptual research” is an unnecessary distraction, and he should have cut that part and just talked directly about the spectrum from “easy-to-verify progress” to “hard-to-verify progress”, which is what actually matters in context.
Partly copying from §1.4 here, you can (A) judge ideas via new external evidence, and/or (B) judge ideas via internal discernment of plausibility, elegance, self-consistency, consistency with already-existing knowledge and observations, etc. There’s a big range in people’s ability to apply (B) to figure things out. But what happens in “normal” sciences like biology is that there are people with a lot of (B), and they can figure out what’s going on, on the basis of hints and indirect evidence. Others don’t. The former group can gather ever-more-direct and ever-more-unassailable (A)-type evidence over time, and use that evidence as a cudgel with which to beat the latter group over the head until they finally get it. (“If you don’t believe my 7 independent lines of evidence for plate tectonics, OK fine I’ll go to the mid-Atlantic ridge and gather even more lines of evidence…”)
This is an important social tool, and explains why bad scientific ideas can die, while bad philosophy ideas live forever. And it’s even worse than that—if the bad philosophy ideas don’t die, then there’s no common knowledge that the bad philosophers are bad, and then they can rise in the ranks and hire other bad philosophers etc. Basically, to a first approximation, I think humans and human institutions are not really up to the task of making intellectual progress systematically over time, except where idiot-proof verification exists for that intellectual progress (for an appropriate definition of “idiot”, and with some other caveats).
…Anyway, AFAICT, OP is just claiming that AI alignment research involves both easy-to-verify progress and hard-to-verify progress, which seems uncontroversial.

Kaj_Sotala May 1, 2025, 10:30 AM
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: steve2152′s Shortform
Reminds me of

ryan_greenblatt May 1, 2025, 3:15 AM
LW: 2 AF: 2
0
AF
in reply to: johnswentworth’s comment on: Can we safely automate alignment research?
You might hope for elicitation efficiency, as in, you heavily RL the model to produce useful considerations and hope that your optimization is good enough that it covers everything well enough.

Or, two lower bars you might hope for:
- It brings up considerations that it “knows” about. (By “knows” I mean relatively deep knows, like it can manipulate and utilize the knowledge relatively strongly.)
- It isn’t much worse than human researchers at bringing up important considerations.
In general, you might have elicitation problems and this domain seems only somewhat worse with respect to elicitation. (It’s worse because the feedback is somewhat more expensive.)

Jacob_Hilton May 1, 2025, 12:58 AM
LW: 7 AF: 4
0
AF
on: Jacob_Hilton’s Shortform
I recently gave this talk at the Safety-Guaranteed LLMs workshop:
The talk is about ARC’s work on low probability estimation (LPE), covering:
- Theoretical motivation for LPE and (towards the end) activation modeling approaches (both described here)
- Empirical work on LPE in language models (described here)
- Recent work-in-progress on theoretical results

johnswentworth May 1, 2025, 12:52 AM
LW: 5 AF: 3
2
AF
in reply to: johnswentworth’s comment on: Can we safely automate alignment research?
Perhaps a better summary of my discomfort here: suppose you train some AI to output verifiable conceptual insights. How can I verify that this AI is not missing lots of things all the time? In other words, how do I verify that the training worked as intended?

johnswentworth May 1, 2025, 12:17 AM
LW: 16 AF: 7
14
AF
in reply to: Joe Carlsmith’s comment on: Can we safely automate alignment research?
Rather, conceptual research as I’m understanding it is defined by the tools available for evaluating the research in question.^[1] In particular, as I’m understanding it, cases where neither available empirical tests nor formal methods help much.
Agreed.
But if some AI presented us with this claim, the question is whether we could evaluate it via some kind of empirical test, which it sounds like we plausibly could.
Disagreed.
My guess is that you have, in the back of your mind here, ye olde “generation vs verification” discussion. And in particular, so long as we can empirically/mathematically verify some piece of conceptual progress once it’s presented to us, we can incentivize the AI to produce interesting new pieces of verifiable conceptual progress.
That’s an argument which works in the high capability regime, if we’re willing to assume that any relevant progress is verifiable, since we can assume that the highly capable AI will in fact find whatever pieces of verifiable conceptual progress are available. Whether it works in the relatively-low capability regime of human-ish-level automated alignment research and realistic amounts of RL is… rather more dubious. Also, the mechanics of designing a suitable feedback signal would be nontrivial to get right in practice.
Getting more concrete: if we’re imagining an LLM-like automated researcher, then a question I’d consider extremely important is: “Is this model/analysis/paper/etc missing key conceptual pieces?”. If the system says “yes, here’s something it’s missing” then I can (usually) verify that. But if it says “nope, looks good”… then I can’t verify that the paper is in fact not missing anything.
And in fact that’s a problem I already do run into with LLMs sometimes: I’ll present a model, and the thing will be all sycophantic and not mention that the model has some key conceptual confusion about something. Of course you might hope that some more-clever training objective will avoid that kind of sycophancy and instead incentivize Good Science, but that’s definitely not an already-solved problem, and I sure do feel suspicious of an assumption that that problem will be easy.

Wei Dai Apr 30, 2025, 11:44 PM
LW: 23 AF: 12
10
AF
on: Can we safely automate alignment research?
At the outermost feedback loop, capabilities can ultimately be grounded via relatively easy objective measures such as revenue from AI, or later, global chip and electricity production, but alignment can only be evaluated via potentially faulty human judgement. Also, as mentioned in the post, the capabilities trajectory is much harder to permanently derail because unlike alignment, one can always recover from failure and try again. I think this means there’s an irreducible logical risk (i.e., the possibility that this statement is true as a matter of fact about logic/math) that capabilities research is just inherently easier to automate than alignment research, that no amount of “work hard to automated alignment research” can lower beyond. Given the lack of established consensus ways of estimating and dealing with such risk, it’s inevitable that the people with the least estimate/concern about this risk (and other AI risks) will push capabilities forward as fast as they can, and seemingly the only way to solve this on the societal level is to push for norms/laws against doing that, i.e., slow down capabilities research via (politically legitimate) force and/or social pressure. I suspect the author might already agree with all this (the existence of this logical risk, the social dynamics, the conclusion about norms/laws being needed to reduce AI risk beyond some threshold), but I think it should be emphasized more in a post like this.

Joe Carlsmith Apr 30, 2025, 9:10 PM
LW: 8 AF: 6
0
AF
in reply to: johnswentworth’s comment on: Can we safely automate alignment research?
Note that conceptual research, as I’m understanding it, isn’t defined by the cognitive skills involved in the research—i.e., by whether the researchers need to have “conceptual thoughts” like “wait, is this measuring what I think it’s measuring?”. I agree that normal science involves a ton of conceptual thinking (and many “number-go-up” tasks do too). Rather, conceptual research as I’m understanding it is defined by the tools available for evaluating the research in question.^[1] In particular, as I’m understanding it, cases where neither available empirical tests nor formal methods help much.
Thus, in your neurotransmitter example, it does indeed take some kind of “conceptual thinking” to come up with the thought “maybe it actually takes longer for neurotransmitters to get re-absorbed than it takes for them to clear from the cleft.” But if some AI presented us with this claim, the question is whether we could evaluate it via some kind of empirical test, which it sounds like we plausibly could. Of course, we do still need to interpret the results of these tests—e.g., to understand enough about what we’re actually trying to measure to notice that e.g. one measurement is getting at it better than another. But we’ve got rich empirical feedback loops to dialogue with.
So if we interpret “conceptual work” as conceptual thinking, I do agree that “there will be market pressure to make AI good at conceptual work, because that’s a necessary component of normal science.” And this is closely related to the comforts I discuss in section 6.1. That is: a lot of alignment research seems pretty comparable to me to the sort of science at stake in e.g. biology, physics, computer science, etc, where I think human evaluation has a decent track record (or at least, a better track record than philosophy/futurism), and where I expect a decent amount of market pressure to resolve evaluation difficulties adequately. So (modulo scheming AIs differentially messing with us in some domains vs. others), at least by the time we’re successfully automating these other forms of science, I think we should be reasonably optimistic about automating that kind of alignment research as well. But this still leaves the type of alignment research that looks more centrally like philosophy/futurism, where I think evaluation is additionally challenging, and where the human track record looks worse.
1. ^
  Thus, from section 6.2.3: “Conceptual research, as I’m understanding it, is defined by the methods available for evaluating it, rather than the cognitive skills involved in producing it. For example: Einstein on relativity was clearly a giant conceptual breakthrough. But because it was evaluable via a combination of empirical predictions and formal methods, it wouldn’t count as ‘conceptual research’ in my sense.”

johnswentworth Apr 30, 2025, 6:39 PM
LW: 24 AF: 11
11
AF
on: Can we safely automate alignment research?
I think you are importantly missing something about how load-bearing “conceptual” progress is in normal science.
An example I ran into just last week: I wanted to know how long it takes various small molecule neurotransmitters to be reabsorbed after their release. And I found some very different numbers:
- Some sources offhandedly claimed ~1ms. AFAICT, this number comes from measuring the time taken for the neurotransmitter to clear from the synaptic cleft, and then assuming that the neurotransmitter clears mainly via reabsorption (an assumption which I emphasize because IIUC it’s wrong; I think the ~1ms number is actually measuring time for the neurotransmitter to diffuse out of the cleft).
- Other sources claimed ~10ms. These were based on <other methods>.
Now, I want to imagine for a moment a character named Emma the Empirical Fundamentalist, someone who eschews “conceptual” work entirely and only updates on Empirically Testable Questions. (For current purposes, mathematical provability can also be lumped in with empirical testability.) How would Emma respond to the two mutually-contradictory neurotransmitter reabsorption numbers above?
Well, first and foremost, Emma would not think “one of these isn’t measuring what the authors think they’re measuring”. That is a quintessential “conceptual” thought. Instead, Emma might suspect that one of the measurements was simply wrong, but repeating the two experiments will quickly rule out that hypothesis. She might also think to try many other measurement methods, and find that most of them agree with the ~10ms number, but the ~1ms measurement will still be a true repeatable phenomenon. Eventually she will likely settle on roughly “the real world is messy and complex, so sometimes measurements depend on context and method in surprising ways, and this is one of those times”. (Indeed, that sort of thing is a very common refrain in tons of biological science.)
But in this case, the real explanation (I claim) is simply that one of the measurements was based on an incorrect assumption. I made the incorrect assumption obvious by highlighting it above, but in-the-wild the assumption would be implicit and unstated and nonobvious; it wouldn’t even occur to the authors that they’re making an assumption which could be wrong.
Discovering that sort of error is centrally in the wheelhouse of conceptual progress. And as this example illustrates, it’s a very load-bearing piece of normal science.
(And indeed, I expect that at least some of your examples of normal-science-style empirical alignment research are cases where the authors are probably not measuring what they think they are measuring, though I don’t know in advance which ones. Conceptual work is exactly what would be required to sort that out.)
(And to be clear, I don’t think this example is the only or even most common “kind of way” in which conceptual work is load-bearing for normal science.)
Moving up the discussion stack: insofar as conceptual work is very load-bearing for normal science, how does that change the view articulated in the post? Well, first, it means that one cannot produce a good normal-science AI by primarily relying on empirical feedback (including mathematical feedback from proofs), unless one gets lucky and training on empirical feedback happens to also make the AI good at conceptual work. Second, there will be market pressure to make AI good at conceptual work, because that’s a necessary component of normal science.

Rauno Arike Apr 30, 2025, 3:18 PM
LW: 3 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: steve2152′s Shortform
Yep, I agree that there are alignment failures which have been called reward hacking that don’t fall under my definition of specification gaming, including your example here. I would call your example specification gaming if the prompt was “Please rewrite my code and get all tests to pass”: in that case, the solution satisfies the prompt in an unintended way. If the model starts deleting tests with the prompt “Please debug this code,” then that just seems like a straightforward instruction-following failure, since the instructions didn’t ask the model to touch the code at all. “Please rewrite my code and get all tests to pass. Don’t cheat.” seems like a corner case to me—to decide whether that’s specification gaming, we would need to understand the implicit specifications that the phrase “don’t cheat” conveys.

Steven Byrnes Apr 30, 2025, 2:09 PM
LW: 3 AF: 3
0
AF
in reply to: Rauno Arike’s comment on: steve2152′s Shortform
I thought of a fun case in a different reply: Harry is a random OpenAI customer and writes in the prompt “Please debug this code. Don’t cheat.” Then o3 deletes the unit tests instead of fixing the code. Is this “specification gaming”? No! Right? If we define “the specification” as what Harry wrote, then o3 is clearly failing the specification. Do you agree?

Steven Byrnes Apr 30, 2025, 1:59 PM
LW: 3 AF: 3
0
AF
in reply to: Kei’s comment on: steve2152′s Shortform
Thanks for the examples!
Yes I’m aware that many are using terminology this way; that’s why I’m complaining about it :)
I think your two 2018 Victoria Krakovna links (in context) are all consistent with my narrower (I would say “traditional”) definition. For example, the CoastRunners boat is actually getting a high RL reward by spinning in circles. Even for non-RL optimization problems that she mentions (e.g. evolutionary optimization), there is an objective which is actually scoring the result highly. Whereas for an example of o3 deleting a unit test during deployment, what’s the objective on which the model is actually scoring highly?
- Getting a good evaluation afterwards? Nope, the person didn’t want cheating!
- The literal text that the person said (“please debug the code”)? For one thing, erasing the unit tests does not satisfy the natural-language phrase “debugging the code”. For another thing, what if the person wrote “Please debug the code. Don’t cheat.” in the prompt, and o3 cheats anyway? Can we at least agree that this case should not be called reward hacking or specification gaming? It’s doing the opposite of its specification, right?
As for terminology, hmm, some options include “lying and cheating”, “ruthless consequentialist behavior” (I added “behavior” to avoid implying intentionality), “loophole-finding”, or “generalizing from a training process that incentivized reward-hacking via cheating and loophole-finding”.
(Note that the last one suggests a hypothesis, namely that if the training process had not had opportunities for successful cheating and loophole-finding, then the model would not be doing those things right now. I think that this hypothesis might or might not be true, and thus we really should be calling it out explicitly instead of vaguely insinuating it.)

Kei Apr 29, 2025, 10:16 PM
LW: 5 AF: 4
0
AF
in reply to: Steven Byrnes’s comment on: steve2152′s Shortform
It’s pretty common for people to use the terms “reward hacking” and “specification gaming” to refer to undesired behaviors that score highly as per an evaluation or a specification of an objective, regardless of whether that evaluation/specification occurs during RL training. I think this is especially common when there is some plausible argument that the evaluation is the type of evaluation that could appear during RL training, even if it doesn’t actually appear there in practice. Some examples of this:
- OpenAI described o1-preview succeeding at a CTF task in an undesired way as reward hacking.
- Anthropic described Claude 3.7 Sonnet giving an incorrect answer aligned with a validation function in a CoT faithfulness eval as reward hacking. They also used the term when describing the rates of models taking certain misaligned specification-matching behaviors during an evaluation after being fine-tuned on docs describing that Claude does or does not like to reward hack.
- This relatively early DeepMind post on specification gaming and the blog post from Victoria Krakovna that it came from (which might be the earliest use of the term specification gaming?) also gives a definition consistent with this.
I think the literal definitions of the words in “specification gaming” align with this definition (although interestingly not the words in “reward hacking”). The specification can be operationalized as a reward function in RL training, as an evaluation function or even via a prompt.
I also think it’s useful to have a term that describes this kind of behavior independent of whether or not it occurs in an RL setting. Maybe this should be reward hacking and specification gaming. Perhaps as Rauno Arike suggests it is best for this term to be specification gaming, and for reward hacking to exclusively refer to this behavior when it occurs during RL training. Or maybe due to the confusion it should be a whole new term entirely. (I’m not sure that the term “ruthless consequentialism” is the ideal term to describe this behavior either as it ascribes certain intentionality that doesn’t necessarily exist.)

Steven Byrnes Apr 29, 2025, 7:26 PM
LW: 6 AF: 3
5
AF
in reply to: Zach Stein-Perlman’s comment on: steve2152′s Shortform
Thanks!
I’m now much more sympathetic to a claim like “the reason that o3 lies and cheats is (perhaps) because some reward-hacking happened during its RL post-training”.
But I still think it’s wrong for a customer to say “Hey I gave o3 this programming problem, and it reward-hacked by editing the unit tests.”

gwern Apr 29, 2025, 6:11 PM
LW: 45 AF: 16
3
AF
on: gwern’s Shortform
The Meta-LessWrong Doomsday Argument (MLWDA) predicts long AI timelines and that we can relax:

LessWrong was founded in 2009 (16 years ago), and there have been 44 mentions of the ‘Doomsday argument’ prior to this one, and it is now 2025, at 2.75 mentions per year.

By the Doomsday argument, we medianly-expect mentions to stop in: after 44 additional mentions over 16 additional years or in 2041. (And our 95% CI on that 44 would then be +1 mention to +1,1760 mentions, corresponding to late-2027 AD to 2665 AD.)

By a curious coincidence, double-checking to see if really no one had made a meta-DA before, it turns out that Alexey Turchin has made a meta-DA as well about 7 years ago, calculating that

If we assume 1993 as the beginning of a large DA-Doomers reference class, and it is 2018 now (at the moment of writing this text), the age of the DA-Doomers class is 25 years. Then, with 50% probability, the reference class of DA-Doomers will disappear in 2043, according to Gott’s equation! Interestingly, the dates around 2030–2050 appear in many different predictions of the singularity or the end of the world (Korotayev 2018; Turchin & Denkenberger 2018b; Kurzweil 2006).

His estimate of 2043 is surprisingly close to 2041.

We offer no explanation as to why this numerical consilience of meta-DA calculations has happened; we attribute their success, as all else, to divine benevolence.

Regrettably, the 2041--2043 date range would seem to imply that it is unlikely we will obtain enough samples of the MLWDA in order to compute a Meta-Meta-LessWrong Doomsday Argument (MMLWDA) with non-vacuous confidence intervals, inasmuch as every mention of the MLWDA would be expected to contain a mention of the DA as well.

Zach Stein-Perlman Apr 29, 2025, 5:46 PM
LW: 20 AF: 12
9
AF
in reply to: Steven Byrnes’s comment on: steve2152′s Shortform
I agree people often aren’t careful about this.
Anthropic says
During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments . . . . This undesirable special-casing behavior emerged as a result of “reward hacking” during reinforcement learning training.
Similarly OpenAI suggests that cheating behavior is due to RL.

Rauno Arike Apr 29, 2025, 5:45 PM
LW: 14 AF: 7
1
AF
in reply to: Steven Byrnes’s comment on: steve2152′s Shortform
I think that using ‘reward hacking’ and ‘specification gaming’ as synonyms is a significant part of the problem. I’d argue that for LLMs, which can learn task specifications not only through RL but also through prompting, it makes more sense to keep those concepts separate, defining them as follows:
- Reward hacking—getting a high RL reward via strategies that were not desired by whoever set up the RL reward.
- Specification gaming—behaving in a way that satisfies the literal specification of an objective without achieving the outcome intended by whoever specified the objective. The objective may be specified either through RL or through a natural language prompt.
Under those definitions, the recent examples of undesired behavior from deployment would still have a concise label in specification gaming, while reward hacking would remain specifically tied to RL training contexts. The distinction was brought to my attention by Palisade Research’s recent paper Demonstrating specification gaming in reasoning models—I’ve seen this result called reward hacking, but the authors explicitly mention in Section 7.2 that they only demonstrate specification gaming, not reward hacking.