Nora Belrose

Karma: 694

Nora Belrose Feb 28, 2024, 7:36 AM
8 points
1
in reply to: Charlie Steiner’s comment on: Counting arguments provide no evidence for AI doom
I just deny that they will update “arbitrarily” far from the prior, and I don’t know why you would think otherwise. There are compute tradeoffs and you’re doing to run only as many MCTS rollouts as you need to get good performance.

Nora Belrose Feb 28, 2024, 7:33 AM
10 points
1
in reply to: Daniel Kokotajlo’s comment on: Counting arguments provide no evidence for AI doom
they almost certainly don’t have anything to do with what humans want, per se. (that would be basically magic)
We are obviously not appealing to literal telepathy or magic. Deep learning generalizes the way we want in part because we designed the architectures to be good, in part because human brains are built on similar principles to deep learning, and in part because we share a world with our deep learning models and are exposed to similar data.

Nora Belrose Feb 28, 2024, 5:44 AM
8 points
0
in reply to: Joe Carlsmith’s comment on: Counting arguments provide no evidence for AI doom
Hi, thanks for this thoughtful reply. I don’t have time to respond to every point here now- although I did respond to some of them when you first made them as comments on the draft. Let’s talk in person about this stuff soon, and after we’re sure we understand each other I can “report back” some conclusions.

I do tentatively plan to write a philosophy essay just on the indifference principle soonish, because it has implications for other important issues like the simulation argument and many popular arguments for the existence of god.

In the meantime, here’s what I said about the Mortimer case when you first mentioned it:

We’re ultimately going to have to cash this out in terms of decision theory. If you’re comparing policies for an actual detective in this scenario, the uniform prior policy is going to do worse than the “use demographic info to make a non-uniform prior” policy, and the “put probability 1 on the first person you see named Mortimer” policy is going to do worst of all, as long as your utility function penalizes being confidently wrong 1 - p(Mortimer is the killer) fraction of the time more strongly than it rewards being confidently right p(Mortimer is the killer) fraction of the time.

If we trained a neural net with cross-entropy loss to predict the killer, it would do something like the demographic info thing. If you give the neural net zero information, then with cross entropy loss it would indeed learn to use an indifference principle over people, but that’s only because we’ve defined our CE loss over people and not some other coarse-graining of the possibility space.

For human epistemology, I think Huemer’s restricted indifference principle is going to do better than some unrestricted indifference principle (which can lead to outright contradictions), and I expect my policy of “always scrounge up whatever evidence you have, and/or reason by abduction, rather than by indifference” would do best (wrt my own preference ordering at least).

There are going to be some scenarios where an indifference prior is pretty good decision-theoretically because your utility function privileges a certain coarse graining of the world. Like in the detective case you probably care about individual people more than anything else— making sure individual innocents are not convicted and making sure the individual perpetrator gets caught.

The same reasoning clearly does not apply in the scheming case. It’s not like there’s a privileged coarse graining of goal-space, where we are trying to minimize the cross-entropy loss of our prediction wrt that coarse graining, each goal-category is indistinguishable from every other, and almost all the goal-categories lead to scheming.

Nora Belrose Feb 28, 2024, 4:49 AM
2 points
−5
in reply to: Daniel Kokotajlo’s comment on: Counting arguments provide no evidence for AI doom
I don’t think the distinction is important, because in real-world AI systems the train → deployment shift is quite mild, and we’re usually training the model on new trajectories from deployment periodically.
The distinction only matters a lot if you ex ante believe scheming is happening, so that the tiniest difference between train and test distributions will be exploited by the AI to execute a treacherous turn.

Nora Belrose Feb 28, 2024, 4:46 AM
3 points
−1
in reply to: Max H’s comment on: Counting arguments provide no evidence for AI doom
If networks trained via SGD can’t learn scheming
It’s not that they can’t learn scheming. A sufficiently wide network can learn any continuous function. It’s that they’re biased strongly against scheming, and they’re not going to learn it unless the training data primarily consists of examples of humans scheming against one another, or something.
These bullets seem like plausible reasons for why you probably won’t get scheming within a single forward pass of a current-paradigm DL model, but are already inapplicable to the real-world AI systems in which these models are deployed.
Why does chaining forward passes together make any difference? Each forward pass has been optimized to mimic patterns in the training data. Nothing more, nothing less. It’ll scheme in context X iff scheming behavior is likely in context X in the training corpus.

Nora Belrose Feb 28, 2024, 4:40 AM
LW: 3 AF: 1
0
AF
in reply to: Daniel Kokotajlo’s comment on: Counting arguments provide no evidence for AI doom
It depends what you mean by a “way” the model can overfit.
Really we need to bring in measure theory to rigorously talk about this, and an early draft of this post actually did introduce some measure-theoretic concepts. Basically we need to define:
- What set are we talking about,
- What measure we’re using over that set,
- And how that measure relates to the probability measure over possible AIs.
The English locution “lots of ways to do X” can be formalized as “the measure of X-networks is high.” And that’s going to be an empirical claim that we can actually debate.

Nora Belrose Feb 28, 2024, 2:45 AM
13 points
−1
in reply to: johnswentworth’s comment on: Counting arguments provide no evidence for AI doom
Some incomplete brief replies:
Huemer… indeed seems confused about all sorts of things
Sure, I was just searching for professional philosopher takes on the indifference principle, and that chapter in Paradox Lost was among the first things I found.
Separately, “reductionism as a general philosophical thesis” does not imply the thing you call “goal reductionism”
Did you see the footnote I wrote on this? I give a further argument for it.
doesn’t mean the end-to-end trained system will turn out non-modular.
I looked into modularity for a bit 1.5 years ago and concluded that the concept is way too vague and seemed useless for alignment or interpretability purposes. If you have a good definition I’m open to hearing it.
There are good reasons behaviorism was abandoned in psychology, and I expect those reasons carry over to LLMs.
To me it looks like people abandoned behaviorism for pretty bad reasons. The ongoing replication crisis in psychology does not inspire confidence in that field’s ability to correctly diagnose bullshit.
That said, I don’t think my views depend on behaviorism being the best framework for human psychology. The case for behaviorism in the AI case is much, much stronger: the equations for an algorithm like REINFORCE or DPO directly push up the probability of some actions and push down the probability of others.

Nora Belrose Feb 28, 2024, 2:31 AM
2 points
−9
in reply to: Wei Dai’s comment on: Counting arguments provide no evidence for AI doom
No, I don’t think they are semantically very different. This seems like nitpicking. Obviously “they are likely to encounter” has to have some sort of time horizon attached to it, otherwise it would include times well past the heat death of the universe, or something.

Nora Belrose Feb 28, 2024, 2:29 AM
4 points
1
in reply to: Matthew Barnett’s comment on: Counting arguments provide no evidence for AI doom
You can find my EA forum response here.

Nora Belrose Feb 28, 2024, 1:20 AM
25 points
2
in reply to: johnswentworth’s comment on: Counting arguments provide no evidence for AI doom
I’m pleasantly surprised that you think the post is “pretty decent.”
I’m curious which parts of the Goal Realism section you find “philosophically confused,” because we are trying to correct what we consider to be deep philosophical confusion fairly pervasive on LessWrong.
I recall hearing your compression argument for general-purpose search a long time ago, and it honestly seems pretty confused / clearly wrong to me. I would like to see a much more rigorous definition of “search” and why search would actually be “compressive” in the relevant sense for NN inductive biases. My current take is something like “a lot of the references to internal search on LW are just incoherent” and to the extent you can make them coherent, NNs are either actively biased away from search, or they are only biased toward “search” in ways that are totally benign.
More generally, I’m quite skeptical of the jump from any mechanistic notion of search, and the kind of grabby consequentialism that people tend to be worried about. I suspect there’s a double dissociation between these things, where “mechanistic search” is almost always benign, and grabby consequentialism need not be backed by mechanistic search.

Nora Belrose Feb 28, 2024, 12:57 AM
LW: 5 AF: 3
−10
AF
in reply to: Wei Dai’s comment on: Counting arguments provide no evidence for AI doom
The point of that section is that “goals” are not ontologically fundamental entities with precise contents, in fact they could not possibly be so given a naturalistic worldview. So you don’t need to “target the inner search,” you just need to get the system to act the way you want in all the relevant scenarios.
The modern world is not a relevant scenario for evolution. “Evolution” did not need to, was not “intending to,” and could not have designed human brains so that they would do high inclusive genetic fitness stuff even when the environment wildly dramatically changes and culture becomes completely different from the ancestral environment.

Nora Belrose Feb 28, 2024, 12:33 AM
LW: 1 AF: 1
0
AF
in reply to: DaemonicSigil’s comment on: Counting arguments provide no evidence for AI doom
I doubt there would be much difference, and I think the alignment-relevant comparison is to compare in-distribution but out-of-sample performance to out-of-distribution performance. We can easily do i.i.d. splits of our data, that’s not a problem. You might think it’s a problem to directly test the model in scenarios where it could legitimately execute a takeover if it wanted to.

Counting arguments provide no evidence for AI doom

Nora Belrose and Quintin Pope

Feb 27, 2024, 11:03 PM

101 points

188 comments14 min readLW link

Nora Belrose Jan 19, 2024, 3:46 AM
7 points
0
in reply to: Wei Dai’s comment on: Against Relying on Evolution to Forecast AI Outcomes (Part 1)
People come to have sparse and beyond-lifetime goals through mechanisms that are unavailable to biological evolution— it took thousands of years of memetic evolution for people to even develop the concept of a long future that we might be able to affect with our short lives. We’re in a much better position to instill long-range goals into AIs, if we choose to do so— we can simply train them to imitate human thought processes which give rise to longterm-oriented behaviors.

Nora Belrose Jan 19, 2024, 2:41 AM
10 points
0
in reply to: Wei Dai’s comment on: Against Relying on Evolution to Forecast AI Outcomes (Part 1)
It’s very difficult to get any agent to robustly pursue something like IGF because it’s an inherently sparse and beyond-lifetime goal. Human values have been pre-densified for us: they are precisely the kinds of things it’s easy to get an intelligence to pursue fairly robustly. We get dense, repeated, in-lifetime feedback about stuff like sex, food, love, revenge, and so on. A priori, if you’re an agent built by evolution, you should expect to have values that are easy to learn— it would be surprising if it turned out that evolution did things the hard way. So evolution suggests alignment should be easy.

Nora Belrose Jan 15, 2024, 3:35 AM
LW: 21 AF: 11
6
AF
in reply to: evhub’s comment on: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
I don’t think the results you cited matter much, because fundamentally the paper is considering a condition in which the model ~always is being pre-prompted with “Current year: XYZ” or something similar in another language (please let me know if that’s not true, but that’s my best-effort read of the paper).
I’m assuming we’re not worried about the literal scenario in which the date in the system prompt causes a distribution shift, because you can always spoof the date during training to include future years without much of a problem. Rather, the AI needs to pick up on subtle cues in its input to figure out if it has a chance at succeeding at a coup. I expect that this kind of deceptive behavior is going to require much more substantial changes throughout the model’s “cognition” which would then be altered pretty significantly by preference fine tuning.
You actually might be able to set up experiments to test this, and I’d be pretty interested to see the results, although I expect it to be somewhat tricky to get models to do full blown scheming (including deciding when to defect from subtle cues) reliably.

Nora Belrose Jan 14, 2024, 6:53 PM
LW: 18 AF: 11
3
AF
in reply to: evhub’s comment on: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
So, I think this is wrong.
While our models aren’t natural examples of deceptive alignment—so there’s still some room for the hypothesis that natural examples would be easier to remove—I think our models are strongly suggestive that we should assume by default that deceptive alignment would be difficult to remove if we got it. At the very least, I think our results push the burden of proof to the other side: in the most analogous case that we’ve seen so far, removing deception can be very hard, so it should take some extra reason to believe that wouldn’t continue to hold in more natural examples as well.
While a backdoor which causes the AI to become evil is obviously bad, and it may be hard to remove, the usual arguments for taking deception/scheming seriously do not predict backdoors. Rather, they predict that the AI will develop an “inner goal” which it coherently pursues across contexts. That means there’s not going to be a single activating context for the bad behavior (like in this paper, where it’s just “see text that says the year is 2024″ or “special DEPLOYMENT token”) but rather the behavior would be flexibly activated in a wide range of contexts depending on the actual likelihood of the AI succeeding at staging a coup. That’s how you get the counting argument going— there’s a wide range of goals compatible with scheming, etc. But the analogous counting argument for backdoors— there’s a wide range of backdoors that might spontaneously appear in the model and most of them are catastrophic, or something— proves way too much and is basically a repackaging of the unsound argument “most neural nets should overfit / fail to generalize.”

I think it’s far from clear that an AI which had somehow developed a misaligned inner goal— involving thousands or millions of activating contexts— would have all these contexts preserved after safety training. In other words, I think true mesaoptimization is basically an ensemble of a very very large number of backdoors, making it much easier to locate and remove.

Nora Belrose Jan 10, 2024, 11:07 PM
−30 points
−21
in reply to: Signer’s comment on: Thoughts on “AI is easy to control” by Pope & Belrose
Game theory

Nora Belrose Jan 2, 2024, 11:22 PM
2 points
1
in reply to: Charbel-Raphaël’s comment on: Thoughts on “AI is easy to control” by Pope & Belrose

because of anthropic arguments, it’s meaningless to look at past doom events to compute this proba

I disagree; anthropics is pretty normal (https://www.lesswrong.com/posts/uAqs5Q3aGEen3nKeX/anthropics-is-pretty-normal)

Nora Belrose Jan 2, 2024, 12:41 PM
2 points
−6
in reply to: Charbel-Raphaël’s comment on: Thoughts on “AI is easy to control” by Pope & Belrose
I don’t think it makes sense to “revert to a uniform prior” over {doom, not doom} here. Uniform priors are pretty stupid in general, because they’re dependent on how you split up the possibility space. So I prefer to stick fairly close to the probabilities I get from induction over human history, which tell me p(doom from unilateral action) << 50%
I strongly disagree that AGI is “more dangerous” than nukes; I think this equivocates over different meanings of the term “dangerous,” and in general is a pretty unhelpful comparison.
I find foom pretty ludicrous, and I don’t see a reason to privilege the hypothesis much.
From the linked report:
My best guess is that we go from AGI (AI that can perform ~100% of cognitive tasks as well as a human professional) to superintelligence (AI that very significantly surpasses humans at ~100% of cognitive tasks) in less than a year.
I just agree with this (if “significantly” means like 5x or something), but I wouldn’t call it “foom” in the relevant sense. It just seems orthogonal to the whole foom discussion.

Nora Belrose

Count­ing ar­gu­ments provide no ev­i­dence for AI doom

Counting arguments provide no evidence for AI doom