paulfchristiano

Karma: 27,889

paulfchristiano Nov 27, 2023, 4:57 PM
LW: 19 AF: 5
−3
AF
in reply to: Daniel Kokotajlo’s comment on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
Differences:
- I don’t buy the story about long-horizon competence—I don’t think there is a compelling argument, and the underlying intuitions seem like they are faring badly. I’d like to see this view turned into some actual predictions, and if it were I expect I’d disagree.
- Calling it a “contradiction” or “extreme surprise” to have any capability without “wanting” looks really wrong to me.
- Nate writes:
This observable “it keeps reorienting towards some target no matter what obstacle reality throws in its way” behavior is what I mean when I describe an AI as having wants/desires “in the behaviorist sense”.”
- I think this is a semantic motte and bailey that’s failing to think about mechanics of the situation. LM agents already have the behavior “reorient towards a target in response to obstacles,” but that’s not the sense of “wanting” about which people disagree or that is relevant to AI risk (which I tried to clarify in my comment). No one disagrees that an LM asked “how can I achieve X in this situation?” will be able to propose methods to achieve X, and those methods will be responsive to obstacles. But this isn’t what you need for AI risk arguments!
- I think this post is a bad answer to the question “when are the people who expected ‘agents’ going to update?” I think you should be updating some now and you should be discussing that in an answer. I think the post also fails to engage with the actual disagreements so it’s not really advancing the discussion.

paulfchristiano Nov 24, 2023, 8:25 PM
LW: 93 AF: 36
39
AF
on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
Okay, so you know how AI today isn’t great at certain… let’s say “long-horizon” tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing? [...] And you know how the AI doesn’t seem to have all that much “want”- or “desire”-like behavior? [...] Well, I claim that these are more-or-less the same fact.
It’s pretty unclear if a system that is good at answering the question “Which action would maximize the expected amount of X?” also “wants” X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system “Which action would maximize the expected amount of Y?” whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.
The point seems almost tautological to me, and yet also seems like the correct answer to the people going around saying “LLMs turned out to be not very want-y, when are the people who expected ‘agents’ going to update?”, so, here we are.
I think that a system may not even be able to “want” things in the behaviorist sense, and this is correlated with being unable to solve long-horizon tasks. So if you think that systems can’t want things or solve long horizon tasks at all, then maybe you shouldn’t update at all when they don’t appear to want things.
But that’s not really where we are at—AI systems are able to do an increasingly good job of solving increasingly long-horizon tasks. So it just seems like it should obviously be an update, and the answer to the original question
Could you give an example of a task you don’t think AI systems will be able to do before they are “want”-y? At what point would you update, if ever? What kind of engineering project requires an agent to be want-y to accomplish it? Is it something that individual humans can do? (It feels to me like you will give an example like “go to the moon” and that you will still be writing this kind of post even once AI systems have 10x’d the pace of R&D.)
(The foreshadowing example doesn’t seem very good to me. One way a human or an AI would write a story with foreshadowing is to first decide what will happen, and then write the story and include foreshadowing of the event you’ve already noted down. Do you think that series of steps is hard? Or that the very idea of taking that approach is hard? Or what?)
Like you, I think that future more powerful AI systems are more likely to want things in the behaviorist sense, but I have a different picture and think that you are overstating the connection between “wanting things” and “ability to solve long horizon tasks” (as well as overstating the overall case). I think a system which gets high reward across a wide variety of contexts is particularly likely to want reward in the behaviorist sense, or to want something which is consistently correlated with reward or for which getting reward is consistently instrumental during training. This seems much closer to a tautology. I think this tendency increases as models get more competent, but that it’s not particularly about “ability to solve long-horizon tasks,” and we are obviously getting evidence about it each time we train a new language model.
What links here?

paulfchristiano Nov 5, 2023, 4:54 PM
10 points
4
on: Deception Chess: Game #1
It might be worth making a choice about a single move which is unclear to weak players but where strong players have a consensus.
Mostly I think it would be faster and I think a lot less noisy per minute. I also think it’s a bit unrepresentative to be able to use “how well did this advisor’s suggestions work out in hindsight?” to learn which advisors are honest and so it’s nice to make the dishonest advisors’ job easier.
(In practice I think evaluating what worked well in hindsight is going to be very valuable, and is already enough for crazy research acceleration—e.g. it would be very valuable to just get predictions of which research direction will feel promising to me after spending a day thinking about it. But I think the main open question here is whether some kind of debate or decomposition can add value over and above the obvious big wins.)
For what it’s worth I think using chess might be kind of tough—if you provide significant time, the debaters can basically just play out the game.

paulfchristiano Oct 27, 2023, 9:02 PM
LW: 6 AF: 5
2
AF
in reply to: elifland’s comment on: Thoughts on responsible scaling policies and regulation
I don’t think you need to reliably classify a system as safe or not. You need to apply consistent standards that output “unsafe” in >90% of cases where things really are unsafe.
I think I’m probably imagining better implementation than you, probably because (based on context) I’m implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium. I think what I’m describing as “very good RSPs” and imagining cutting risk 10x still requires significantly less political will than a global moratorium now (but I think this is a point that’s up for debate).
So at that point you obviously aren’t talking about 100% of countries voluntarily joining (instead we are assuming export controls implemented by the global community on straggling countries—which I don’t even think seems very unrealistic at this point and IMO is totally reasonable for “very good”), and I’m not convinced open source models are a relevant risk (since the whole proposal is gating precautions on hazardous capabilities of models rather than size, and so again I think that’s fair to include as part of “very good”).
I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can’t cut risk by much. I’m sympathetic to the claim that >10% of risk comes from worlds where you need to pursue the technology in a qualitatively different way to avoid catastrophe, but again in those scenarios I do think it’s plausible for well-implemented RSPs to render some kinds of technologies impractical and therefore force developers to pursue alternative approaches.

paulfchristiano Oct 26, 2023, 5:17 PM
5 points
2
in reply to: Foyle’s comment on: Thoughts on responsible scaling policies and regulation
I think politically realistic hardware controls could buy significant time, or be used to push other jurisdictions to implement appropriate regulation and allow for international verification if they want access to hardware. This seems increasingly plausible given the United States’ apparent willingness to try to control access to hardware (e.g. see here).

paulfchristiano Oct 26, 2023, 5:09 PM
6 points
2
in reply to: Oliver Sourbut’s comment on: Thoughts on responsible scaling policies and regulation
Which laxer jurisdictions are poised to capture talent/hardware/etc. right now? It seems like ‘The West’ (interpreted as Silicon Valley) is close to the laxest jurisdiction on Earth when it comes to tech! (If we interpret ‘The West’ more broadly, this no longer holds, thankfully.)
If you implemented a unilateral pause on AI training runs in the West, then anyone who wasn’t pausing AI would be a much laxer jurisdiction.
Regarding the situation today, I don’t believe that any jurisdiction has regulations that meaningfully reduce catastrophic risk, but that the US, EU, and UK seem by far the closest, which I’d call “the West.”
I assume your caveat about ‘a pause on new computing hardware’ indicates that you think that business-as-usual capitalism means that pausing capital-intensive frontier development unilaterally doesn’t buy much, because hardware (and talent and data etc.) will flow basically-resistance-free to other places? This seems like a crux: one I don’t feel well-equipped to evaluate, but which I do feel it’s appropriate to be quite uncertain on.
I think a unilateral pause in the US would slow down AI development materially, there is obviously a ton of resistance. Over the long term I do think you will bounce back significantly to the previous trajectory from catch-up growth, despite resistance, and I think the open question is more like whether that bounce back is 10% or 50% or 90%. So I end up ambivalent; the value of a year of pause now is pretty low compared to the value of a year of pause later, and you are concentrating development in time and shifting it to places that are (by hypothesis) less inclined to regulate risk.

paulfchristiano Oct 26, 2023, 4:52 PM
LW: 2 AF: 2
0
AF
in reply to: Wei Dai’s comment on: Thoughts on responsible scaling policies and regulation
I don’t think an RSP will be able to address these risks, and I think very few AI policies would address these risks either. An AI pause could address them primarily by significantly slowing human technological development, and if that happened today I’m not even really these risks are getting better at an appreciable rate (if the biggest impact is the very slow thinking from a very small group of people who care about them, then I think that’s a very small impact). I think that in that regime random political and social consequences of faster or slower technological development likely dominate the direct effects from becoming better prepared over time. I would have the same view in retrospect about e.g. a possible pause on AI development 6 years ago. I think at that point the amount of quality-adjusted work on alignment was probably higher than the quality-adjusted work on these kinds of risks today, but still the direct effects on increasingly alignment preparedness would be pretty tiny compared to random other incidental effects of a pause on the AI landscape.

paulfchristiano Oct 26, 2023, 4:47 PM
LW: 14 AF: 10
10
AF
in reply to: elifland’s comment on: Thoughts on responsible scaling policies and regulation
I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.
I do not know whether this kind of conditional pause is feasible even given that evidence. That said I think it’s much more feasible to get such a pause as a result of good safety standards together with significant evidence of hazardous capabilities and alignment difficulty, and the 10x risk reduction is reflecting the probability that you are able to get that kind of evidence in advance of a catastrophe (but conditioning on a very good implementation).
The point of this comment is to explain why I am primarily worried about implementation difficulty, rather than about the risk that failures will occur before we detect them. It seems extremely difficult to manage risks even once they appear, and almost all of the risk comes from our failure to do so.
(Incidentally, I think some other participants in this discussion are advocating for an indefinite pause starting now, and so I’d expect them to be much more optimistic about this step than you appear to be.)
(I’m guessing you’re not assuming that every lab in the world will adopt RSPs, though it’s unclear. And even if every lab implements them presumably some will make mistakes in evals and/or protective measures)
I don’t think that voluntary implementation of RSPs is a substitute for regulatory requirements and international collaboration (and tried to emphasize this in the post). In talking about a 10x risk reduction I’m absolutely imagining international coordination to regulate AI development.
In terms of “mistakes in evals” I don’t think this is the right picture of how this works. If you have noticed serious enough danger that leading developers have halted further development, and also have multiple years of experience with those systems establishing alignment difficulty and the nature of dangerous capabilities, you aren’t just relying on other developers to come up with their own independent assessments. You have an increasingly robust picture of what would be needed to proceed safely, and if someone claims that actually they are the one developer who has solved safety, that claim is going to be subject to extreme scrutiny.
unlikely that alignment difficulty is within the range of effort that we would put into the problem in normal-ish circumstances.
I don’t really believe this argument. I guess I don’t think situations will be that “normal-ish” in the world where a $10 trillion industry has been paused for years over safety concerns, and in that regime I think we have more like 3 orders of magnitude of gap between “low effort” and “high effort” which is actually quite large. I also think there very likely ways to get several orders of magnitude of additional output with AI systems using levels of caution that are extreme but knowably possible. And even if we can’t solve the problem we could continue to invest in stronger understanding of risk, and with good enough understanding in hand I think there is a significant chance (perhaps 50%) that we could hold off on AI development for many years such that other game-changing technologies or institutional changes could arrive first.

paulfchristiano Oct 25, 2023, 5:56 PM
LW: 26 AF: 15
11
AF
on: Thoughts on responsible scaling policies and regulation
Relatedly, I thought Managing AI Risks in an Era of Rapid Progress was great, particularly the clear statement that this is an urgent priority and the governance recommendations.
On a first reading I feel like I agree with most everything that was said, including about RSPs and the importance of regulation.
Small caveats: (i) I don’t know enough to understand the implications or comment on the recommendation “they should also hold frontier AI developers and owners legally accountable for harms from their models that can be reasonably foreseen and prevented,” (ii) “take seriously the possibility that generalist AI systems will outperform human abilities across many critical domains within this decade or the next” seems like a bit of a severe understatement that might undermine urgency (I think we should that possibility seriously over the next few years, and I’d give better than even odds that they will outperform humans across all critical domains within this decade or next), (iii) I think that RSPs / if-then commitments are valuable not just for bridging the period between now and when regulation is in place, but for helping accelerate more concrete discussions about regulation and building relevant infrastructure.
I’m a tiny bit nervous about the way that “autonomous replication” is used as a dangerous capability here and in other communications. I’ve advocated for it as a good benchmark task for evaluation and responses because it seems likely to be easier than almost anything catastrophic (including e.g. intelligence explosion, superhuman weapons R&D, organizing a revolution or coup...) and by the time it occurs there is a meaningful probability of catastrophe unless you have much more comprehensive evaluations in place. That said, I think most audiences will think it sounds somewhat improbable as a catastrophic risk in and of itself (and a bit science-fiction-y, in contrast with other risks like cybersecurity that also aren’t existential in-and-of-themselves but sound much more grounded). So it’s possible that while it makes a good evaluation target it doesn’t make a good first item on a list of dangerous capabilities. I would defer to people who have a better understanding of politics and perception, I mostly raise the hesitation because I think ARC may have had a role in how focal it is in some of these discussions.

paulfchristiano Oct 25, 2023, 4:11 AM
LW: 18 AF: 10
−2
AF
in reply to: Joe Collman’s comment on: Thoughts on responsible scaling policies and regulation
Unknown unknowns seem like a totally valid basis for concern.
But I don’t think you get to move the burden of proof by fiat. If you want action then you need to convince relevant actors they should be concerned about them, and that unknown unknowns can cause catastrophe before a lab will stop. Without further elaboration I don’t think “unknown unknowns could cause a catastrophe” is enough to convince governments (or AI developers) to take significant actions.
I think RSPs make this situation better by pushing developers away from vague “Yeah we’ll be safe” to saying “Here’s what we’ll actually do” and allowing us to have a conversation about whether that specific thing sufficient to prevent risk early enough. I think this is way better, because vagueness and equivocation make scrutiny much harder.
My own take is that there is small but non-negligible risk before Anthropic’s ASL-3. For my part I’d vote to move to a lower threshold, or to require more stringent protective measures when working with any system bigger than LLaMA. But I’m not the median voter or decision-maker here (nor is Anthropic), and so I’ll say my piece but then move on to trying to convince people or to find a compromise that works.

paulfchristiano Oct 25, 2023, 2:55 AM
LW: 133 AF: 61
75
AF
on: Lying is Cowardice, not Strategy
Here is a short post explaining some of my views on responsible scaling policies, regulation, and pauses I wrote it last week in response to several people asking me to write something. Hopefully this helps clear up what I believe.
I don’t think I’ve ever hidden my views about the dangers of AI or the advantages of scaling more slowly and carefully. I generally aim to give honest answers to questions and present my views straightforwardly. I often point out that catastrophic risk would be lower if we could coordinate to build AI systems later and slower; I usually caveat that doing so seems costly and politically challenging and so I expect it to require clearer evidence of risk.

paulfchristiano Oct 25, 2023, 2:14 AM
6 points
2
in reply to: Dave Orr’s comment on: Thoughts on responsible scaling policies and regulation
That’s fair, I think I misread you.
I guess our biggest differences are (i) I don’t think the takeaway depends so strongly on whether AI developers are trying to do the right thing—either way it’s up to all of us, and (ii) I think it’s already worth talking about ways which Anthropic’s RSP is good or bad or could be better, and so I disagree with “there’s probably not much to say at this point.”

paulfchristiano Oct 25, 2023, 1:29 AM
LW: 22 AF: 14
16
AF
in reply to: Dave Orr’s comment on: Thoughts on responsible scaling policies and regulation
But I also suspect that people on the more cynical side aren’t going to be persuaded by a post like this. If you think that companies are pretending to care about safety but really are just racing to make $$, there’s probably not much to say at this point other than, let’s see what happens next.
This seems wrong to me. We can say all kinds of things, like:
- Are these RSPs actually effective if implemented? How could they be better? (Including aspects like: how will this policy be updated in the future? What will happen given disagreements?)
- Is there external verification that they are implemented well?
- Which developers have and have not implemented effective and verifiable RSPs?
- How could employees, the public, and governments push developers to do better?
I don’t think we’re just sitting here and rolling a die about which is going to happen, path #1 or path #2. Maybe that’s right if you just are asking how much companies will do voluntarily, but I don’t think that should be the exclusive focus (and if it was there wouldn’t be much purpose to this more meta discussion). One of my main points is that external stakeholders can look at what companies are doing, discuss ways in which it is or isn’t adequate, and then actually push them to do better (and build support for government action to demand better). That process can start immediately, not at some hypothetical future time.

Thoughts on responsible scaling policies and regulation

paulfchristianoOct 24, 2023, 10:21 PM

221 points

33 comments6 min readLW link

paulfchristiano Sep 20, 2023, 7:27 PM
9 points
6
in reply to: Zach Stein-Perlman’s comment on: Anthropic’s Responsible Scaling Policy & Long-Term Benefit Trust
The role of the Trust is to elect (and potentially replace) board members; its formal power comes entirely from the fact that it will eventually elect a majority of the board seats.
The post mentions a “failsafe” where a supermajority of investors can amend this arrangement, which I think is a reasonable compromise. But I’m not aware of any public information about what that supermajority is, or whether there are other ways the Trust’s formal powers could be reduced.
Dylan Matthews reports the members of the board here: Dario, Daniela, Luke Meulhauser, and Yasmin Razavi. (I think it’s also listed plenty of other places.)

paulfchristiano Sep 5, 2023, 2:55 AM
LW: 4 AF: 4
0
AF
on: Prizes for matrix completion problems
We intend to leave this prize open until the end of September. At that point we will distribute prizes (probably just small prizes for useful arguments and algorithms, but no full solution).
I now pretty strongly suspect that the version of problem 1 with logarithmic dependence on $ε$ is not solvable. We would award a prize for an algorithm running in time $O (m n ε^{- 1})$ which can distinguish matrices with no PSD completion from those with a completion where the ratio of min to max eigenvalue is at least $ε$ . And of course a lower bound is still fair game.
That said, I don’t expect any new submissions to win prizes and so wouldn’t recommend that anyone start working on it.

paulfchristiano Aug 14, 2023, 4:51 PM
LW: 19 AF: 7
2
AF
in reply to: michaelcohen’s comment on: Thoughts on sharing information about language model capabilities
By process-based RL, I mean: the reward for an action doesn’t depend on the consequences of executing that action. Instead it depends on some overseer’s evaluation of the action, potentially after reading justification or a debate about it or talking with other AI assistants or whatever. I think this has roughly the same risk profile as imitation learning, while potentially being more competitive.
I’m generally excited and optimistic about coordination. If you are just saying that AI non-proliferation isn’t that much harder than nuclear non-proliferation, then I think I’m with you. But I think (i) it’s totally fair to call that “strong global coordination,” (ii) you would probably have to do a somewhat better job than we did of nuclear non-proliferation.
I think the technical question is usually going to be about how to trade off capability against risk. If you didn’t care about that at all, you could just not build scary ML systems. I’m saying that you should build smaller models with process-based RL.
It might be good to focus on legible or easy-to-enforce lines rather than just trading off capability vs risk optimally. But I don’t think that “no RL” is effective as a line—it still leaves you with a lot of reward-hacking (e.g. by planning against an ML model, or predicting what actions lead to a high reward, or expert iteration...). Trying to avoid all these things requires really tightly monitoring every use of AI, rather than just training runs. And I’m not convinced it helps significantly with deceptive alignment.
So in any event it seems like you are going to care about model size. “No big models” is also a way easier line to enforce. This is pretty much like saying “minimize the amount of black-box end-to-end optimization you do,” which feels like it gets closer to the heart of the issue.
If you are taking that approach, I think you would probably prefer to do process-based RL with smaller models, rather than imitation learning with bigger models (and will ultimately want to use outcomes in relatively safe ways). Yes it would be safer to use neither process-based RL nor big models, and just make your AI weaker. But the main purpose of technical work is to reduce how demanding the policy ask is—how much people are being asked to give up, how unstable the equilibrium is, how much powerful AI we can tolerate in order to help enforce or demonstrate necessity. Otherwise we wouldn’t be talking about these compromises at all—we’d just be pausing AI development now until safety is better understood.
I would quickly change my tune on this if e.g. we got some indication that process-based RL increased rather than decreased the risk of deceptive alignment at a fixed level of capability.

paulfchristiano Aug 10, 2023, 5:24 PM
LW: 5 AF: 3
0
AF
in reply to: michaelcohen’s comment on: Thoughts on sharing information about language model capabilities
It would be safest of all to just not build powerful AI for a very long time. But alas, that seems wildly uncompetitive and so would require some kind of strong global coordination (and would create considerable instability and leave significant value on the table for other worldviews).
It’s possible that “human-level AI with CoT” will be competitive enough, but I would guess not.
So to me the obvious approach is to use chain of thought and decomposition to improve performance, and then to distill the result back into the model.
You could try to do distillation with imitation learning. This is way more likely to be competitive then with no distillation at all.
But it still seems like it has a very good chance of being uncompetitive because the imitation objective significantly impairs performance and creates all kinds of artifacts. Using process-based RL for distillation seems like it has essentially the same safety profile to using imitation learning, while avoiding the obvious pathologies and having a much higher probability of being competitive.
(People give various reasons that RL in the distillation step is less safe than imitation learning in the distillation step, but so far I haven’t found anything at all persuasive.)
I think there’s still a good chance that process-based RL in the distillation step still can’t be competitive and so you need to talk about how to develop new techniques or prudently incorporate outcomes. But I think it’s at least much more likely to be competitive than CoT-only, or imitation learning in the distillation step. (Perhaps it cuts down the probability of deal-breaking uncompetitiveness by 30%, compared to using imitation learning alone for distillation.)

paulfchristiano Aug 9, 2023, 3:59 AM
8 points
3
in reply to: Zach Stein-Perlman’s comment on: Thoughts on sharing information about language model capabilities
My guess is that if you hold capability fixed and make a marginal move in the direction of (better LM agents) + (smaller LMs) then you will make the world safer. It straightforwardly decreases the risk of deceptive alignment, makes oversight easier, and decreases the potential advantages of optimizing on outcomes.
What links here?
- Which possible AI systems are relatively safe? by Zach Stein-Perlman (Aug 21, 2023, 5:00 PM; 42 points)
- Which paths to powerful AI should be boosted? by Zach Stein-Perlman (Aug 23, 2023, 4:00 PM; 5 points)

paulfchristiano Aug 1, 2023, 6:37 PM
LW: 21 AF: 10
17
AF
on: Thoughts on sharing information about language model capabilities
Note that Evals has just published a description of some of their work evaluating GPT-4 and Claude. Their publication does not include transcripts, the details of the LM agents they evaluated, or detailed qualitative discussion of the strengths and weaknesses of the agents they evaluated. I believe that eventually Evals should be considerably more liberal about sharing this kind of information; my post is explaining why I believe that.

paulfchristiano

Thoughts on re­spon­si­ble scal­ing poli­cies and regulation

Thoughts on responsible scaling policies and regulation