Lauro Langosco
I agree in principle that labs have the responsibility to dispel myths about what they’re committed to. OTOH, in defense of the labs I imagine that this can be hard to do while you’re in the middle of negotiations with various AISIs about what those commitments should look like.
The argument I think is good (nr (2) in my previous comment) doesn’t go through reference classes at all. I don’t want to make an outside-view argument (eg “things we call optimization often produce misaligned results, therefore sgd is dangerous”). I like the evolution analogy because it makes salient some aspects of AI training that make misalignment more likely. Once those aspects are salient you can stop thinking about evolution and just think directly about AI.
evolution does not grow minds, it grows hyperparameters for minds.
Imo this is a nitpick that isn’t really relevant to the point of the analogy. Evolution is a good example of how selection for X doesn’t necessarily lead to a thing that wants (‘optimizes for’) X; and more broadly it’s a good example for how the results of an optimization process can be unexpected.
I want to distinguish two possible takes here:
The argument from direct implication: “Humans are misaligned wrt evolution, therefore AIs will be misaligned wrt their objectives”
Evolution as an intuition pump: “Thinking about evolution can be helpful for thinking about AI. In particular it can help you notice ways in which AI training is likely to produce AIs with goals you didn’t want”
It sounds like you’re arguing against (1). Fair enough, I too think (1) isn’t a great take in isolation. If the evolution analogy does not help you think more clearly about AI at all then I don’t think you should change your mind much on the strength of the analogy alone. But my best guess is that most people incl Nate mean (2).
I’m not saying that GPT-4 is lying to us—that part is just clarifying what I think Matthew’s claim is.
Re cauldron: I’m pretty sure MIRI didn’t think that. Why would they?
I think the specification problem is still hard and unsolved. It looks like you’re using a different definition of ‘specification problem’ / ‘outer alignment’ than others, and this is causing confusion.
IMO all these terms are a bit fuzzy / hard to pin down, and so it makes sense that they’d lead to disagreement sometimes. The best way (afaict) to avoid this is to keep the terms grounded in ‘what would be useful for avoiding AGI doom’? To me it looks like on your definition, outer alignment is basically a trivial problem that doesn’t help alignment much.
More generally, I think this discussion would be more grounded / useful if you made more object-level claims about how value specification being solved (on your view) might be useful, rather than meta claims about what others were wrong about.
Do you have an example of one way that the full alignment problem is easier now that we’ve seen that GPT-4 can understand & report on human values?
(I’m asking because it’s hard for me to tell if your definition of outer alignment is disconnected from the rest of the problem in a way where it’s possible for outer alignment to become easier without the rest of the problem becoming easier).
I think it’s false in the sense that MIRI never claimed that it would be hard to build an AI with GPT-4 level understanding of human values + GPT-4 level of willingness to answer honestly (as far as I can tell). The reason I think it’s false is mostly that I haven’t seen a claim like that made anywhere, including in the posts you cite.
I agree lots of the responses elide the part where you emphasize that it’s important how GPT-4 doesn’t just understand human values, but is also “willing” to answer questions somewhat honestly. TBH I don’t understand why that’s an important part of the picture for you, and I can see why some responses would just see the “GPT-4 understands human values” part as the important bit (I made that mistake too on my first reading, before I went back and re-read).
It seems to me that trying to explain the original motivations for posts like Hidden Complexity of Wishes is a good attempt at resolving this discussion, and it looks to me as if the responses from MIRI are trying to do that, which is part of why I wanted to disagree with the claim that the responses are missing the point / not engaging productively.
I think maybe there’s a parenthesis issue here :)
I’m saying “your claim, if I understand correctly, is that MIRI thought AI wouldn’t (understand human values and also not lie to us)”.
I think we agree—that sounds like it matches what I think Matthew is saying.
You make a claim that’s very close to that—your claim, if I understand correctly, is that MIRI thought AI wouldn’t understand human values and also not lie to us about it (or otherwise decide to give misleading or unhelpful outputs):
The key difference between the value identification/specification problem and the problem of getting an AI to understand human values is the transparency and legibility of how the values are represented: if you solve the problem of value identification, that means you have an actual function that can tell you the value of any outcome (which you could then, hypothetically, hook up to a generic function maximizer to get a benevolent AI). If you get an AI that merely understands human values, you can’t necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.
I think this is similar enough (and false for the same reasons) that I don’t think the responses are misrepresenting you that badly. Of course I might also be misunderstanding you, but I did read the relevant parts multiple times to make sure, so I don’t think it makes sense to blame your readers for the misunderstanding.
My paraphrase of your (Matthews) position: while I’m not claiming that GPT-4 provides any evidence about inner alignment (i.e. getting an AI to actually care about human values), I claim that it does provide evidence about outer alignment being easier than we thought: we can specify human values via language models, which have a pretty robust understanding of human values and don’t systematically deceive us about their judgement. This means people who used to think outer alignment / value specification was hard should change their minds.
(End paraphrase)
I think this claim is mistaken, or at least it rests on false assumptions about what alignment researchers believe. Here’s a bunch of different angles on why I think this:
-
My guess is a big part of the disagreement here is that I think you make some wrong assumptions about what alignment researchers believe.
-
I think you’re putting a bit too much weight on the inner vs outer alignment distinction. The central problem that people talked about always was how to get an AI to care about human values. E.g. in The Hidden Complexity of Wishes (THCW) Eliezer writes
To be a safe fulfiller of a wish, a genie must share the same values that led you to make the wish.
If you find something that looks to you like a solution to outer alignment / value specification, but it doesn’t help make an AI care about human values, then you’re probably mistaken about what actual problem the term ‘value specification’ is pointing at. (Or maybe you’re claiming that value specification is just not relevant to AI safety—but I don’t think you are?).
-
It was always possible to attempt to solve the value specification problem by just pointing at a human. The fact that we can now also point at an LLM and get a result that’s not all that much worse than pointing at a human is not cause for an update about how hard value specification is. Part of the difficulty is how to define the pointer to the human and get a model to maximize human values rather than maximize some error in your specification. IMO THCW makes this point pretty well.
-
It’s tricky to communicate problems in AI alignment―people come in with lots of different assumptions about what kind of things are easy / hard, and it’s hard to resolve disagreements because we don’t have a live AGI to do experiments on. I think THCW and related essays you criticize are actually great resources. They don’t try to communicate the entire problem at once because that’s infeasible. The fact that human values are complex and hard to specify explicitly is part of the reason why alignment is hard, where alignment means get the AI to care about human values, not get an AI to answer questions about moral behavior reasonably.
-
You claim the existence of GPT-4 is evidence against the claims in THCW. But IMO GPT-4 fits in neatly with THCW. The post even starts with a taxonomy of genies:
There are three kinds of genies: Genies to whom you can safely say “I wish for you to do what I should wish for”; genies for which no wish is safe; and genies that aren’t very powerful or intelligent.
GPT-4 is an example of a genie that is not very powerful or intelligent.
-
If in 5 years we build firefighter LLMs that can rescue mothers from burning buildings when you ask them to, that would also not show that we’ve solved value specification—it’s just a didactic example, not a full description of the actual technical problem. More broadly, I think it’s plausible that within a few years LLM will be able to give moral counsel far better than the average human. That still doesn’t solve value specification any more than the existence of humans that could give good moral counsel 20 years ago had solved value specification.
-
If you could come up with a simple action-value function Q(observation, action), that when maximized over actions yields a good outcome for humans, then I think that would probably be helpful for alignment. This is an example of a result that doesn’t directly make an AI care about human values, but would probably lead to progress in that direction. I think if it turned out to be easy to formalize such a Q then I would change my mind about how hard value specification is.
-
While language models understand human values to some extent, they aren’t robust. The RHLF/RLAIF family of methods is based on using an LLM as a reward model, and to make things work you need to be careful not to optimize too hard or you’ll just get gibberish (Gao et al. 2022). LLMs don’t hold up against mundane RLHF optimization pressure, nevermind an actual superintelligence. (Of course, humans wouldn’t hold up either).
-
(Newbie guest fund manager here) My impression is there are plans re individuals but they’re not very developed or put into practice yet. AFAIK there are currently no plans to fundraise from companies or governments.
IMO a good candidate is anything that is object-level useful for X-risk mitigation. E.g. technical alignment work, AI governance / policy work, biosecurity, etc.
- 8 Sep 2023 11:23 UTC; 3 points) 's comment on Long-Term Future Fund Ask Us Anything (September 2023) by (EA Forum;
Broadly agree with the takes here.
However, these results seem explainable by the widely-observed tendency of larger models to learn faster and generalize better, given equal optimization steps.
This seems right and I don’t think we say anything contradicting it in the paper.
I also don’t see how saying ‘different patterns are learned at different speeds’ is supposed to have any explanatory power. It doesn’t explain why some types of patterns are faster to learn than others, or what determines the relative learnability of memorizing versus generalizing patterns across domains. It feels like saying ‘bricks fall because it’s in a brick’s nature to move towards the ground’: both are repackaging an observation as an explanation.
The idea is that the framing ‘learning at different speeds’ lets you frame grokking and double descent as the same thing. More like generalizing ‘bricks move towards the ground’ and ‘rocks move towards the ground’ to ‘objects move towards the ground’. I don’t think we make any grand claims about explaining everything in the paper, but I’ll have a look and see if there’s edits I should make—thanks for raising these points.
There are positive feedback loops between prongs:
Successfully containing & using more capable models (p1) gives you more scary demos for p2
Success in p1 also speeds up p3 a lot, because:
1) You can empirically study AGI directly,
2) Very advanced but “narrow” AI tools accelerate research (“narrow” here still means maybe more general than GPT-4)
3) Maybe you can even have (proto-)AGIs do research for you
You definitely need a lot of success in p2 for anything to work, otherwise people will take all the useful work we can get from proto-AGIs and pour it into capabilities research.
Better alignment research (p3) lets you do more p1 type risky stuff with SOTA models (on the margin)
If p1 is very successful, maybe we can punt most of p3 to the AIs; conversely, if p1 seems very hard then we probably only get ‘narrow’ tools to help with p3 and need to mostly do it ourselves, and hopefully get ML researchers to delay for long enough.
A three-pronged approach to AGI safety. (This is assuming we couldn’t just avoid building AGI or proto-AGIs at all until say ~2100, which would of course be much better).
Prong 1: boxing & capability control (aka ‘careful bootstrapping’)Make the AGI as capable as possible, under the constraint that you can make sure it can’t break out of the box or do other bad stuff.
Assume the AGI is misaligned. Be super paranoid
Goal: get useful work out of boxed AGIs.
For example, AIs might be able to do interpretability really well.
More generally, for any field there is an effective 2x-10x research productivity boost you can probably get, depending on the field (maybe 100x? unclear, I’m sceptical).
Goal: learn more about AGI by studying the systems you’ve built.
To test your security & oversight procedures, you run tests where you train your AGI to escape in a sandbox.
This is operationally hard and needs lots of buy-in
Research directions: scalable oversight, interpretability for oversight, auditing, adversarial training, capability control / “unlearning”, scaling laws & capabilities forecasting.
Prong 2: scary demos and and convincing people that AGI is dangerous
Goal 1: shut it all down, or failing that slow down capabilities research.
Goal 2: get operational & political support for the entire approach, which is going to need lots of support, esp first prong
In particular make sure that research productivity boosts from AGI don’t feed back into capabilities research, which requires high levels of secrecy + buy-in from a large number of people.
Avoiding a speed-up is probably a little bit easier than enacting a slow-down, though maybe not much easier.
Demos can get very scary if we get far into prong 1, e.g. we have AGIs that are clearly misaligned or show that they are capable of breaking many of our precautions.
Prong 3: alignment research aka “understanding minds”
Goal: understand the systems well enough to make sure they are at least corrigible, or at best ‘fully aligned’.
Roughly this involves understanding how the behaviour of the system emerges in enough generality that we can predict and control what happens once the system is deployed OOD, made more capable, etc.
Relevant directions: agent foundations / embedded agency, interpretability, some kinds of “science of deep learning”
whether or not this is the safest path, important actors seem likely to act as though it is
It’s not clear to me that this is true, and it strikes me as maybe overly cynical. I get the sense that people at OpenAI and other labs are receptive to evidence and argument, and I expect us to get a bunch more evidence about takeoff speeds before it’s too late. I expect people’s takes on AGI safety plans to evolve a lot, including at OpenAI. Though TBC I’m pretty uncertain about all of this―definitely possible that you’re right here.
Whether or not this is the safest path, the fact that OpenAI thinks it’s true and is one of the leading AI labs makes it a path we’re likely to take. Humanity successfully navigating the transition to extremely powerful AI might therefore require successfully navigating a scenario with short timelines and slow, continuous takeoff.
You can’t just choose “slow takeoff”. Takeoff speeds are mostly a function of the technology, not company choices. If we could just choose to have a slow takeoff, everything would be much easier! Unfortunately, OpenAI can’t just make their preferred timelines & “takeoff” happen. (Though I agree they have some influence, mostly in that they can somewhat accelerate timelines).
There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals
IMO making the field of alignment 10x larger or evals do not solve a big part of the problem, while indefinitely pausing AI development would. I agree it’s much harder, but I think it’s good to at least try, as long as it doesn’t terribly hurt less ambitious efforts (which I think it doesn’t).
Yeah fair point. I do think labs have some some nonzero amount of responsibility to be proactive about what others believe about their commitments. I agree it doesn’t extend to ‘rebut every random rumor’.