Seth Herd comments on My disagreements with “AGI ruin: A List of Lethalities”

Seth Herd 16 Sep 2024 19:32 UTC
1 point
0
Excellent post! This is what we need more of on LW.
It was indeed long. I needed a drink and snack just to read it, and a few to respond point-by-point. Like your comment that sparked this post, my reply should become the basis of a full post—if and when I get the time. Thanks for sparking it.
I’ve tried to make each comment relatively self-contained for ease of reading.
Response to Lethality 3: We need to get alignment right on the ‘first critical try’
I found this an unfortunate starting point, because I found this claim the least plausible of any you make here. I’m afraid this relatively dramatic and controversial claim might’ve stopped some people from reading the rest of this excellent post.
Extended quote from a Beren post (I didn’t understand where you were quoting from prior to googling for it.)
Summary: We can use synthetic data to create simulated worlds and artificial honeypots.
I don’t think we can, and worse, I don’t think we would even if we could—it sounds really hard to do well enough.
I find this wildly improbable if we’re talking about similar levels of human and AGI intelligence trying to set up a simulated reality through synthetic data, and convincing honeypots. It will be like the Truman show, or like humans trying to secure software systems against other humans: you’d have to think of everything they can think of in order to secure a system or fool someone with a simulated reality. Even a team of dedicated people can’t secure a system nor would they be able to create a self-consistent reality for someone of similar intelligence. There are just too many places you could go wrong.
This helps clarify the way I’m framing my alignment optimistic claims: you can detect misaligned AGI only if the AGI is dumber than you (or maybe up to similar level if you have good interpretability measures it doesn’t know about). That alignment can persist through and past human level as long as it’s a relatively continuous progression, so that at beyond human level, the system itself can advise you on keeping it aligned at the next level/progression of intelligence.
To frame it differently: the alignment tax on creating a simulated self-consistent reality for a human-level AGI sounds way too high.
Maybe I’m misunderstanding; if I am, others probably will too.
I think we do need to get it right on a first critical try; I just think we can do that, and even for a few similar first tries.
Response to Lethality 6:
I agree with you that a pivotal act won’t be necessary to prevent unaligned AGI (except as AGIs proliferate to >100, someone might screw it up).
I think a pivotal act will be necessary to prevent unaligned humans using their personal intent aligned AGIs to seize control of the future for selfish purposes.
After our exchange on that topic on If we solve alignment, do we die anyway?, I agreed with your logic that exponential increase of material resources would be available to all AGIs. But I still think a pivotal act will be necessary to avoid doom. Telling your AGI to turn the moon into compute and robots as rapidly as possible (the fully exponential strategy) to make sure it could defend against any other AGI would be a lot like the pivotal act we’re discussing (and would count as one by the original definition). There would be other AGIs, but they’d be powerless against the sovereign or coalition of AGIs that exerted authority using their lead in exponential production. This wouldn’t happen in the nice multipolar future people envision in which people control lots of AGIs and they do wonderful things without anyone turning the moon into an army to control the future.
This is a separate issue from the one Hubinger addressed in Homogeneity vs. heterogeneity in AI takeoff scenarios (where the link is wrong, it just leads back to this post.) There’s he’s saying that if the first AGI is successfully aligned, the following ones probably will be too. I agree with this. There he’s not addressing the risks of personal intent aligned AGIs under different humans’ control.
My final comment from that thread:
Ah—now I see your point. This will help me clarify my concern in future presentations, so thanks!
My concern is that a bad actor will be the first to go all-out exponential. Other, better humans in charge of AGI will be reluctant to turn the moon much less the earth into military/industrial production, and to upend the power structure of the world. The worst actors will, by default, be the first go full exponential and ruthlessly offensive. Beyond that, I’m afraid the physics of the world does favor offense over defense. It’s pretty easy to release a lot of energy where you want it, and very hard to build anything that can withstand a nuke let alone a nova. But the dynamics are more complex than that, of course. So I think the reality is unknown. My point is that this scenario deserves some more careful thought.
Response to AI fragility claims
Here I agree with you entirely. I tink it’s fair to say that AGI isn’t safe by default, but the reasons you give and that excellent comment you quote show why safety of an AGI is readily achievable wit reasonable care.
Response to Lethality 10
Your argument that alignment generalizes farther than capabilities is quite interesting. I’m not sure I’d make a claim that strong, but I do think alignment generalizes about as far as capabilities—both quite far once you hit actual reasoning or sapience and understanding.
I do worry about The alignment stability problem WRT long-term alignment generalization. I think reflective stability will probably prove adequate for superhuman AGI’s alignment stability, but I’m not nearly sure enough to want to launch a value-aligned AGI even if I thought initial alignment would work.
Response to Lethality 11
Back to whether a pivotal act is necessary. Same as #6 above—agree that it’s not necessary to prevent misaligned AGI, think it is to prevent misaligned humans with AGIs aligned to follow their instructions.
c
15. Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously.
Here I very much agree with that statement from the original LoL, while also largely agreeing with your reasoning for why we can overcome that problem. I expect fast capability gains when we reach “Real AGI” that can improve its reasoning and knowledge on its own without retraining. And I expect its alignment properties to be somewhat different than “aligning” a tool LLM that doesn’t have coherent agency. But I expect the synthetic data approach and instruction-following as the core tenet of that new entity to establish reflective stability, which will help alignment if it’s approximately on.
Response to Lethalities 16 and 17
I agree that the analogy with evolution doesn’t go very far, since we’re a lot smarter and have much better tools for alignment than evolution. We don’t have its ability to do trial and error to a staggering level, but this analogy only goes so far as to say we won’t get alignment right with only a shoddy, ill-considered attempt. Even OpenAI is going to at least try to do alignment, and put some thought into it. So the arguments for different techniques have to be taken on their merits.
I also think we won’t rely heavily on RL for alignment, as this and other Lethalities assume. I expect us to lean heavilty on Goals selected from learned knowledge: an alternative to RL alignment, for instance, by putting the prompt “act as an agent following instructions from users (x))” at the heart of our LLM agent proto-AGI.
Response to Lethalities 17 & 18
Agreed; language points to objects in the world adequately for humans that use it carefully, so I expect it to also be adequate for superhuman AGI that’s taking instructions from and so working in close collaboration with intelligent, careful humans.
Response to Lethality 21
You say, and I very much agree:
The key is that data on values is what constrains the choice of utility functions, and while values aren’t in physics, they are in human books and languages, and I’ve explained why alignment generalizes further than capabilities above.
Except I expect this to be used to reference what humans mean, not what they value. I expect do-what-I-mean-and-check or instruction-following alignment strategies, and am not sure that full value alignment would work were it attempted in this manner. For that and other reasons, I expect instruction-following as the alignment target for all early AGI projects
Response to Lethality 22
I halfway agree and find the issues you raise fascinating, but I’m not sure they’re relevant to alignment.
I think that there is actually a simple core of alignment to human values, and a lot of the reasons for why I believe this is because I believe about 80-90%, if not more of our values is broadly shaped by the data, and not the prior, and that the same algorithms that power our capabilities is also used to influence our values, though the data matters much more than the algorithm for what values you have. More generally, I’ve become convinced that evopsych was mostly wrong about how humans form values, and how they get their capabilities in ways that are very alignment relevant.
I think evopsych was only wrong about how humans form values as a result of how very wrong they were about how humans get their capabilities.
Thus, I halfway agree that people get their values largely from the environment. I think we get our values largely from the environment and get our values largely from the way evolution designed our drives. They interact in complex ways that make both absolutely critical for the resulting values.
I also disbelieve the claim that humans had a special algorithm that other species don’t have, and broadly think human success was due to more compute, data and cultural evolution.
After 2+ decades of studying how human brains solve complex problems, I completely agree. We have the same brain plan as other mammals, we just have more compute and way better data (from culture, and our evolutionary drive to pay close attention to other humans).
So, what is the “simple core” of human values you mention? Is it what people have written about human values? I’d pretty much agree that that’s a usable core, even if it’s not simple.
Response to Lethality 23
Corrigibility as anti-natural.
I very much agree with Eliezer that his definition as an extra add-on is anti-natural. But you can get corrigibility in a natural way by making corrigibility (correctability) itself, or the closely related instruction-following, the single or primary alignment target.
To your list of references, including my own, I’ll add one:
I think Max Harms’ Corrigibility as Singular Target sequence is the definitive work on corrigibility in all its senses.
Response to Lethality 24
I very much agree with Yudkowsky’s framing here:
24. There are two fundamentally different approaches you can potentially take to alignment, which are unsolvable for two different sets of reasons; therefore, by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult.
I just wrote about this critical point in Conflating value alignment and intent alignment is causing confusion.
I agree with Yud that a CEV sovereign is not a good idea for a first attempt. As explained in 23 above, I very much agree with you and disagree with Yudkowsky that the corrigiblity approach is also doomed.
Response to Lethality 25 - we don’t understand networks
I agree that interpretability has a lot of work to do before it’s useful- but I think it only needs to solvve one problem: is the AGI deliberately lying?
Responses to Lethalities 28 and 29 - we can’t check whether the outputs of a smarter than human AGI are aligned
Here I completely agree—misaligned superhuman AGI would make monkeys of us with no problem, even if we did box it and check its outputs—which we won’t.
That’s why we need a combination of having it tell us honestly about its motivations and thoughts (by instructing our personal intent aligned AGI to do so) and interpretability to discover when it’s lying.
Response to Lethality 32 - language isn’t a complete represention of underlying thoughts, so LLMs won’t reach AGI
Language has turned out to be a shockingly good training set. So LLMs will probably enable AGI, with a little scaffolding and additional cognitive systems (each reliant on the strengths of LLMs) to turn them into language model cognitive architectures. See Capabilities and alignment of LLM cognitive architectures. This year-old post isn’t nearly the extent of the detailed reasons I think such brain-inspired synthetic cognitive architectures will achieve proto-AGI relatively soon, because I’m not sure enough that they’re our best chance at alignment to start advancing capabilities.
WRT to the related but separate issue of language being an adequate reflection of their underlrying thoughts to allow alignment and transparency:
It isn’t, except if you want it to be and work to make sure it’s communicating your thoughts well enough.
There’s some critical stuff here about whether we apply enough RL or other training pressures to foundation models/agents to make their use of language not approximately reflect their underlying thoughts (and prevent translucent thoughts).
Response to Lethality 39
39. I figured this stuff out using the null string as input,
Yudkowsky may very well be the smartest human whose thought I’ve personally encountered. That doesn’t make him automatically right and the rest of us wrong. Expertise, time-on-task, and breadth of thought all count for a lot, far more in sum than sheer ability-to-juggle-concepts. Arguments count way more than appeals to authority (although those should be taken seriously too, and I definitely respect Yud’s authority on the topic).
Conclusion: we already have adequate alignment techniques to create aligned AGI
I agree, with the caveat that I think we can pull off personal intent alignment (corrigibility or instruciton-following) but not value alignment (CEV or similar sovereign AGI).
And I’m not sure, so I sure wish we could get some more people examining this- because people are going to try these alignment techniques, whether or not they work.
Excellent post! This is what we need more of in the alignment community: closely examining proposed alignment techniques.
- Noosphere89 16 Sep 2024 23:01 UTC
  4 points
  0
  Parent
  Some thoughts on your excellent comment:
  First, I fixed the link issue you saw.
  I think the potential difference between you and me on whether synthetic data works to box an AI is whether the AI notices it’s being in a simulation made via synthetic data, and also I think that it’s not intended to be applied post-training, it’s instead applied continuously throughout the training process.
  I agree that if we had an AGI/ASI that was already misaligned, we’d have to do pretty extreme actions like mindwipe it’s memories and restart the entire training process again, but the point of synthetic data is to get it into a basin of alignment/corrigibility early on, before it can be deceptive.
  I also think that real data will be only given to AGIs at the end of training as a way to ground them, so it has no real way to know whether it’s subtly being changed in training or whether it’s in reality, since we control their data sources.
  Controlling an AI’s data sources is a powerful way to control their values and capabilities, which was why I think that the tax for the synthetic data alignment is actually pretty low.
  Re Response to Lethality 6, I’m honestly coming around to your position as I think more and more, at least to the extent that I think your arguments are plausible and we need more research on that.
  Re Response to Lethality 10, I was relying on both empirical evidence from today’s models and some theoretical reasons for why the phenomenon of alignment generalizing further than capabilities exists in general.
  On the alignment stability problem, I like the post, and we should plausibly do interventions to stabilize alignment once we get it.′
  Re Response to Lethality 15, I agree with the idea that fast capability progress will happen, but deny the implication, because of both large synthetic datasets on values/instruction following already being in the AI when the fast capabilities progress happened, because synthetic data about values is pretrained in very early, combined with me being more optimistic about alignment generalization than you are.
  I liked your Real AGI post by the way.
  Response to Lethalities 16 and 17
  I agree that the analogy with evolution doesn’t go very far, since we’re a lot smarter and have much better tools for alignment than evolution. We don’t have its ability to do trial and error to a staggering level, but this analogy only goes so far as to say we won’t get alignment right with only a shoddy, ill-considered attempt. Even OpenAI is going to at least try to do alignment, and put some thought into it. So the arguments for different techniques have to be taken on their merits.
  One concrete way we have better tools than evolution is we have far more control over what their data sources are, and more generally we have far more inspectability and controllability over their data, especially the synthetic kind, which means we don’t have to create very realistic simulations, since for all the AI knows, we might be elaborately fooling it to reveal itself, and until the very end of training, it probably doesn’t even have specific data on our reality.
  Re Response to Lethality 21, you are exactly correct on what I meant.
  Re Response to Lethality 22, on this:
  I think evopsych was only wrong about how humans form values as a result of how very wrong they were about how humans get their capabilities.
  You’re not wrong they got human capabilities very wrong, see this post for details:
  https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine
  But I’d also argue that this has implications for how complex human values actually are.
  On this:
  Thus, I halfway agree that people get their values largely from the environment. I think we get our values largely from the environment and get our values largely from the way evolution designed our drives. They interact in complex ways that make both absolutely critical for the resulting values.
  Yeah, this seems like a crux, since I think that a lot of how values is learned is basically via quite weak priors from evolutionary drives, and I’d say the big one is probably which algorithm we have for being intelligent, and put far more weight on socialization/environment data than you do, closer to 5-10% evolution at best/85-90% data and culture determine our values.
  But at any rate, since AIs will be influenced by their data a lot, this means that it’s tractable to influence their values.
  After 2+ decades of studying how human brains solve complex problems, I completely agree. We have the same brain plan as other mammals, we just have more compute and way better data (from culture, and our evolutionary drive to pay close attention to other humans).
  Agree with this mostly, though culture is IMO the best explanation for why humans succeed.
  So, what is the “simple core” of human values you mention? Is it what people have written about human values? I’d pretty much agree that that’s a usable core, even if it’s not simple.
  Yes, I am talking about what people have written about human values, but I’m also talking about future synthetic data where we write about what values we want the AI to have, and I’m also talking about reward information as a simple core.
  One of my updates on Constitiutional AI and GPT-4 handling what we value pretty well is that the claim that value is complicated is mostly untrue, and in general updated hard against evopsych explainations about what humans value, how humans got their capabilities and more, since the data we got is very surprising under evopsych hypotheses and less surprising under Universal Learning Machine hypotheses.
  I agree with all of this for the record:
  
  I very much agree with Yudkowsky’s framing here:
  24. There are two fundamentally different approaches you can potentially take to alignment, which are unsolvable for two different sets of reasons; therefore, by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult.
  I just wrote about this critical point in Conflating value alignment and intent alignment is causing confusion.
  I agree with Yud that a CEV sovereign is not a good idea for a first attempt. As explained in 23 above, I very much agree with you and disagree with Yudkowsky that the corrigiblity approach is also doomed.
  Re response to lethalities 28 and 29, I think you meant that you totally disagree, and I agree we’d probably be boned if a misaligned AGI/ASI was running loose, but my point is that the verification/generation gap that pervades so many fields also is likely to apply to alignment research, where it’s easier to verify if research is correct than to do it yourself.
  Re response to Lethality 32:
  WRT to the related but separate issue of language being an adequate reflection of their underlrying thoughts to allow alignment and transparency:
  It isn’t, except if you want it to be and work to make sure it’s communicating your thoughts well enough.
  There’s some critical stuff here about whether we apply enough RL or other training pressures to foundation models/agents to make their use of language not approximately reflect their underlying thoughts (and prevent translucent thoughts).
  There is definitely a chance that RL or other processes make their use of language diverge more from their thoughts, so I’m a little worried about that, but I do think that AI words do convey their thoughts, at least for current LLMs.
  39. I figured this stuff out using the null string as input,
  Yudkowsky may very well be the smartest human whose thought I’ve personally encountered. That doesn’t make him automatically right and the rest of us wrong. Expertise, time-on-task, and breadth of thought all count for a lot, far more in sum than sheer ability-to-juggle-concepts. Arguments count way more than appeals to authority (although those should be taken seriously too, and I definitely respect Yud’s authority on the topic).
  I think my difference wrt you is I consider his models and arguments for AI doom essentially irreparable for the most part due to reality invalidating his core assumptions of how AGIs/ASIs work, and also how human capabilities and values work and are learned, so I don’t think Yud’s authority on the topic earns him any epistemic points.
  My point was basically that you cannot figure out anything using a null string as input, for the same reason you cannot update on no evidence of something happening as if there was evidence for something happening.
  Agree with the rest of it though.
  Thanks for your excellent comment, which gave me lots of food for thought.