Rob Bensinger comments on “AI Alignment” is a Dangerously Overloaded Term

Rob Bensinger 16 Dec 2023 22:28 UTC
63 points
12
From briefly talking to Eliezer about this the other day, I think the story from MIRI’s perspective is more like:
- Back in 2001, we defined “Friendly AI” as “The field of study concerned with the production of human-benefiting, non-human-harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals.”
We could have defined the goal more narrowly or generically than that, but that just seemed like an invitation to take your eye off the ball: if we aren’t going to think about the question of how to get good long-run outcomes from powerful AI systems, who will?
And many of the technical and philosophical problems seemed particular to CEV, which seemed like an obvious sort of solution to shoot for: just find some way to leverage the AI’s intelligence to solve the problem of extrapolating everyone’s preferences in a reasonable way, and of aggregating those preferences fairly.
- Come 2014, Stuart Russell and MIRI were both looking for a new term to replace “the Friendly AI problem”, now that the field was starting to become a Real Thing. Both parties disliked Bostrom’s “the control problem”. In conversation, Russell proposed “the alignment problem”, and MIRI liked it, so Russell and MIRI both started using the term in public.
Unfortunately, it gradually came to light that Russell and MIRI had understood “Friendly AI” to mean two moderately different things, and this disconnect now turned into a split between how MIRI used “(AI) alignment” and how Russell used “(value) alignment”. (Which I think also influenced the split between Paul Christiano’s “(intent) alignment” and MIRI’s “(outcome) alignment”.)
Russell’s version of “friendliness/alignment” was about making the AI have good, human-deferential goals. But Creating Friendly AI 1.0 had been very explicit that “friendliness” was about good behavior, regardless of how that’s achieved. MIRI’s conception of “the alignment problem” (like Bostrom’s “control problem”) included tools like capability constraint and boxing, because the thing we wanted researchers to focus on was the goal of leveraging AI capabilities to get actually-good outcomes, whatever technical work that requires, not some proxy goal that might turn out to be surprisingly irrelevant.
Again, we wanted a field of people keeping their eye on the ball and looking for clever technical ways to get the job done, rather than a field that neglects some actually-useful technique because it doesn’t fit their narrow definition of “alignment”.
- Meanwhile, developments like the rise of deep learning had updated MIRI that CEV was not going to be a realistic thing to shoot for with your first AI. We were still thinking of some version of CEV as the ultimate goal, but it now seemed clear that capabilities were progressing too quickly for humanity to have time to nail down all the details of CEV, and it was also clear that the approaches to AI that were winning out would be far harder to analyze, predict, and “aim” than 2001-Eliezer had expected. It seemed clear that if AI was going to help make the future go well, the first order of business would be to do the minimal thing to prevent other AIs from destroying the world six months later, with other parts of alignment/friendliness deferred to later.
I think considerations like this eventually trickled in to how MIRI used the term “alignment”. Our first public writing reflecting the switch from “Friendly AI” to “alignment”, our Dec. 2014 agent foundations research agenda, said:
We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”
Whereas by July 2016, when we released a new research agenda that was more ML-focused, “aligned” was shorthand for “aligned with the interests of the operators”.
In practice, we started using “aligned” to mean something more like “aimable” (where aimability includes things like corrigibility, limiting side-effects, monitoring and limiting capabilities, etc., not just “getting the AI to predictably tile the universe with smiley faces rather than paperclips”). Focusing on CEV-ish systems mostly seemed like a distraction, and an invitation to get caught up in moral philosophy and pie-in-the-sky abstractions, when “do a pivotal act” is legitimately a hugely more philosophically shallow topic than “implement CEV”. Instead, we went out of our way to frame the challenge of alignment in a way that seemed almost comically simple and “un-philosophical”, but that successfully captured all of the key obstacles: ‘explain how to use an AI to cause there two exist two strawberries that are identical at the cellular level, without causing anything weird or disruptive to happen in the process’.
Since realistic pivotal acts still seemed pretty outside the Overton window (and since we were mostly focused on our own research at the time), we wrote up our basic thoughts about the topic on Arbital but didn’t try to super-popularize the topic among rationalists or EAs at the time. (Which unfortunately, I think, exacerbated a situation where the larger communities had very fuzzy models of the strategic situation, and fuzzy models of what the point even was of this “alignment research” thing; alignment research just become a thing-that-was-good-because-it-was-a-good, not a concrete part of a plan backchained from concrete real-world goals.)
I don’t think MIRI wants to stop using “aligned” in the context of pivotal acts, and I also don’t think MIRI wants to totally divorce the term from the original long-term goal of friendliness/alignment.
Turning “alignment” purely into a matter of “get the AI to do what a particular stakeholder wants” is good in some ways—e.g., it clarifies that the level of alignment needed for pivotal acts could also be used to do bad things.
But from Eliezer’s perspective, this move would also be sending a message to all the young Eliezers “Alignment Research is what you do if you’re a serious sober person who thinks it’s naive to care about Doing The Right Thing and is instead just trying to make AI Useful To Powerful People; if you want to aim for the obvious desideratum of making AI friendly and beneficial to the world, go join e/acc or something”. Which does not seem ideal.
So I think my proposed solution would be to just acknowledge that ‘the alignment problem’ is ambiguous between three different (overlapping) efforts to figure out how to get good and/or intended outcomes from powerful AI systems:
- intent alignment, which is about getting AIs to try to do what the AI thinks the user wants, and in practice seems to be most interested in ‘how do we get AIs to be generically trying-to-be-helpful’.
- “strawberry problem” alignment, which is about getting AIs to safely, reliably, and efficiently do a small number of specific concrete tasks that are very difficult, for the sake of ending the acute existential risk period.
- CEV-style alignment, which is about getting AIs to fully figure out how to make the future good.
Plausibly it would help to have better names for the latter two things. The distinction is similar to “narrow value learning vs. ambitious value learning”, but both problems (as MIRI thinks about them) are a lot more general than just “value learning”, and there’s a lot more content to the strawberry problem than to “narrow alignment”, and more content to CEV than to “ambitious value learning” (e.g., CEV cares about aggregation across people, not just about extrapolation).
(Note: Take the above summary of MIRI’s history with a grain of salt; I had Nate Soares look at this comment and he said “on a skim, it doesn’t seem to quite line up with my recollections nor cut things along the joints I would currently cut them along, but maybe it’s better than nothing”.)
What links here?
- Roko 19 Dec 2023 0:46 UTC
  13 points
  −7
  Parent
  
  getting AIs to safely, reliably, and efficiently do a small number of specific concrete tasks that are very difficult, for the sake of ending the acute existential risk period.
  
  The problem is another way to phrase this is a superintelligent weapon system—“ending a risk period” by “reliably, and efficiently doing a small number of specific concrete tasks” means using physical force to impose your will on others.
  
  On reflection, I do not think that it is a wise idea to factor the path to a good future through a global AI-assisted coup.
  
  Instead one should try hard to push the path to a good future through a consensual agreement with some, uh, mechanisms, to discourage people from engaging in an excessive amount of brinksmanship. If and only if that fails it may be appropriate to consider less consensual options.
  - Rob Bensinger 19 Mar 2024 20:32 UTC
    4 points
    2
    Parent
    The problem is another way to phrase this is a superintelligent weapon system—“ending a risk period” by “reliably, and efficiently doing a small number of specific concrete tasks” means using physical force to impose your will on others.
    The pivotal acts I usually think about actually don’t route through physically messing with anyone else. I’m usually thinking about using aligned AGI to bootstrap to fast human whole-brain emulation, then using the ems to bootstrap to fully aligned CEV AI.
    If someone pushes a “destroy the world” button then the ems or CEV AI would need to stop the world from being destroyed, but that won’t necessarily happen if the developers have enough of a lead, if they get the job done quickly enough, and if CEV AI is able to persuade the world to step back from the precipice voluntarily (using superhumanly good persuasion that isn’t mind-control-y, deceptive, or otherwise consent-violating). It’s a big ask, but not as big as CEV itself, I expect.
    From my current perspective this is all somewhat of a moot point, however, because I don’t think alignment is tractable enough that humanity should be trying to use aligned AI to prevent human extinction. I think we should instead hit the brakes on AI and shift efforts toward human enhancement, until some future generation is in a better position to handle the alignment problem.
    If and only if that fails it may be appropriate to consider less consensual options.
    It’s not clear to me that we disagree in any action-relevant way, since I also don’t think AI-enabled pivotal acts are the best path forward anymore. I think the path forward is via international agreements banning dangerous tech, and technical research to improve humanity’s ability to wield such tech someday.
    That said, it’s not clear to me how your “if that fails, then try X instead” works in practice. How do you know when it’s failed? Isn’t it likely to be too late by the time we’re sure that we’ve failed on that front? Indeed, it’s plausibly already too late for humanity to seriously pivot to ‘aligned AGI’. If I thought humanity’s last best scrap of hope for survival lay in an AI-empowered pivotal act, I’d certainly want more details on when it’s OK to start trying to figure out have humanity not die via this last desperate path.
    - Lao Mein 18 Jun 2024 5:54 UTC
      6 points
      0
      Parent
      Are people actually working on human enhancement? Many talk about how it’s the best chance humanity has, but I see zero visible efforts other than Neurolink. No one’s even seriously trying to clone Von Neumann!
      - niplav 18 Jun 2024 12:07 UTC
        2 points
        0
        Parent
        @Genesmith has received a $20,000 ACX grant:
        
        Gene Smith, $20,000, to create an open-source polygenic predictor for educational attainment and intelligence. You upload your 23andMe results, it tells your your (predicted) IQ. Technology hasn’t advanced to the point where this will be any good—even if everything goes perfectly, the number it gives you will have only the most tenuous connection to your actual IQ (and everyone on Gene’s team agrees with this claim). I’m funding it anyway.
        
        I think there could be far more money in that area (even if it’s not directed at cloning von Neumann in particular), but it’s not happening for political reasons.