JenniferRM comments on Refactoring Alignment (attempt #2)

JenniferRM 29 Jul 2021 3:31 UTC
8 points
0
AF
I like the nine-node graph for how it makes the stakes of “how you group the things” more clear, potentially? Also it suggests ways of composing tools maybe?
Personally, I always like to start with, then work backwards from, The Goal.
The Goal in “schematic/toy form”.
Then, someone might wonder about the details, and how they might be expanded and implemented and creatively adjusted to safe but potentially surprising ways.
So you work out how to make some external source of power (which is still TBD) somehow serve The Goal (which is now the lower left node, forced to play nicely with the larger framework) and you make sure that you’re asking for something coherent, and the subtasks are do-able, and so on?
The goal plus a way to make sure the goal’s expression doesn’t end up going off the rails. It is like a very thoughtful and coherent wish, looking for a genie to grant it!
Metaphorically, if you’re thinking of a “Three Wishes” story, this would be an instance that makes for a boring story because hopefully there will be no twists or anything. It will just be a Thoughtful Wish and all work out, even in terms of second order consequences and maybe figuring out how to get nearly all of what you want with just two wishes, so you have a safety valve of sorts with with number three? Nice!
Then you just need to find a genie, even an insane or half evil genie that can do almost anything?
One possibility is that no one will have built, or wanted to build a half insane genie that could easily kill the person holding the lamp. They will have assumed that The Goal can be discussed later, because surely the goal is pretty obvious? Its just good stuff, like what everyone wants for everyone.
So they won’t have built a tiny little engine of pure creative insanity, they will have tried to build something that can at least be told what to do:
A “Do Anything Machine”: The capability node is the magic rocket fuel, and the thing at the lower left is the philosophic middleware implementing a generic Do What I Mean instruction, and together they can do… uh… anything?
But then, the people who came up with a framework for thinking about the goals, and reconciling the possibility of contradictions or tradeoffs in the goals (in various clever ways (under extreme pressure of breakdown due to extremely creative things happening very quickly)) can say “I would like one of those ‘Do Anything Machines’ please, and I have a VERY LARGE VERY THOUGHTFUL WISH”.
But in fact, if a very large and very thoughtful wish exists, you might safely throw away the philosophy stuff in the Do Anything Machine, why not get one, then tear out the Capability Robustness part, and just use THAT as the genie that tries to accomplish The Goal?
The danger might be that maybe you can’t just tear the component out of the Do Anything Machine and still have it even sorta work? Or maybe by the end of working out the wish, it will turn out that there are categorically different kinds of genies and you need a specific reasoning or optimizing strategy to ensure that the wish (even a careful one that accounts for many potential gotchas) actually works out. Either getting a special one or tearing one out of the Do Anything Machine could work here:
EITHER: The “Capability Robustness” torn out of the Do Anything Machine OR ELSE: maybe a different thing that is perfectly consistently usable by The Thoughtful Wish?
If I was going to say some “and then” content here, based on this framing...
What if we didn’t just give the “Do Anything Machine” a toy example of a toy alignment problem and hope it generalizes: what if we gave it “The Thoughtful Wish” as the training distribution to generalize?
Or (maybe equivalently, maybe not) what if “The Thoughtful Wish” was given a genie that actually didn’t need that much thoughtfulness as its “optimizer” and so… is that better?
Personally, I see a Do Anything Machine and it kinda scares me. (Also, I hear “alignment” and think “if orthogonality is actually true, then you can align with evil as easily as with good, so wtf, why is there a nazgul in the fellowship already?”)
And so if I imagine this instrument of potentially enormous danger being given a REALLY THOUGHTFUL GOAL then it seems obviously more helpful than if it was given a toy goal with lots of reliance on the “generalization” powers… (But maybe I’m wrong: see below!)
I don’t have any similar sense that The Thoughtful Wish is substantially helped by using this or that optimization engine, but maybe I’m not understanding the idea very well.
For all of my understanding of the situation, it could be that if you have TWO non-trivial philosophical systems for reasoning about creative problem solving, and they interact… then maybe they… gum each other up? Maybe they cause echoes that resonate into something destructive? It seems like it would depend very very very much on the gears level view of both the systems, not just this high level description? Thus...
Possibly the heart of the gears level question?
To the degree to which “Inner Alignment” and “Objective Robustness” are the same, or work well together, I think that says a lot. To the degree that they are quite different… uh...
Based on this diagram, it seems to me like they are not the same, because it kinda looks like “Inner Alignment” is “The Generalization Problem for only and exactly producing a Good AGI” whereas it seems like “Objective Robustness” would be able to flexibly generalize many many other goals that are… less obviously good?
So maybe Inner Alignment is a smaller and thus easier problem?
On the other hand, sometimes the easiest way to solve a motivating problem is to build a tool that can solve any problem that’s vaguely similar (incorporating and leveraging the full generality of the problem space directly and elegantly without worrying about too many weird boundary conditions at the beginning) and then use the general tool to loop back and solve the motivating problem as a happy little side effect?
I have no stake in this game, except the obvious one where I don’t want to be ground up into fuel paste by whatever thing someone eventually builds, but would rather grow up to be an angel and live forever and go meet aliens in my pet rocket ship (or whatever).
Hopefully this was helpful? <3
Maybe a very practical question about the diagram: is there a REASON for there to be no “sufficient together” linkage from “Intent Alignment” and “Robustness” up to “Behavioral Alignment”?
The ABSENCE of such a link suggests that maybe people think there WOULD be destructive interference? Or maybe the absence is just an oversight?
- abramdemski 2 Aug 2021 19:09 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Maybe a very practical question about the diagram: is there a REASON for there to be no “sufficient together” linkage from “Intent Alignment” and “Robustness” up to “Behavioral Alignment”?
  Leaning hard on my technical definitions:
  Robustness: Performing well on the base objective in a wide range of circumstances.
  Intent Alignment: A model is intent-aligned if it has a mesa-objective, and that mesa-objective is aligned with humans. (Again, I don’t want to get into exactly what “alignment” means.)
  These two together do not quite imply behavioral alignment, because it’s possible for a model to have a human-friendly mesa-objective but be super bad at achieving it, while being super good at achieving some other objective.
  So, yes, there is a little bit of gear-grinding if we try to combine the two plans like that. They aren’t quite the right thing to fit together.
  It’s like we have a magic vending machine that can give us anything, and we have a slip of paper with our careful wish, and we put the slip of paper in the coin slot.
  That being said, if we had technology for achieving both intent alignment and robustness, I expect we’d be in a pretty good position! I think the main reason not to go after both is that we may possibly be able to get away with just one of the two paths.
  - JenniferRM 15 Aug 2021 23:21 UTC
    2 points
    0
    Parent
    If they are equivalent, then I feel like the obvious value of the work would make resource constraints go away?
    However, thinking about raising money for it helps to convince me that the proposed linkage has “leaks”.
    Imagine timelines where they had Robustness & Intent Alignment (but there was no point where they had “Inner Robustness” or “On-Distribution Alignment”). Some of those timelines might have win conditions, and others might now. The imaginable failures work for me as useful intuition pumps.
    I haven’t managed to figure out a clean short response here, so I’ll give you apologies and lots of words <3
    ...
    If I was being laconic, I might try to restate what I think I noticed is that BOTH “Inner Alignment” and “Objective Robustness” have in some deep sense solved the principle agent problem...
    ...but only Inner Alignment has solved the “time less multi-agent case”, while Objective Robustness has solved the principle agent problem for maybe only the user, only at the moment the user requests help?
    (I can imagine the people who are invested in the yellow or the red options rejecting this for various reasons, but I think it would be interesting to hear the objections framed in terms of principals, agents, groups, contracts, explicable requests for an AGI, and how these could interact over time to foreclose the possibility of very high value winning conditions. Since my laconic response’s best expected followup is more debate, it seems good to sharpen and clarify the point.)
    ...
    Restating “the same insight” in a concretely behavioral form: I think that hypothetically, I would have a lot easier time explicitly and honestly pitching generic (non-altruistic, non-rationalist) investors on an “AGI Startup” if I was aiming for Robustness, rather than Intent Alignment.
    The reason it would be easier: it enables the benefits to go disproportionately to the investors. Like, what if it turns out that disproportionate investor returns are not consistent with something like “the world’s Coherent Extrapolated Volition (or whatever the latest synecdoche for the win condition is)”. THEN, just request “pay out the investors and THEN with any leftovers do the good stuff”. Easy peasy <3
    That is, Robustness is easier to raise funds for, because it increases the pool of possible investors from “saints” to “normal selfish investors”…
    ...which feels like almost an accusation against some people, which is not exactly what I’m aiming for. (I’m not not aiming for that, but its not the goal.) I’ll try again.
    ...
    Restating “again, and at length, and with a suggested modification to the diagram”:
    My intuitions suggest reversing the “coin vs paper” metaphor to make it very vivid and to make metal money be the real good stuff <3
    (If you have not been studying block chain economics by applying security mindset to protocol governance for a while, and kept generating things “not as good as gold” over and over and over, maybe this metaphor won’t make sense to you. It works for ME though?)
    I imagine an “Intent Alignment” that is Actually Good as being like 100 kg of non-radioactive gold.
    You could bury it somewhere, and dig it up 1000 years later, and it would still be just what it is: an accurate theory of pragmatically realizable abstract goodness that is in perfect resonance with the best parts of millennia of human economic and spiritual and scientific history up to the moment it was produced.
    (
    Asteroid mining could change the game for actual gold? And maybe genetic engineering could change the question of values so maybe 150 years from now humans will be twisted demons?
    But assuming no “historically unprecedented changes to our axiological circumstances” Intent Alignment and gold seem metaphorically similar to me (and similarly at risk as a standard able to function for human people as an age old meter stick for goodness in the face of technological plasticity and post-scarcity economics and wireheading and sybil attacks and so on).
    )
    Following this metaphorical flip: “Robustness” becomes the vending machine that will take any paper instruction, and banking credentials you wish to provide (for a bank that is part of westphalian finance and says that you have credit).
    If you pay enough to whoever owns a Robust machine, it’ll give you almost anything…
    ...then the impedance mismatch could be thought of as a problem where the machine doesn’t model the gold plates covered in the Thoughtful Wish as “valuable” (because the gold isn’t held by a bank) though maybe it could work as an awkward and bulky set of instructions that aren’t on paper but then you could do a clever referential rewrite?
    Thus, a simple way to reconcile these things would be for some rich/powerful person to come up, swipe a card to transfer 20 million argentinian nuevo pesos (possibly printed yesterday?) and write the instructions “Do what that 100kg of gold that is stamped and shaped with relevant algorithms and inspiring poetry says to do.”
    Since Robustness will more or less safely-to-the-user do anything that can be done (like it won’t fail to parse “that” in a sloppy and abusive way, for example, triggering on other gold and getting the instruction scrambled, or any of an infinity of other quibbles that could be abusively generated) it will work, right?
    By hypothesis, it has “Objective Robustness” so it WILL robustly achieve any achievable goal (or fail out in some informative way if asked to make 1+2=4 or whatever).
    So then TIME seems to be related to how the pesos and a paper instructions to follow the gold instructions could fail?
    Suppose a Robust vending machine was first asked to create a singleton situation where an AGI exists, manages basically everyone, but isn’t following any kind of Intent Aligned “Golden Plan” that is philosophically coherent n’stuff.
    Since the gears spin very fast, the machine understands that a Golden Plan would be globally inconsistent with its own de facto control of “all the things” already in its relatively pedestrian and selfish way that serve the goals of the first person to print enough pesos, and so it would prevent any such Golden Plan from being carried out later.
    To connect this longer version of a restatement with earlier/shorter restatements, recall the idea of solving the principal/agent problem in the “timeless multi-agent case”...
    In the golden timelessly aligned case, if somehow in the future an actually better theory of “what an AGI should do” is discovered (and so we get “even more gold” in the coin/paper/vending machine metaphor), then Intent Alignment would presumably get out of the way and allow this progress to unfold in an essentially fair and wise way.
    Robustness has no such guarantees. This may get at the heart of the inconsistency?
    Compressing this down to a concrete suggestion to usefully change the diagram:
    I think maybe you could add a 10th node, that was something like “A Mechanism To Ensure That Early Arriving Robustness Defers To Late Arriving Intent Alignment”?
    (In some sense then, the thing that Robustness might lack is “corrigibility to high quality late-arriving Alignment updates”?)
    I’m pretty sure that Deference is not traditionally part of Robustness as normally conceived, but also if such a thing somehow existed in addition to Robustness then I’d feel like: yeah, this is going to work and the three things (Deference, Robustness, and Intent Alignment) might be logically sufficient to guarantee the win condition at the top of the diagram :-)