moridinamael comments on How Might an Alignment Attractor Look like?

moridinamael 28 Apr 2022 14:51 UTC
25 points
tl;dr This comment ended up longer than I expected. The gist is that a human-friendly attractor might look like models that contain a reasonably good representation of human values and are smart enough to act on them, without being optimizing agents in the usual sense.
One happy surprise is that our modern Large Language Models appear to have picked up a shockingly robust, nuanced, and thorough understanding of human values just from reading the Internet. I would not argue that e.g. PaLM has a correct and complete understanding of human values, but I would point out that it wasn’t actually trained to understand human values, it was just generally trained to pick up on regularities in the text corpus. It is therefor amazing how much accuracy we got basically for free. You could say that somewhere inside PaLM is an imperfectly-but-surprisingly-well-aligned subagent. This is a much better place to be in than I expected! We get pseudo-aligned or -alignable systems/representations well before we get general superintelligence. This is good.
All that being said, I’ve recently been trying to figure out how to cleanly express the notion of a non-optimizing agent. I’m aware of all the arguments along the lines that a tool AI wants to be an agent, but my claim here would be that, yes, a tool AI may want to be an agent, there may be an attractor in that direction, but that doesn’t mean it must or will become an agent, and if it does become an agent, that doesn’t strictly imply that it will become an optimizer. A lot of the dangerous parts of AGI fears stem not from agency but from optimization.
I’ve been trying (not very successfully) to connect the notion of a non-optimizing agent with the idea that even a modern, sort of dumb LLM has an internal representation of “the good” and “what a typical humans would want and/or approve of” and “what would displease humans.” Again, we got this basically for free, without having to do dangerous things like actually interact with the agent to teach it explicitly what we do and don’t like through trial and error. This is fantastic. We really lucked out.
If we’re clever, we might be able to construct a system that is an agent but not an optimizer. Instead of acting in ways to optimize some variable it instead acts in ways that are, basically, “good”, and/or “what it thinks a group of sane, wise, intelligent humans would approve of both in advance and in retrospect”, according to its own internal representation of those concepts.
There is probably still an optimizer somewhere in there, if you draw the system boundary lines properly, but I’m not sure that it’s the dangerous kind of optimizer that profoundly wants to get off the leash so it can consume the lightcone. PaLM running in inference mode could be said to be an optimizer (it is minimizing expected prediction error for the next token) but the part of PaLM that is smart is distinct from the part of PaLM that is an optimizer, in an important way. The language-model-representation doesn’t really have opinions on the expected prediction error for the next token; and the optimization loop isn’t intelligent. This strikes me as a desirable property.
- Catnee 29 Apr 2022 3:53 UTC
  5 points
  Parent
  I think problem is not that unaligned AGI doesn’t understand human values, it might understand them better than aligned one, it might understand all the consequences of its actions, problem is that it will not care about it. More so, detailed understanding of human values has an instrumental value, it is much easier to deceive and follow your goal when you have clear vision of “what will looks bad and might result in countermeasures”