sunwillrise comments on The Standard Analogy

sunwillrise 3 Jun 2024 20:44 UTC
24 points
9
Simplicia: I understand that it’s possible for things to superficially look good in a brittle way. We see this with adversarial examples in image classification: classifiers that perform well on natural images can give nonsense answers on images constructed to fool them, which is worrying, because it indicates that the machines aren’t really seeing the same images we are. That sounds like the sort of risk story you’re worried about: that a full-fledged AGI might seem to be aligned in the narrow circumstances you trained it on, while it was actually pursuing alien goals all along.
But in that same case of the image classification, we can see progress being made. When you try to construct adversarial examples for classifiers that have been robustified with adversarial training, you get examples that affect human perception, too. When you use generative models for classification rather than just training a traditional classifier, they exhibit human-like shape bias and out-of-distribution performance. You can try perturbing the network’s internal states rather than the inputs to try to defend against unforeseen failure modes …
I imagine you’re not impressed by any of this, but why not? Why isn’t incremental progress at instilling human-like behavior into machines, incremental progress on AGI alignment?
The problem with this dialogue (or maybe it’s not a problem at all; after all, Zack isn’t advertising this as The One True Alignment Debate to End All Alignment Debates; it’s just a fictional dialogue) is that Simplicia is arguing that the real-world existence of incremental progress at instilling human-like behavior into machines should make Doomimir feel more optimistic, but it doesn’t appear that she gets what the argument is for why that should make him less doomy. Or, more concretely, she is using arguments that veer away from the most salient part and the upstream generator of disagreements between doomy and less-doomy alignment researchers (one could uncharitably say she is being somewhat… simple-minded here). To use a Said Achmiz phrase, the following section lacks a certain concentrated force which this topic greatly deserves:
Simplicia: As it happens, I also don’t think RLHF is as damning as you do. Early theoretical discussions of AI alignment would sometimes talk about what would go wrong if you tried to align AI with a “reward button.” Those discussions were philosophically valuable. Indeed, if you had a hypercomputer and your AI design method was to run a brute-force search for the simplest program that resulted in the most reward-button pushes, that would predictably not end well. While a weak agent selected on that basis might behave how you wanted, a stronger agent would find creative ways to trick or brainwash you into pushing the button, or just seize the button itself. If we had a hypercomputer in real life and were literally brute-forcing AI that way, I would be terrified.
But again, this isn’t a philosophy problem anymore. Fifteen years later, our state-of-the-art methods do have a brute-force aspect to them, but the details are different, and the details matter. Real-world RLHF setups aren’t an unconstrained hypercomputer search for whatever makes humans hit the thumbs-up button. It’s reinforcing the state–action trajectories that got reward in the past, often with a constraint on the Kullback–Leibler divergence from the base policy, which blows up on outputs that would be vanishingly unlikely from the base policy.
If most of the bits of search are coming from pretraining, which solves problems by means of copying the cognitive steps that humans would use, then using a little bit of reinforcement learning for steering doesn’t seem dangerous in the way that it would be dangerous if the core capabilities fell directly out of RL.
It seems to be working pretty well? It just doesn’t seem that implausible that the result of searching for the simplest program that approximates the distribution of natural language in the real world, and then optimizing that to give the responses of a helpful, honest, and harmless assistant is, well … a helpful, honest, and harmless assistant?
Doomimir: Of course it seems to be working pretty well! It’s been optimized for seeming-good-to-you!
Simplicia, I was willing to give this a shot, but I truly despair of leading you over this pons asinorum. You can articulate what goes wrong with the simplest toy illustrations, but keep refusing to see how the real-world systems you laud suffer from the same fundamental failure modes in a systematically less visible way. From evolution’s perspective, humans in the EEA would have looked like they were doing a good job of optimizing inclusive fitness.
What Simplicia should have focused on at this very point in the conversation (maybe she will in a future dialogue?) is on what takeaways should the modularity, lack of agentic activity, moderate effectiveness of RLHF etc (overall just the empirical information) coming from recent SOTA models imply about how accurately we should expect the very specific theory of optimization and artificial cognition that the doomy worldview is based around to track the territory:
Eliezer is essentially claiming that, just as his pessimism compared to other AI safety researchers is due to him having engaged with the relevant concepts at a concrete level (“So I have a general thesis about a failure mode here which is that, the moment you try to sketch any concrete plan or events which correspond to the abstract descriptions, it is much more obviously wrong, and that is why the descriptions stay so abstract in the mouths of everybody who sounds more optimistic than I am. This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all”), his experience with and analysis of powerful optimization allows him to be confident in what the cognition of a powerful AI would be like. In this view, Vingean uncertainty prevents us from knowing what specific actions the superintelligence would take, but effective cognition runs on Laws that can nonetheless be understood and which allow us to grasp the general patterns (such as Instrumental Convergence) of even an “alien mind” that’s sufficiently powerful. In particular, any (or virtually any) sufficiently advanced AI must be a consequentialist optimizer that is an agent as opposed to a tool and which acts to maximize expected utility [over future world states] according to its world model to purse a goal that can be extremely different from what humans deem good.
I think the particular framing I gave this issue in my previous (non-block quote) paragraph reveals my personal conclusion on that topic, but I expect significant disagreement and pushback to appear here. “And All the Shoggoths Merely Players” did not key in on this consideration either, certainly not with the degree of specificity and force that I would have desired.
I hope this shows up at some point in future sequels, Zack, and I of course hope that such sequels are on their way. The dialogues have certainly been fun to read thus far.
- Charlie Steiner 4 Jun 2024 8:33 UTC
  12 points
  4
  Parent
  And conversely, Doomimir does a lot of laughing, crowing, and becoming enraged, but doesn’t do things like point out the importance of situational awareness for RL being dangerous, either before RLHF being brought up (which would be a more moderate thing) or only as a response to Simplicia bringing up RLHF (which probably is more in-character).
- Zack_M_Davis 9 Jun 2024 21:31 UTC
  7 points
  2
  Parent
  Simplicia: The thing is, I basically do buy realism about rationality, and realism having implications for future powerful AI—in the limit. The completeness axiom still looks reasonable to me; in the long run, I expect superintelligent agents to get what they want, and anything that they don’t want to get destroyed as a side-effect. To the extent that I’ve been arguing that empirical developments in AI should make us rethink alignment, it’s not so much that I’m doubting the classical long-run story, but rather pointing out that the long run is “far away”—in subjective time, if not necessarily sidereal time. If you can get AI that does a lot of useful cognitive work before you get the superintelligence whose utility function has to be exactly right, that has implications for what we should be doing and what kind of superintelligence we’re likely to end up with.
  - Algon 8 Jul 2024 15:00 UTC
    4 points
    0
    Parent
    I find it ironic that Simplicia’s position in this comment is not too far from my own, and yet my reaction to it was “AIIIIIIIIIIEEEEEEEEEE!”. The shrieking is about everyone who thinks about alignment having illegible models from the perspective of almost everyone else, of which this thread is an example.