Simplicia: I understand that it’s possible for things to superficially look good in a brittle way. We see this with adversarial examples in image classification: classifiers that perform well on natural images can give nonsense answers on images constructed to fool them, which is worrying, because it indicates that the machines aren’t really seeing the same images we are. That sounds like the sort of risk story you’re worried about: that a full-fledged AGI might seem to be aligned in the narrow circumstances you trained it on, while it was actually pursuing alien goals all along.
I imagine you’re not impressed by any of this, but why not? Why isn’t incremental progress at instilling human-like behavior into machines, incremental progress on AGI alignment?
The problem with this dialogue (or maybe it’s not a problem at all; after all, Zack isn’t advertising this as The One True Alignment Debate to End All Alignment Debates; it’s just a fictional dialogue) is that Simplicia is arguing that the real-world existence of incremental progress at instilling human-like behavior into machines should make Doomimir feel more optimistic, but it doesn’t appear that she gets what the argument is for why that should make him less doomy. Or, more concretely, she is using arguments that veer away from the most salient part and the upstream generator of disagreements between doomy and less-doomy alignment researchers (one could uncharitably say she is being somewhat… simple-minded here). To use a Said Achmiz phrase, the following section lacks a certain concentrated force which this topic greatly deserves:
Simplicia: As it happens, I also don’t think RLHF is as damning as you do. Early theoretical discussions of AI alignment would sometimes talk about what would go wrong if you tried to align AI with a “reward button.” Those discussions were philosophically valuable. Indeed, if you had a hypercomputer and your AI design method was to run a brute-force search for the simplest program that resulted in the most reward-button pushes, that would predictably not end well. While a weak agent selected on that basis might behave how you wanted, a stronger agent would find creative ways to trick or brainwash you into pushing the button, or just seize the button itself. If we had a hypercomputer in real life and were literally brute-forcing AI that way, I would be terrified.
But again, this isn’t a philosophy problem anymore. Fifteen years later, our state-of-the-art methods do have a brute-force aspect to them, but the details are different, and the details matter. Real-world RLHF setups aren’t an unconstrained hypercomputer search for whatever makes humans hit the thumbs-up button. It’s reinforcing the state–action trajectories that got reward in the past, often with a constraint on the Kullback–Leibler divergence from the base policy, which blows up on outputs that would be vanishingly unlikely from the base policy.
It seems to be working pretty well? It just doesn’t seem that implausible that the result of searching for the simplest program that approximates the distribution of natural language in the real world, and then optimizing that to give the responses of a helpful, honest, and harmless assistant is, well … a helpful, honest, and harmless assistant?
Simplicia, I was willing to give this a shot, but I truly despair of leading you over this pons asinorum. You can articulate what goes wrong with the simplest toy illustrations, but keep refusing to see how the real-world systems you laud suffer from the same fundamental failure modes in a systematically less visible way. From evolution’s perspective, humans in the EEA would have looked like they were doing a good job of optimizing inclusive fitness.
What Simplicia should have focused on at this very point in the conversation (maybe she will in a future dialogue?) is on what takeaways should the modularity, lack of agentic activity, moderate effectiveness of RLHF etc (overall just the empirical information) coming from recent SOTA models imply about how accurately we should expect the very specific theory of optimization and artificial cognition that the doomy worldview is based around to track the territory:
I think the particular framing I gave this issue in my previous (non-block quote) paragraph reveals my personal conclusion on that topic, but I expect significant disagreement and pushback to appear here. “And All the Shoggoths Merely Players” did not key in on this consideration either, certainly not with the degree of specificity and force that I would have desired.
I hope this shows up at some point in future sequels, Zack, and I of course hope that such sequels are on their way. The dialogues have certainly been fun to read thus far.
And conversely, Doomimir does a lot of laughing, crowing, and becoming enraged, but doesn’t do things like point out the importance of situational awareness for RL being dangerous, either before RLHF being brought up (which would be a more moderate thing) or only as a response to Simplicia bringing up RLHF (which probably is more in-character).
Simplicia: The thing is, I basically do buy realism about rationality, and realism having implications for future powerful AI—in the limit. The completeness axiom still looks reasonable to me; in the long run, I expect superintelligent agents to get what they want, and anything that they don’t want to get destroyed as a side-effect. To the extent that I’ve been arguing that empirical developments in AI should make us rethink alignment, it’s not so much that I’m doubting the classical long-run story, but rather pointing out that the long run is “far away”—in subjective time, if not necessarily sidereal time. If you can get AI that does a lot of useful cognitive work before you get the superintelligence whose utility function has to be exactly right, that has implications for what we should be doing and what kind of superintelligence we’re likely to end up with.
I find it ironic that Simplicia’s position in this comment is not too far from my own, and yet my reaction to it was “AIIIIIIIIIIEEEEEEEEEE!”. The shrieking is about everyone who thinks about alignment having illegible models from the perspective of almost everyone else, of which this thread is an example.
The problem with this dialogue (or maybe it’s not a problem at all; after all, Zack isn’t advertising this as The One True Alignment Debate to End All Alignment Debates; it’s just a fictional dialogue) is that Simplicia is arguing that the real-world existence of incremental progress at instilling human-like behavior into machines should make Doomimir feel more optimistic, but it doesn’t appear that she gets what the argument is for why that should make him less doomy. Or, more concretely, she is using arguments that veer away from the most salient part and the upstream generator of disagreements between doomy and less-doomy alignment researchers (one could uncharitably say she is being somewhat… simple-minded here). To use a Said Achmiz phrase, the following section lacks a certain concentrated force which this topic greatly deserves:
What Simplicia should have focused on at this very point in the conversation (maybe she will in a future dialogue?) is on what takeaways should the modularity, lack of agentic activity, moderate effectiveness of RLHF etc (overall just the empirical information) coming from recent SOTA models imply about how accurately we should expect the very specific theory of optimization and artificial cognition that the doomy worldview is based around to track the territory:
I think the particular framing I gave this issue in my previous (non-block quote) paragraph reveals my personal conclusion on that topic, but I expect significant disagreement and pushback to appear here. “And All the Shoggoths Merely Players” did not key in on this consideration either, certainly not with the degree of specificity and force that I would have desired.
I hope this shows up at some point in future sequels, Zack, and I of course hope that such sequels are on their way. The dialogues have certainly been fun to read thus far.
And conversely, Doomimir does a lot of laughing, crowing, and becoming enraged, but doesn’t do things like point out the importance of situational awareness for RL being dangerous, either before RLHF being brought up (which would be a more moderate thing) or only as a response to Simplicia bringing up RLHF (which probably is more in-character).
Simplicia: The thing is, I basically do buy realism about rationality, and realism having implications for future powerful AI—in the limit. The completeness axiom still looks reasonable to me; in the long run, I expect superintelligent agents to get what they want, and anything that they don’t want to get destroyed as a side-effect. To the extent that I’ve been arguing that empirical developments in AI should make us rethink alignment, it’s not so much that I’m doubting the classical long-run story, but rather pointing out that the long run is “far away”—in subjective time, if not necessarily sidereal time. If you can get AI that does a lot of useful cognitive work before you get the superintelligence whose utility function has to be exactly right, that has implications for what we should be doing and what kind of superintelligence we’re likely to end up with.
I find it ironic that Simplicia’s position in this comment is not too far from my own, and yet my reaction to it was “AIIIIIIIIIIEEEEEEEEEE!”. The shrieking is about everyone who thinks about alignment having illegible models from the perspective of almost everyone else, of which this thread is an example.