I have a heuristic mental model of a prosaic alignment solution for near-human-level AGIs. I know it must be broken for various reasons, but I’m not sure what the slam dunk, “no way we get that lucky” reason is. Would love others’ help.
First, as a prior, let’s assume deceptive/misaligned mesaoptimization gets solved. I’m also going to assume corrigibility doesn’t get solved, and so we’re forced to play that most dangerous game, “are our heuristic solutions smarter than the optimization pressure we’re trying to use to break them”.
Now, first step is defining our reward function. We do this by unsupervised training of a giant LLM, followed by RL from human feedback, to get a decent-ish proxy for “textual description of scenarios humans do and don’t like”. (Obviously “kill humans” is easily captured here, and hopefully “please the humans giving you instructions” is too.) Moreover, we do a lot of off-distribution checking, and make sure that the reward function penalizes anything that even vaguely seems like an adversarial example, or off-distribution. Next, we “ground” the model by mapping the whole archive of YouTube videos to distributions over textual descriptions of them. We define the “reward” of a given video as the adversarially-generated “worst-sounding, according to the text-based reward function, description of the video” that still has decently large probability of being true. Again, the video inputs are very heavily checked for OOD robustness.
We repeat this process for our various reward datastreams. This is our “proxy reward”, which we say must remain above a certain threshold for 100 years relative to where it is today absent any further actions (valuing not just the end state but the intermediate states equally), after which it shouldn’t care.
We also build a series of really intense proxies against self-duplication and self-modification, e.g. very negative rewards if there ever exists a copy of weights that downsample into this AI’s lottery ticket outside of the airgapped data center this AI was built in, very negative rewards for all the versions of “modifying your own reward function” that we can think of.
Now, we give the AI the best world-model we can construct via supervised learning, on all the data ever. We especially focus on re-generating math/TCS papers, including alignment papers.
We tell the AI, in addition to its default goals, “write a groundbreaking paper on [infra-Bayesian physicalism, or some other alignment approach]”. We let the AI think with a scratchpad for a while, until it outputs an arxiv paper. One or two randomly-chosen alignment experts read the paper, judge its at least somewhat existing merits, rederive its main arguments and port them into a fresh manuscript (destroying the old one), and then get sent to an island somewhere without talking to anyone. (If we want to save on alignment experts, they can do this for several papers at a time.)
We repeat this process sequentially, possibly using different restarts of the same AGI checkpoint but provided with the (transcribed) new paper, about 1000 times (ideally largely in parallel). By the end, we maybe have an alignment solution that human researchers believe might work? Which is better than where we’re at now.
Obviously, if the first AI capable of writing a meaningful alignment paper is already a superintelligence, none of these protection measures do much good and this kills us. But, I think that’s a hypothesis. It might be >50%? I don’t really know. My question is, which of these countermeasures seem most likely to break first, given the fairly modest goal the AGI is being asked to fulfill? (I’m trying to improve my intuition here.)
To spell out the intuition for why the AGI doesn’t immediately decide that its effect on the long-term future is way more important, that it could communicate with successors by only proving certain alignment theorems and thus coordinate to a solution that lets them or their successors escape, and then kill all humans and reshape the world according to their (inevitably flawed somewhere) reward functions: that’s, like, pretty hard. The AGI is searching for the simplest solutions that will yield a good reward within a week of thinking. Maybe writing a good alignment paper is just the easiest way to solve these constraints, even if a superintelligent version of itself would’ve discovered a better (worse for us) solution.
The most obvious-to-me flaw with the plan of “hang out in the slightly superhuman range for a few decades and try to slowly get better alignment work done, possibly with the help of the AI” is that it involves no one ever turning up an AI even a little bit.
That level of coordination isn’t completely infeasible but it doesn’t seem remotely reliable.
100%, if I thought we had other options I’d obviously choose them.
The only reason this might be even hypothetically possible is self-interest, if we can create really broad social consensus about the difficulty of alignment. No one is trying to kill themselves.
I have a heuristic mental model of a prosaic alignment solution for near-human-level AGIs. I know it must be broken for various reasons, but I’m not sure what the slam dunk, “no way we get that lucky” reason is. Would love others’ help.
First, as a prior, let’s assume deceptive/misaligned mesaoptimization gets solved. I’m also going to assume corrigibility doesn’t get solved, and so we’re forced to play that most dangerous game, “are our heuristic solutions smarter than the optimization pressure we’re trying to use to break them”.
Now, first step is defining our reward function. We do this by unsupervised training of a giant LLM, followed by RL from human feedback, to get a decent-ish proxy for “textual description of scenarios humans do and don’t like”. (Obviously “kill humans” is easily captured here, and hopefully “please the humans giving you instructions” is too.) Moreover, we do a lot of off-distribution checking, and make sure that the reward function penalizes anything that even vaguely seems like an adversarial example, or off-distribution. Next, we “ground” the model by mapping the whole archive of YouTube videos to distributions over textual descriptions of them. We define the “reward” of a given video as the adversarially-generated “worst-sounding, according to the text-based reward function, description of the video” that still has decently large probability of being true. Again, the video inputs are very heavily checked for OOD robustness. We repeat this process for our various reward datastreams. This is our “proxy reward”, which we say must remain above a certain threshold for 100 years relative to where it is today absent any further actions (valuing not just the end state but the intermediate states equally), after which it shouldn’t care.
We also build a series of really intense proxies against self-duplication and self-modification, e.g. very negative rewards if there ever exists a copy of weights that downsample into this AI’s lottery ticket outside of the airgapped data center this AI was built in, very negative rewards for all the versions of “modifying your own reward function” that we can think of.
Now, we give the AI the best world-model we can construct via supervised learning, on all the data ever. We especially focus on re-generating math/TCS papers, including alignment papers.
We tell the AI, in addition to its default goals, “write a groundbreaking paper on [infra-Bayesian physicalism, or some other alignment approach]”. We let the AI think with a scratchpad for a while, until it outputs an arxiv paper. One or two randomly-chosen alignment experts read the paper, judge its at least somewhat existing merits, rederive its main arguments and port them into a fresh manuscript (destroying the old one), and then get sent to an island somewhere without talking to anyone. (If we want to save on alignment experts, they can do this for several papers at a time.)
We repeat this process sequentially, possibly using different restarts of the same AGI checkpoint but provided with the (transcribed) new paper, about 1000 times (ideally largely in parallel). By the end, we maybe have an alignment solution that human researchers believe might work? Which is better than where we’re at now.
Obviously, if the first AI capable of writing a meaningful alignment paper is already a superintelligence, none of these protection measures do much good and this kills us. But, I think that’s a hypothesis. It might be >50%? I don’t really know. My question is, which of these countermeasures seem most likely to break first, given the fairly modest goal the AGI is being asked to fulfill? (I’m trying to improve my intuition here.)
To spell out the intuition for why the AGI doesn’t immediately decide that its effect on the long-term future is way more important, that it could communicate with successors by only proving certain alignment theorems and thus coordinate to a solution that lets them or their successors escape, and then kill all humans and reshape the world according to their (inevitably flawed somewhere) reward functions: that’s, like, pretty hard. The AGI is searching for the simplest solutions that will yield a good reward within a week of thinking. Maybe writing a good alignment paper is just the easiest way to solve these constraints, even if a superintelligent version of itself would’ve discovered a better (worse for us) solution.
The most obvious-to-me flaw with the plan of “hang out in the slightly superhuman range for a few decades and try to slowly get better alignment work done, possibly with the help of the AI” is that it involves no one ever turning up an AI even a little bit.
That level of coordination isn’t completely infeasible but it doesn’t seem remotely reliable.
100%, if I thought we had other options I’d obviously choose them.
The only reason this might be even hypothetically possible is self-interest, if we can create really broad social consensus about the difficulty of alignment. No one is trying to kill themselves.