I’ve formerly done research for MIRI and what’s now the Center on Long-Term Risk; I’m now making a living as an emotion coach and Substack writer.
Most of my content becomes free eventually, but if you’d like to get a paid subscription to my Substack, you’ll get it a week early and make it possible for me to write more.
I haven’t read the full ELK report, just Scott Alexander’s discussion of it, so I may be missing something important. But at least based on that discussion, it looks to me like ELK might be operating off premises that don’t seem clearly true for LLMs.
Scott writes:
It seems to me that this is assuming that our training is creating the AI’s policy essentially from scratch. It is doing a lot of things, some of which are what we want and some of which aren’t, and unless we are very careful to only reward the things we want and none of the ones we don’t want, it’s going to end up doing things we don’t want.
I don’t know how future superintelligent AI systems work, but if LLM training was like this, they would work horrendously much worse than they do. People paid to rate AI answers report working with “incomplete instructions, minimal training and unrealistic time limits to complete tasks” and say things like “[a]fter having seen how bad the data is that goes into supposedly training the model, I knew there was absolutely no way it could ever be trained correctly like that”. Yet for some reason LLMs still do quite well on lots of tasks. And even if all raters worked under perfect conditions, they’d still be fallible humans.
It seems to me that LLMs are probably reasonably robust to noisy reward signals because a large part of what the training does is “upvoting” and tuning existing capabilities and simulated personas rather than creating them entirely from scratch. A base model trained to predict the world creates different kinds of simulated personas whose behavior that would explain the data it sees; these include personas like “a human genuinely trying to do its best at task X”, “a deceitful human”, or “an honest human”.
Scott writes:
This might happen. It might also happen that the AI contains a “genuinely protect the diamond” persona and a “manipulate the humans to believe that the diamond is safe” persona, and that the various reward signals are upvoting these to different degrees. And that such a random process of manipulation does end up upvoting the “manipulate the humans” persona… but that if the “genuinely protect the diamond” persona has gotten sufficiently upvoted by other signals, it still ends up being the dominant one. Then it doesn’t matter if there’s some noise and upvoting of the “manipulate the humans” persona, as long as the “genuinely protect the diamond” persona gets more upvotes overall. And if the “genuinely protect the diamond” persona had been sufficiently upvoted from the start, the “manipulate the humans” one might end up with such a low prior probability that it’d effectively never end up active.
Now of course none of this is a rigorous proof that things would work, and with our current approaches we still see a lot of reward hacking and so on. But it seems to me like a reasonable possibility that there could be a potential “honestly report everything that I’ve done” persona waiting inside most models, such that one could just upvote it in a variety of scenarios and then it’d get widely linked to the rest of the model’s internals so as to always detect if some kind of deception was going on. And once that had happened, it wouldn’t matter if some of the reward signals around honesty were noisy, because the established structure was sufficiently robust and general against the noise.