Equipping LLMs with agency and intrinsic motivation is a fascinating and important direction for future work.
Saying the quiet part out loud, I see!
It is followed by this sentence, though, which is the only place in the 154-page paper that even remotely hints at critical risks:
With this direction of work, great care would have to be taken on alignment and safety per a system’s abilities to take autonomous actions in the world and to perform autonomous self-improvement via cycles of learning.
Very scarce references to any safety works, except the GPT-4 report and a passing mention to some interpretability papers.
Overall, I feel like the paper is a shameful exercise in not mentioning the elephant in the room. My guess is that their corporate bosses are censoring mentions of risks that could get them bad media PR, like with the Sydney debacle. It’s still not a good excuse.
My guess is that their corporate bosses are censoring mentions of risks that could get them bad media PR, like with the Sydney debacle.
I think an equally if not more likely explanation is that these particular researchers simply don’t happen to be that interested in alignment questions, and thought “oh yeah we should probably put in a token mention of alignment and some random citations to it” when writing the paper.
great care would have to be taken on alignment and safety per a system’s abilities to take autonomous actions in the world and to perform autonomous self-improvement via cycles of learning
Not allowing cycles of learning sounds like a bound on capability, but it might be a bound on capability of the part of the system that’s aligned, without a corresponding bound on the part that might be misaligned.
GPT-4 can do a lot of impresive things without thinking out loud with tokens in the context window, so where does this thinking take place? Probably with layers updating the residual stream. There are enough layers now that a sequence of their application might be taking on the role of context window to perform chain-of-thought reasoning, which is non-interpretable and not imitating human speech. This capability is being trained during pre-training, as the model is forced to read the dataset.
But the corresponding capability for studying deliberative reasoning in tokens is not being trained. The closest thing to it in GPT-4 is mitigation of hallucinations (see the 4-step algorithm in section 3.1 of the System Card part of GPT-4 report), and it’s nowhere near general enough.
This way, the inscrutable alien shoggoth is on track to wake up, while human-imitating masks that are plausibly aligned by default are being held back in situationally unaware confusion in the name of restricting capabilities for the sake of not burning the timeline.
I expected downvotes (it is cheeky and maybe not great for fruitful discussion), but instead I got disagreevotes. Big company labs do review papers for statements that could hurt the company! It’s not a conspiracy theory to suggest this shaped the content in some ways, especially the risks section.
Saying the quiet part out loud, I see!
It is followed by this sentence, though, which is the only place in the 154-page paper that even remotely hints at critical risks:
Very scarce references to any safety works, except the GPT-4 report and a passing mention to some interpretability papers.
Overall, I feel like the paper is a shameful exercise in not mentioning the elephant in the room. My guess is that their corporate bosses are censoring mentions of risks that could get them bad media PR, like with the Sydney debacle. It’s still not a good excuse.
I think an equally if not more likely explanation is that these particular researchers simply don’t happen to be that interested in alignment questions, and thought “oh yeah we should probably put in a token mention of alignment and some random citations to it” when writing the paper.
Which is somehow worse than doing it for corporate reasons.
Not allowing cycles of learning sounds like a bound on capability, but it might be a bound on capability of the part of the system that’s aligned, without a corresponding bound on the part that might be misaligned.
GPT-4 can do a lot of impresive things without thinking out loud with tokens in the context window, so where does this thinking take place? Probably with layers updating the residual stream. There are enough layers now that a sequence of their application might be taking on the role of context window to perform chain-of-thought reasoning, which is non-interpretable and not imitating human speech. This capability is being trained during pre-training, as the model is forced to read the dataset.
But the corresponding capability for studying deliberative reasoning in tokens is not being trained. The closest thing to it in GPT-4 is mitigation of hallucinations (see the 4-step algorithm in section 3.1 of the System Card part of GPT-4 report), and it’s nowhere near general enough.
This way, the inscrutable alien shoggoth is on track to wake up, while human-imitating masks that are plausibly aligned by default are being held back in situationally unaware confusion in the name of restricting capabilities for the sake of not burning the timeline.
I expected downvotes (it is cheeky and maybe not great for fruitful discussion), but instead I got disagreevotes. Big company labs do review papers for statements that could hurt the company! It’s not a conspiracy theory to suggest this shaped the content in some ways, especially the risks section.