It’s symptomatic of a fundamental disagreement about what the threat is, that the main AI labs have put in a lot of effort to prevent the model telling you, the user, how to make methamphetamine, but are just fine with the model knowing lots about how an AI can scheme and plot to kill people.
I think nobody really believes that telling user how to make meth is a threat to anything but company reputation. I would guess this is a nice toy task which recreates some obstacles on aligning superintelligence (i.e., superintelligence will probably know how to kill you anyway). The primary value of censoring dataset is to detect whether model can rederive doom scenario without them in training data.
This is an incoherent approach, but not quite as incoherent as it seems, at least near term. In the current paradigm, the actual agentic thing is a shitty pile of (possibly self editing) prompts and python scripts that calls the model via an api in order to be intelligent. If the agent is a user of the model and the model refuses to help users make bombs, the agent can’t work out how to make bombs.
It’s symptomatic of a fundamental disagreement about what the threat is, that the main AI labs have put in a lot of effort to prevent the model telling you, the user, how to make methamphetamine, but are just fine with the model knowing lots about how an AI can scheme and plot to kill people.
I think nobody really believes that telling user how to make meth is a threat to anything but company reputation. I would guess this is a nice toy task which recreates some obstacles on aligning superintelligence (i.e., superintelligence will probably know how to kill you anyway). The primary value of censoring dataset is to detect whether model can rederive doom scenario without them in training data.
This is an incoherent approach, but not quite as incoherent as it seems, at least near term. In the current paradigm, the actual agentic thing is a shitty pile of (possibly self editing) prompts and python scripts that calls the model via an api in order to be intelligent. If the agent is a user of the model and the model refuses to help users make bombs, the agent can’t work out how to make bombs.