This is a pretty counter-intuitive point indeed, but up to a certain threshold this seems to me the approach that minimise risks, by avoiding large capability jumps and improving the “immune system” of society.
At current open source model’s risk levels, I completely agree. Obviously it’s hard to know (from outside OpenAI) how bad open-sourced unfiltered GTP-4 would be, but my impression is that that also isn’t capable of being seriously dangerous, so I suspect the same might be true there, and I agree that adapting to it may “help society’s immune system” (after rather a lot of spearhishing emails, public opinion-manipulating propaganda, and similar scams). [And I don’t see propaganda as a small problem: IMO the rise of Fascism that led to the Second World War was partly caused by it taking society a while to adjust to the propaganda capabilities of radio and film (those old propaganda films that look so hokey to us now used to actually work), and the recent polarization of US politics and things like QAnon and InfoWars I think have a lot to do with the same for social media.] So my “something more dangerous than atomic energy” remark above is anticipatory, for what I expect from future models such as GPT-5/6 if they were open sourced unfiltered/unaligned. So I basically see two possibilities:
At some point (my current guess would be somewhere around the GPT-5 level), we’re going to need to stop open-sourcing these, or else if we don’t unacceptable amounts of damage will be done, unless
We figure out a way of aligning models that is “baked in” during the pretraining stage or some other way, and that cannot then be easily fine-tuned out again using 1% or less of the amount of compute needed to pretrain the model in the first place. Filtering dangerous information out of the pretraining set might be an example of a candidate, some form of distillation process of an aligned model that actually managed to drop unaligned capabilities might be another.
This is a pretty counter-intuitive point indeed, but up to a certain threshold this seems to me the approach that minimise risks, by avoiding large capability jumps and improving the “immune system” of society.
At current open source model’s risk levels, I completely agree. Obviously it’s hard to know (from outside OpenAI) how bad open-sourced unfiltered GTP-4 would be, but my impression is that that also isn’t capable of being seriously dangerous, so I suspect the same might be true there, and I agree that adapting to it may “help society’s immune system” (after rather a lot of spearhishing emails, public opinion-manipulating propaganda, and similar scams). [And I don’t see propaganda as a small problem: IMO the rise of Fascism that led to the Second World War was partly caused by it taking society a while to adjust to the propaganda capabilities of radio and film (those old propaganda films that look so hokey to us now used to actually work), and the recent polarization of US politics and things like QAnon and InfoWars I think have a lot to do with the same for social media.] So my “something more dangerous than atomic energy” remark above is anticipatory, for what I expect from future models such as GPT-5/6 if they were open sourced unfiltered/unaligned. So I basically see two possibilities:
At some point (my current guess would be somewhere around the GPT-5 level), we’re going to need to stop open-sourcing these, or else if we don’t unacceptable amounts of damage will be done, unless
We figure out a way of aligning models that is “baked in” during the pretraining stage or some other way, and that cannot then be easily fine-tuned out again using 1% or less of the amount of compute needed to pretrain the model in the first place. Filtering dangerous information out of the pretraining set might be an example of a candidate, some form of distillation process of an aligned model that actually managed to drop unaligned capabilities might be another.