Here is the list of counter-arguments I prepared beforehand
1) Digital cliff, it may not be possible to weaken a stronger model 2) Competition, the existence of a stronger model implies we live in a more dangerous world 3) Deceptive alignment, the stronger model may be more likely to decieve you into thinking it’s aligned 4) Wireheading, the user may be unable to resist using the stronger model even knowing it is more dangerous 5) Passive Saftey, the weaker model may be passively safe while the stronger model is not 6) Malicious actors, the stronger model may be more likely to be used by malicious actors 7) inverse scaling, the stronger model may be weaker in some safety-critical dimensions 8) Domain of alignment, the stronger model may be more likely to be used in a safety-critical context
I think the strongest counter arguments are:
There may not be a surefire way to weaken a stronger model
saying you “can” weaken a model is useless unless you actually do it
It would love to hear a stronger argument for what @johnswentworth describes as “subproblem 1”: that the model might become dangerous during training. All of the versions of this argument that I aware of involve some “magic” step where the AI unboxes itself by (e.g. side-channel or talking its way out of the box ) that seem like the either require huge leaps in intelligence or can be easily mitigated (air-gapped network, two person control).
Here is the list of counter-arguments I prepared beforehand
1) Digital cliff, it may not be possible to weaken a stronger model
2) Competition, the existence of a stronger model implies we live in a more dangerous world
3) Deceptive alignment, the stronger model may be more likely to decieve you into thinking it’s aligned
4) Wireheading, the user may be unable to resist using the stronger model even knowing it is more dangerous
5) Passive Saftey, the weaker model may be passively safe while the stronger model is not
6) Malicious actors, the stronger model may be more likely to be used by malicious actors
7) inverse scaling, the stronger model may be weaker in some safety-critical dimensions
8) Domain of alignment, the stronger model may be more likely to be used in a safety-critical context
I think the strongest counter arguments are:
There may not be a surefire way to weaken a stronger model
saying you “can” weaken a model is useless unless you actually do it
It would love to hear a stronger argument for what @johnswentworth describes as “subproblem 1”: that the model might become dangerous during training. All of the versions of this argument that I aware of involve some “magic” step where the AI unboxes itself by (e.g. side-channel or talking its way out of the box ) that seem like the either require huge leaps in intelligence or can be easily mitigated (air-gapped network, two person control).