kave comments on Anthropic’s Responsible Scaling Policy & Long-Term Benefit Trust

kave 9 Dec 2023 22:00 UTC
LW: 3 AF: 2
2
AF
The model shows early signs of autonomous self-replication ability, as defined by 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations]
Would you be willing to rephrase this as something like
The model shows early signs of autonomous self-replication ability. Autonomous self-replication ability is defined as 50% aggregate success rate on the capabilities for which we list evaluations in [Appendix on Autonomy Evaluations]
?
The hope here is to avoid something like “well this system doesn’t have autonomous self-replication ability/isn’t ASL-3, because Anthropic’s evals failed to elicit the behaviour. That definitionally means it’s not ASL-3”, and get a bit more map-territory distinction in.