The model shows early signs of autonomous self-replication ability, as defined by 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations]
Would you be willing to rephrase this as something like
The model shows early signs of autonomous self-replication ability. Autonomous self-replication ability is defined as 50% aggregate success rate on the capabilities for which we list evaluations in [Appendix on Autonomy Evaluations]
?
The hope here is to avoid something like “well this system doesn’t have autonomous self-replication ability/isn’t ASL-3, because Anthropic’s evals failed to elicit the behaviour. That definitionally means it’s not ASL-3”, and get a bit more map-territory distinction in.
Would you be willing to rephrase this as something like
?
The hope here is to avoid something like “well this system doesn’t have autonomous self-replication ability/isn’t ASL-3, because Anthropic’s evals failed to elicit the behaviour. That definitionally means it’s not ASL-3”, and get a bit more map-territory distinction in.