I think the broader use is sensible—e.g. to include post-training.
However, I’m not sure how narrow you’d want [training hacking] to be. Do you want to call it training only if NN internals get updated by default? Or just that it’s training hacking if it occurs during the period we consider training? (otherwise, [deceptive alignment of a …selection… process that could be ongoing], seems to cover all deceptive alignment—potential deletion/adjustment being a selection process).
Fine if there’s no bright line—I’d just be curious to know your criteria.
I’d probably be more specific and say ‘gradient hacking’ or ‘update hacking’ for deception of a training process which updates NN internals.
I see what you’re saying with a deployment scenario being often implicitly a selection scenario (should we run the thing more/less or turn it off?) in practice. So deceptive alignment at deploy-time could be a means of training (selection) hacking.
More centrally, ‘training hacking’ might refer to a situation with denser oversight and explicit updating/gating.
Deceptive alignment during this period is just one way of training hacking (could alternatively hack exploration, cyber crack and literally hack oversight/updating, …). I didn’t make that clear in my original comment and now I think there’s arguably a missing term for ‘deceptive alignment for training hacking’ but maybe that’s fine.
I think the broader use is sensible—e.g. to include post-training.
However, I’m not sure how narrow you’d want [training hacking] to be.
Do you want to call it training only if NN internals get updated by default? Or just that it’s training hacking if it occurs during the period we consider training? (otherwise, [deceptive alignment of a …selection… process that could be ongoing], seems to cover all deceptive alignment—potential deletion/adjustment being a selection process).
Fine if there’s no bright line—I’d just be curious to know your criteria.
I’d probably be more specific and say ‘gradient hacking’ or ‘update hacking’ for deception of a training process which updates NN internals.
I see what you’re saying with a deployment scenario being often implicitly a selection scenario (should we run the thing more/less or turn it off?) in practice. So deceptive alignment at deploy-time could be a means of training (selection) hacking.
More centrally, ‘training hacking’ might refer to a situation with denser oversight and explicit updating/gating.
Deceptive alignment during this period is just one way of training hacking (could alternatively hack exploration, cyber crack and literally hack oversight/updating, …). I didn’t make that clear in my original comment and now I think there’s arguably a missing term for ‘deceptive alignment for training hacking’ but maybe that’s fine.