Starting more restrictive seems sensible; this could be, as you say, learned away, or one could use human feedback to sign off on high-impact actions. The first problem reminds me of finding regions of attractions in nonlinear control where the ROA is explored without leaving the stable region. The second approach seems to hinge on humans being able to understand the implications of high-impact actions and the consequences of a baseline like inaction. There are probably also other alternatives that we have not yet considered.
Starting more restrictive seems sensible; this could be, as you say, learned away, or one could use human feedback to sign off on high-impact actions. The first problem reminds me of finding regions of attractions in nonlinear control where the ROA is explored without leaving the stable region. The second approach seems to hinge on humans being able to understand the implications of high-impact actions and the consequences of a baseline like inaction. There are probably also other alternatives that we have not yet considered.