In my other aligning-a-human-level-intelligence project (parenting), my kids get “points” for trying new foods. We are often having arguments about what kinds of trivial modifications to an old food will make it count as a new food. This seems like it could have a similar problem—couldn’t a superintelligence generate thousands of non-substantive variations for an effective, dangerous action while electing not to do so for other actions?
Similarly, since the tails come apart, perhaps it would be better to sample from 85-95%ile actions instead of sampling from 90-100%ile actions.
In my other aligning-a-human-level-intelligence project (parenting), my kids get “points” for trying new foods. We are often having arguments about what kinds of trivial modifications to an old food will make it count as a new food. This seems like it could have a similar problem—couldn’t a superintelligence generate thousands of non-substantive variations for an effective, dangerous action while electing not to do so for other actions?
Similarly, since the tails come apart, perhaps it would be better to sample from 85-95%ile actions instead of sampling from 90-100%ile actions.