Surely creating the full concrete details of the strategy is not much different from “putting forth as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections.” I at least don’t see why the same mechanism couldn’t be used here (i.e. apply this definition iteration to the word “good”, and then have the AI do that, and apply it to “bad” and have the AI avoid that). If you see it as a different thing, can you explain why?
It’s much easier to get safe, effective definitions of ‘reason’, ‘hopes’, ‘worries’, and ‘intuitions’ on first tries than to get a safe and effective definition of ‘good’.
Surely creating the full concrete details of the strategy is not much different from “putting forth as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections.” I at least don’t see why the same mechanism couldn’t be used here (i.e. apply this definition iteration to the word “good”, and then have the AI do that, and apply it to “bad” and have the AI avoid that). If you see it as a different thing, can you explain why?
It’s much easier to get safe, effective definitions of ‘reason’, ‘hopes’, ‘worries’, and ‘intuitions’ on first tries than to get a safe and effective definition of ‘good’.
I’d be interested to know why you think that.
I’d be further interested if you would endorse the statement that your proposed plan would fully bridge that gap.
And if you wouldn’t, I’d ask if that helps illustrate the issue.