I don’t think my proposed strategy is analogous to that, but I’ll answer in good faith just in case.
If that description of a strategy is knowingly abstract compared to the full concrete details of the strategy, then the description may or may not turn out to describe a good strategy, and the description may or may not be an accurate description of the strategy and its consequences.
If there is no concrete strategy to make explicitly stated which the abstract statement is describing, then the statement appears to just be repositing the problem of AI alignment, and it brings us nowhere.
Surely creating the full concrete details of the strategy is not much different from “putting forth as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections.” I at least don’t see why the same mechanism couldn’t be used here (i.e. apply this definition iteration to the word “good”, and then have the AI do that, and apply it to “bad” and have the AI avoid that). If you see it as a different thing, can you explain why?
It’s much easier to get safe, effective definitions of ‘reason’, ‘hopes’, ‘worries’, and ‘intuitions’ on first tries than to get a safe and effective definition of ‘good’.
I don’t think my proposed strategy is analogous to that, but I’ll answer in good faith just in case.
If that description of a strategy is knowingly abstract compared to the full concrete details of the strategy, then the description may or may not turn out to describe a good strategy, and the description may or may not be an accurate description of the strategy and its consequences.
If there is no concrete strategy to make explicitly stated which the abstract statement is describing, then the statement appears to just be repositing the problem of AI alignment, and it brings us nowhere.
Surely creating the full concrete details of the strategy is not much different from “putting forth as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections.” I at least don’t see why the same mechanism couldn’t be used here (i.e. apply this definition iteration to the word “good”, and then have the AI do that, and apply it to “bad” and have the AI avoid that). If you see it as a different thing, can you explain why?
It’s much easier to get safe, effective definitions of ‘reason’, ‘hopes’, ‘worries’, and ‘intuitions’ on first tries than to get a safe and effective definition of ‘good’.
I’d be interested to know why you think that.
I’d be further interested if you would endorse the statement that your proposed plan would fully bridge that gap.
And if you wouldn’t, I’d ask if that helps illustrate the issue.