I proposed a strategy for an aligned AI that involves it terminally valuing to following the steps of a game that involves talking with us about morality, creating moral theories with the fewest paradoxes, creating plans which are prescribed by the moral theories, and getting approval for the plans.
You objected that my words-for-concepts were vague.
I replied that near-future AIs could make as-good-as-human-or-better definitions, and that the process of [putting forward as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections] was automatable.
You said the AI could come up with many more objections than you would.
I said, “okay, good.” I will add right now: just because it considers an objection, doesn’t mean the current definition has to be rejected; it can decide that the objections are not strong enough, or that its current definition is the one with the fewest/weakest objections.
Now I think you’re saying something like that it doesn’t matter if the AI can come up with great definitions if it’s not aligned and that my plan won’t work either way. But if it can come up with such great objection-solved definitions, then you seem to lack any explicitly made objections to my alignment strategy.
Alternatively, you are saying that an AI can’t make great definitions unless it is aligned, which I think is just plainly wrong; I think getting an unaligned language model to make good-as-human definitions is maybe somewhere around as difficult as getting an unaligned language model to hold a conversation. “What is the definition of X?” is about as hard a question as “In which country can I find Mount Everest?” or “Write me a poem about the Spring season.”
I don’t think my proposed strategy is analogous to that, but I’ll answer in good faith just in case.
If that description of a strategy is knowingly abstract compared to the full concrete details of the strategy, then the description may or may not turn out to describe a good strategy, and the description may or may not be an accurate description of the strategy and its consequences.
If there is no concrete strategy to make explicitly stated which the abstract statement is describing, then the statement appears to just be repositing the problem of AI alignment, and it brings us nowhere.
Surely creating the full concrete details of the strategy is not much different from “putting forth as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections.” I at least don’t see why the same mechanism couldn’t be used here (i.e. apply this definition iteration to the word “good”, and then have the AI do that, and apply it to “bad” and have the AI avoid that). If you see it as a different thing, can you explain why?
It’s much easier to get safe, effective definitions of ‘reason’, ‘hopes’, ‘worries’, and ‘intuitions’ on first tries than to get a safe and effective definition of ‘good’.
Because that’s not a plan, it’s a property of a solution you’d expect the plan to have. It’s like saying “just keep the reactor at the correct temperature”. The devil is in the details of getting there, and there are lots of subtle ways things can go catastrophically wrong.
Exactly. I notice you aren’t who I replied to, so the canned response I had won’t work. But perhaps you can see why most of his objections to my objections would apply to objections to that plan?
I proposed a strategy for an aligned AI that involves it terminally valuing to following the steps of a game that involves talking with us about morality, creating moral theories with the fewest paradoxes, creating plans which are prescribed by the moral theories, and getting approval for the plans.
You objected that my words-for-concepts were vague.
I replied that near-future AIs could make as-good-as-human-or-better definitions, and that the process of [putting forward as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections] was automatable.
You said the AI could come up with many more objections than you would.
I said, “okay, good.” I will add right now: just because it considers an objection, doesn’t mean the current definition has to be rejected; it can decide that the objections are not strong enough, or that its current definition is the one with the fewest/weakest objections.
Now I think you’re saying something like that it doesn’t matter if the AI can come up with great definitions if it’s not aligned and that my plan won’t work either way. But if it can come up with such great objection-solved definitions, then you seem to lack any explicitly made objections to my alignment strategy.
Alternatively, you are saying that an AI can’t make great definitions unless it is aligned, which I think is just plainly wrong; I think getting an unaligned language model to make good-as-human definitions is maybe somewhere around as difficult as getting an unaligned language model to hold a conversation. “What is the definition of X?” is about as hard a question as “In which country can I find Mount Everest?” or “Write me a poem about the Spring season.”
Let me ask you this. Why is “Have the AI do good things, and not do bad things” a bad plan?
I don’t think my proposed strategy is analogous to that, but I’ll answer in good faith just in case.
If that description of a strategy is knowingly abstract compared to the full concrete details of the strategy, then the description may or may not turn out to describe a good strategy, and the description may or may not be an accurate description of the strategy and its consequences.
If there is no concrete strategy to make explicitly stated which the abstract statement is describing, then the statement appears to just be repositing the problem of AI alignment, and it brings us nowhere.
Surely creating the full concrete details of the strategy is not much different from “putting forth as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections.” I at least don’t see why the same mechanism couldn’t be used here (i.e. apply this definition iteration to the word “good”, and then have the AI do that, and apply it to “bad” and have the AI avoid that). If you see it as a different thing, can you explain why?
It’s much easier to get safe, effective definitions of ‘reason’, ‘hopes’, ‘worries’, and ‘intuitions’ on first tries than to get a safe and effective definition of ‘good’.
I’d be interested to know why you think that.
I’d be further interested if you would endorse the statement that your proposed plan would fully bridge that gap.
And if you wouldn’t, I’d ask if that helps illustrate the issue.
Because that’s not a plan, it’s a property of a solution you’d expect the plan to have. It’s like saying “just keep the reactor at the correct temperature”. The devil is in the details of getting there, and there are lots of subtle ways things can go catastrophically wrong.
Exactly. I notice you aren’t who I replied to, so the canned response I had won’t work. But perhaps you can see why most of his objections to my objections would apply to objections to that plan?
I was just responding to something I saw on the main page. No context for the earlier thread. Carry on lol.