(I’m not sure if you’re just checking your own understanding of the post, or if you’re offering suggestions for how to express the ideas more clearly, or if you’re trying to improve the ideas. If the latter two, I’d also welcome more direct feedback pointing out issues in my use of language or my ideas.)
Sorry, I should have explained. With most posts, there are enough details and examples that when I summarize the post for the newsletter, I’m quite confident that I got the details mostly right. This post was short enough that I wasn’t confident this was true, so I pasted it here to make sure I wasn’t changing the meaning too much.
I suppose you could think of this as a suggestion on how to express the ideas more clearly to me/the audience of the newsletter, but I think that’s misleading. It’s more that I try to use consistent language in the newsletter to make it easier for readers to follow, and the language you use is different from the language I use. (For example, you have short words like “human safety problems” for a large class of concepts, each of which I explain out in full sentences with examples.)
I think the first paragraph of your rewrite is missing the “obligation” part of my post. It seems that even aligned AI could exacerbate human safety problems (and make the future worse than if we magically or technologically made humans more intelligent) so I think AI designers at least have an obligation to prevent that.
Good point, added it in the newsletter summary.
For the second paragraph, I think under the proposed approach, the AI should start inferring what the idealized humans would say (or calculate how it should optimize for the idealized humans’ values-in-reflective-equilibrium, depending on details of how the AI is designed) as soon as it can, and not wait until the real humans start contradicting each other a lot, because the real humans could all be corrupted in the same direction. Even before that, it should start taking measures to protect itself from the real humans (under the assumption that the real humans might become corrupt at any time in a way that it can’t yet detect). For example, it should resist any attempts by the real humans to change its terminal goal.
Hmm, that’s what I was trying to say. I’ve changed the last sentence of that paragraph to:
But if the idealized humans begin to have different preferences from real humans, then the AI system should ignore the “corrupted” values of the real humans.
Sorry, I should have explained. With most posts, there are enough details and examples that when I summarize the post for the newsletter, I’m quite confident that I got the details mostly right. This post was short enough that I wasn’t confident this was true, so I pasted it here to make sure I wasn’t changing the meaning too much.
I suppose you could think of this as a suggestion on how to express the ideas more clearly to me/the audience of the newsletter, but I think that’s misleading. It’s more that I try to use consistent language in the newsletter to make it easier for readers to follow, and the language you use is different from the language I use. (For example, you have short words like “human safety problems” for a large class of concepts, each of which I explain out in full sentences with examples.)
Good point, added it in the newsletter summary.
Hmm, that’s what I was trying to say. I’ve changed the last sentence of that paragraph to:
But if the idealized humans begin to have different preferences from real humans, then the AI system should ignore the “corrupted” values of the real humans.