Thus, if given the right incentives, it should be “easy” for our AI systems to avoid those kinds of catastrophes: they just need to not do it. To us, this is one of the core reasons for optimism about alignment.
I’m not sure I understand this correctly. Are you saying that one of the main reasons for optimism is that more competent models will be easier to align because we just need to give them “the right incentives”?
What exactly do you mean by “the right incentives”?
I’m not sure I understand this correctly. Are you saying that one of the main reasons for optimism is that more competent models will be easier to align because we just need to give them “the right incentives”?
What exactly do you mean by “the right incentives”?
Can you illustrate this by means of an example?