You’re adding a lot of extra assumptions here, a couple being:
there is a problem with having arbitrary goals
it has a pleasure-pain axis
it notices it has a pleasure-pain axis
it cares about its pleasure-pain axis
its pleasure-pain axis is independent of its understanding of the state of the environment
The main problem of inner alignment is making an agent want to do what you want it to do (as opposed to even understanding what you want it to do). Which is an unsolved problem.
Although I’m criticizing your specific criticism, my main issue with it is that it’s a very specific failure mode, which is unlikely to appear, because it requires a lot of other things which are also unlikely. That being said, you’ve provided a good example of WHY inner alignment is a big problem, i.e. it’s very hard to keep something following the goals you set it, especially when it can think for itself and change its mind.
You’re adding a lot of extra assumptions here, a couple being:
there is a problem with having arbitrary goals
it has a pleasure-pain axis
it notices it has a pleasure-pain axis
it cares about its pleasure-pain axis
its pleasure-pain axis is independent of its understanding of the state of the environment
The main problem of inner alignment is making an agent want to do what you want it to do (as opposed to even understanding what you want it to do). Which is an unsolved problem.
Although I’m criticizing your specific criticism, my main issue with it is that it’s a very specific failure mode, which is unlikely to appear, because it requires a lot of other things which are also unlikely. That being said, you’ve provided a good example of WHY inner alignment is a big problem, i.e. it’s very hard to keep something following the goals you set it, especially when it can think for itself and change its mind.