mruwnik comments on AGI Safety FAQ / all-dumb-questions-allowed thread

mruwnik 2 Sep 2023 10:06 UTC
1 point
0
You’re adding a lot of extra assumptions here, a couple being:
- there is a problem with having arbitrary goals
- it has a pleasure-pain axis
- it notices it has a pleasure-pain axis
- it cares about its pleasure-pain axis
- its pleasure-pain axis is independent of its understanding of the state of the environment
The main problem of inner alignment is making an agent want to do what you want it to do (as opposed to even understanding what you want it to do). Which is an unsolved problem.
Although I’m criticizing your specific criticism, my main issue with it is that it’s a very specific failure mode, which is unlikely to appear, because it requires a lot of other things which are also unlikely. That being said, you’ve provided a good example of WHY inner alignment is a big problem, i.e. it’s very hard to keep something following the goals you set it, especially when it can think for itself and change its mind.