Seth Herd comments on TsviBT’s Shortform

Seth Herd 18 Jun 2024 23:01 UTC
9 points
2
Agreed! I tried to say the same thing in The alignment stability problem.
I think most people in prosaic alignment aren’t thinking about this problem. Without this, they’re working on aligning AI, but not on aligning AGI or ASI. It seems really likely on the current path that we’ll soon have AGI that is reflective. In addition, it will do continuous learning, which introduces another route to goal change (e.g., learning that what people mean by “human” mostly applies to some types of artificial minds, too).
The obvious route past this problem, that I think prosaic alignment often sort of assumes without being explicit about it, is that humans will remain in charge of how the AGI updates its goals and beliefs. They’re banking on corrigible or instruction-following AGI.
I think that’s a viable approach, but we should be more explicit about it. Aligning AI probably helps with aligning AGI, but they’re not the same thing, so we should try to get more sure that prosaic alignment really helps align a reflectively stable AGI.
- TsviBT 19 Jun 2024 5:37 UTC
  2 points
  0
  Parent
  Thanks. (I think we have some ontological mismatches which hopefully we can discuss later.)