Seth Herd comments on The alignment stability problem

Seth Herd 19 Sep 2024 16:06 UTC
2 points
0
As per our discussions on our other posts, I don’t think we can say that value learning in itself solves the problem. The issue of whether the ASI’s interpretation of its central goal or instructions changing is not automatically solved by adopting that approach. The value mutability problem you link to is a separate issue. I’m not addressing here whether human values might change, but whether an AGI’s interpretations of its central goal/values might change.