mattmacdermott comments on Is instrumental convergence a thing for virtue-driven agents?

mattmacdermott 2 Apr 2025 15:07 UTC
4 points
2
I think this gets at the heart of the question (but doesn’t consider the other possible answer). Does a powerful virtue-driven agent optimise hard now for its ability to embody that virtue in the future? Or does it just kinda chill and embody the virtue now, sacrificing some of its ability to embody it extra-hard in the future?

I guess both are conceivable, so perhaps I do need to give an argument why we might expect some kind of virtue-driven AI in the first place, and see which kind that argument suggests.
- Gordon Seidoh Worley 2 Apr 2025 16:43 UTC
  4 points
  0
  Parent
  Yeah I guess I should be clear that I generally like the idea of building virtuous AI and maybe somehow this solves some of the problems we have with other designs, the trick is building something that actually implements whatever we think it means to be virtuous, which means getting precise enough about what it means to be virtuous that we can be sure we don’t simply collapse back into the default thing all negative feedback systems do: optimize for their targets as hard as they can (with “can” doing a lot of work here!).