Gordon Seidoh Worley comments on Is instrumental convergence a thing for virtue-driven agents?

Gordon Seidoh Worley 2 Apr 2025 5:35 UTC
2 points
0
No matter what the goal, power seeking is of general utility. Even if an AI is optimizing for virtue instead of some other goal, more power would, in general, give them more ability to behave virtuously. Even if the virtue is something like “be an equal partner with other beings”, an AI could ensure equality by gaining lots of power and enforcing equality on everyone.
- Gurkenglas 2 Apr 2025 6:30 UTC
  5 points
  2
  Parent
  The idea would be that it isn’t optimizing for virtue, it’s taking the virtuous action, as in https://www.lesswrong.com/posts/LcjuHNxubQqCry9tT/vdt-a-solution-to-decision-theory.
  - Gordon Seidoh Worley 2 Apr 2025 16:43 UTC
    2 points
    0
    Parent
    How do you get something to take virtuous action without optimizing for taking virtuous actions, and how is this different from optimizing for virtue?
- mattmacdermott 2 Apr 2025 15:07 UTC
  4 points
  2
  Parent
  I think this gets at the heart of the question (but doesn’t consider the other possible answer). Does a powerful virtue-driven agent optimise hard now for its ability to embody that virtue in the future? Or does it just kinda chill and embody the virtue now, sacrificing some of its ability to embody it extra-hard in the future?
  
  I guess both are conceivable, so perhaps I do need to give an argument why we might expect some kind of virtue-driven AI in the first place, and see which kind that argument suggests.
  - Gordon Seidoh Worley 2 Apr 2025 16:43 UTC
    4 points
    0
    Parent
    Yeah I guess I should be clear that I generally like the idea of building virtuous AI and maybe somehow this solves some of the problems we have with other designs, the trick is building something that actually implements whatever we think it means to be virtuous, which means getting precise enough about what it means to be virtuous that we can be sure we don’t simply collapse back into the default thing all negative feedback systems do: optimize for their targets as hard as they can (with “can” doing a lot of work here!).