dxu comments on My Objections to “We’re All Gonna Die with Eliezer Yudkowsky”

dxu 22 Mar 2023 3:51 UTC
12 points
10
The important point of the tests in the Pretraining from Human Feedback paper, and the AI saying nice things, is that they show that we can align AI to any goal we want
I don’t see how the bolded follows from the unbolded, sorry. Could you explain in more detail how you reached this conclusion?
What links here?
- dxu's comment on Pausing AI Developments Isn’t Enough. We Need to Shut it All Down by Eliezer Yudkowsky by jacquesthibs (30 Mar 2023 2:03 UTC; 38 points)
- Noosphere89 22 Mar 2023 12:28 UTC
  1 point
  −1
  Parent
  The point is that similar techniques can be used to align them, since both (or arguably all goals) are both functionally arbitrary in what we pick, and important for us.
  
  One major point I did elide is the amount of power seeking involved, since in the niceness goal, there’s almost no power seeking involved, unlike the existential risk concerns we have.
  
  But in some of the tests for alignment in Pretraining from Human Feedback, they showed that they can make models avoid taking certain power seeking actions, like getting personal identifying information.
  
  In essence, it’s at least some evidence that as AI gets more capable, that we can make sure that power seeking actions can be avoided if it’s misaligned with human interests.