Wei Dai comments on Introducing Alignment Stress-Testing at Anthropic

Wei Dai 21 Jan 2024 0:58 UTC
LW: 2 AF: 2
0
AF
Thanks for the pointers. I think these proposals are unlikely to succeed (or at least very risky) and/or liable to give people a false sense of security (that we’ve solved the problem when we actually haven’t) absent a large amount of philosophical progress, which we’re unlikely to achieve given how slow philosophical progress typically is and lack of resources/efforts. Thus I find it hard to understand why @evhub wrote “I’m less concerned about this; I think it’s relatively easy to give AIs “outs” here where we e.g. pre-commit to help them if they come to us with clear evidence that they’re moral patients in pain.” if these are the kinds of ideas he has in mind.
- ryan_greenblatt 21 Jan 2024 4:37 UTC
  LW: 2 AF: 2
  2
  AF Parent
  I also think these proposals seem problematic in various ways. However, I expect they would be able to accomplish something important in worlds where the following are true:
  - There is something (or things) inside of an AI which has a relatively strong and coherant notion of self including coherant preferences.
  - This thing also has control over actions and it’s own cognition to some extent. In particular, it can control behavior in cases where training didn’t “force it” to behave in some particular way.
  - This thing can understand english presented in the inputs and can also “ground out” some relevant concepts in english. (In particular, the idea/symbol of preferences needs to be able to ground out to its own preferences: the AI needs to understand the relationship between its own preferences and the symbol “preferences” to at least some extent. Ideally, the same would also be true for suffering, but this seems more dubious.)
  In reference to the specific comment you linked, I’m personally skeptical that the “self-report training” approach adds value on top of a well optimized prompting baseline (see here), in fact, I expect it’s probably worse and I would prefer the prompting approach if we had to pick one. This is primarily because I think that if you already have the 3 criteria I listed above, then I expect the prompting baseline would suffice while self-report training might fail (by forcing the AI to behave in some particular way), and it seems unlikely that self-reports will work in cases where you don’t meet the criteria above. (In particular, if it doesn’t already naturally understand from how it’s own preferences relate the the symbol “preferences” (like literally this token), I don’t think self-reports has much hope.)
  
  Just being able to communicate with this “thing” inside of an AI which is relatively coherant doesn’t suffice for avoiding moral atrocity. (There might be other things we neglect, we might be unable to satisfy the preference of these things because the cost is unacceptable given other constraints, or it could be that merely satisfying stated preferences is still a moral atrocity.)
  
  Note that just because the “thing” inside the AI could communicate with us doesn’t mean that it will choose to. I think from many moral (or decision theory) perspectives we’re at least doing better if we gave the AI a realistic and credible means of communication.
  
  Of course, we might have important moral issues while not hitting the 3 three criteria I listed above and have a moral atrocity for this reason. (It also doesn’t seem that unlikely to me that deep learning has already caused a moral atrocity. E.g., perhaps GPT-4 has morally relevant states and we’re doing seriously bad things in our current usage of GPT-4.)
  
  So, I’m also skeptical of @evhub ’s statement here. But, even though AI moral atrocity seems reasonably likely to me and our current interventions seem far from sufficing, it overall seems notably less concerning than other ongoing or potential future moral atrocities (e.g. factory farming, wild animal welfare, substantial probabilty of AI takeover, etc).
  
  If you’re interested in a more thorough understanding of my views on the topic, I would recommend reading the full “Improving the welfare of AIs: a nearcasted proposal” which talks about a bunch of these issues.