Rohin Shah comments on Shah (DeepMind) and Leahy (Conjecture) Discuss Alignment Cruxes

Rohin Shah 2 May 2023 7:22 UTC
3 points
0
High level response: yes, I agree that “gradient descent retargets the search” is a decent summary; I also agree the thing you outline is a plausible failure mode, but it doesn’t justify confidence in doom.
1. Our understanding of what we actually want is poor, such that we wouldn’t want to optimize for how we understand what we want
I’m not very worried about this. We don’t need to solve all of philosophy and morality, it would be sufficient to have the AI system to leave us in control and respect our preferences where they are clear.
1. We poorly express our understanding of what we actually want in the data we train our models with, such that we wouldn’t want to optimize for the expression of what we want
I agree this is more of an issue, but it’s very unclear to me how badly this issue will bite us. Does this lead to AI systems that sometimes say what we want to hear rather than what is actually true, but are otherwise nice? Seems mostly fine. Does this lead to AI systems that tamper with all of our sources of information about things that are happening in the world, to make things simply appear to be good rather than actually being good? Seems pretty bad. Which of the two (or the innumerable other possibilities) happens? Who knows?
- watermark 2 May 2023 9:33 UTC
  5 points
  0
  Parent
  We don’t need to solve all of philosophy and morality, it would be sufficient to have the AI system to leave us in control and respect our preferences where they are clear
  I agree that we don’t need to solve philosophy/morality if we could at least pin down things like corrigibility, but humans may poorly understand “leaving humans in control” and “respecting human preferences” such that optimizing for human abstractions of these concepts could be unsafe (this belief isn’t that strongly held, I’m just considering some exotic scenarios where humans are technically ‘in control’ according to the specification we thought of, but the consequences are negative nonetheless, normal goodharting failure mode).
  Which of the two (or the innumerable other possibilities) happens?
  Depending on the work you’re asking the AI(s) to do (e.g. automating large parts of open ended software projects, automating large portions of STEM work), I’d say the world takeover/power-seeking/recursive self improvement type of scenarios happen since these tasks incentivize the development of unbounded behaviors (because open-ended, project based work doesn’t have clear deadlines, may require multiple retries, and has lots of uncertainty, I can imagine unbounded behaviors like “gain more resources because that’s broadly useful under uncertainty” to be strongly selected for).