Noosphere89 comments on Demis Hassabis — Google DeepMind: The Podcast

Noosphere89 9 Sep 2024 21:54 UTC
5 points
0
My general thoughts on Deepmind’s strategy can be found in my comment here, as well as discussing the impact of RL agentizing an AI more generally, and short answer, I’m a little more concerned than in the case of pre-trained AIs like GPT-4 or GPT-N, and some more alignment work should go to that scenario, but the reward is likely to be densely defined such that the AI has limited opportunities for breaking it via instrumental convergence:

https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/?commentId=DgLC43S7PgMuC878j

(BTW, I also see this as a problem for Section B2, as it’s examples rely on the analogy of evolution, but there are critical details that disallow us generalizing from “Evolution failed at aligning us to X” to “Humans can’t align AIs to X”. Also incorrectly assumes that corrigibility is anti-natural for consequentialist/Expected Utility Maximizing AIs and highly capable AIs, because GPT-4 and GPT-N do likely have a utility function that is learned as described here:

https://www.lesswrong.com/posts/k48vB92mjE9Z28C3s/implied-utilities-of-simulators-are-broad-dense-and-shallow

https://www.lesswrong.com/posts/vs49tuFuaMEd4iskA/one-path-to-coherence-conditionalization

I might have more to say on that post later.)
- Noosphere89 14 Sep 2024 17:07 UTC
  2 points
  0
  Parent
  I’ve replied to AGI Ruin: A List of Lethalities here:
  https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/?commentId=Gcigdmuje4EacwirD
  It’s a very long comment due to having me to respond to a lot of points, so get a drink and a snack while you read this comment.
- Seth Herd 9 Sep 2024 23:22 UTC
  2 points
  0
  Parent
  I endorse the link to that other comment. We’ve got what feels like a useful discussion over there on exactly this issue.
  - Noosphere89 10 Sep 2024 18:07 UTC
    4 points
    0
    Parent
    Note I have written about how I’d actually do alignment in practice, such that we can get the densely defined signal of human values/instruction following to hold yesterday:
    https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/?commentId=BxNLNXhpGhxzm7heg