Vladimir_Nesov comments on The Dissolution of AI Safety

Vladimir_Nesov 13 Dec 2024 2:44 UTC
2 points
0
The usual related term is inner alignment, but this is not about definitions, it’s a real potential problem that isn’t ruled out by what we’ve seen of LLMs so far. It could get worse in the future, or it might never become serious. But there is a clear conceptual and potentially practical distinction with a difference.

This sounds like a distinction without a difference
- Roko 13 Dec 2024 2:58 UTC
  2 points
  0
  Parent
  OK, imagine that I make an AI that works like this: a copy of Satan is instantiated and his preferences are extracted in percentiles, then sentences from Satan’s 2nd-5th percentile of outputs are randomly sampled. Then that copy of Satan is destroyed.
  
  Is the “Satan Reverser” AI misaligned?
  
  Is it “inner misaligned”?
  - Vladimir_Nesov 13 Dec 2024 3:08 UTC
    5 points
    0
    Parent
    It’s not valid to say that there is no different inner motivation when there could be. It might be powerless and unimportant in practice, but it can still be a thing. The argument that it’s powerless and unimportant in practice is distinct from the argument that it doesn’t make conceptual sense as a distinct construction. If this distinct construction is there, we should ask and aim to measure how much influence it gets. Given the decades of neuroscience, it’s a somewhat hopeless endeavor in the medium term.
    - Roko 13 Dec 2024 3:20 UTC
      2 points
      0
      Parent
      ok but as a matter of terminology, is a “Satan reverser” misaligned because it contains a Satan?
      - Vladimir_Nesov 13 Dec 2024 3:49 UTC
        4 points
        0
        Parent
        I don’t have a clear sense of terminology around the edges or motivation to particularly care once the burden of nuance in the way it should be used stops it from being helpful for communication. I sketched how I think about the situation. Which words I or you or someone else would use to talk about it is a separate issue.