johnswentworth comments on fake alignment solutions????

johnswentworth 11 Dec 2024 13:53 UTC
24 points
20
Note that Eliezer and I and probably Tammy would all tell you that “can’t see how it fails” is not a very useful bar to aim for. At the bare minimum, we should have both (1) a compelling positive case that it in fact works, and (2) a compelling positive case that even if we missed something crucial, failure is likely to be recoverable rather than catastrophic.
- KvmanThinking 13 Dec 2024 14:05 UTC
  4 points
  0
  Parent
  I already figured that. The point of this question was to ask if there could possibly exist things that look indistinguishable from true alignment solutions (even to smart people), but that aren’t actually alignment solutions. Do you think things like this could exist?
  
  By the way, good luck with your plan. Seeing people actively go out and do actually meaningful work to save the world gives me hope for the future. Just try not to burn out. Smart people are more useful to humanity when their mental health is in good shape.
  - johnswentworth 13 Dec 2024 16:24 UTC
    7 points
    2
    Parent
    I’m pretty uncertain on this one. Could a superintelligence find a plan which fools me? Yes. Will such a plans show up early on in a search order without actively trying to fool me? Ehh… harder to say. It’s definitely a possibility I keep in mind. Most importantly, over time as our understanding improves on the theory side, it gets less and less likely that a plan which would fool me shows up early in a natural search order.