johnswentworth comments on The Field of AI Alignment: A Postmortem, and What To Do About It

johnswentworth 28 Dec 2024 0:13 UTC
5 points
2
From the post:
… but crucially, the details of the rationalizations aren’t that relevant to this post. Someone who’s flinching away from a hard problem will always be able to find some rationalization. Argue them out of one (which is itself difficult), and they’ll promptly find another. If we want people to not streetlight, then we need to somehow solve the flinching.
- Zach Stein-Perlman 28 Dec 2024 0:16 UTC
  5 points
  0
  Parent
  Yeah. I agree/concede that you can explain why you can’t convince people that their own work is useless. But if you’re positing that the flinchers flinch away from valid arguments about each category of useless work, that seems surprising.
  - TsviBT 28 Dec 2024 1:26 UTC
    16 points
    4
    Parent
    The flinches aren’t structureless particulars. Rather, they involve warping various perceptions. Those warped perceptions generalize a lot, causing other flaws to be hidden.
    
    As a toy example, you could imagine someone attached to the idea of AI boxing. At first they say it’s impossible to break out / trick you / know about the world / whatever. Then you convince them otherwise—that the AI can do RSI internally, and superhumanly solve computer hacking / protein folding / persuasion / etc. But they are attached to AI boxing. So they warp their perception, clamping “can an AI be very superhumanly capable” to “no”. That clamping causes them to also not see the flaws in the plan “we’ll deploy our AIs in a staged manner, see how they behave, and then recall them if they behave poorly”, because they don’t think RSI is feasible, they don’t think extreme persuasion is feasible, etc.
    
    A more real example is, say, people thinking of “structures for decision making”, e.g. constitutions. You explain that these things, they are not reflectively stable. And now this person can’t understand reflective stability in general, so they don’t understand why steering vectors won’t work, or why lesioning won’t work, etc.
    
    Another real but perhaps more controversial example: {detecting deception, retargeting the search, CoT monitoring, lesioning bad thoughts, basically anything using RL} all fail because creativity starts with illegible concomitants to legible reasoning.
    
    (This post seems to be somewhat illegible, but if anyone wants to see more real examples of aspects of mind that people fail to remember, see https://tsvibt.blogspot.com/2023/03/the-fraught-voyage-of-aligned-novelty.html)