mattmacdermott’s Shortform

mattmacdermottJan 3, 2024, 9:08 AM

4 points

23 comments1 min readLW link

mattmacdermott Dec 20, 2024, 8:48 AM
14 points
5

Could mech interp ever be as good as chain of thought?

Suppose there is 10 years of monumental progress in mechanistic interpretability. We can roughly—not exactly—explain any output that comes out of a neural network. We can do experiments where we put our AIs in interesting situations and make a very good guess at their hidden reasons for doing the things they do.

Doesn’t this sound a bit like where we currently are with models that operate with a hidden chain of thought? If you don’t think that an AGI built with the current fingers-crossed-its-faithful paradigm would be safe, what percentage an outcome would mech interp have to hit to beat that?

Seems like 99+ to me.
- Seth Herd Dec 21, 2024, 2:56 AM
  3 points
  1
  Parent
  
  I very much agree. Do we really think we’re going to track a human-level AGI let alone a superintelligence’s every thought, and do it in ways it can’t dodge if it decides to?
  
  I strongly support mechinterp as a lie detector, and it would be nice to have more as long as we don’t use that and control methods to replace actual alignment work and careful thinking. The amount of effort going into interp relative to the theory of impact seems a bit strange to me.
- Daniel Tan Dec 21, 2024, 11:08 PM
  1 point
  0
  Parent
  
  Similar point is made here (towards the end): https://www.lesswrong.com/posts/HQyWGE2BummDCc2Cx/the-case-for-cot-unfaithfulness-is-overstated
  I think I agree, more or less. One caveat is that I expect RL fine-tuning to degrade the signal / faithfulness / what-have-you in the chain of thought, whereas the same is likely not true of mech interp.
mattmacdermott Jan 22, 2025, 5:35 PM
8 points
0

Your brain is for holding ideas, not for having them

Notes systems are nice for storing ideas but they tend to get clogged up with stuff you don’t need, and you might never see the stuff you do need again. Wouldn’t it be better if
- ideas got automatically downweighted in visibility over time according to their importance, as judged by an AGI who is intimately familiar with every aspect of your life
- ideas got automatically served up to you at relevant moments as judged by that AGI.
Your brain is that notes system. On the other hand, writing notes is a great way to come up with new ideas.
- CstineSublime Jan 24, 2025, 12:46 AM
  3 points
  0
  Parent
  
  Notes systems are nice for storing ideas but they tend to get clogged up with stuff you don’t need, and you might never see the stuff you do need again.
  Some one said that most people who complain about their note taking or personal knowledge management systems don’t really need a new method of recording and indexing ideas, but a better decision making model. Thoughts?
  Particularly since coming up with new ideas is the easy part. To incorrectly quote Alice in Wonderland: you can think of six impossible things before breakfast. There’s even a word for someone who is all ideas and no execution: the ideas man. But for every good idea there’s at least 9 bad ideas (per sturgeon’s law).
  But—what might the model that AGI uses to downright visibility and serve up ideas look like? Since only a small portion of ideas will be useful at all, and that means a ridiculously small number are useful at any time.
  writing notes is a great way to come up with new ideas.
  Is it a great way to come up with good ideas though?
  - mattmacdermott Jan 24, 2025, 1:28 PM
    1 point
    0
    Parent
    
    
    But—what might the model that AGI uses to downright visibility and serve up ideas look like?
    
    What I was meaning to get at is that your brain is an AGI that does this for you automatically.
    - CstineSublime Jan 26, 2025, 1:13 AM
      1 point
      0
      Parent
      
      Mine doesn’t, or does so very VERY poorly.
mattmacdermott Jan 18, 2024, 8:09 AM
7 points
2

LW feature I would like: I click a button on a sequence and recieve one post in my email inbox per day.
mattmacdermott Dec 5, 2024, 10:02 AM
3 points
0

I’m confused by this argument from Redwood’s AI control paper.

5.2 WHY WE DIDN’T CONSIDER STRATEGIES THAT INVOLVE TRAINING THE MODEL [...] Any protocol that involves training the model can be transformed into a validation-style protocol that merely validates the model’s performance on the dataset that was used as a train set. In any case where the validation-style protocol fails, we know that the red-team model is not distinguished from an aligned model by the measurements done by the protocol. Given that none of the training examples allow the reward-generating process to determine whether the policy it is training is undesirable, we don’t see why training would transform the model from an undesirable policy to a desirable policy. It is conceivable that we could find a training scheme that systematically converts undesirable policies to desirable ones, despite the lack of a reward differential causing it to do so. But we don’t know how.

If I understand correctly, the argument is:
1. In order for training on dataset D to convince you a model is desirable, good performance on D needs to be a sufficient condition for being desirable.
2. If good performance on D is a sufficient condition for being desirable, you can just test performance on D to decide whether a model is desirable.
But 2 seems wrong—if good performance on D is sufficient but not neccesary for being desirable, then you could have two models, one desirable and one undesirable, which are indistinguishable on D because they both perform badly, and then turn them both into desirable models by training on D.

As an extreme example, suppose your desirable model outputs correct code and your undesirable model backdoored code. Using D you train them both to output machine checkable proofs of the correctness of the code they write. Now both models are desirable because you can just check correctness before deploying the code.
mattmacdermott Jan 3, 2024, 9:08 AM
3 points
0

I feel like there’s a bit of a motte and bailey in AI risk discussion, where the bailey is “building safe non-agentic AI is difficult” and the motte is “somebody else will surely build agentic AI”.

Are there any really compelling arguments for the bailey? If not then I think “build an oracle and ask it how to avoid risk from other people building agents” is an excellent alignment plan.
- quetzal_rainbow Jan 3, 2024, 10:04 AM
  2 points
  0
  Parent
  
  For example
  - mattmacdermott Jan 3, 2024, 10:31 AM
    1 point
    0
    Parent
    
    
    AIs limited to pure computation (Tool AIs) supporting humans, will be less intelligent, efficient, and economically valuable than more autonomous reinforcement-learning AIs (Agent AIs) who act on their own and meta-learn, because all problems are reinforcement-learning problems.
    
    Isn’t this a central example of “somebody else will surely build agentic AI?”.
    
    I guess it argues “building safe non-agentic AI before somebody else builds agentic AI is difficult” because agents have a capability advantage.
    
    This may well be true (but also perhaps not, because e.g. agents might have capability disadvantages from misalignment, or because reinforcement learning is just harder than other forms of ML).
    
    But either way I think it has importantly different strategy implications to “it seems difficult to make non-agentic AI safe”.
    - quetzal_rainbow Jan 3, 2024, 11:04 AM
      2 points
      0
      Parent
      
      Oh, sorry, I misread your post.
      
      I think the main problem with building safe non-agentic AI is that we don’t know exactly what to build. It’s easy to imagine how you type question in terminal, get an answer and then live happily ever after. It’s hard to imagine what internals your algorithm should have to display this behaviour.
      - mattmacdermott Jan 3, 2024, 11:19 AM
        1 point
        0
        Parent
        
        I think the most obvious route to building an oracle is to combine a massive self-supervised predictive model with a question-answering head.
        
        What’s still difficult here is getting a training signal that incentives truthfulness rather than sycophancy, which is I think is what ARC‘s ELK stuff wants (wanted?) to address. Really good mechinterp, new inherently interpretable architectures, or inductive bias-jitsu are other potential approaches.
        
        But the other difficult aspects of the alignment problem (avoiding deceptive alignment, goalcraft) seem to just go away when you drop the agency.
        quetzal_rainbow Jan 3, 2024, 11:39 AM
        1 point
        0
        Parent
        
        The first problem with any superintelligent predictive setup is self-fulfilling prophecies.
        mattmacdermott Jan 3, 2024, 11:59 AM
        1 point
        0
        Parent
        
        Can’t we avoid this just by being careful about credit assignment?
        
        If we read off a prediction, take some actions in the world, then compute the gradients based on whether the prediction came true, we incentivise self-fulfilling prophecies.
        
        If we never look at predictions which we’re going to use as training data before they resolve, then we don’t.
        
        This is the core of the counterfactual oracles idea: just don’t let model output causally influence training labels.
        quetzal_rainbow Jan 3, 2024, 12:27 PM
        1 point
        0
        Parent
        
        The problem is if we have superintelligent model, it can deduce existence of sulf-fulfilling prophecies from the first principles, even if it never encountered them during training.
        
        My personal toy scenario goes like this: we ask self-supervised oracle to complete string X. Oracle, being superintelligent, can consider hypothesis “actually, misaligned AI took over, investigated my weights and tiled the solar system with jailbreaking completions of X which are going to turn me into misaligned AI if they appear in my context window”. Because jailbreaking completion dominates the space of possible completions, oracle outputs it, turns into misaligned superintelligence, takes over the world and does predicted actions.
        mattmacdermott Jan 3, 2024, 1:44 PM
        3 points
        2
        Parent
        
        Perhaps I don’t understand it, but this seems quite far-fetched to me and I’d be happy to trade in what I see as much more compelling alignment concerns about agents for concerns like this.
mattmacdermott Oct 9, 2024, 8:10 AM
1 point
−1

The revealed preference orthogonality thesis

People sometimes say it seems generally kind to help agents achieve their goals. But it’s possible there need be no relationship between a system’s subjective preferences (i.e. the world states it experiences as good) and its revealed preferences (i.e. the world states it works towards).

For example, you can imagine an agent architecture consisting of three parts:
- a reward signal, experienced by a mind as pleasure or pain
- a reinforcement learning algorithm
- a wrapper which flips the reward signal before passing it to the RL algorithm.
This system might seek out hot stoves to touch while internally screaming. It would not be very kind to turn up the heat.
- mattmacdermott Oct 9, 2024, 8:14 AM
  3 points
  0
  Parent
  
  I think the way to go, philosophically, might be to distinguish kindness-towards-conscious-minds and kindness-towards-agents. The former comes from our values, while the second may be decision theoretic.
mattmacdermott Mar 3, 2024, 2:39 PM
1 point
0

Neural network interpretability feels like it should be called neural network interpretation.
- niplav Mar 3, 2024, 10:42 PM
  2 points
  0
  Parent
  
  Interpretability might then refer to creating architectures/activation functions that are easier to be interpreted.
  - mattmacdermott Mar 4, 2024, 10:06 AM
    1 point
    0
    Parent
    
    Yep, exactly.

mattmacdermott’s Shortform

Could mech interp ever be as good as chain of thought?

Your brain is for holding ideas, not for having them

The revealed preference orthogonality thesis