Filip Sondej comments on Current AIs Provide Nearly No Data Relevant to AGI Alignment

Filip Sondej 17 Jan 2024 15:31 UTC
7 points
0

the natural language bottleneck is itself a temporary stage in the evolution of AI capabilities. It is unlikely to be an optimal mind design; already many people are working on architectures that don’t have a natural language bottleneck

This one looks fatal. (I think the rest of the reasons could be dealt with somehow.)

What existing alternative architectures do you have in mind? I guess mamba would be one?

Do you think it’s realistic to regulate this? F.e. requiring that above certain size, models can’t have recurrence that uses a hidden state, but recurrence that uses natural language (or images) is fine. (Or maybe some softer version of this, if alignment tax proves too high.)
- Daniel Kokotajlo 17 Jan 2024 18:08 UTC
  7 points
  0
  Parent
  I think it would be realistic to regulate this if the science of faithful CoT was better developed. If there were lots of impressive papers to cite about CoT faithfulness for example, and lots of whitepapers arguing for the importance of faithfulness to alignment and safety.
  
  As it is, it seems unlikely to be politically viable… but maybe it’s still worth a shot?
  - Filip Sondej 17 Jan 2024 19:51 UTC
    7 points
    0
    Parent
    Yeah, true. But it’s also easier to do early, when no one is that invested in the hidden-recurrence architectures, and so there’s less resistance, it doesn’t break anyone’s plans.
    
    Maybe a strong experiment would be to compare mamba-3b and some SOTA 3b transformer, trained similarly, on several tasks where we can evaluate CoT faithfulness. (Although maybe at 3b capability level we won’t see clear differences yet.) The hard part would be finding the right tasks.
    - Daniel Kokotajlo 17 Jan 2024 21:55 UTC
      7 points
      0
      Parent
      Agreed. I was working on this for six months and I’ve been trying to get more people to work on it.
      
      We don’t have a way of measuring CoT faithfulness as far as I know, in general—but you emphasize “tasks where we can evaluate...” that seems intriguing to me, you are saying it may be feasible today for some tasks at least. What tasks do you have in mind?
      - Filip Sondej 18 Jan 2024 12:25 UTC
        5 points
        0
        Parent
        Unfortunately I didn’t have any particular tasks in mind when I wrote it. I was vaguely thinking about settings as in:
        
        https://arxiv.org/pdf/2305.04388.pdf
        https://arxiv.org/pdf/2307.13702.pdf
        
        Now that I though about it, for this particular transformers vs mamba experiment, I’d go with something even simpler. I want a task that is very easy sequentially, but hard to answer immediately. So for example a task like:
        
        x = 5 x += 2 x *= 3 x **= 2 x -= 3 ...
        
        and then have a CoT:
        
        after x = 5 5 after x += 2 7 ...
        
        And then we intervene on CoT to introduce some error in one operation, but still ask of the model to give the correct answer at the end. (Despite all steps after the error being irrelevant.) We can go even further and train the models to give the correct answer after inadequate CoT. And have a curriculum where at first it only needs to do one hidden operation, later two, and so on.
        
        (It’s an unrealistic setting, but the point is rather to check if the model is able at all to learn hidden sequential reasoning.)
        
        Now, my hypothesis is that transformers will have some limited sequence length for which they can do it (probably smaller than their number of layers), but mamba won’t have a limit.
        
        I was working on this for six months
        
        Can you say what you tried in these six months and how did it go?