robo comments on Dan Braun’s Shortform

robo Oct 5, 2024, 4:30 PM
2 points
−3
This might be a reason to try to design AI’s to fail-safe and break without controlling units. E.g. before fine-tuning language models to be useful, fine-tune them to not generate useful content without approval tokens generated by a supervisory model.
- Seth Herd Oct 6, 2024, 10:32 PM
  2 points
  0
  Parent
  I don’t see how that would work technically. It seems like any small set of activating tokens would be stolen along with the weights, and I don’t see how to train it for a large shifting set.
  
  I’m not saying this is impossible, just htat I’m not sure it is. Can you flesh this idea out any further?
  - robo Oct 7, 2024, 5:35 AM
    2 points
    1
    Parent
    Sorry, that was an off-the-cuff example I meant to help gesture towards the main idea. I didn’t mean to imply it’s a working instance (it’s not). The idea I’m going for is:
    I’m expecting future AIs to be less single LLMs (like Llama) and more loops and search and scaffolding (like o1)
    Those AIs will be composed of individual pieces
    Maybe we can try making the AI pieces mutually dependent in such a way that it’s a pain to get the AI working at peak performance unless you include the safety pieces