soth02 comments on Half-baked AI Safety ideas thread

soth02 23 Jun 2022 20:43 UTC
2 points
0
Brute force alignment by adding billions of tokens of object level examples of love, kindness, etc to the dataset. Have the majority of humanity contribute essays, comments, and (later) video.
- Evan R. Murphy 17 Jul 2022 8:19 UTC
  1 point
  0
  Parent
  What would be the reward you’re training the AI on with this dataset? If you’re not careful you could inadvertently train a learned optimizer, e.g. a “hugging humans maximizer” to take a silly example.
  
  That may sound nice but could have torturous results, e.g. the AI forcing humans to hug, or replacing biological humans with server farms housing simulations of quadrillions of humans hugging.
  - soth02 22 Jul 2022 23:55 UTC
    1 point
    0
    Parent
    Does there have to be a reward? This is using brute force to create the underlying world model. It’s just adjusting weights right?
    - Evan R. Murphy 23 Jul 2022 0:47 UTC
      1 point
      0
      Parent
      I think there has to be some kind of reward or loss function, in the current paradigm anyway. That’s what gradient descent uses to know such weights to adjust on each update.
      
      Like what are you imagining is the input output channel of this AI? Maybe discussing this a bit would help us clarify.
      - Hastings 23 Jul 2022 1:49 UTC
        2 points
        0
        Parent
        To steelman, I’d guess this idea applies in the hypothetical where GPT-N gains general intelligence and agency (such as via a mesa-optimizer) just by predicting the next token.