What would be the reward you’re training the AI on with this dataset? If you’re not careful you could inadvertently train a learned optimizer, e.g. a “hugging humans maximizer” to take a silly example.
That may sound nice but could have torturous results, e.g. the AI forcing humans to hug, or replacing biological humans with server farms housing simulations of quadrillions of humans hugging.
I think there has to be some kind of reward or loss function, in the current paradigm anyway. That’s what gradient descent uses to know such weights to adjust on each update.
Like what are you imagining is the input output channel of this AI? Maybe discussing this a bit would help us clarify.
To steelman, I’d guess this idea applies in the hypothetical where GPT-N gains general intelligence and agency (such as via a mesa-optimizer) just by predicting the next token.
What would be the reward you’re training the AI on with this dataset? If you’re not careful you could inadvertently train a learned optimizer, e.g. a “hugging humans maximizer” to take a silly example.
That may sound nice but could have torturous results, e.g. the AI forcing humans to hug, or replacing biological humans with server farms housing simulations of quadrillions of humans hugging.
Does there have to be a reward? This is using brute force to create the underlying world model. It’s just adjusting weights right?
I think there has to be some kind of reward or loss function, in the current paradigm anyway. That’s what gradient descent uses to know such weights to adjust on each update.
Like what are you imagining is the input output channel of this AI? Maybe discussing this a bit would help us clarify.
To steelman, I’d guess this idea applies in the hypothetical where GPT-N gains general intelligence and agency (such as via a mesa-optimizer) just by predicting the next token.