Joseph Miller comments on AI Safety in a World of Vulnerable Machine Learning Systems

Joseph Miller 27 May 2024 13:01 UTC
1 point
0

The model has capacity to learn to do this, but can’t just leverage existing capabilities in the foundation model, as the performance of that model is limited to that of the best humans it saw in the self-supervised training data. So, we need to do RL for many more time steps.

It’s unclear to me if this is true because modelling the humans that generated the training data sufficiently well probably requires the model to be smarter than the humans it is modelling. So I expect the current regime where RLHF just elicits a particular persona from the model rather than teaching any new abilities to be sufficient to reach superhuman capabilities.