Maybe to expand: In order to get truly good training loss on an autoregressive training objective, you probably need to have some sort of intelligence-like or agency-like dynamic. But much more importantly, you need a truly vast amount of knowledge. So most of the explanation for the good performance comes from the knowledge, not the intelligence-like dynamic.
(Ah, but intelligence is more general, so maybe we’d expect it to show up in lots of datapoints, thereby making up a relatively big chunk of the training objective? I don’t think so, for two reasons: 1) a lot of datapoints don’t really require much intelligence to predict, 2) there are other not-very-intelligence-requiring things like grammar or certain aspects of vocabulary which do show up in a really big chunk.)
It seems like a full explanation of a neural network’s low loss on the training set needs to rely on lots of pieces of knowledge that it learns from the training set (e.g. “Barack” is usually followed by “Obama”). How do random “empirical regularities” about the training set like this one fit into the explanation of the neural net?
Our current best guess about what an explanation looks like is something like modeling the distribution of neural activations. Such an activation model would end up having baked-in empirical regularities, like the fact that “Barack” is usually followed by “Obama”. So in other words, just as the neural net learned this empirical regularity of the training set, our explanation will also learn the empirical regularity, and that will be part of the explanation of the neural net’s low loss.
(There’s a lot more to be said here, and our picture of this isn’t fully fleshed out: there are some follow-up questions you might ask to which I would answer “I don’t know”. I’m also not sure I understood your question correctly.)
Yeah, this seems like a reasonable restatement of my question.
I guess my main issue with this approach is that extrapolating the distribution of activations from a dataset isn’t what I’d consider the hard part of alignment. Rather, it would be:
Detecting catastrophic outputs and justifying their catastrophicness to others. (In particular, I suspect no individual output will be catastrophic on the margin regardless of whether catastrophe will occur. Either the network will consistently avoid giving catastrophic outputs, or it will sufficiently consistently be harmful that localizing the harm to 1 output will not be meaningful.)
Learning things about the distribution of inputs that cannot be extrapolated from any dataset. (In particular, the most relevant short-term harm I’ve noticed would be stuff like young nerds starting to see the AI as a sort of mentor and then having their questionable ideas excessively validated by this mentor rather than receiving appropriate pushback. This would be hard to extrapolate from a dataset, even though it is relatively obvious if you interact with certain people. Though whether that counts as “catastrophic” is a complicated question.)
Maybe to expand: In order to get truly good training loss on an autoregressive training objective, you probably need to have some sort of intelligence-like or agency-like dynamic. But much more importantly, you need a truly vast amount of knowledge. So most of the explanation for the good performance comes from the knowledge, not the intelligence-like dynamic.
(Ah, but intelligence is more general, so maybe we’d expect it to show up in lots of datapoints, thereby making up a relatively big chunk of the training objective? I don’t think so, for two reasons: 1) a lot of datapoints don’t really require much intelligence to predict, 2) there are other not-very-intelligence-requiring things like grammar or certain aspects of vocabulary which do show up in a really big chunk.)
Is this a correct rephrasing of your question?
Our current best guess about what an explanation looks like is something like modeling the distribution of neural activations. Such an activation model would end up having baked-in empirical regularities, like the fact that “Barack” is usually followed by “Obama”. So in other words, just as the neural net learned this empirical regularity of the training set, our explanation will also learn the empirical regularity, and that will be part of the explanation of the neural net’s low loss.
(There’s a lot more to be said here, and our picture of this isn’t fully fleshed out: there are some follow-up questions you might ask to which I would answer “I don’t know”. I’m also not sure I understood your question correctly.)
Yeah, this seems like a reasonable restatement of my question.
I guess my main issue with this approach is that extrapolating the distribution of activations from a dataset isn’t what I’d consider the hard part of alignment. Rather, it would be:
Detecting catastrophic outputs and justifying their catastrophicness to others. (In particular, I suspect no individual output will be catastrophic on the margin regardless of whether catastrophe will occur. Either the network will consistently avoid giving catastrophic outputs, or it will sufficiently consistently be harmful that localizing the harm to 1 output will not be meaningful.)
Learning things about the distribution of inputs that cannot be extrapolated from any dataset. (In particular, the most relevant short-term harm I’ve noticed would be stuff like young nerds starting to see the AI as a sort of mentor and then having their questionable ideas excessively validated by this mentor rather than receiving appropriate pushback. This would be hard to extrapolate from a dataset, even though it is relatively obvious if you interact with certain people. Though whether that counts as “catastrophic” is a complicated question.)