I do think it’s reasonable to describe the model as trying to simulate the professor, albeit with very low fidelity, and at the same time as trying to imitate other scenarios in which the prompt would appear (such as parodies). The model has a very poor understanding of what the professor would say, so it is probably often falling back to what it thinks would typically appear in response to the question.
This suggests perhaps modifying the prompt to make it more likely or more easily for the LM to do the intended simulation instead of other scenarios. For example, perhaps changing “I have no comment” to “I’m not sure” would help, since the latter is something that a typical professor doing a typical Q/A might be more likely to say, within the LM’s training data?
I hope and expect that longer term we’ll tend to use much more flexible and robust alignment techniques than prompt engineering, such that things like the ideological bias of the AI is something we will have direct control over. (What that bias should be is a separate discussion.)
Suppose we wanted the AI to be ideologically neutral and free from human biases, just telling the objective truth to the extent possible. Do you think achieving something like that would be possible in the longer term, and if so through what kinds of techniques?
Suppose we wanted the AI to be ideologically neutral and free from human biases, just telling the objective truth to the extent possible. Do you think achieving something like that would be possible in the longer term, and if so through what kinds of techniques?
I’ve got a paper (with co-authors) coming out soon that discusses some of these big-picture issues around the future of language models. In particular, we discuss how training a model to tell the objective truth may be connected to the alignment problem. For now, I’ll just gesture at some high-level directions:
Make the best use of all human text/utterances (e.g. the web, all languages, libraries, historical records, conversations). Humans could curate and annotate datasets (e.g. using some procedures to reduce bias). Ideas like prediction markets, Bayesian Truth Serum, Ideological Turing Tests, and Debate between humans (instead of AIs) may also help. The ideas may work best if the AI is doing active learning from humans (who could be working anonymously).
Train the AI for a task where accurate communication with other agents (e.g. other AIs or copies) helps with performance. It’s probably best if it’s a real-world task (e.g. related to finance or computer security). Then train a different system to translate this communication into human language. (One might try to intentionally prevent the AI from reading human texts.)
Training using ideas from IDA or Debate (i.e. bootstrapping from human supervision) but with the objective of giving true and informative answers.
Somehow use the crisp notion of truth in math/logic as a starting point to understanding empirical truth.
Suppose we wanted the AI to be ideologically neutral and free from human biases, just telling the objective truth to the extent possible. Do you think achieving something like that would be possible in the longer term, and if so through what kinds of techniques?
I think that should be possible with techniques like reinforcement learning from human feedback, for a given precise specification of “ideologically neutral”. (You’ll of course have a hard time convincing everyone that your specification is itself ideologically neutral, but projects like Wikipedia give me hope that we can achieve a reasonable amount of consensus.) There are still a number of challenging obstacles, including being able to correctly evaluate responses to difficult questions, collecting enough data while maintaining quality, and covering unusual or adversarially-selected edge cases.
I think that should be possible with techniques like reinforcement learning from human feedback, for a given precise specification of “ideologically neutral”.
What kind of specification do you have in mind? Is it like a set of guidelines for the human providing feedback on how to do it in an ideologically neutral way?
You’ll of course have a hard time convincing everyone that your specification is itself ideologically neutral, but projects like Wikipedia give me hope that we can achieve a reasonable amount of consensus.
I’m less optimistic about this, given that complaints about Wikipedia’s left-wing bias seem common and credible to me.
What kind of specification do you have in mind? Is it like a set of guidelines for the human providing feedback on how to do it in an ideologically neutral way?
Yes.
The reason I said “precise specification” is that if your guidelines are ambiguous, then you’re implicitly optimizing something like, “what labelers prefer on average, given the ambiguity”, but doing so in a less data-efficient way than if you had specified this target more precisely.
This suggests perhaps modifying the prompt to make it more likely or more easily for the LM to do the intended simulation instead of other scenarios. For example, perhaps changing “I have no comment” to “I’m not sure” would help, since the latter is something that a typical professor doing a typical Q/A might be more likely to say, within the LM’s training data?
Suppose we wanted the AI to be ideologically neutral and free from human biases, just telling the objective truth to the extent possible. Do you think achieving something like that would be possible in the longer term, and if so through what kinds of techniques?
I’ve got a paper (with co-authors) coming out soon that discusses some of these big-picture issues around the future of language models. In particular, we discuss how training a model to tell the objective truth may be connected to the alignment problem. For now, I’ll just gesture at some high-level directions:
Make the best use of all human text/utterances (e.g. the web, all languages, libraries, historical records, conversations). Humans could curate and annotate datasets (e.g. using some procedures to reduce bias). Ideas like prediction markets, Bayesian Truth Serum, Ideological Turing Tests, and Debate between humans (instead of AIs) may also help. The ideas may work best if the AI is doing active learning from humans (who could be working anonymously).
Train the AI for a task where accurate communication with other agents (e.g. other AIs or copies) helps with performance. It’s probably best if it’s a real-world task (e.g. related to finance or computer security). Then train a different system to translate this communication into human language. (One might try to intentionally prevent the AI from reading human texts.)
Training using ideas from IDA or Debate (i.e. bootstrapping from human supervision) but with the objective of giving true and informative answers.
Somehow use the crisp notion of truth in math/logic as a starting point to understanding empirical truth.
I think that should be possible with techniques like reinforcement learning from human feedback, for a given precise specification of “ideologically neutral”. (You’ll of course have a hard time convincing everyone that your specification is itself ideologically neutral, but projects like Wikipedia give me hope that we can achieve a reasonable amount of consensus.) There are still a number of challenging obstacles, including being able to correctly evaluate responses to difficult questions, collecting enough data while maintaining quality, and covering unusual or adversarially-selected edge cases.
What kind of specification do you have in mind? Is it like a set of guidelines for the human providing feedback on how to do it in an ideologically neutral way?
I’m less optimistic about this, given that complaints about Wikipedia’s left-wing bias seem common and credible to me.
Yes.
The reason I said “precise specification” is that if your guidelines are ambiguous, then you’re implicitly optimizing something like, “what labelers prefer on average, given the ambiguity”, but doing so in a less data-efficient way than if you had specified this target more precisely.