Suppose we wanted the AI to be ideologically neutral and free from human biases, just telling the objective truth to the extent possible. Do you think achieving something like that would be possible in the longer term, and if so through what kinds of techniques?
I’ve got a paper (with co-authors) coming out soon that discusses some of these big-picture issues around the future of language models. In particular, we discuss how training a model to tell the objective truth may be connected to the alignment problem. For now, I’ll just gesture at some high-level directions:
Make the best use of all human text/utterances (e.g. the web, all languages, libraries, historical records, conversations). Humans could curate and annotate datasets (e.g. using some procedures to reduce bias). Ideas like prediction markets, Bayesian Truth Serum, Ideological Turing Tests, and Debate between humans (instead of AIs) may also help. The ideas may work best if the AI is doing active learning from humans (who could be working anonymously).
Train the AI for a task where accurate communication with other agents (e.g. other AIs or copies) helps with performance. It’s probably best if it’s a real-world task (e.g. related to finance or computer security). Then train a different system to translate this communication into human language. (One might try to intentionally prevent the AI from reading human texts.)
Training using ideas from IDA or Debate (i.e. bootstrapping from human supervision) but with the objective of giving true and informative answers.
Somehow use the crisp notion of truth in math/logic as a starting point to understanding empirical truth.
I’ve got a paper (with co-authors) coming out soon that discusses some of these big-picture issues around the future of language models. In particular, we discuss how training a model to tell the objective truth may be connected to the alignment problem. For now, I’ll just gesture at some high-level directions:
Make the best use of all human text/utterances (e.g. the web, all languages, libraries, historical records, conversations). Humans could curate and annotate datasets (e.g. using some procedures to reduce bias). Ideas like prediction markets, Bayesian Truth Serum, Ideological Turing Tests, and Debate between humans (instead of AIs) may also help. The ideas may work best if the AI is doing active learning from humans (who could be working anonymously).
Train the AI for a task where accurate communication with other agents (e.g. other AIs or copies) helps with performance. It’s probably best if it’s a real-world task (e.g. related to finance or computer security). Then train a different system to translate this communication into human language. (One might try to intentionally prevent the AI from reading human texts.)
Training using ideas from IDA or Debate (i.e. bootstrapping from human supervision) but with the objective of giving true and informative answers.
Somehow use the crisp notion of truth in math/logic as a starting point to understanding empirical truth.