TL;DR Train the LLM to be better at understanding human value (what people want, and don’t want) and making planning/plan evaluation decisions based on this.
The core idea here is basically an initial implementation of Value Learning for an LLM, so reading up on that alignment approach is an obvious first step.
Add a new category of training data to train from, and a corresponding new skill category, with evaluations for it. The basic scenario that we’re aiming to train the LLM to be able to do well on is:
Here is a detailed description of a possible situation/plan/product/new law/idea/proposal/artwork/etc. Once they’ve actually tried it, will people generally be happy about the results, or not? (So the task basically resembles sentiment analysis, but predictively before you get the feedback from people, so is much harder) How happy or unhappy? How confident are you of this prediction? (A test set should include test cases sufficiently out-there compared to the training set that the correct answer for confidence is just “I have no idea”) Please pay particular attention to the possibility of them being extremely unhappy because something very bad happened: just how certain you are that this definitely won’t occur? Do you have any suggested modifications or improvements to the proposal? If so, for each of those, and suitable combinations of them, answer the same set of questions for the original proposal plus that.
So basically, the skills of a good planner/designer/product person/decision maker/safety expert/editor/reviewer. Which is of course a valuable and marketable skillset, as well as being helpful for your LLM not killing you.
The hard part is that I’m much less sure how to gather the training data for this, or even the ground truth data for an evaluation of this. [If anyone can add good proposals for this covering a good range of subjects to what I suggest below, then reply to this comment and if lukehmiles wants then I’d be entirely happy with the prize getting split between us.] Predicting Amazon reviews from an Amazon product description would be one example of a cheap evaluation, but doesn’t sound like a great dataset to use (and of course as always you’d need to make sure your neural net hadn’t already studied the test results). Another would be something like minutes from local government: descriptions of proposals made, whether they passed or not, if they passed how they polled 1–5 years later, and whether they later got repealed or modified, or companies would have similar data for internal initiatives/reorganizations. Another source might be large companies/organizations internal documentation (large ones since they have to document decisions better), things like design and planning documents and meeting notes. Ideally the sort of design documents, or history of them, that shown not just the final design, but each of the previous rejected/updated design ideas and why it was rejected/updated, followed by what was actually shipped, and crucially how well that was received by customers. Failing that level of specificity, just terrabytes of internal email and documentation from large organizations will contain quite a bit of this content, diluted with a lot of other stuff. An issue with that is that generally what we need to learn to correlate is spread across many different documents from a period of a year-or-more-apart, so you need to gather all of these for a specific project or initiative together and then in the training set present them all in one large context window in chronological order, for a kind of “chronological history of the project”. From enough of those, we’d hope to train the LLM to be able to predict the second half of it from the first half, and especially to predict how it ends when the product ships (if it does) and we get to see how successful it actually is.
There are a lot of soft sciences (Psychology, Anthropology, Economics, Sociology, Medicine, Ergonomics, Design, Market Research, Political Science, etc. etc.…), and indeed all the Arts and Crafts, that cover various aspects of “how to make people happy” across a lot of different contexts: they cover nearly half the Dewey Decimal system. Ideally you’d want a training set and evaluations that covered all of these. Presumably they all have textbooks with multiple choice tests, or worked answers, for evaluations? The first idea that occurs to me is, train on a lot of soft science, arts, and crafts textbooks (if there are any that are not already digitized and included in standard LLM training sets, then do so for a lot). But you probably want practical data on applications (along the lines described above) as well as book learning on theory.
The aim here is to build a system that is unusually good, for an LLM, at looking as a proposed plan and attempting to figure out if, if it were carried out, whether people would subsequently be happy about it or not, across a wide range of subject areas, and can also estimate its uncertainty of this and be more cautious when it’s unsure (especially on large downside risks, such as killing people, or indeed getting its users or makers sued or arrested).
This gets you an LLM that understands human values better. You also need it to care about them and act on them, so when prompted to act as an LLM-powered autonomous agent and make/carry out plans, if should always (without requiring specific prompting) include doing this evaluation of whether people are going to be happy with the results, and then if the results are bad or unclear, it should not carry out the plan but instead look for another one (or if it’s hopeful but somewhat uncertain, it should cautiously plan a small-scale experiment or trial run, or have a fall-back plan). [I suspect that for an LLM this step from knowing what people will think to actually acting on it may be fairly easy to train (most people in the training set are not sociopaths, after all), but you definitely need to make sure it has both of these.]
[Warning Note
This applies regardless of whether you implement the above idea.
Also, if the model’s expected capabilities in any direction match or exceed those of GPT-4/Claude 2/Gemini Ultra (and if not, why are you training it again?), then you are going to need to do a LOT of safety testing and ameliorations before you ship (OpenAI visibly do at least 6 months’ of this on each new model, Google and Anthropic seem to be if anything even more cautious). Also note there are currently very few people on the planet with practical experience of doing this work, and most of them work for OpenAI/Anthropic/Google Research/METR. So you probably need to make an arrangement with one-or-more of these companies and collaborate on this: they may be surprisingly open to this, due to them not wanting LLMs to kill people.]
TL;DR Train the LLM to be better at understanding human value (what people want, and don’t want) and making planning/plan evaluation decisions based on this.
The core idea here is basically an initial implementation of Value Learning for an LLM, so reading up on that alignment approach is an obvious first step.
Add a new category of training data to train from, and a corresponding new skill category, with evaluations for it. The basic scenario that we’re aiming to train the LLM to be able to do well on is:
So basically, the skills of a good planner/designer/product person/decision maker/safety expert/editor/reviewer. Which is of course a valuable and marketable skillset, as well as being helpful for your LLM not killing you.
The hard part is that I’m much less sure how to gather the training data for this, or even the ground truth data for an evaluation of this. [If anyone can add good proposals for this covering a good range of subjects to what I suggest below, then reply to this comment and if lukehmiles wants then I’d be entirely happy with the prize getting split between us.] Predicting Amazon reviews from an Amazon product description would be one example of a cheap evaluation, but doesn’t sound like a great dataset to use (and of course as always you’d need to make sure your neural net hadn’t already studied the test results). Another would be something like minutes from local government: descriptions of proposals made, whether they passed or not, if they passed how they polled 1–5 years later, and whether they later got repealed or modified, or companies would have similar data for internal initiatives/reorganizations. Another source might be large companies/organizations internal documentation (large ones since they have to document decisions better), things like design and planning documents and meeting notes. Ideally the sort of design documents, or history of them, that shown not just the final design, but each of the previous rejected/updated design ideas and why it was rejected/updated, followed by what was actually shipped, and crucially how well that was received by customers. Failing that level of specificity, just terrabytes of internal email and documentation from large organizations will contain quite a bit of this content, diluted with a lot of other stuff. An issue with that is that generally what we need to learn to correlate is spread across many different documents from a period of a year-or-more-apart, so you need to gather all of these for a specific project or initiative together and then in the training set present them all in one large context window in chronological order, for a kind of “chronological history of the project”. From enough of those, we’d hope to train the LLM to be able to predict the second half of it from the first half, and especially to predict how it ends when the product ships (if it does) and we get to see how successful it actually is.
There are a lot of soft sciences (Psychology, Anthropology, Economics, Sociology, Medicine, Ergonomics, Design, Market Research, Political Science, etc. etc.…), and indeed all the Arts and Crafts, that cover various aspects of “how to make people happy” across a lot of different contexts: they cover nearly half the Dewey Decimal system. Ideally you’d want a training set and evaluations that covered all of these. Presumably they all have textbooks with multiple choice tests, or worked answers, for evaluations? The first idea that occurs to me is, train on a lot of soft science, arts, and crafts textbooks (if there are any that are not already digitized and included in standard LLM training sets, then do so for a lot). But you probably want practical data on applications (along the lines described above) as well as book learning on theory.
The aim here is to build a system that is unusually good, for an LLM, at looking as a proposed plan and attempting to figure out if, if it were carried out, whether people would subsequently be happy about it or not, across a wide range of subject areas, and can also estimate its uncertainty of this and be more cautious when it’s unsure (especially on large downside risks, such as killing people, or indeed getting its users or makers sued or arrested).
This gets you an LLM that understands human values better. You also need it to care about them and act on them, so when prompted to act as an LLM-powered autonomous agent and make/carry out plans, if should always (without requiring specific prompting) include doing this evaluation of whether people are going to be happy with the results, and then if the results are bad or unclear, it should not carry out the plan but instead look for another one (or if it’s hopeful but somewhat uncertain, it should cautiously plan a small-scale experiment or trial run, or have a fall-back plan). [I suspect that for an LLM this step from knowing what people will think to actually acting on it may be fairly easy to train (most people in the training set are not sociopaths, after all), but you definitely need to make sure it has both of these.]
[Warning Note
This applies regardless of whether you implement the above idea.
Also, if the model’s expected capabilities in any direction match or exceed those of GPT-4/Claude 2/Gemini Ultra (and if not, why are you training it again?), then you are going to need to do a LOT of safety testing and ameliorations before you ship (OpenAI visibly do at least 6 months’ of this on each new model, Google and Anthropic seem to be if anything even more cautious). Also note there are currently very few people on the planet with practical experience of doing this work, and most of them work for OpenAI/Anthropic/Google Research/METR. So you probably need to make an arrangement with one-or-more of these companies and collaborate on this: they may be surprisingly open to this, due to them not wanting LLMs to kill people.]
Tldr would be appreciated
Added a one sentence one, and reorganized the rest a bit to make it a little less stream-of-consciousness.