Human-written texts, especially literature, laws, news articles etc., are both shaped by and shaping human culture and values. So, LLMs like GPT, which are trained on the big massives of those, probably already have a pretty deep understanding of the human values. GPT can reason about values and ethics, when prompted, maybe, even better than many humans https://www.lesswrong.com/posts/ztqpqff2xfLpahSpB/challenge-does-chatgpt-ever-claim-that-a-bad-outcome-for
That is not exactly an alignment, of cause, but a big step in that direction, imho.
Yeah, LLMs somewhat understand how to do good stuff, and how to label it as good. Also they somewhat understand how to do bad stuff, and how to label it as bad. So the situation is symmetric. The question in the post was, can we make it asymmetric? Make a dataset that, when extrapolated, tends toward outputting information that helps humanity?
To be fair, it’s not entirely symmetric. Current datasets are already a bit biased toward human morality, because they consist of texts written by humans. In a way that’s lucky. If we’d first gotten powerful AIs trained on observations of physical reality instead, they’d have been more amoral and dangerous. Texts are better. But they don’t get us all the way, because humans can do bad things too. And it’s tricky to figure out how to make the dataset lean more toward morality, without making it much smaller and thus less powerful.
For example, it would take a very smart AI, probably AGI, to reliably figure out that some abstract math or engineering task is actually a weapon recipe.
Human-written texts, especially literature, laws, news articles etc., are both shaped by and shaping human culture and values. So, LLMs like GPT, which are trained on the big massives of those, probably already have a pretty deep understanding of the human values. GPT can reason about values and ethics, when prompted, maybe, even better than many humans https://www.lesswrong.com/posts/ztqpqff2xfLpahSpB/challenge-does-chatgpt-ever-claim-that-a-bad-outcome-for
That is not exactly an alignment, of cause, but a big step in that direction, imho.
Yeah, LLMs somewhat understand how to do good stuff, and how to label it as good. Also they somewhat understand how to do bad stuff, and how to label it as bad. So the situation is symmetric. The question in the post was, can we make it asymmetric? Make a dataset that, when extrapolated, tends toward outputting information that helps humanity?
To be fair, it’s not entirely symmetric. Current datasets are already a bit biased toward human morality, because they consist of texts written by humans. In a way that’s lucky. If we’d first gotten powerful AIs trained on observations of physical reality instead, they’d have been more amoral and dangerous. Texts are better. But they don’t get us all the way, because humans can do bad things too. And it’s tricky to figure out how to make the dataset lean more toward morality, without making it much smaller and thus less powerful.
I suspect GPT already can figure what is the description of the “benevolent” action. If not, please give me an example of AI mislabeling it.
Problems are that AI now is too dumb to figure if the act is bad if it is described in some roundabout way https://humanevents.com/2023/03/24/chatgpt-helps-plan-a-state-run-death-camp , or is too complex, or have to be inferred from non-text information etc.
For example, it would take a very smart AI, probably AGI, to reliably figure out that some abstract math or engineering task is actually a weapon recipe.