Yeah, LLMs somewhat understand how to do good stuff, and how to label it as good. Also they somewhat understand how to do bad stuff, and how to label it as bad. So the situation is symmetric. The question in the post was, can we make it asymmetric? Make a dataset that, when extrapolated, tends toward outputting information that helps humanity?
To be fair, it’s not entirely symmetric. Current datasets are already a bit biased toward human morality, because they consist of texts written by humans. In a way that’s lucky. If we’d first gotten powerful AIs trained on observations of physical reality instead, they’d have been more amoral and dangerous. Texts are better. But they don’t get us all the way, because humans can do bad things too. And it’s tricky to figure out how to make the dataset lean more toward morality, without making it much smaller and thus less powerful.
For example, it would take a very smart AI, probably AGI, to reliably figure out that some abstract math or engineering task is actually a weapon recipe.
Yeah, LLMs somewhat understand how to do good stuff, and how to label it as good. Also they somewhat understand how to do bad stuff, and how to label it as bad. So the situation is symmetric. The question in the post was, can we make it asymmetric? Make a dataset that, when extrapolated, tends toward outputting information that helps humanity?
To be fair, it’s not entirely symmetric. Current datasets are already a bit biased toward human morality, because they consist of texts written by humans. In a way that’s lucky. If we’d first gotten powerful AIs trained on observations of physical reality instead, they’d have been more amoral and dangerous. Texts are better. But they don’t get us all the way, because humans can do bad things too. And it’s tricky to figure out how to make the dataset lean more toward morality, without making it much smaller and thus less powerful.
I suspect GPT already can figure what is the description of the “benevolent” action. If not, please give me an example of AI mislabeling it.
Problems are that AI now is too dumb to figure if the act is bad if it is described in some roundabout way https://humanevents.com/2023/03/24/chatgpt-helps-plan-a-state-run-death-camp , or is too complex, or have to be inferred from non-text information etc.
For example, it would take a very smart AI, probably AGI, to reliably figure out that some abstract math or engineering task is actually a weapon recipe.