Seth Herd comments on Agentized LLMs will change the alignment landscape

Seth Herd 11 Apr 2023 17:54 UTC
3 points
0
Good point. It would be an even better emotional impact and intuition pump to see an agentized LLM arrive at destroying humanity as a subgoal of some other objective.

Somebody put in producing paperclips as a goal to one of these; I’ve forgotten where I saw it. Maybe it was a baby AGI example? That one actually recognized the dangers and shifted to researching the alignment problem. That seemed to be the result of how the paperclip goal is linked to that issue in internet writing, and the RLHF and other ethical safeguards built into GPT4 as the core LLM. That example unfortunately sends the inaccurate opposite intuition, that these systems automatically have safeguards and ethics. They have that only when using an LLM with those things built in, and they’re still unreliable.