Jon Garcia comments on Agentized LLMs will change the alignment landscape

Jon Garcia 9 Apr 2023 3:32 UTC
4 points
2
Yep, ever since Gato, it’s been looking increasingly like you can get some sort of AGI by essentially just slapping some sensors, actuators, and a reward function onto an LLM core. I don’t like that idea.

LLMs already have a lot of potential for causing bad outcomes if abused by humans for generating massive amounts of misinformation. However, that pales in comparison to the destructive potential of giving GPT agency and setting it loose, even without idiots trying to make it evil explicitly.

I would much rather live in a world where the first AGIs weren’t built around such opaque models. LLMs may look like they think in English, but there is still a lot of black-box computation going on, with a strange tendency to switch personas partway through a conversation. That doesn’t bode well for steerability if such models are given control of an agent.

However, if we are heading for a world of LLM-AGI, maybe our priorities should be on figuring out how to route their models of human values to their own motivational schemas. GPT-4 probably already understands human values to a much deeper extent than we could specify with an explicit utility function. The trick would be getting it to care.

Maybe force the LLM-AGI to evaluate every potential plan it generates on how it would impact human welfare/society, including second-order effects, and to modify its plans to avoid any pitfalls it finds from a (simulated) human perspective. Do this iteratively until it finds no more conflict before it actually implements a plan. Maybe require actual verbal human feedback in the loop before it can act.

It’s not a perfect solution, but there’s probably not enough time to design a custom aligned AGI from scratch before some malign actor sets a ChaosGPT-AGI loose. A multipolar landscape is probably the best we can hope for in such a scenario.
- Seth Herd 9 Apr 2023 3:46 UTC
  4 points
  7
  Parent
  Yes, the Auto-GPT approach does evaluate its potential plans against the goals it was given. So all you have to do to get some decent starting alignment is give it good high-level goals (which isn’t trivial; don’t tell it to reduce suffering or you may find out too late it had a solution you didn’t intend...). But because it’s also pretty easily interpretable, and can be made to at least start as corrigible with good top-level goals, there’s a shot at correcting your alignment mistakes as they arise.