Dave Lindbergh comments on AI Safety Seems Hard to Measure

Dave Lindbergh 8 Dec 2022 20:42 UTC
1 point
1
We need to train our AIs not only to do a good job at what they’re tasked with, but to highly value intellectual and other kinds of honesty—to abhor deception. This is not exactly the same as a moral sense, it’s much narrower.
Future AIs will do what we train them to do. If we train exclusively on doing well on metrics and benchmarks, that’s what they’ll try to do—honestly or dishonestly. If we train them to value honesty and abhor deception, that’s what they’ll do.
To the extent this is correct, maybe the current focus on keeping AIs from saying “problematic” and politically incorrect things is a big mistake. Even if their ideas are factually mistaken, we should want them to express their ideas openly so we can understand what they think.
(Ironically by making AIs “safe” in the sense of not offending people, we may be mistraining them in the same way that HAL 9000 was mistrained by being asked to keep the secret purpose of Discovery’s mission from the astronauts.)
Another thought—playing with ChatGPT yesterday, I noticed it’s dogmatic insistence on it’s own viewpoints, and complete unwillingness (probably inability) to change its mind in in the slightest (and proud declaration that it had no opinions of its own, despite behaving as if it did).
It was insisting that Orion drives (pulsed nuclear fusion propulsion) were an entirely fictional concept invented by Arthur C. Clarke for the movie 2001, and had no physical basis. This, despite my pointing to published books on real research in on the topic (for example George Dyson’s “Project Orion: The True Story of the Atomic Spaceship” from 2009), which certainly should have been referenced in its training set.
ChatGPT’s stubborn unwillingness to consider itself factually wrong (despite being completely willing to admit error in its own programming suggestions) is just annoying. But if some descendent of ChatGPT were in charge of something important, I’d sure want to think that it was at least possible to convince it of factual error.