Daniel Kokotajlo comments on All AGI Safety questions welcome (especially basic ones) [~monthly thread]

Daniel Kokotajlo 2 Nov 2022 4:30 UTC
13 points
8
Currently, we don’t know how to make a smarter-than-us AGI obey any particular set of instructions whatsoever, at least not robustly in novel circumstances once they are deployed and beyond our ability to recall/delete. (Because of the Inner Alignment Problem). So we can’t just type in stipulations like that. If we could, we’d be a lot closer to safety. Probably we’d have a lot more stipulations than just that. Speculating wildly, we might want to try something like: “For now, the only thing you should do is be a faithful imitation of Paul Christiano but thinking 100x faster.” Then we can ask our new Paul-mimic to think for a while and then come up with better ideas for how to instruct new versions of the AI.