I think it’s good this post exists. But I really want to make the distinction between “safe” and “is a solution to the alignment problem,” which this post elides. Or maybe “safe” vs. “friendly”?
If we build a superhuman AGI we’d better have solved the alignment problem in the sense of actually making that AGI want to do good things and not bad things. (Past a certain point, “just follow orders” isn’t safe unless the order “do good things” works. If it wouldn’t work, you’ve built an unsafe AI, and if it would work, you might as well give it.)
OpenAI’s “parallelizable alignment assistant” strategy can work for between 0 and 4 organizations in the world, because it relies on having enough of a lead that you can build something that is safe yet not a solution to the alignment problem, and nobody else will cause an accident in the weeks or months you spend trying to convert this into a solution to the alignment problem.
To look at one example property: taking a random AI and putting a human in the loop makes it more safe. But it does little to nothing for solving alignment. It helps when you’re building an AI that’s dumber than you, but doesn’t really when you’re building an AI that’s smarter than you.
Or lack of situational awareness. This is actively anti-alignment, because the state of the world is useful information for doing good things. But it’s even more anti-capabilities, so it’s a fine property to shoot for if you’re making an AI that’s safe because it has limited capabilities.
I’ll come back later with a comment that actually makes suggestions, both ones that trade off for safety and for friendliness.
Well said. I mostly agree, but I’ll note that safety-without-friendliness is good as a non-ultimate goal.
Re human in the loop, I mostly agree. Re situational awareness, I mostly agree and I’ll add that lack-of-situational-awareness is sometimes a good way to deprive a system of capabilities not relevant to the task it’s designed for—“capabilities” isn’t monolithic.
I think my list is largely bad. I think central examples of good-ideas include LM agents and process-based systems. (Maybe because they’re more fundamental / architecture-y? Maybe because they’re more concrete?)
Looking forward to your future-comment-with-suggestions.
I think it’s good this post exists. But I really want to make the distinction between “safe” and “is a solution to the alignment problem,” which this post elides. Or maybe “safe” vs. “friendly”?
If we build a superhuman AGI we’d better have solved the alignment problem in the sense of actually making that AGI want to do good things and not bad things. (Past a certain point, “just follow orders” isn’t safe unless the order “do good things” works. If it wouldn’t work, you’ve built an unsafe AI, and if it would work, you might as well give it.)
OpenAI’s “parallelizable alignment assistant” strategy can work for between 0 and 4 organizations in the world, because it relies on having enough of a lead that you can build something that is safe yet not a solution to the alignment problem, and nobody else will cause an accident in the weeks or months you spend trying to convert this into a solution to the alignment problem.
To look at one example property: taking a random AI and putting a human in the loop makes it more safe. But it does little to nothing for solving alignment. It helps when you’re building an AI that’s dumber than you, but doesn’t really when you’re building an AI that’s smarter than you.
Or lack of situational awareness. This is actively anti-alignment, because the state of the world is useful information for doing good things. But it’s even more anti-capabilities, so it’s a fine property to shoot for if you’re making an AI that’s safe because it has limited capabilities.
I’ll come back later with a comment that actually makes suggestions, both ones that trade off for safety and for friendliness.
Well said. I mostly agree, but I’ll note that safety-without-friendliness is good as a non-ultimate goal.
Re human in the loop, I mostly agree. Re situational awareness, I mostly agree and I’ll add that lack-of-situational-awareness is sometimes a good way to deprive a system of capabilities not relevant to the task it’s designed for—“capabilities” isn’t monolithic.
I think my list is largely bad. I think central examples of good-ideas include LM agents and process-based systems. (Maybe because they’re more fundamental / architecture-y? Maybe because they’re more concrete?)
Looking forward to your future-comment-with-suggestions.