It’d be nice to hear a response from Paul to paragraph 1. My 2 cents:
I tend to agree that we end up with extremes eventually. You seem to say that we would immediately go to alignment given somewhat aligned systems so Paul’s 1st story barely plays out.
Of course, the somewhat aligned systems may aim at the wrong thing if we try to make them solve alignment. So the most plausible way it could work is if they produce solutions that we can check. But if this were the case,
human supervision would be relatively easy. That’s plausible but it’s a scenario I care less about.
Additionally, if we could use somewhat aligned systems to make more aligned ones, iterated amplification probably works for alignment (narrowly defined by “trying to do what we want”). The only remaining challenge would be to create one system that’s somewhat smarter than us and somewhat aligned (in our case that’s true by assumption). The rest follows, informally speaking, by induction as long as the AI+humans system can keep improving intelligence as alignment is improved. Which seems likely. That’s also plausible but it’s a big assumption and may not be the most important scenario / isn’t a ‘tale of doom’.
It’d be nice to hear a response from Paul to paragraph 1. My 2 cents:
I tend to agree that we end up with extremes eventually. You seem to say that we would immediately go to alignment given somewhat aligned systems so Paul’s 1st story barely plays out.
Of course, the somewhat aligned systems may aim at the wrong thing if we try to make them solve alignment. So the most plausible way it could work is if they produce solutions that we can check. But if this were the case, human supervision would be relatively easy. That’s plausible but it’s a scenario I care less about.
Additionally, if we could use somewhat aligned systems to make more aligned ones, iterated amplification probably works for alignment (narrowly defined by “trying to do what we want”). The only remaining challenge would be to create one system that’s somewhat smarter than us and somewhat aligned (in our case that’s true by assumption). The rest follows, informally speaking, by induction as long as the AI+humans system can keep improving intelligence as alignment is improved. Which seems likely. That’s also plausible but it’s a big assumption and may not be the most important scenario / isn’t a ‘tale of doom’.