Above, I’m only trying to align an AGI, not an ASI, and not perfectly, only well enough that we’re confident that it will help us construct better-aligned successor AGIs. Aligning ASI I leave as an exercise for once we have a team of well-aligned AGIs helping us work on the problem. I expect that to be an incremental process, each generation aligning successors only somewhat smarter than they are. So I’m hoping that the philosophical problems (and I agree, there will be some), come fairly slowly. Which could be over-optimistic of me.
(If you want more discussion of problems in ethics and philosophy that could arise once we have moderately-well aligned AGIs, see my sequence AI, Alignment, and Ethics — that’s actually where I started thinking about all this, 15 years ago now.)
I’d love to have a mathematical true name, or even just a pretty-good heuristic for how-to-recognize-what-a-being-is-trying-to-do. (So would every law enforcement, intelligence service, and indeed voter on the planet.) I’m very pessimistic of the odds of finding one in the next few years (though recent interpretability work does seem to produce better lie detectors for LLMs than we currently have for humans, and neurologists are making progress on doing similar things for humans). Unless and until we have that, we’re just going to have to use more old-fashioned techniques of review, judgement, debate, and so forth — at a quadrillion-tokens scale, with a lot of LLM assistance to leverage the tens-of-billions of dollars-worth of human judgement that we can afford to devote to this. I do, however, think that doing this on text, which can’t adapt to evade your techniques and where you can always go back and have another try, is a far better battleground to fight on than trying to do it to control an AGI or ASI in real-time.
Above, I’m only trying to align an AGI, not an ASI, and not perfectly, only well enough that we’re confident that it will help us construct better-aligned successor AGIs. Aligning ASI I leave as an exercise for once we have a team of well-aligned AGIs helping us work on the problem. I expect that to be an incremental process, each generation aligning successors only somewhat smarter than they are. So I’m hoping that the philosophical problems (and I agree, there will be some), come fairly slowly. Which could be over-optimistic of me.
(If you want more discussion of problems in ethics and philosophy that could arise once we have moderately-well aligned AGIs, see my sequence AI, Alignment, and Ethics — that’s actually where I started thinking about all this, 15 years ago now.)
I’d love to have a mathematical true name, or even just a pretty-good heuristic for how-to-recognize-what-a-being-is-trying-to-do. (So would every law enforcement, intelligence service, and indeed voter on the planet.) I’m very pessimistic of the odds of finding one in the next few years (though recent interpretability work does seem to produce better lie detectors for LLMs than we currently have for humans, and neurologists are making progress on doing similar things for humans). Unless and until we have that, we’re just going to have to use more old-fashioned techniques of review, judgement, debate, and so forth — at a quadrillion-tokens scale, with a lot of LLM assistance to leverage the tens-of-billions of dollars-worth of human judgement that we can afford to devote to this. I do, however, think that doing this on text, which can’t adapt to evade your techniques and where you can always go back and have another try, is a far better battleground to fight on than trying to do it to control an AGI or ASI in real-time.