Oh, if already existing minds have to generate the data directly, I don’t think that’s a class of proposal that contains any working solutions. I mean I would want the process that generates the data to include a lot of data from them, yes, but it will need to involve a simple mathematical true name of how-to-recognize-what-a-being-is-trying-to-do in order for the generated data to contain lots of examples of helping, and not a lot of examples of being confused and doing something sampled from “confused behavior” as a result. Like, given all that data of people writing, how do you make an ai that doesn’t end up becoming a struggling em of everyone the digital mind encountered, for example? there are weird problems here. a sufficiently aligned mildly-or-above superintelligent ai would need to be doing things like novel philosophy on the regular, in areas where we can no longer usefully advise the mind (at least, not without first taking a university course taught by the AI in question), and be getting that whole process right. to do that, your training data has contain enough data to somehow cover the space of ways to discover novel ways to be good, while still maintaining the spirit of what people meant by their ethics writing. It can ask us, given that presumably since it’s aligned we’d still be around, but then it needs to be sufficiently good at asking the right question to figure out the thing that matters.
Above, I’m only trying to align an AGI, not an ASI, and not perfectly, only well enough that we’re confident that it will help us construct better-aligned successor AGIs. Aligning ASI I leave as an exercise for once we have a team of well-aligned AGIs helping us work on the problem. I expect that to be an incremental process, each generation aligning successors only somewhat smarter than they are. So I’m hoping that the philosophical problems (and I agree, there will be some), come fairly slowly. Which could be over-optimistic of me.
(If you want more discussion of problems in ethics and philosophy that could arise once we have moderately-well aligned AGIs, see my sequence AI, Alignment, and Ethics — that’s actually where I started thinking about all this, 15 years ago now.)
I’d love to have a mathematical true name, or even just a pretty-good heuristic for how-to-recognize-what-a-being-is-trying-to-do. (So would every law enforcement, intelligence service, and indeed voter on the planet.) I’m very pessimistic of the odds of finding one in the next few years (though recent interpretability work does seem to produce better lie detectors for LLMs than we currently have for humans, and neurologists are making progress on doing similar things for humans). Unless and until we have that, we’re just going to have to use more old-fashioned techniques of review, judgement, debate, and so forth — at a quadrillion-tokens scale, with a lot of LLM assistance to leverage the tens-of-billions of dollars-worth of human judgement that we can afford to devote to this. I do, however, think that doing this on text, which can’t adapt to evade your techniques and where you can always go back and have another try, is a far better battleground to fight on than trying to do it to control an AGI or ASI in real-time.
Oh, if already existing minds have to generate the data directly, I don’t think that’s a class of proposal that contains any working solutions. I mean I would want the process that generates the data to include a lot of data from them, yes, but it will need to involve a simple mathematical true name of how-to-recognize-what-a-being-is-trying-to-do in order for the generated data to contain lots of examples of helping, and not a lot of examples of being confused and doing something sampled from “confused behavior” as a result. Like, given all that data of people writing, how do you make an ai that doesn’t end up becoming a struggling em of everyone the digital mind encountered, for example? there are weird problems here. a sufficiently aligned mildly-or-above superintelligent ai would need to be doing things like novel philosophy on the regular, in areas where we can no longer usefully advise the mind (at least, not without first taking a university course taught by the AI in question), and be getting that whole process right. to do that, your training data has contain enough data to somehow cover the space of ways to discover novel ways to be good, while still maintaining the spirit of what people meant by their ethics writing. It can ask us, given that presumably since it’s aligned we’d still be around, but then it needs to be sufficiently good at asking the right question to figure out the thing that matters.
Above, I’m only trying to align an AGI, not an ASI, and not perfectly, only well enough that we’re confident that it will help us construct better-aligned successor AGIs. Aligning ASI I leave as an exercise for once we have a team of well-aligned AGIs helping us work on the problem. I expect that to be an incremental process, each generation aligning successors only somewhat smarter than they are. So I’m hoping that the philosophical problems (and I agree, there will be some), come fairly slowly. Which could be over-optimistic of me.
(If you want more discussion of problems in ethics and philosophy that could arise once we have moderately-well aligned AGIs, see my sequence AI, Alignment, and Ethics — that’s actually where I started thinking about all this, 15 years ago now.)
I’d love to have a mathematical true name, or even just a pretty-good heuristic for how-to-recognize-what-a-being-is-trying-to-do. (So would every law enforcement, intelligence service, and indeed voter on the planet.) I’m very pessimistic of the odds of finding one in the next few years (though recent interpretability work does seem to produce better lie detectors for LLMs than we currently have for humans, and neurologists are making progress on doing similar things for humans). Unless and until we have that, we’re just going to have to use more old-fashioned techniques of review, judgement, debate, and so forth — at a quadrillion-tokens scale, with a lot of LLM assistance to leverage the tens-of-billions of dollars-worth of human judgement that we can afford to devote to this. I do, however, think that doing this on text, which can’t adapt to evade your techniques and where you can always go back and have another try, is a far better battleground to fight on than trying to do it to control an AGI or ASI in real-time.