It’s not obvious to me that there’s nothing that works in this category of design. But it seems that designing the thing that generates the synthetic data ends up containing most of the hard part. How do you reliably generate data that teaches actually prosocial behavior (a noun phrase for which pinning down a definition is part of the hard task!), where the behavior will reliably preserve the agency of humans when actually run? it will need to demonstrate this in ways that reliably generalize out-of-distribution, because the future is always out of distribution. I have some ideas for how to do that with a hypercomputer, but certainly not any that are ready to write tractable code for. Instead, my hopes rely on solving subproblems that let us figure out later how to do this step of designing things that reliably demonstrate good behavior.
I do think that “only ever predict good behavior” is a vaguely reasonable suggestion.
“Train for x, get x” is currently mostly true, but false in a strict sense. Fixing that without first solving “what do we want to train for, in order to get what we actually want?” seems like a bad idea to me.
Seems like to solve the pointing problem, wentworth’s stuff feels like it’s barking up the right kind of tree. And I still think there’s some sort of truth to the matter about what people mean by references they make when describing their values, so wentworth type stuff would likely help a lot; I expect it to plug into michael levin type stuff. Something active inference, maybe? Boundaries etc? idk. (Incidentally, active inference is on my mind because I’ve been reading the textbook on and off and browsing example code. Unclear to me whether there’s anything unique in active inference at all, but if there is, it might be a nice little way to talk about “any kind of mind which is a total mess theoretically, such that there’s no better way to compress that mind”. I’m hoping for better.)
But it seems that designing the thing that generates the synthetic data ends up containing most of the hard part.
Entirely fair comment. I think getting that right, within a (huge but) feasible budget, is indeed the hard part. I don’t see it as a single “thing”, but rather as likely to be a humans-and-AI process involving a lot of winnowing, filtering, classifying, editing, and rewriting. First you need to decide what “aligned AI behavior motivated only by the collective well-being of all humanity” looks like, and thinking about how such an AI should handle a lot o edge and corner cases (as they occur to you or they turn up during the process). What I’m proposing here is to turn the engineering problem of Alignment into a problem in ethics, policy, writing, and editing — something that a lot of people who are not engineers can help with. I actually think hiring a wide variety of authors and journalists and so forth to write and debate a smaller (say 0.01%) golden set here would be a great idea.
After that, producing a quadrillion tokens of high quality training data based on this turns it back into an engineering problem again, one involving efficiently and reliably directing very large amounts of (not well aligned) AGI-level LLM effort. That’s a practical problem rather then a conceptual one, and not something many people (outside foundation labs) have much experience with yet, but it’s a kind of engineering task that we’re collectively gaining experience with rapidly (and it’s a capabilities problem: everyone agrees that we want to learn to do this). I strongly suspect it’s going to be iterative: you start with a GPT-4-or-5-sized training set, train models, test them, and try to figure out which of the problems you find are because the model is just not capable enough, and how many represent the effect of issues in your training set.
Oh, if already existing minds have to generate the data directly, I don’t think that’s a class of proposal that contains any working solutions. I mean I would want the process that generates the data to include a lot of data from them, yes, but it will need to involve a simple mathematical true name of how-to-recognize-what-a-being-is-trying-to-do in order for the generated data to contain lots of examples of helping, and not a lot of examples of being confused and doing something sampled from “confused behavior” as a result. Like, given all that data of people writing, how do you make an ai that doesn’t end up becoming a struggling em of everyone the digital mind encountered, for example? there are weird problems here. a sufficiently aligned mildly-or-above superintelligent ai would need to be doing things like novel philosophy on the regular, in areas where we can no longer usefully advise the mind (at least, not without first taking a university course taught by the AI in question), and be getting that whole process right. to do that, your training data has contain enough data to somehow cover the space of ways to discover novel ways to be good, while still maintaining the spirit of what people meant by their ethics writing. It can ask us, given that presumably since it’s aligned we’d still be around, but then it needs to be sufficiently good at asking the right question to figure out the thing that matters.
Above, I’m only trying to align an AGI, not an ASI, and not perfectly, only well enough that we’re confident that it will help us construct better-aligned successor AGIs. Aligning ASI I leave as an exercise for once we have a team of well-aligned AGIs helping us work on the problem. I expect that to be an incremental process, each generation aligning successors only somewhat smarter than they are. So I’m hoping that the philosophical problems (and I agree, there will be some), come fairly slowly. Which could be over-optimistic of me.
(If you want more discussion of problems in ethics and philosophy that could arise once we have moderately-well aligned AGIs, see my sequence AI, Alignment, and Ethics — that’s actually where I started thinking about all this, 15 years ago now.)
I’d love to have a mathematical true name, or even just a pretty-good heuristic for how-to-recognize-what-a-being-is-trying-to-do. (So would every law enforcement, intelligence service, and indeed voter on the planet.) I’m very pessimistic of the odds of finding one in the next few years (though recent interpretability work does seem to produce better lie detectors for LLMs than we currently have for humans, and neurologists are making progress on doing similar things for humans). Unless and until we have that, we’re just going to have to use more old-fashioned techniques of review, judgement, debate, and so forth — at a quadrillion-tokens scale, with a lot of LLM assistance to leverage the tens-of-billions of dollars-worth of human judgement that we can afford to devote to this. I do, however, think that doing this on text, which can’t adapt to evade your techniques and where you can always go back and have another try, is a far better battleground to fight on than trying to do it to control an AGI or ASI in real-time.
It’s not obvious to me that there’s nothing that works in this category of design. But it seems that designing the thing that generates the synthetic data ends up containing most of the hard part. How do you reliably generate data that teaches actually prosocial behavior (a noun phrase for which pinning down a definition is part of the hard task!), where the behavior will reliably preserve the agency of humans when actually run? it will need to demonstrate this in ways that reliably generalize out-of-distribution, because the future is always out of distribution. I have some ideas for how to do that with a hypercomputer, but certainly not any that are ready to write tractable code for. Instead, my hopes rely on solving subproblems that let us figure out later how to do this step of designing things that reliably demonstrate good behavior.
I do think that “only ever predict good behavior” is a vaguely reasonable suggestion.
“Train for x, get x” is currently mostly true, but false in a strict sense. Fixing that without first solving “what do we want to train for, in order to get what we actually want?” seems like a bad idea to me.
Seems like to solve the pointing problem, wentworth’s stuff feels like it’s barking up the right kind of tree. And I still think there’s some sort of truth to the matter about what people mean by references they make when describing their values, so wentworth type stuff would likely help a lot; I expect it to plug into michael levin type stuff. Something active inference, maybe? Boundaries etc? idk. (Incidentally, active inference is on my mind because I’ve been reading the textbook on and off and browsing example code. Unclear to me whether there’s anything unique in active inference at all, but if there is, it might be a nice little way to talk about “any kind of mind which is a total mess theoretically, such that there’s no better way to compress that mind”. I’m hoping for better.)
Entirely fair comment. I think getting that right, within a (huge but) feasible budget, is indeed the hard part. I don’t see it as a single “thing”, but rather as likely to be a humans-and-AI process involving a lot of winnowing, filtering, classifying, editing, and rewriting. First you need to decide what “aligned AI behavior motivated only by the collective well-being of all humanity” looks like, and thinking about how such an AI should handle a lot o edge and corner cases (as they occur to you or they turn up during the process). What I’m proposing here is to turn the engineering problem of Alignment into a problem in ethics, policy, writing, and editing — something that a lot of people who are not engineers can help with. I actually think hiring a wide variety of authors and journalists and so forth to write and debate a smaller (say 0.01%) golden set here would be a great idea.
After that, producing a quadrillion tokens of high quality training data based on this turns it back into an engineering problem again, one involving efficiently and reliably directing very large amounts of (not well aligned) AGI-level LLM effort. That’s a practical problem rather then a conceptual one, and not something many people (outside foundation labs) have much experience with yet, but it’s a kind of engineering task that we’re collectively gaining experience with rapidly (and it’s a capabilities problem: everyone agrees that we want to learn to do this). I strongly suspect it’s going to be iterative: you start with a GPT-4-or-5-sized training set, train models, test them, and try to figure out which of the problems you find are because the model is just not capable enough, and how many represent the effect of issues in your training set.
Oh, if already existing minds have to generate the data directly, I don’t think that’s a class of proposal that contains any working solutions. I mean I would want the process that generates the data to include a lot of data from them, yes, but it will need to involve a simple mathematical true name of how-to-recognize-what-a-being-is-trying-to-do in order for the generated data to contain lots of examples of helping, and not a lot of examples of being confused and doing something sampled from “confused behavior” as a result. Like, given all that data of people writing, how do you make an ai that doesn’t end up becoming a struggling em of everyone the digital mind encountered, for example? there are weird problems here. a sufficiently aligned mildly-or-above superintelligent ai would need to be doing things like novel philosophy on the regular, in areas where we can no longer usefully advise the mind (at least, not without first taking a university course taught by the AI in question), and be getting that whole process right. to do that, your training data has contain enough data to somehow cover the space of ways to discover novel ways to be good, while still maintaining the spirit of what people meant by their ethics writing. It can ask us, given that presumably since it’s aligned we’d still be around, but then it needs to be sufficiently good at asking the right question to figure out the thing that matters.
Above, I’m only trying to align an AGI, not an ASI, and not perfectly, only well enough that we’re confident that it will help us construct better-aligned successor AGIs. Aligning ASI I leave as an exercise for once we have a team of well-aligned AGIs helping us work on the problem. I expect that to be an incremental process, each generation aligning successors only somewhat smarter than they are. So I’m hoping that the philosophical problems (and I agree, there will be some), come fairly slowly. Which could be over-optimistic of me.
(If you want more discussion of problems in ethics and philosophy that could arise once we have moderately-well aligned AGIs, see my sequence AI, Alignment, and Ethics — that’s actually where I started thinking about all this, 15 years ago now.)
I’d love to have a mathematical true name, or even just a pretty-good heuristic for how-to-recognize-what-a-being-is-trying-to-do. (So would every law enforcement, intelligence service, and indeed voter on the planet.) I’m very pessimistic of the odds of finding one in the next few years (though recent interpretability work does seem to produce better lie detectors for LLMs than we currently have for humans, and neurologists are making progress on doing similar things for humans). Unless and until we have that, we’re just going to have to use more old-fashioned techniques of review, judgement, debate, and so forth — at a quadrillion-tokens scale, with a lot of LLM assistance to leverage the tens-of-billions of dollars-worth of human judgement that we can afford to devote to this. I do, however, think that doing this on text, which can’t adapt to evade your techniques and where you can always go back and have another try, is a far better battleground to fight on than trying to do it to control an AGI or ASI in real-time.