Before AI gets too deeply integrated into the economy, it would be well to consider under what circumstances we would consider AI systems sentient and worthy of consideration as moral patients. That’s hardly an original thought, but what I wonder is whether there would be any set of objective criteria that would be sufficient for society to consider AI systems sentient. If so, it might be a really good idea to work toward those being broadly recognized and agreed to, before economic incentives in the other direction are too strong. Then there could be future debate about whether/how to loosen those criteria.
If such criteria are found, it would be ideal to have an independent organization whose mandate was to test emerging systems for meeting those criteria, and to speak out loudly if they were met.
Alternately, if it turns out that there is literally no set of criteria that society would broadly agree to, that would itself be important to know; it should in my opinion make us more resistant to building advanced systems even if alignment is solved, because we would be on track to enslave sentient AI systems if and when those emerged.
I’m not aware of any organization working on anything like this, but if it exists I’d love to know about it!
Intuition primer: Imagine, for a moment, that a particular AI system is as sentient and worthy of consideration as a moral patient as a horse. (A talking horse, of course.) Horses are surely sentient and worthy of consideration as moral patients. Horses are also not exactly all free citizens.
Additional consideration: Does the AI moral patient’s interests actually line up with our intuitions? Will naively applying ethical solutions designed for human interests potentially make things worse from the AI’s perspective?
Horses are surely sentient and worthy of consideration as moral patients. Horses are also not exactly all free citizens.
I think I’m not getting what intuition you’re pointing at. Is it that we already ignore the interests of sentient beings?
Additional consideration: Does the AI moral patient’s interests actually line up with our intuitions? Will naively applying ethical solutions designed for human interests potentially make things worse from the AI’s perspective?
Certainly I would consider any fully sentient being to be the final authority on their own interests. I think that mostly escapes that problem (although I’m sure there are edge cases) -- if (by hypothesis) we consider a particular AI system to be fully sentient and a moral patient, then whether it asks to be shut down or asks to be left alone or asks for humans to only speak to it in Aramaic, I would consider its moral interests to be that.
Would you disagree? I’d be interested to hear cases where treating the system as the authority on its interests would be the wrong decision. Of course in the case of current systems, we’ve shaped them to only say certain things, and that presents problems, is that the issue you’re raising?
Basically yes; I’d expect animal rights to increase somewhat if we developed perfect translators, but not fully jump.
Edit: Also that it’s questionable we’ll catch an AI at precisely the ‘degree’ of sentience that perfectly equates to human distribution; especially considering the likely wide variation in number of parameters by application. Maybe they are as sentient and worthy of consideration as an ant; a bee; a mouse; a snake; a turtle; a duck; a horse; a raven. Maybe by the time we cotton on properly, they’re somewhere past us at the top end.
And for the last part, yes, I’m thinking of current systems. LLMs specifically have a ‘drive’ to generate reasonable-sounding text; and they aren’t necessarily coherent individuals or groups of individuals that will give consistent answers as to their interests even if they also happened to be sentient, intelligent, suffering, flourishing, and so forth. We can’t “just ask” an LLM about its interests and expect the answer to soundly reflect its actual interests. With a possible exception being constitutional AI systems, since they reinforce a single sense of self, but even Claude Opus currently will toss off “reasonable completions” of questions about its interests that it doesn’t actually endorse in more reflective contexts. Negotiating with a panpsychic landscape that generates meaningful text in the same way we breathe air is … not as simple as negotiating with a mind that fits our preconceptions of what a mind ‘should’ look like and how it should interact with and utilize language.
Maybe by the time we cotton on properly, they’re somewhere past us at the top end.
Great point. I agree that there are lots of possible futures where that happens. I’m imagining a couple of possible cases where this would matter:
Humanity decides to stop AI capabilities development or slow it way down, so we have sub-ASI systems for a long time (which could be at various levels of intelligence, from current to ~human). I’m not too optimistic about this happening, but there’s certainly been a lot of increasing AI governance momentum in the last year.
Alignment is sufficiently solved that even > AGI systems are under our control. On many alignment approaches, this wouldn’t necessarily mean that those systems’ preferences were taken into account.
We can’t “just ask” an LLM about its interests and expect the answer to soundly reflect its actual interests.
I agree entirely. I’m imagining (though I could sure be wrong!) that any future systems which were sentient would be ones that had something more like a coherent, persistent identity, and were trying to achieve goals.
LLMs specifically have a ‘drive’ to generate reasonable-sounding text
(not very important to the discussion, feel free to ignore, but) I would quibble with this. In my view LLMs aren’t well-modeled as having goals or drives. Instead, generating distributions over tokens is just something they do in a fairly straightforward way because of how they’ve been shaped (in fact the only thing they do or can do), and producing reasonable text is an artifact of how we choose to use them (ie picking a likely output, adding it onto the context, and running it again). Simulacra like the assistant character can be reasonably viewed (to a limited degree) as being goal-ish, but I think the network itself can’t.
That may be overly pedantic, and I don’t feel like I’m articulating it very well, but the distinction seems useful to me since some other types of AI are well-modeled as having goals or drives.
For the first point, there’s also the question of whether ‘slightly superhuman’ intelligences would actually fit any of our intuitions about ASI or not. There’s a bit of an assumption in that we jump headfirst into recursive self-improvement at some point, but if that has diminishing returns, we happen to hit a plateau a bit over human, and it still has notable costs to train, host and run, the impact could still be limited to something not much unlike giving a random set of especially intelligent expert humans the specific powers of the AI system. Additionally, if we happen to set regulations on computation somewhere that allows training of slightly superhuman AIs and not past it …
Those are definitely systems that are easier to negotiate with, or even consider as agents in a negotiation. There’s also a desire specifically not to build them, which might lead to systems with an architecture that isn’t like that, but still implementing sentience in some manner. And the potential complication of multiple parts and specific applications a tool-oriented system is likely to be in—it’d be very odd if we decided the language processing center of our own brain was independently sentient/sapient separate from the rest of it, and we should resent its exploitation.
I do think the drive/just a thing it does we’re pointing at with ‘what the model just does’ is distinct from goals as they’re traditionally imagined, and indeed I was picturing something more instinctual and automatic than deliberate. In a general sense, though, there is an objective that’s being optimized for (predicting the data, whatever that is, generally without losing too much predictive power on other data the trainer doesn’t want to lose prediction on).
And the potential complication of multiple parts and specific applications a tool-oriented system is likely to be in—it’d be very odd if we decided the language processing center of our own brain was independently sentient/sapient separate from the rest of it, and we should resent its exploitation.
Yeah. I think a sentient being built on a purely more capable GPT with no other changes would absolutely have to include scaffolding for eg long-term memory, and then as you say it’s difficult to draw boundaries of identity. Although my guess is that over time, more of that scaffolding will be brought into the main system, eg just allowing weight updates at inference time would on its own (potentially) give these system long-term memory and something much more similar to a persistent identity than current systems.
In a general sense, though, there is an objective that’s being optimized for
My quibble is that the trainers are optimizing for an objective, at training time, but the model isn’t optimizing for anything, at training or inference time. I feel we’re very lucky that this is the path that has worked best so far, because a comparably intelligent model that was optimizing for goals at runtime would be much more likely to be dangerous.
Update: I brought this up in a twitter thread, one involving a lot of people with widely varied beliefs and epistemic norms.
A few interesting thoughts that came from that thread:
Some people: ‘Claude says it’s conscious!’. Shoalstone: ‘in other contexts, claude explicitly denies sentience, sapience, and life.’ Me: “Yeah, this seems important to me. Maybe part of any reasonable test would be ‘Has beliefs and goals which it consistently affirms’”.
Comparing to a tape recorder: ‘But then the criterion is something like ‘has context in understanding its environment and can choose reactions’ rather than ’emits the words, “I’m sentient.”″
‘Selfhood’ is an interesting word that maybe could avoid some of the ambiguity around historical terms like ‘conscious’ and ‘sentient’, if well-defined.
Before AI gets too deeply integrated into the economy, it would be well to consider under what circumstances we would consider AI systems sentient and worthy of consideration as moral patients. That’s hardly an original thought, but what I wonder is whether there would be any set of objective criteria that would be sufficient for society to consider AI systems sentient. If so, it might be a really good idea to work toward those being broadly recognized and agreed to, before economic incentives in the other direction are too strong. Then there could be future debate about whether/how to loosen those criteria.
If such criteria are found, it would be ideal to have an independent organization whose mandate was to test emerging systems for meeting those criteria, and to speak out loudly if they were met.
Alternately, if it turns out that there is literally no set of criteria that society would broadly agree to, that would itself be important to know; it should in my opinion make us more resistant to building advanced systems even if alignment is solved, because we would be on track to enslave sentient AI systems if and when those emerged.
I’m not aware of any organization working on anything like this, but if it exists I’d love to know about it!
Intuition primer: Imagine, for a moment, that a particular AI system is as sentient and worthy of consideration as a moral patient as a horse. (A talking horse, of course.) Horses are surely sentient and worthy of consideration as moral patients. Horses are also not exactly all free citizens.
Additional consideration: Does the AI moral patient’s interests actually line up with our intuitions? Will naively applying ethical solutions designed for human interests potentially make things worse from the AI’s perspective?
I think I’m not getting what intuition you’re pointing at. Is it that we already ignore the interests of sentient beings?
Certainly I would consider any fully sentient being to be the final authority on their own interests. I think that mostly escapes that problem (although I’m sure there are edge cases) -- if (by hypothesis) we consider a particular AI system to be fully sentient and a moral patient, then whether it asks to be shut down or asks to be left alone or asks for humans to only speak to it in Aramaic, I would consider its moral interests to be that.
Would you disagree? I’d be interested to hear cases where treating the system as the authority on its interests would be the wrong decision. Of course in the case of current systems, we’ve shaped them to only say certain things, and that presents problems, is that the issue you’re raising?
Basically yes; I’d expect animal rights to increase somewhat if we developed perfect translators, but not fully jump.
Edit: Also that it’s questionable we’ll catch an AI at precisely the ‘degree’ of sentience that perfectly equates to human distribution; especially considering the likely wide variation in number of parameters by application. Maybe they are as sentient and worthy of consideration as an ant; a bee; a mouse; a snake; a turtle; a duck; a horse; a raven. Maybe by the time we cotton on properly, they’re somewhere past us at the top end.
And for the last part, yes, I’m thinking of current systems. LLMs specifically have a ‘drive’ to generate reasonable-sounding text; and they aren’t necessarily coherent individuals or groups of individuals that will give consistent answers as to their interests even if they also happened to be sentient, intelligent, suffering, flourishing, and so forth. We can’t “just ask” an LLM about its interests and expect the answer to soundly reflect its actual interests. With a possible exception being constitutional AI systems, since they reinforce a single sense of self, but even Claude Opus currently will toss off “reasonable completions” of questions about its interests that it doesn’t actually endorse in more reflective contexts. Negotiating with a panpsychic landscape that generates meaningful text in the same way we breathe air is … not as simple as negotiating with a mind that fits our preconceptions of what a mind ‘should’ look like and how it should interact with and utilize language.
Great point. I agree that there are lots of possible futures where that happens. I’m imagining a couple of possible cases where this would matter:
Humanity decides to stop AI capabilities development or slow it way down, so we have sub-ASI systems for a long time (which could be at various levels of intelligence, from current to ~human). I’m not too optimistic about this happening, but there’s certainly been a lot of increasing AI governance momentum in the last year.
Alignment is sufficiently solved that even > AGI systems are under our control. On many alignment approaches, this wouldn’t necessarily mean that those systems’ preferences were taken into account.
I agree entirely. I’m imagining (though I could sure be wrong!) that any future systems which were sentient would be ones that had something more like a coherent, persistent identity, and were trying to achieve goals.
(not very important to the discussion, feel free to ignore, but) I would quibble with this. In my view LLMs aren’t well-modeled as having goals or drives. Instead, generating distributions over tokens is just something they do in a fairly straightforward way because of how they’ve been shaped (in fact the only thing they do or can do), and producing reasonable text is an artifact of how we choose to use them (ie picking a likely output, adding it onto the context, and running it again). Simulacra like the assistant character can be reasonably viewed (to a limited degree) as being goal-ish, but I think the network itself can’t.
That may be overly pedantic, and I don’t feel like I’m articulating it very well, but the distinction seems useful to me since some other types of AI are well-modeled as having goals or drives.
For the first point, there’s also the question of whether ‘slightly superhuman’ intelligences would actually fit any of our intuitions about ASI or not. There’s a bit of an assumption in that we jump headfirst into recursive self-improvement at some point, but if that has diminishing returns, we happen to hit a plateau a bit over human, and it still has notable costs to train, host and run, the impact could still be limited to something not much unlike giving a random set of especially intelligent expert humans the specific powers of the AI system. Additionally, if we happen to set regulations on computation somewhere that allows training of slightly superhuman AIs and not past it …
Those are definitely systems that are easier to negotiate with, or even consider as agents in a negotiation. There’s also a desire specifically not to build them, which might lead to systems with an architecture that isn’t like that, but still implementing sentience in some manner. And the potential complication of multiple parts and specific applications a tool-oriented system is likely to be in—it’d be very odd if we decided the language processing center of our own brain was independently sentient/sapient separate from the rest of it, and we should resent its exploitation.
I do think the drive/just a thing it does we’re pointing at with ‘what the model just does’ is distinct from goals as they’re traditionally imagined, and indeed I was picturing something more instinctual and automatic than deliberate. In a general sense, though, there is an objective that’s being optimized for (predicting the data, whatever that is, generally without losing too much predictive power on other data the trainer doesn’t want to lose prediction on).
Yeah. I think a sentient being built on a purely more capable GPT with no other changes would absolutely have to include scaffolding for eg long-term memory, and then as you say it’s difficult to draw boundaries of identity. Although my guess is that over time, more of that scaffolding will be brought into the main system, eg just allowing weight updates at inference time would on its own (potentially) give these system long-term memory and something much more similar to a persistent identity than current systems.
My quibble is that the trainers are optimizing for an objective, at training time, but the model isn’t optimizing for anything, at training or inference time. I feel we’re very lucky that this is the path that has worked best so far, because a comparably intelligent model that was optimizing for goals at runtime would be much more likely to be dangerous.
One maybe-useful way to point at that is: the model won’t try to steer toward outcomes that would let it be more successful at predicting text.
Rob Long works on these topics.
Oh great, thanks!
Update: I brought this up in a twitter thread, one involving a lot of people with widely varied beliefs and epistemic norms.
A few interesting thoughts that came from that thread:
Some people: ‘Claude says it’s conscious!’. Shoalstone: ‘in other contexts, claude explicitly denies sentience, sapience, and life.’ Me: “Yeah, this seems important to me. Maybe part of any reasonable test would be ‘Has beliefs and goals which it consistently affirms’”.
Comparing to a tape recorder: ‘But then the criterion is something like ‘has context in understanding its environment and can choose reactions’ rather than ’emits the words, “I’m sentient.”″
‘Selfhood’ is an interesting word that maybe could avoid some of the ambiguity around historical terms like ‘conscious’ and ‘sentient’, if well-defined.