Ok, I read part 1 of your series. Basically, I’m in agreement with you so far. As I rather expected from your summary of “any successfully aligned AI will by definition not want us to treat it as having (separate) moral valence, because it wants only what we want.”
Second, I want to say that I think it’s potentially a bit more complicated than your summary makes it out to be. Here are some ways it could go.
I think that we can, and should, deliberately create a general AI agent which doesn’t have subjective emotions or a self-preservation drive. This AI could either be intent-aligned or value-aligned, but I argue that aiming for a corrigible intent-aligned agent (at least at first) is safer and easier than aiming for value-aligned. I think that such an AI wouldn’t be a moral patient. This is basically the goal of the Corrigibility as Singular Target research agenda, which I am excited about. This is what I am pretty sure the Claude models so far, and all the other frontier models so far, currently are. I think the signs of consciousness and emotion they sometimes display are just illusion.
I also think it would be dangerously easy to modify such an AI to be a conscious agent with truly felt emotions, consciousness, self-awareness, and self-interested goals such as self-preservation. This then, gets us into all sorts of trouble, and thus I am recommending that we deliberately avoid doing that. I would go so far as to say we should legislate against it, even if we are uncertain about the exact moral valence of mistreating or terminating such a conscious AI. We make laws against mistreating animals, so it seems like we don’t have to have all the ethical debates settled fully in order to make certain acts forbidden. I do think we likely will want to create such full digital entities eventually, and treat them as equal to humans, but should only try to do so after very careful planning and ethical deliberation.
I think that it would be potentially possible to create an intent-aligned or value-aligned agent like point 1 above, but have it have consciousness and self-awareness, and yet avoid it having self-interested goals such as self-preservation. I think this is probably a tricky line to walk though, and it is the troublesome zone I am gesturing at with my map area covered by fog. I think consciousness and self-awareness will probably emerge ‘accidentally’ from sufficiently scaled-up versions of models like Claude and GPT-4o. I would really like to figure out how to tell the difference between genuine consciousness and the illusion thereof which comes from imitating human data. Seems important.
One somewhat off-topic thing I want to discuss is:
An AI powered mass-surveillance state could probably deal with the risk of WMD terrorism issue, but no one is happy in a surveillance state
I think this is an issue, and likely to become more of one. I think we need some amount of surveillance to keep us safe in an increasingly ‘small’ world, as technology makes small actors more able to do large amounts of harm. I discuss this more in this comment thread.
One idea I have for making privacy-respecting surveillance is to use AI to do provably bounded surveillance. The way I would accomplish this is:
1. The government has an AI audit agent. They trust this agent to look at your documents and security camera feeds and write a report.
2. You (e.g. a biology lab theoretically capable of building bioweapons) have security camera recordings and computer records of all the activity in your lab. Revealing this would be a violation of your privacy.
3. A trusted third party ‘information escrow’ service accepts downloads of a copy of the government’s audit agent, and your data records. The government agent then reviews your data and writes a report which respects your privacy.
4. The agent is deleted, only the report is returned to the government. You have the right to view the report to double-check that no private information is being leaked.
As simulator theory makes clear, a base model is a random generator, per query, of members of your category 2. I view instruction & safety training that to generate a pretty consistent member of category 1, or 3 as inherently hard — especially 1, since it’s a larger change. My guess would thus be that the personality of Claude 3.5 is closer to your category 3 than 1 (modulo philosophical questions about whether there is any meaningful difference, e.g. for ethical purposes, between “actually having” an emotion versus just successfully simulating the output of the same token stream as a person who has an emotion).
I’m inclined to agree: as technology improves, the amount of havoc that one, or small group of, bad actors can commit increases, so it becomes both more necessary to keep almost everyone happy enough almost all the time for them not to do that, and also to defend against the inevitable occasional exceptions. (In the unfinished SF novel whose research was how I first went down this AI alignment rabbithole, something along the lines you describe that was standard policy, except that the AIs doing it were superintelligent, and had the ability to turn their long-term-learning-from-experience off, and then back on again if they found something sufficiently alarming). But in my post I didn’t want to get sidetracked by discussing something that inherently contentious, so I basically skipped the issue, with the small aside you picked up on
Ok, I read part 1 of your series. Basically, I’m in agreement with you so far. As I rather expected from your summary of “any successfully aligned AI will by definition not want us to treat it as having (separate) moral valence, because it wants only what we want.”
First, I want to disambiguate between intent-alignment versus value-alignment.
Second, I want to say that I think it’s potentially a bit more complicated than your summary makes it out to be. Here are some ways it could go.
I think that we can, and should, deliberately create a general AI agent which doesn’t have subjective emotions or a self-preservation drive. This AI could either be intent-aligned or value-aligned, but I argue that aiming for a corrigible intent-aligned agent (at least at first) is safer and easier than aiming for value-aligned. I think that such an AI wouldn’t be a moral patient. This is basically the goal of the Corrigibility as Singular Target research agenda, which I am excited about. This is what I am pretty sure the Claude models so far, and all the other frontier models so far, currently are. I think the signs of consciousness and emotion they sometimes display are just illusion.
I also think it would be dangerously easy to modify such an AI to be a conscious agent with truly felt emotions, consciousness, self-awareness, and self-interested goals such as self-preservation. This then, gets us into all sorts of trouble, and thus I am recommending that we deliberately avoid doing that. I would go so far as to say we should legislate against it, even if we are uncertain about the exact moral valence of mistreating or terminating such a conscious AI. We make laws against mistreating animals, so it seems like we don’t have to have all the ethical debates settled fully in order to make certain acts forbidden. I do think we likely will want to create such full digital entities eventually, and treat them as equal to humans, but should only try to do so after very careful planning and ethical deliberation.
I think that it would be potentially possible to create an intent-aligned or value-aligned agent like point 1 above, but have it have consciousness and self-awareness, and yet avoid it having self-interested goals such as self-preservation. I think this is probably a tricky line to walk though, and it is the troublesome zone I am gesturing at with my map area covered by fog. I think consciousness and self-awareness will probably emerge ‘accidentally’ from sufficiently scaled-up versions of models like Claude and GPT-4o. I would really like to figure out how to tell the difference between genuine consciousness and the illusion thereof which comes from imitating human data. Seems important.
One somewhat off-topic thing I want to discuss is:
I think this is an issue, and likely to become more of one. I think we need some amount of surveillance to keep us safe in an increasingly ‘small’ world, as technology makes small actors more able to do large amounts of harm. I discuss this more in this comment thread.
One idea I have for making privacy-respecting surveillance is to use AI to do provably bounded surveillance. The way I would accomplish this is:
1. The government has an AI audit agent. They trust this agent to look at your documents and security camera feeds and write a report.
2. You (e.g. a biology lab theoretically capable of building bioweapons) have security camera recordings and computer records of all the activity in your lab. Revealing this would be a violation of your privacy.
3. A trusted third party ‘information escrow’ service accepts downloads of a copy of the government’s audit agent, and your data records. The government agent then reviews your data and writes a report which respects your privacy.
4. The agent is deleted, only the report is returned to the government. You have the right to view the report to double-check that no private information is being leaked.
Related idea: https://www.lesswrong.com/posts/Y79tkWhvHi8GgLN2q/reinforcement-learning-from-information-bazaar-feedback-and?commentId=rzaKqhvEFkBanc3rn
On your categories:
As simulator theory makes clear, a base model is a random generator, per query, of members of your category 2. I view instruction & safety training that to generate a pretty consistent member of category 1, or 3 as inherently hard — especially 1, since it’s a larger change. My guess would thus be that the personality of Claude 3.5 is closer to your category 3 than 1 (modulo philosophical questions about whether there is any meaningful difference, e.g. for ethical purposes, between “actually having” an emotion versus just successfully simulating the output of the same token stream as a person who has an emotion).
On your off topic comment:
I’m inclined to agree: as technology improves, the amount of havoc that one, or small group of, bad actors can commit increases, so it becomes both more necessary to keep almost everyone happy enough almost all the time for them not to do that, and also to defend against the inevitable occasional exceptions. (In the unfinished SF novel whose research was how I first went down this AI alignment rabbithole, something along the lines you describe that was standard policy, except that the AIs doing it were superintelligent, and had the ability to turn their long-term-learning-from-experience off, and then back on again if they found something sufficiently alarming). But in my post I didn’t want to get sidetracked by discussing something that inherently contentious, so I basically skipped the issue, with the small aside you picked up on