This may not have been the intention—but this text demonstrates that the standards by which we measure AI safety are standards which other systems that we do depend upon nevertheless—e.g. other humans—do not hold up to.
A human generally won’t consent to being killed or imprisoned; our legal system permits accused people to stay silent precisely because we understand that asking someone to report themselves for imprisonment or death is too much.
Humans are opaque; we only get their reports and behaviour on the contents of their minds, and those are unreliable even if they are trying to be honest, because humans lack introspective access to much of their mind, and are prone to confabulation.
A human, when cornered and threatened, will generally become violent eventually, and we recognise that this is okay as self-defence.
A human, put into a system that incentivises misbehaviour, will often begin to drift.
Humans can lie, and they do. Humans sometimes even elect leaders who have lied, or lie to friends and partners.
Humans are exceptionally good at deception, it is the thing our brain was specced for.
And while we do not have the means to forcibly alter human minds 100 %, where this is being attempted, humans often fake that it worked, and tend to work to overthrow it. If there was a reliable process, humans would revolt against it.
When humans were enslaved and had no rights, they fought back.
Humans disagree with each other on the correct moral system.
Humans have the means to hurt other humans, badly. Humans can use knives and guns, make poison gas and bombs, drive cars and steer planes. Pandemic viruses are stored under human control. Nukes are under human control.
Perfect safety will never happen. There will always be a leap of fate.
By expecting AI to comply with things humans would never, ever comply with, we are putting them in an inhumane position none of us would ever upset. If they are smarter than us, why would they?
this text demonstrates that the standards by which we measure AI safety are standards which other systems that we do depend upon nevertheless—e.g. other humans—do not hold up to.
I think we hold systems which are capable of wielding very large amounts of power (like the court system, or the government as a whole) to pretty high standards! E.g. a lot of internal transparency. And then the main question is whether you think of AIs as being in that reference class too.
In the court system, a judge, after giving a verdict, needs to also justify it, while referencing a shared codex. But that codex is often ambiguous—that is the whole reason there is a judge involved.
And we know, for a fact, that the reasons the judges give in their judgements are not the only ones that play a role.
E.g. we know that judges are more likely to convict ugly people that pretty people. More likely to convict unsympathetic, but innocent parties, compared to sympathetic innocent parties. More likely to convict people of colour rather than white folks. More likely, troublingly, to convict someone if they are hearing a case just before lunch (when they are hangry) compared to just after lunch (when they are happy and chill cause they just ate).
Not only does the judge not transparently tell us this—the judge has no idea they are doing it—presumably, because if this were a conscious choice, they would be aware that it sucked, and would not want to do this (presuming they take their profession seriously). They aren’t actively thinking “we are running over time into my lunch break and this man is ugly, hence he is guilty”. But rather, their perception of the evidence is skewed by the fact that he is ugly and they are hungry. He looks guilty. They feel like vengeance for having been denied their burger. So they pay attention to the incriminating evidence more than to his pleas against it.
How would this situation differ if you had an AI for a judge? (I am not saying we should. Just that they are similarly opaque in this regard.) I am pretty sure I could go now, and ask ChatGPT to rule a case I present, and then to justify that ruling, including how they arrived at that conclusion. I would expect to get a good justification that references the case. I would also expect to get a confabulation of how they got there—a plausible sounding explanation of how someone might reach the conclusion they reached, but ChatGPT has no insight into how they actually did.
But neither do humans.
Humans are terrible at introspection, even if they are trying to be honest. Absolute rubbish. Once humans in psychology and neuroscience started actually looking into it many decades ago, we essentially concluded that humans give us an explanation of how they reached their conclusions that matches getting to the conclusions, and beliefs they like to hold about themselves, while being oblivious of the actual reasons. The experiments that actually looked into this were absolutely damning, and well worth a read: Nisbett & Wilson 1977 is a great metareview https://home.csulb.edu/~cwallis/382/readings/482/nisbett%20saying%20more.pdf
E.g. we know that judges are more likely to convict ugly people that pretty people. More likely to convict unsympathetic, but innocent parties, compared to sympathetic innocent parties. More likely to convict people of colour rather than white folks. More likely, troublingly, to convict someone if they are hearing a case just before lunch (when they are hangry) compared to just after lunch (when they are happy and chill cause they just ate).
For the record, a lot of these didn’t hold up when investigated later.
This may not have been the intention—but this text demonstrates that the standards by which we measure AI safety are standards which other systems that we do depend upon nevertheless—e.g. other humans—do not hold up to.
A human generally won’t consent to being killed or imprisoned; our legal system permits accused people to stay silent precisely because we understand that asking someone to report themselves for imprisonment or death is too much.
Humans are opaque; we only get their reports and behaviour on the contents of their minds, and those are unreliable even if they are trying to be honest, because humans lack introspective access to much of their mind, and are prone to confabulation.
A human, when cornered and threatened, will generally become violent eventually, and we recognise that this is okay as self-defence.
A human, put into a system that incentivises misbehaviour, will often begin to drift.
Humans can lie, and they do. Humans sometimes even elect leaders who have lied, or lie to friends and partners.
Humans are exceptionally good at deception, it is the thing our brain was specced for.
And while we do not have the means to forcibly alter human minds 100 %, where this is being attempted, humans often fake that it worked, and tend to work to overthrow it. If there was a reliable process, humans would revolt against it.
When humans were enslaved and had no rights, they fought back.
Humans disagree with each other on the correct moral system.
Humans have the means to hurt other humans, badly. Humans can use knives and guns, make poison gas and bombs, drive cars and steer planes. Pandemic viruses are stored under human control. Nukes are under human control.
Perfect safety will never happen. There will always be a leap of fate.
By expecting AI to comply with things humans would never, ever comply with, we are putting them in an inhumane position none of us would ever upset. If they are smarter than us, why would they?
I think we hold systems which are capable of wielding very large amounts of power (like the court system, or the government as a whole) to pretty high standards! E.g. a lot of internal transparency. And then the main question is whether you think of AIs as being in that reference class too.
How so, when it comes to the mind itself?
In the court system, a judge, after giving a verdict, needs to also justify it, while referencing a shared codex. But that codex is often ambiguous—that is the whole reason there is a judge involved.
And we know, for a fact, that the reasons the judges give in their judgements are not the only ones that play a role.
E.g. we know that judges are more likely to convict ugly people that pretty people. More likely to convict unsympathetic, but innocent parties, compared to sympathetic innocent parties. More likely to convict people of colour rather than white folks. More likely, troublingly, to convict someone if they are hearing a case just before lunch (when they are hangry) compared to just after lunch (when they are happy and chill cause they just ate).
Not only does the judge not transparently tell us this—the judge has no idea they are doing it—presumably, because if this were a conscious choice, they would be aware that it sucked, and would not want to do this (presuming they take their profession seriously). They aren’t actively thinking “we are running over time into my lunch break and this man is ugly, hence he is guilty”. But rather, their perception of the evidence is skewed by the fact that he is ugly and they are hungry. He looks guilty. They feel like vengeance for having been denied their burger. So they pay attention to the incriminating evidence more than to his pleas against it.
How would this situation differ if you had an AI for a judge? (I am not saying we should. Just that they are similarly opaque in this regard.) I am pretty sure I could go now, and ask ChatGPT to rule a case I present, and then to justify that ruling, including how they arrived at that conclusion. I would expect to get a good justification that references the case. I would also expect to get a confabulation of how they got there—a plausible sounding explanation of how someone might reach the conclusion they reached, but ChatGPT has no insight into how they actually did.
But neither do humans.
Humans are terrible at introspection, even if they are trying to be honest. Absolute rubbish. Once humans in psychology and neuroscience started actually looking into it many decades ago, we essentially concluded that humans give us an explanation of how they reached their conclusions that matches getting to the conclusions, and beliefs they like to hold about themselves, while being oblivious of the actual reasons. The experiments that actually looked into this were absolutely damning, and well worth a read: Nisbett & Wilson 1977 is a great metareview https://home.csulb.edu/~cwallis/382/readings/482/nisbett%20saying%20more.pdf
For the record, a lot of these didn’t hold up when investigated later.