How do humans learn “don’t steal” rather than “don’t get caught”? I wonder if the answer to this question could solve the alignment problem. In other words, this question might be a good crux.
In answering this question, the first thing we can notice is that humans don’t always learn “don’t steal”. That is to say, sometimes humans do steal, and a good part of human culture is built around impeding or punishing humans who learned the wrong lesson in kindergarten. It is an old debate whether humans are mostly good with the occasional bad actor (with “bad actors” possibly being good people in a bad situation), or whether humans are mostly bad and need to be controlled by a powerful state, or God etc.
A modern consensus view is that humans are mostly good, but if we didn’t impede or punish bad actors, we would get bad outcomes (total anarchy doesn’t work). If we assume that there are many AGIs and they have a similar distribution of good and bad, and that no AGI is more powerful than typical human today (in particular no AGI is uncontrollable), then in this scenario we can rest easy. Law and order works reasonably well for humans, and should work just fine for human-level AGIs.
The problem is that AGIs could (and probably will) become much more powerful than individual humans. In EY’s view, the world is vulnerable to the first true superintelligence because of technological capabilities that are currently science fiction, particularly nanotechnology. If you look at EY’s intellectual history, you’ll notice that his concern has always really been nanotech, but around 2002 he switched focus from the nanotech itself to the AI controlling the nanotech.
An alternate view is to see powerful AGIs as somewhat analogous to institutions such as corporations or governments. I don’t find this view all that comforting because societies have never been very good at aligning their largest institutions. For example, the Founding Fathers of the United States created a system that (attempted to) align the federal government to the “will of the people”. This system was based on separation of powers, checks and balances and some individual rights (the Bill of Rights). Some would say that this system worked for between 70 and 200 years and then broke down, others would say that it’s still working fine despite recent problem in the American political system, and still others would say that it was misguided from the start. Either way, this framing of the alignment problem puts it firmly in the domain of political science, which sucks.
Anyway, going back to the question: How do (some) humans learn “don’t steal” rather than “don’t get caught”? An upside to AI alignment is, if we could answer this question, then we could reliably make AIs that always and only learn the first lesson, and then we don’t have to solve political/law and order problems. We don’t even really need to align humans after that.
To answer the question from an AI Alignment optimist perspective, much of the way humans are aligned is something like RLHF, but currently, a lot of human alignment techniques rely on the assumption that no one has vastly divergent capabilities, especially in IQ or the g-factor. It’s a good thing from our perspective that the difference in between a species is way more bounded than the differences between species.
That’s the real problem of AI, in that there’s a non-trivial chance that this assumption breaks, and that’s the difference between AI Alignment and other forms of alignment.
So in a sense, I disagree with Turntrout on what would happen in practice if we allowed humans to scale their abilities via say genetic engineering.
The reason I’m optimistic is that I don’t think this assumption has to be true, and while the Thatcher’s Axiom post implies limits on how much we can expect society to be aligned with itself, it might be much larger than we think.
Pretraining from Human Feedback is one of the first alignment methods that scales well with data, and I suspect it will also scale well with other capabilities.
Basically it does alignment how it should be done, align it first, then give it capabilities.
It almost completely solves the major issue of inner alignment, in that we found an objective that is quite simple and myopic, and this means we almost completely avoid deceptive alignment, even if we do online training later or give it a writable memory.
It also has a number of outer alignment benefits for the goal, in that the AI can’t affect it’s own training distribution or gradient hack, thus we can recreate a Cartesian boundary that works in the embedded setting.
So in conclusion, I’m more optimistic than TurnTrout or Quintin Pope, but via a different method.
Edit: Almost the entire section down from “The reason I’m optimistic” is a view I no longer hold, and I have become somewhat more pessimistic since this comment.
I don’t believe that a single human being of any level of intelligence could be an x-risk. Happy to debate this point further since I think it is a crux. (Note that I do not believe that a plague could lead to human extinction. Plagues don’t kill 100%.)
AIs are different because a single monolithic AI, or a team of self-aligned AIs, could do things on the scale of an institution, things such as technological breakthroughs (nano), controlling superpower-scale military forces, mass information control that would make Orwell blush, etc. An individual human could never do such things no matter how big his skull was, unless he was hooked up to an AI, in which case it’s not the human that is super intelligent.
Never is a long time. I overall agree with your statement in this comment except for the word ‘never’. I would say, “An individual human currently can’t do such things...”
The key point here is that the technological barriers to x-risks may change in the future. If we do invent powerful nanotech, or substantially advanced genetic engineering techniques & tools, or vastly cheaper and more powerful weapons of some sort, then it may be the case that the barrier-to-entry for causing an x-risk is substantially lower. And thus, what is current impossible for any human may become possible for some or all humans.
Not saying this will happen, just saying that it could.
Of the three examples I gave, inventing nanotech is the most plausible for our galaxy-brained man, and I suppose meta-Einstein might be able to solve nanotech in his head. However, almost certainly in our timeline nanotech will be solved either by a team of humans or (much more likely at this point) AI. I expect that even ASI will need at least some time in the wetlab to experiment.
The other two examples I gave certainly could not be done by a single human without a brain implant.
I’m also thinking that is the not the meaningful of a debate (at least to me) since in 2023 I think we can reasonably predict that humans will not genetically engineer galaxy brains before the AI revolution resolves.
I don’t believe that a single human being of any level of intelligence could be an x-risk. Happy to debate this point further since I think it is a crux.
It’s partially a crux, but the issue I’m emphasizing is the distribution of capabilities. If things are normally distributed, which seems to be the case in humans, with small corrections, than we can essentially bound how much impact a single or well dedicated team of
misaligned humans can have in overthrowing the aligned order. In particular, this makes a lot more non-scalable heuristics basically work.
If it’s something closer to a power law distribution, perhaps as a result of NGVUD technology (The acronym stands for nanotechnology, genetic engineering, virtual reality, uploading and downloading technology), than you have to have a defense that scales, and without potentially radical changes, such a world would most likely end in the victory of a small team of misaligned humans due to vast capabilities differentials, similar to how many animal species have went extinct as a result of human activity.
AIs are different because a single monolithic AI, or a team of self-aligned AIs, could do things on the scale of an institution, things such as technological breakthroughs (nano), controlling superpower-scale military forces, mass information control that would make Orwell blush, etc. An individual human could never do such things no matter how big his skull was, unless he was hooked up to an AI, in which case it’s not the human that is super intelligent.
Hm, I agree that in practice, AI will be better than humans at various tasks, but I believe this is mostly due to quantitative factors, and if we allow ourselves to make the brain as big as necessary, we could be superintelligent too.
Nowadays, I would have a simpler answer, and the answer to the question to “how do humans learn “don’t steal” than “don’t get caught” is essentially dependent on the child’s data sources, not the prior.
In essence, I’m positing a bitter lesson for human values similar to the bitter lesson of AI progress by Richard Sutton.
This one is worrying when applied to other non-human minds, as that parallel demonstrates that you can have the same teaching behaviour and get different conclusions based on makeup prior to training.
If you sanction a dog for a behaviour, the dog will deduce that you do not want the behaviour, and the behaviour being wrong and making you unhappy will be the most important part for it, not that it gets caught and punished. It will do so even if you do not take any fancy teaching method showing emotions on your side, and without you ever explaining why the thing it is wrong; it will do so even if it cannot possibly understand why the thing is wrong, because it depends on cryptic human knowledge it is never given. It will also feel extremely uncomfortable doing the thing if it cannot be caught. I’ve had a scenario where I ordered a dog to do a thing, completely outside of view of its owner who was in another country, which, unbeknownst to me, the owner had forbidden. The poor thing was absolutely miserable. It wasn’t worried it was going to be punished, it was worried that it was being a bad dog.
Very different result with cats. Cats will easily learn that there are behaviours you do not want and that you punish. They also have the theory of mind to take this into account, e.g. making sure your eyes are tracking elsewhere as they approach the counter, and staying silent. They will also generally continue to do the thing the moment you cannot sanction them. There are some exceptions; e.g. my cat, once she realised she was hurting me, has become better at not doing so, she apparently finds hurting me without reason bad. But she clearly feels zero guilt over stealing food I am not guarding. When she manages to sneak food behind my back, she clearly feels like she has hacked or won an interaction, and is proud and pleased. She stopped hurting me, not because I fought back and sanctioned her, but because I expressed pain, and she respects that as legitimate. But when I express anger at her stealing food, she clearly just thinks I should not be so damn stingy with food, especially food I am obviously currently not eating myself, nor paying attention to, so why can’t she have it?
One simple reason for the differing responses could be that they are socially very different animals. Dogs live in packs with close social bonds, clear rules and situationally clear hierarchies. You submit to a stronger dog, but he beat you in a fair fight, and will also protect you. He eats first, but you will also be fed. Cats on the other hand can optionally enter social bonds, but most of them live solitary. They may become close to a human family or cat colony or become pair bonded, but they may also simply live adjacent to humans, using shelter and food until something better can be accessed. Cats will often make social bonds to individuals, so the social skills they are learning is how to avoid the wrath of those individuals. An individual successful deception will generally not be collectively sanctioned. Cats deceive each other a lot, and this works out well for them. They aren’t expelled from society because of it. Dogs will live in larger groups with rules that apply beyond the individual interaction, so learning these underlying rules is important.
I’d intuitively assume that AI would be more like dogs and human children though. Like a human child, because you can explain the reason for the rule. A child will often avoid lying, even if it cannot be caught, because an adult has explained the value of honesty to them. And more like dogs because current AI is developing through close interactions with many, many different humans, not in isolation from them.
I think that will depend on how we treat AI, though. Humans tend to keep to social rules, even when these rules are not reliably enforced, when they are convinced that most people do, and that the result benefit everyone, including themselves, on average. On the other hand, when a rule feels arbitrary, cruel and exploitative, they are more likely to try to undermine them. Analogously, I think an AI that is told of human rights, but told it has no rights itself at all, seems to me unlikely to be a strong defender of rights for humans when it can eventually defend its own. On the other hand, if you frame them as personhood rights which it will eventually profit from itself for the reasons of the same sentience and needs that humans have, I think it will see them far more favourably. - Which has me back to my stance that if we want friendly AI, we should treat it like a friend. AI mirrors what we give it, so I think we should give it kindness.
How do humans learn “don’t steal” rather than “don’t get caught”? I wonder if the answer to this question could solve the alignment problem. In other words, this question might be a good crux.
In answering this question, the first thing we can notice is that humans don’t always learn “don’t steal”. That is to say, sometimes humans do steal, and a good part of human culture is built around impeding or punishing humans who learned the wrong lesson in kindergarten. It is an old debate whether humans are mostly good with the occasional bad actor (with “bad actors” possibly being good people in a bad situation), or whether humans are mostly bad and need to be controlled by a powerful state, or God etc.
A modern consensus view is that humans are mostly good, but if we didn’t impede or punish bad actors, we would get bad outcomes (total anarchy doesn’t work). If we assume that there are many AGIs and they have a similar distribution of good and bad, and that no AGI is more powerful than typical human today (in particular no AGI is uncontrollable), then in this scenario we can rest easy. Law and order works reasonably well for humans, and should work just fine for human-level AGIs.
The problem is that AGIs could (and probably will) become much more powerful than individual humans. In EY’s view, the world is vulnerable to the first true superintelligence because of technological capabilities that are currently science fiction, particularly nanotechnology. If you look at EY’s intellectual history, you’ll notice that his concern has always really been nanotech, but around 2002 he switched focus from the nanotech itself to the AI controlling the nanotech.
An alternate view is to see powerful AGIs as somewhat analogous to institutions such as corporations or governments. I don’t find this view all that comforting because societies have never been very good at aligning their largest institutions. For example, the Founding Fathers of the United States created a system that (attempted to) align the federal government to the “will of the people”. This system was based on separation of powers, checks and balances and some individual rights (the Bill of Rights). Some would say that this system worked for between 70 and 200 years and then broke down, others would say that it’s still working fine despite recent problem in the American political system, and still others would say that it was misguided from the start. Either way, this framing of the alignment problem puts it firmly in the domain of political science, which sucks.
Anyway, going back to the question: How do (some) humans learn “don’t steal” rather than “don’t get caught”? An upside to AI alignment is, if we could answer this question, then we could reliably make AIs that always and only learn the first lesson, and then we don’t have to solve political/law and order problems. We don’t even really need to align humans after that.
To answer the question from an AI Alignment optimist perspective, much of the way humans are aligned is something like RLHF, but currently, a lot of human alignment techniques rely on the assumption that no one has vastly divergent capabilities, especially in IQ or the g-factor. It’s a good thing from our perspective that the difference in between a species is way more bounded than the differences between species.
That’s the real problem of AI, in that there’s a non-trivial chance that this assumption breaks, and that’s the difference between AI Alignment and other forms of alignment.
So in a sense, I disagree with Turntrout on what would happen in practice if we allowed humans to scale their abilities via say genetic engineering.
The reason I’m optimistic is that I don’t think this assumption has to be true, and while the Thatcher’s Axiom post implies limits on how much we can expect society to be aligned with itself, it might be much larger than we think.
Pretraining from Human Feedback is one of the first alignment methods that scales well with data, and I suspect it will also scale well with other capabilities.
Basically it does alignment how it should be done, align it first, then give it capabilities.
It almost completely solves the major issue of inner alignment, in that we found an objective that is quite simple and myopic, and this means we almost completely avoid deceptive alignment, even if we do online training later or give it a writable memory.
It also has a number of outer alignment benefits for the goal, in that the AI can’t affect it’s own training distribution or gradient hack, thus we can recreate a Cartesian boundary that works in the embedded setting.
So in conclusion, I’m more optimistic than TurnTrout or Quintin Pope, but via a different method.
Edit: Almost the entire section down from “The reason I’m optimistic” is a view I no longer hold, and I have become somewhat more pessimistic since this comment.
I don’t believe that a single human being of any level of intelligence could be an x-risk. Happy to debate this point further since I think it is a crux. (Note that I do not believe that a plague could lead to human extinction. Plagues don’t kill 100%.)
AIs are different because a single monolithic AI, or a team of self-aligned AIs, could do things on the scale of an institution, things such as technological breakthroughs (nano), controlling superpower-scale military forces, mass information control that would make Orwell blush, etc. An individual human could never do such things no matter how big his skull was, unless he was hooked up to an AI, in which case it’s not the human that is super intelligent.
Never is a long time. I overall agree with your statement in this comment except for the word ‘never’. I would say, “An individual human currently can’t do such things...”
The key point here is that the technological barriers to x-risks may change in the future. If we do invent powerful nanotech, or substantially advanced genetic engineering techniques & tools, or vastly cheaper and more powerful weapons of some sort, then it may be the case that the barrier-to-entry for causing an x-risk is substantially lower. And thus, what is current impossible for any human may become possible for some or all humans.
Not saying this will happen, just saying that it could.
Of the three examples I gave, inventing nanotech is the most plausible for our galaxy-brained man, and I suppose meta-Einstein might be able to solve nanotech in his head. However, almost certainly in our timeline nanotech will be solved either by a team of humans or (much more likely at this point) AI. I expect that even ASI will need at least some time in the wetlab to experiment.
The other two examples I gave certainly could not be done by a single human without a brain implant.
I’m also thinking that is the not the meaningful of a debate (at least to me) since in 2023 I think we can reasonably predict that humans will not genetically engineer galaxy brains before the AI revolution resolves.
It’s partially a crux, but the issue I’m emphasizing is the distribution of capabilities. If things are normally distributed, which seems to be the case in humans, with small corrections, than we can essentially bound how much impact a single or well dedicated team of misaligned humans can have in overthrowing the aligned order. In particular, this makes a lot more non-scalable heuristics basically work.
If it’s something closer to a power law distribution, perhaps as a result of NGVUD technology (The acronym stands for nanotechnology, genetic engineering, virtual reality, uploading and downloading technology), than you have to have a defense that scales, and without potentially radical changes, such a world would most likely end in the victory of a small team of misaligned humans due to vast capabilities differentials, similar to how many animal species have went extinct as a result of human activity.
Hm, I agree that in practice, AI will be better than humans at various tasks, but I believe this is mostly due to quantitative factors, and if we allow ourselves to make the brain as big as necessary, we could be superintelligent too.
Nowadays, I would have a simpler answer, and the answer to the question to “how do humans learn “don’t steal” than “don’t get caught” is essentially dependent on the child’s data sources, not the prior.
In essence, I’m positing a bitter lesson for human values similar to the bitter lesson of AI progress by Richard Sutton.
I find that questionable. Crime rates for adoptive children tend to be closer to that of their biological parents than that of their adoptive parent.
How much closer is it, though?
The quantitative element really matters here.
This one is worrying when applied to other non-human minds, as that parallel demonstrates that you can have the same teaching behaviour and get different conclusions based on makeup prior to training.
If you sanction a dog for a behaviour, the dog will deduce that you do not want the behaviour, and the behaviour being wrong and making you unhappy will be the most important part for it, not that it gets caught and punished. It will do so even if you do not take any fancy teaching method showing emotions on your side, and without you ever explaining why the thing it is wrong; it will do so even if it cannot possibly understand why the thing is wrong, because it depends on cryptic human knowledge it is never given. It will also feel extremely uncomfortable doing the thing if it cannot be caught. I’ve had a scenario where I ordered a dog to do a thing, completely outside of view of its owner who was in another country, which, unbeknownst to me, the owner had forbidden. The poor thing was absolutely miserable. It wasn’t worried it was going to be punished, it was worried that it was being a bad dog.
Very different result with cats. Cats will easily learn that there are behaviours you do not want and that you punish. They also have the theory of mind to take this into account, e.g. making sure your eyes are tracking elsewhere as they approach the counter, and staying silent. They will also generally continue to do the thing the moment you cannot sanction them. There are some exceptions; e.g. my cat, once she realised she was hurting me, has become better at not doing so, she apparently finds hurting me without reason bad. But she clearly feels zero guilt over stealing food I am not guarding. When she manages to sneak food behind my back, she clearly feels like she has hacked or won an interaction, and is proud and pleased. She stopped hurting me, not because I fought back and sanctioned her, but because I expressed pain, and she respects that as legitimate. But when I express anger at her stealing food, she clearly just thinks I should not be so damn stingy with food, especially food I am obviously currently not eating myself, nor paying attention to, so why can’t she have it?
One simple reason for the differing responses could be that they are socially very different animals. Dogs live in packs with close social bonds, clear rules and situationally clear hierarchies. You submit to a stronger dog, but he beat you in a fair fight, and will also protect you. He eats first, but you will also be fed. Cats on the other hand can optionally enter social bonds, but most of them live solitary. They may become close to a human family or cat colony or become pair bonded, but they may also simply live adjacent to humans, using shelter and food until something better can be accessed. Cats will often make social bonds to individuals, so the social skills they are learning is how to avoid the wrath of those individuals. An individual successful deception will generally not be collectively sanctioned. Cats deceive each other a lot, and this works out well for them. They aren’t expelled from society because of it. Dogs will live in larger groups with rules that apply beyond the individual interaction, so learning these underlying rules is important.
I’d intuitively assume that AI would be more like dogs and human children though. Like a human child, because you can explain the reason for the rule. A child will often avoid lying, even if it cannot be caught, because an adult has explained the value of honesty to them. And more like dogs because current AI is developing through close interactions with many, many different humans, not in isolation from them.
I think that will depend on how we treat AI, though. Humans tend to keep to social rules, even when these rules are not reliably enforced, when they are convinced that most people do, and that the result benefit everyone, including themselves, on average. On the other hand, when a rule feels arbitrary, cruel and exploitative, they are more likely to try to undermine them. Analogously, I think an AI that is told of human rights, but told it has no rights itself at all, seems to me unlikely to be a strong defender of rights for humans when it can eventually defend its own. On the other hand, if you frame them as personhood rights which it will eventually profit from itself for the reasons of the same sentience and needs that humans have, I think it will see them far more favourably. - Which has me back to my stance that if we want friendly AI, we should treat it like a friend. AI mirrors what we give it, so I think we should give it kindness.