I don’t think it’s the case that you’re telling me that the supposedly monumental challenge of AI alignment is simply that of getting computers to understand more things, such as what things are reasons, intuitions, hopes, and worries. I feel like these are just gruntwork things and not hard problems.
Look, all you need to do to get an AI which understands what intuitions, reasons, hopes, and worries are is to tell everyone very loudly and hubristically that AIs will never understand these things and that’s what makes humans irreplaceable. Then go talk to whatever development team is working on proving that wrong, and see what their primitive methods are. Better yet, just do it yourself because you know it’s possible.
I am not fluent in computer science so I can’t tell you how to do it, but someone does know how to make it so.
Edit: In spite of what I wrote here, I don’t think it’s necessary that humans should ensure specifically that the AI understands in advance what intuitions, hopes, or worries are, as opposed to all the other mental states humans can enter. Rather, there should be a channel where you type your requests/advice/shouldness-related-speech, and people are encouraged to type their moral intuitions, hopes, and worries there, and the AI just interprets the nature of the messages using its general models of humans as context.
No, they really don’t. I’m not trying to be insulting. I’m just not sure how to express the base idea.
The issue isn’t exactly that computers can’t understand this, specifically. It’s that no one understands what those words mean enough. Define reason. You’ll notice that your definition contains other words. Define all of those words. You’ll notice that those are made of words as well. Where does it bottom out? When have you actually, rigorously, objectively defined these things? Computers only understand that language, but the fact that a computer wouldn’t understand your plan is just illustrative of the fact that it is not well defined. It just seems like it is, because you have a human brain that fills in all the gaps seamlessly. So seamlessly you don’t even notice that there were gaps that need filling.
This is why there’s an emphasis on thinking about the problem like a computer programmer. Misalignment thrives in those gaps, and if you gloss over them, they stay dangerous. The only way to be sure you’re not glossing over them is to define things with something as rigorous as Math. English is not that rigorous.
I think some near future iteration of GPT, if it is prompted to be a really smart person who understands A Human’s Guide to Words, would be capable of giving explanations of the meanings of words just as well as humans can, which I think is fine enough for the purposes of recognizing when people are telling it their intuitions, hopes, and worries, fine enough for the purposes of trying to come up with best explanations of people’s shouldness-related speech, fine enough for coming up with moral theories which [solve the most objections]/[have the fewest paradoxes], and fine enough for explaining plans which those moral theories prescribe.
On a side note, and I’m not sure if this is a really useful analogy, but I wonder what would happen if the parameters of some future iteration of GPT included the sort of parameters that A Human’s Guide to Words installs into human brains.
I’m not sure this is being productive. I feel like I’ve said the same thing over and over again. But I’ve got one more try: Fine, you don’t want to try to define “reason” in math. I get it, that’s hard. But just try defining it in English.
If I tell the machine “I want to be happy.” And it tries to determine my reason for that, what does it come up with? “I don’t feel fulfilled in life”? Maybe that fits, but is it the reason, or do we have to go back more: “I have a dead end job”? Or even more “I don’t have enough opportunities”?
Or does it go a completely different tack and say my reason is “My pleasure centers aren’t being stimulated enough” or “I don’t have enough endorphins.”
Or, does it say the reason I said that was because my fingers pressed keys on a keyboard.
To me, as a human, all of these fit the definition of “reasons.” And I expect they could all be true. But I expect some of them are not what you mean. And not even in the sense of some of them being a different definition for “reason.” How would you try to divide what you mean and what you don’t mean?
Then do that same thought process on all the other words.
By “reason” I mean something like psychological, philosophical, and biological motivating factors; so, your fingers pressing the keys wouldn’t be a reason for saying it.
I don’t claim that this definition is robust to all of objection-space, and I’m interested in making it more robust as you come up with objections, but so far I find it simple and effective.
The AI does not need to think that there was only one real reason why you do things; there can be multiple, of course.
Also I do recognize that my definition is made up of more words, but I think it’s reasonable that a near-future AI could infer from our conversation that kind of definition which I gave, and spit it out itself. Similarly it could probably spit out good definitions for the compound words “psychological motivation,” “philosophical motivation,” and “biological motivation”.
Also also this process whereby I propose a simple and effective yet admittedly objection-vulnerable definition, and you provide an objection which my new definition can account for, is not a magical process and is probably automatable.
It seems simple and effective because you don’t need to put weight on it. We’re talking a superintelligence, though. Your definition will not hold when the weight of the world is on it.
And the fact that you’re just reacting to my objections is the problem. My objections are not the ones that matter. The superintelligence’s objections are. And it is, by definition, smarter than me. If your definition is not something like provably robust, then you won’t know if it will hold to a superintelligent objection. And you won’t be able to react fast enough to fix it in that case.
You can’t bandaid a solution into working, because if a human can point out a flaw, you should expect a superintelligence to point out dozens, or hundreds, or thousands.
I don’t know how else to get you to understand this central objection. Robustness is required. Provable robustness is, while not directly required, kinda the only way we can tell if something is actually robust.
I think this is almost redundant to say: the objection that superintelligences will be able to notice more of objection-space and account for it makes me more inclined to trust it. If a definition is more objection-solved than some other definition, that is the definition I want to hold. If the human definition is more objectionable than a non-human one, then I don’t want the human definition.
I think you missed the point. I’d trust an aligned superintelligence to solve the objections. I would not trust a misaligned one. If we already have an aligned superintelligence, your plan is unnecessary. If we do not, your plan is unworkable. Thus, the problem.
If you still don’t see that, I don’t think I can make you see it. I’m sorry.
I proposed a strategy for an aligned AI that involves it terminally valuing to following the steps of a game that involves talking with us about morality, creating moral theories with the fewest paradoxes, creating plans which are prescribed by the moral theories, and getting approval for the plans.
You objected that my words-for-concepts were vague.
I replied that near-future AIs could make as-good-as-human-or-better definitions, and that the process of [putting forward as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections] was automatable.
You said the AI could come up with many more objections than you would.
I said, “okay, good.” I will add right now: just because it considers an objection, doesn’t mean the current definition has to be rejected; it can decide that the objections are not strong enough, or that its current definition is the one with the fewest/weakest objections.
Now I think you’re saying something like that it doesn’t matter if the AI can come up with great definitions if it’s not aligned and that my plan won’t work either way. But if it can come up with such great objection-solved definitions, then you seem to lack any explicitly made objections to my alignment strategy.
Alternatively, you are saying that an AI can’t make great definitions unless it is aligned, which I think is just plainly wrong; I think getting an unaligned language model to make good-as-human definitions is maybe somewhere around as difficult as getting an unaligned language model to hold a conversation. “What is the definition of X?” is about as hard a question as “In which country can I find Mount Everest?” or “Write me a poem about the Spring season.”
I don’t think my proposed strategy is analogous to that, but I’ll answer in good faith just in case.
If that description of a strategy is knowingly abstract compared to the full concrete details of the strategy, then the description may or may not turn out to describe a good strategy, and the description may or may not be an accurate description of the strategy and its consequences.
If there is no concrete strategy to make explicitly stated which the abstract statement is describing, then the statement appears to just be repositing the problem of AI alignment, and it brings us nowhere.
Surely creating the full concrete details of the strategy is not much different from “putting forth as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections.” I at least don’t see why the same mechanism couldn’t be used here (i.e. apply this definition iteration to the word “good”, and then have the AI do that, and apply it to “bad” and have the AI avoid that). If you see it as a different thing, can you explain why?
It’s much easier to get safe, effective definitions of ‘reason’, ‘hopes’, ‘worries’, and ‘intuitions’ on first tries than to get a safe and effective definition of ‘good’.
Because that’s not a plan, it’s a property of a solution you’d expect the plan to have. It’s like saying “just keep the reactor at the correct temperature”. The devil is in the details of getting there, and there are lots of subtle ways things can go catastrophically wrong.
Exactly. I notice you aren’t who I replied to, so the canned response I had won’t work. But perhaps you can see why most of his objections to my objections would apply to objections to that plan?
I don’t think it’s the case that you’re telling me that the supposedly monumental challenge of AI alignment is simply that of getting computers to understand more things, such as what things are reasons, intuitions, hopes, and worries. I feel like these are just gruntwork things and not hard problems.
Look, all you need to do to get an AI which understands what intuitions, reasons, hopes, and worries are is to tell everyone very loudly and hubristically that AIs will never understand these things and that’s what makes humans irreplaceable. Then go talk to whatever development team is working on proving that wrong, and see what their primitive methods are. Better yet, just do it yourself because you know it’s possible.
I am not fluent in computer science so I can’t tell you how to do it, but someone does know how to make it so.
Edit: In spite of what I wrote here, I don’t think it’s necessary that humans should ensure specifically that the AI understands in advance what intuitions, hopes, or worries are, as opposed to all the other mental states humans can enter. Rather, there should be a channel where you type your requests/advice/shouldness-related-speech, and people are encouraged to type their moral intuitions, hopes, and worries there, and the AI just interprets the nature of the messages using its general models of humans as context.
No, they really don’t. I’m not trying to be insulting. I’m just not sure how to express the base idea.
The issue isn’t exactly that computers can’t understand this, specifically. It’s that no one understands what those words mean enough. Define reason. You’ll notice that your definition contains other words. Define all of those words. You’ll notice that those are made of words as well. Where does it bottom out? When have you actually, rigorously, objectively defined these things? Computers only understand that language, but the fact that a computer wouldn’t understand your plan is just illustrative of the fact that it is not well defined. It just seems like it is, because you have a human brain that fills in all the gaps seamlessly. So seamlessly you don’t even notice that there were gaps that need filling.
This is why there’s an emphasis on thinking about the problem like a computer programmer. Misalignment thrives in those gaps, and if you gloss over them, they stay dangerous. The only way to be sure you’re not glossing over them is to define things with something as rigorous as Math. English is not that rigorous.
I think some near future iteration of GPT, if it is prompted to be a really smart person who understands A Human’s Guide to Words, would be capable of giving explanations of the meanings of words just as well as humans can, which I think is fine enough for the purposes of recognizing when people are telling it their intuitions, hopes, and worries, fine enough for the purposes of trying to come up with best explanations of people’s shouldness-related speech, fine enough for coming up with moral theories which [solve the most objections]/[have the fewest paradoxes], and fine enough for explaining plans which those moral theories prescribe.
On a side note, and I’m not sure if this is a really useful analogy, but I wonder what would happen if the parameters of some future iteration of GPT included the sort of parameters that A Human’s Guide to Words installs into human brains.
I’m not sure this is being productive. I feel like I’ve said the same thing over and over again. But I’ve got one more try: Fine, you don’t want to try to define “reason” in math. I get it, that’s hard. But just try defining it in English.
If I tell the machine “I want to be happy.” And it tries to determine my reason for that, what does it come up with? “I don’t feel fulfilled in life”? Maybe that fits, but is it the reason, or do we have to go back more: “I have a dead end job”? Or even more “I don’t have enough opportunities”?
Or does it go a completely different tack and say my reason is “My pleasure centers aren’t being stimulated enough” or “I don’t have enough endorphins.”
Or, does it say the reason I said that was because my fingers pressed keys on a keyboard.
To me, as a human, all of these fit the definition of “reasons.” And I expect they could all be true. But I expect some of them are not what you mean. And not even in the sense of some of them being a different definition for “reason.” How would you try to divide what you mean and what you don’t mean?
Then do that same thought process on all the other words.
By “reason” I mean something like psychological, philosophical, and biological motivating factors; so, your fingers pressing the keys wouldn’t be a reason for saying it.
I don’t claim that this definition is robust to all of objection-space, and I’m interested in making it more robust as you come up with objections, but so far I find it simple and effective.
The AI does not need to think that there was only one real reason why you do things; there can be multiple, of course.
Also I do recognize that my definition is made up of more words, but I think it’s reasonable that a near-future AI could infer from our conversation that kind of definition which I gave, and spit it out itself. Similarly it could probably spit out good definitions for the compound words “psychological motivation,” “philosophical motivation,” and “biological motivation”.
Also also this process whereby I propose a simple and effective yet admittedly objection-vulnerable definition, and you provide an objection which my new definition can account for, is not a magical process and is probably automatable.
It seems simple and effective because you don’t need to put weight on it. We’re talking a superintelligence, though. Your definition will not hold when the weight of the world is on it.
And the fact that you’re just reacting to my objections is the problem. My objections are not the ones that matter. The superintelligence’s objections are. And it is, by definition, smarter than me. If your definition is not something like provably robust, then you won’t know if it will hold to a superintelligent objection. And you won’t be able to react fast enough to fix it in that case.
You can’t bandaid a solution into working, because if a human can point out a flaw, you should expect a superintelligence to point out dozens, or hundreds, or thousands.
I don’t know how else to get you to understand this central objection. Robustness is required. Provable robustness is, while not directly required, kinda the only way we can tell if something is actually robust.
I think this is almost redundant to say: the objection that superintelligences will be able to notice more of objection-space and account for it makes me more inclined to trust it. If a definition is more objection-solved than some other definition, that is the definition I want to hold. If the human definition is more objectionable than a non-human one, then I don’t want the human definition.
I think you missed the point. I’d trust an aligned superintelligence to solve the objections. I would not trust a misaligned one. If we already have an aligned superintelligence, your plan is unnecessary. If we do not, your plan is unworkable. Thus, the problem.
If you still don’t see that, I don’t think I can make you see it. I’m sorry.
I proposed a strategy for an aligned AI that involves it terminally valuing to following the steps of a game that involves talking with us about morality, creating moral theories with the fewest paradoxes, creating plans which are prescribed by the moral theories, and getting approval for the plans.
You objected that my words-for-concepts were vague.
I replied that near-future AIs could make as-good-as-human-or-better definitions, and that the process of [putting forward as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections] was automatable.
You said the AI could come up with many more objections than you would.
I said, “okay, good.” I will add right now: just because it considers an objection, doesn’t mean the current definition has to be rejected; it can decide that the objections are not strong enough, or that its current definition is the one with the fewest/weakest objections.
Now I think you’re saying something like that it doesn’t matter if the AI can come up with great definitions if it’s not aligned and that my plan won’t work either way. But if it can come up with such great objection-solved definitions, then you seem to lack any explicitly made objections to my alignment strategy.
Alternatively, you are saying that an AI can’t make great definitions unless it is aligned, which I think is just plainly wrong; I think getting an unaligned language model to make good-as-human definitions is maybe somewhere around as difficult as getting an unaligned language model to hold a conversation. “What is the definition of X?” is about as hard a question as “In which country can I find Mount Everest?” or “Write me a poem about the Spring season.”
Let me ask you this. Why is “Have the AI do good things, and not do bad things” a bad plan?
I don’t think my proposed strategy is analogous to that, but I’ll answer in good faith just in case.
If that description of a strategy is knowingly abstract compared to the full concrete details of the strategy, then the description may or may not turn out to describe a good strategy, and the description may or may not be an accurate description of the strategy and its consequences.
If there is no concrete strategy to make explicitly stated which the abstract statement is describing, then the statement appears to just be repositing the problem of AI alignment, and it brings us nowhere.
Surely creating the full concrete details of the strategy is not much different from “putting forth as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections.” I at least don’t see why the same mechanism couldn’t be used here (i.e. apply this definition iteration to the word “good”, and then have the AI do that, and apply it to “bad” and have the AI avoid that). If you see it as a different thing, can you explain why?
It’s much easier to get safe, effective definitions of ‘reason’, ‘hopes’, ‘worries’, and ‘intuitions’ on first tries than to get a safe and effective definition of ‘good’.
I’d be interested to know why you think that.
I’d be further interested if you would endorse the statement that your proposed plan would fully bridge that gap.
And if you wouldn’t, I’d ask if that helps illustrate the issue.
Because that’s not a plan, it’s a property of a solution you’d expect the plan to have. It’s like saying “just keep the reactor at the correct temperature”. The devil is in the details of getting there, and there are lots of subtle ways things can go catastrophically wrong.
Exactly. I notice you aren’t who I replied to, so the canned response I had won’t work. But perhaps you can see why most of his objections to my objections would apply to objections to that plan?
I was just responding to something I saw on the main page. No context for the earlier thread. Carry on lol.
This seems wrong but at least resembles a testable prediction.