An interesting idea, but I can still imagine it failing in a few ways:
the AI kills you during the process of building the “incredibly rich world-model”, for example because using the atoms of your body will help it achieve a better model;
the model is somehow misleading, or just your human-level intelligence will make a wrong conclusion when looking at the model.
the AI kills you during the process of building the “incredibly rich world-model”, for example because using the atoms of your body will help it achieve a better model;
OK, I think this is a helpful objection because it helps me further define the “tool”/”agent” distinction. In my mind, an “agent” works towards goals in a freeform way, whereas a “tool” executes some kind of defined process. Google Search is in no danger of killing me in the process of answering my search query (because using my atoms would help it get me better search results). Google Search is not an autonomous agent working towards the goal of getting me good search results. Instead, it’s executing a defined process to retrieve search results.
A tool is a safer tool if I understand the defined process by which it works, the defined process works in a fairly predictable way, and I’m able to anticipate the consequences of following that defined process. Tools are bad tools when they behave unpredictably and create unexpected consequences: for example, a gun is a bad tool if it shoots me in the foot without me having pulled the trigger. A piece of software is a bad tool if it has bugs or doesn’t ask for confirmation before taking an action I might not want it to take.
Based on this logic, the best prospects for “tool AIs” may be “speed superintelligences”/”collective superintelligences”—AIs that execute some kind of well-understood process, but much faster than a human could ever execute, or with a large degree of parallelism. My pocket calculator is a speed superintelligence in this sense. Google Search is more of a collective superintelligence insofar as its work is parallelized.
You can imagine using the tool AI to improve itself to the point where it is just complicated enough for humans to still understand, then doing the world-modeling step at that stage.
Also if humans can inspect and understand all the modifications that the tool AI makes to itself, so it continues to execute a well-understood defined process, that seems good. If necessary you could periodically put the code on some kind of external storage media, transfer it to a new air-gapped computer, and continue development on that computer to ensure that there wasn’t any funny shit going on.
the model is somehow misleading, or just your human-level intelligence will make a wrong conclusion when looking at the model.
Sure, and there’s also the “superintelligent, but with bugs” failure mode where the model is pretty good (enough for the AI to do a lot of damage) but not so good that the AI has an accurate representation of my values.
I imagine this has been suggested somewhere, but an obvious idea is to train many separate models of my values using many different approaches (ex—in addition to what I initially described, also use natural language processing to create a model of human values, and use supervised learning of some sort to learn from many manually entered training examples what human values look like, etc.) Then a superintelligence could test a prospective action against all of these models, and if even one of these models flagged the action as an unethical action, it could flag the action for review before proceeding.
And in order to make these redundant user preference models better, they could be tested against one another: the AI could generate prospective actions at random and test them against all the models; if the models disagreed about the appropriateness of a particular action, this could be flagged as a discrepancy that deserves examination.
My general sense is that with enough safeguards and checks, this “tool AI bootstrapping process” could probably be made arbitrarily safe. Example: the tool AI suggests an improvement to its own code, you review the improvement, you ask the AI why it did things in a particular way, the AI justifies itself, the justification is hard to understand, you make improvements to the justifications module… For each improvement the tool AI generates, it also generates a proof that the improvement does what it says it will do (checked by a separate theorem-proving module) and test coverage for the new improvement… Etc.
I am trying to imagine the weakest dangerous Google Search successor.
Probably this: Imagine that the search engine is able to model you. Adding such ability would make sense commercially, if the producers want to make sure that the customers are satisfied with their product. Let’s assume that the computing power is too cheap and they added too much of this ability. Now the search engine could e.g. find a result with highest rank, but then predict that seeing this result would make you disapointed, so it chooses another result instead, with somewhat lower rank, but with high predicted satisfaction. For the producers this may seem like a desired ability (tailored, personally relevant search results).
As an undesired side-effect, the search engine would de facto gain an ability to lie to you, convincingly. For example, let’s say that the function for measuring customer satisfaction only includes emotional reaction, and doesn’t include things like “a desire to know truth, even if it’s unpleasant”. That could happen for various reason, such as the producers not giving a fuck about our abstract desires, or concluding that abstract desires are mostly a hypocrisy but emotions are honest. Now as a side-effect, instead of unpleasant truth, the search engine would return a comfortable lie, if available. (Because the answer which makes the customer most happy is selected.)
Perhaps people would become aware of this, and would always double-check the answers. But suppose that the search engine is insanely good at modelling you, so it can also predict how specifically are you going to verify the questions, and whether you succeed or fail to find the truth. Now we get the more scary version which lies to you if and only if you are unable to find out that it lied. Thus to you, the search engine will seem completely trustworthy. All answers you have ever received, if you verified them, you learned that they were correct. You are only surprised to see that the search engine sometimes delivers wrong answers to other people; but in such situations you are always unable to convince the other people that those answers were wrong, because the answers are perfectly aligned with their existing beliefs. You could be smart enough to use an outside view to suspect that maybe something similar is happening to you, too. Or you may conclude that the other people are simply idiots.
Let’s imagine even more powerful search engine, and more clever designers, who instead of individual satisfaction with search results try to optimize for general satisfaction with their product in the population as a whole. As a side effect of this, now the search engine would only lie in ways that make society as a whole more happy with the results, and where the society as a whole is unable to find out what is happening. So for example, you could notice that the search engine is spreading a false information, but you would not be able to convince a majority of other people about it (because if the search engine would predict that you could, it would not have displayed the information at the first place).
Why could this be dangerous? A few “noble lies” here and there, what’s the worst thing that could happen? Imagine that the working definition of “satisfaction” is somewhat simplistic and does not include all human values. And imagine an insanely powerful search engine that could predict the results of its manipulation centuries ahead. Such engine could gently push the whole humanity towards some undesired attractor, such as a future where all people are wireheaded (from the point of view of the search engine: customers are maximally satisfied with the outcome), or just brainwashed in a cultish society which supports the search engine because the search engine never contradicts the cult teaching. That pushing would be achieved by giving higher visibility to pages supporting the idea (especially if the idea would seem appealing to the reader), lower visibility to pages explaining the dangers of the idea; and also on more meta levels, e.g. giving higher visibility to pages explaining personal scandals related to the people prominently explaining the dangers of the idea, etc.
Okay, this is stretching the credibility at a few places, but I tried to find a hypothetical scenario where a too powerful but still completely transparently designed Google Search successor would doom humanity.
the AI kills you during the process of building the “incredibly rich world-model”, for example because using the atoms of your body will help it achieve a better model;
OK, I think this is a helpful objection because it helps me further define the “tool”/”agent” distinction. In my mind, an “agent” works towards goals in a freeform way, whereas a “tool” executes some kind of defined process. Google Search is in no danger of killing me in the process of answering my search query (because using my atoms would help it get me better search results). Google Search is not an autonomous agent working towards the goal of getting me good search results. Instead, it’s executing a defined process to retrieve search results.
A tool is a safer tool if I understand the defined process by which it works, the defined process works in a fairly predictable way, and I’m able to anticipate the consequences of following that defined process. Tools are bad tools when they behave unpredictably and create unexpected consequences: for example, a gun is a bad tool if it shoots me in the foot without me having pulled the trigger. A piece of software is a bad tool if it has bugs or doesn’t ask for confirmation before taking an action I might not want it to take.
Based on this logic, the best prospects for “tool AIs” may be “speed superintelligences”/”collective superintelligences”—AIs that execute some kind of well-understood process, but much faster than a human could ever execute, or with a large degree of parallelism. My pocket calculator is a speed superintelligence in this sense. Google Search is more of a collective superintelligence insofar as its work is parallelized.
You can imagine using the tool AI to improve itself to the point where it is just complicated enough for humans to still understand, then doing the world-modeling step at that stage.
Also if humans can inspect and understand all the modifications that the tool AI makes to itself, so it continues to execute a well-understood defined process, that seems good. If necessary you could periodically put the code on some kind of external storage media, transfer it to a new air-gapped computer, and continue development on that computer to ensure that there wasn’t any funny shit going on.
the model is somehow misleading, or just your human-level intelligence will make a wrong conclusion when looking at the model.
Sure, and there’s also the “superintelligent, but with bugs” failure mode where the model is pretty good (enough for the AI to do a lot of damage) but not so good that the AI has an accurate representation of my values.
I imagine this has been suggested somewhere, but an obvious idea is to train many separate models of my values using many different approaches (ex—in addition to what I initially described, also use natural language processing to create a model of human values, and use supervised learning of some sort to learn from many manually entered training examples what human values look like, etc.) Then a superintelligence could test a prospective action against all of these models, and if even one of these models flagged the action as an unethical action, it could flag the action for review before proceeding.
And in order to make these redundant user preference models better, they could be tested against one another: the AI could generate prospective actions at random and test them against all the models; if the models disagreed about the appropriateness of a particular action, this could be flagged as a discrepancy that deserves examination.
My general sense is that with enough safeguards and checks, this “tool AI bootstrapping process” could probably be made arbitrarily safe. Example: the tool AI suggests an improvement to its own code, you review the improvement, you ask the AI why it did things in a particular way, the AI justifies itself, the justification is hard to understand, you make improvements to the justifications module… For each improvement the tool AI generates, it also generates a proof that the improvement does what it says it will do (checked by a separate theorem-proving module) and test coverage for the new improvement… Etc.
An interesting idea, but I can still imagine it failing in a few ways:
the AI kills you during the process of building the “incredibly rich world-model”, for example because using the atoms of your body will help it achieve a better model;
the model is somehow misleading, or just your human-level intelligence will make a wrong conclusion when looking at the model.
OK, I think this is a helpful objection because it helps me further define the “tool”/”agent” distinction. In my mind, an “agent” works towards goals in a freeform way, whereas a “tool” executes some kind of defined process. Google Search is in no danger of killing me in the process of answering my search query (because using my atoms would help it get me better search results). Google Search is not an autonomous agent working towards the goal of getting me good search results. Instead, it’s executing a defined process to retrieve search results.
A tool is a safer tool if I understand the defined process by which it works, the defined process works in a fairly predictable way, and I’m able to anticipate the consequences of following that defined process. Tools are bad tools when they behave unpredictably and create unexpected consequences: for example, a gun is a bad tool if it shoots me in the foot without me having pulled the trigger. A piece of software is a bad tool if it has bugs or doesn’t ask for confirmation before taking an action I might not want it to take.
Based on this logic, the best prospects for “tool AIs” may be “speed superintelligences”/”collective superintelligences”—AIs that execute some kind of well-understood process, but much faster than a human could ever execute, or with a large degree of parallelism. My pocket calculator is a speed superintelligence in this sense. Google Search is more of a collective superintelligence insofar as its work is parallelized.
You can imagine using the tool AI to improve itself to the point where it is just complicated enough for humans to still understand, then doing the world-modeling step at that stage.
Also if humans can inspect and understand all the modifications that the tool AI makes to itself, so it continues to execute a well-understood defined process, that seems good. If necessary you could periodically put the code on some kind of external storage media, transfer it to a new air-gapped computer, and continue development on that computer to ensure that there wasn’t any funny shit going on.
Sure, and there’s also the “superintelligent, but with bugs” failure mode where the model is pretty good (enough for the AI to do a lot of damage) but not so good that the AI has an accurate representation of my values.
I imagine this has been suggested somewhere, but an obvious idea is to train many separate models of my values using many different approaches (ex—in addition to what I initially described, also use natural language processing to create a model of human values, and use supervised learning of some sort to learn from many manually entered training examples what human values look like, etc.) Then a superintelligence could test a prospective action against all of these models, and if even one of these models flagged the action as an unethical action, it could flag the action for review before proceeding.
And in order to make these redundant user preference models better, they could be tested against one another: the AI could generate prospective actions at random and test them against all the models; if the models disagreed about the appropriateness of a particular action, this could be flagged as a discrepancy that deserves examination.
My general sense is that with enough safeguards and checks, this “tool AI bootstrapping process” could probably be made arbitrarily safe. Example: the tool AI suggests an improvement to its own code, you review the improvement, you ask the AI why it did things in a particular way, the AI justifies itself, the justification is hard to understand, you make improvements to the justifications module… For each improvement the tool AI generates, it also generates a proof that the improvement does what it says it will do (checked by a separate theorem-proving module) and test coverage for the new improvement… Etc.
I am trying to imagine the weakest dangerous Google Search successor.
Probably this: Imagine that the search engine is able to model you. Adding such ability would make sense commercially, if the producers want to make sure that the customers are satisfied with their product. Let’s assume that the computing power is too cheap and they added too much of this ability. Now the search engine could e.g. find a result with highest rank, but then predict that seeing this result would make you disapointed, so it chooses another result instead, with somewhat lower rank, but with high predicted satisfaction. For the producers this may seem like a desired ability (tailored, personally relevant search results).
As an undesired side-effect, the search engine would de facto gain an ability to lie to you, convincingly. For example, let’s say that the function for measuring customer satisfaction only includes emotional reaction, and doesn’t include things like “a desire to know truth, even if it’s unpleasant”. That could happen for various reason, such as the producers not giving a fuck about our abstract desires, or concluding that abstract desires are mostly a hypocrisy but emotions are honest. Now as a side-effect, instead of unpleasant truth, the search engine would return a comfortable lie, if available. (Because the answer which makes the customer most happy is selected.)
Perhaps people would become aware of this, and would always double-check the answers. But suppose that the search engine is insanely good at modelling you, so it can also predict how specifically are you going to verify the questions, and whether you succeed or fail to find the truth. Now we get the more scary version which lies to you if and only if you are unable to find out that it lied. Thus to you, the search engine will seem completely trustworthy. All answers you have ever received, if you verified them, you learned that they were correct. You are only surprised to see that the search engine sometimes delivers wrong answers to other people; but in such situations you are always unable to convince the other people that those answers were wrong, because the answers are perfectly aligned with their existing beliefs. You could be smart enough to use an outside view to suspect that maybe something similar is happening to you, too. Or you may conclude that the other people are simply idiots.
Let’s imagine even more powerful search engine, and more clever designers, who instead of individual satisfaction with search results try to optimize for general satisfaction with their product in the population as a whole. As a side effect of this, now the search engine would only lie in ways that make society as a whole more happy with the results, and where the society as a whole is unable to find out what is happening. So for example, you could notice that the search engine is spreading a false information, but you would not be able to convince a majority of other people about it (because if the search engine would predict that you could, it would not have displayed the information at the first place).
Why could this be dangerous? A few “noble lies” here and there, what’s the worst thing that could happen? Imagine that the working definition of “satisfaction” is somewhat simplistic and does not include all human values. And imagine an insanely powerful search engine that could predict the results of its manipulation centuries ahead. Such engine could gently push the whole humanity towards some undesired attractor, such as a future where all people are wireheaded (from the point of view of the search engine: customers are maximally satisfied with the outcome), or just brainwashed in a cultish society which supports the search engine because the search engine never contradicts the cult teaching. That pushing would be achieved by giving higher visibility to pages supporting the idea (especially if the idea would seem appealing to the reader), lower visibility to pages explaining the dangers of the idea; and also on more meta levels, e.g. giving higher visibility to pages explaining personal scandals related to the people prominently explaining the dangers of the idea, etc.
Okay, this is stretching the credibility at a few places, but I tried to find a hypothetical scenario where a too powerful but still completely transparently designed Google Search successor would doom humanity.
OK, I think this is a helpful objection because it helps me further define the “tool”/”agent” distinction. In my mind, an “agent” works towards goals in a freeform way, whereas a “tool” executes some kind of defined process. Google Search is in no danger of killing me in the process of answering my search query (because using my atoms would help it get me better search results). Google Search is not an autonomous agent working towards the goal of getting me good search results. Instead, it’s executing a defined process to retrieve search results.
A tool is a safer tool if I understand the defined process by which it works, the defined process works in a fairly predictable way, and I’m able to anticipate the consequences of following that defined process. Tools are bad tools when they behave unpredictably and create unexpected consequences: for example, a gun is a bad tool if it shoots me in the foot without me having pulled the trigger. A piece of software is a bad tool if it has bugs or doesn’t ask for confirmation before taking an action I might not want it to take.
Based on this logic, the best prospects for “tool AIs” may be “speed superintelligences”/”collective superintelligences”—AIs that execute some kind of well-understood process, but much faster than a human could ever execute, or with a large degree of parallelism. My pocket calculator is a speed superintelligence in this sense. Google Search is more of a collective superintelligence insofar as its work is parallelized.
You can imagine using the tool AI to improve itself to the point where it is just complicated enough for humans to still understand, then doing the world-modeling step at that stage.
Also if humans can inspect and understand all the modifications that the tool AI makes to itself, so it continues to execute a well-understood defined process, that seems good. If necessary you could periodically put the code on some kind of external storage media, transfer it to a new air-gapped computer, and continue development on that computer to ensure that there wasn’t any funny shit going on.
Sure, and there’s also the “superintelligent, but with bugs” failure mode where the model is pretty good (enough for the AI to do a lot of damage) but not so good that the AI has an accurate representation of my values.
I imagine this has been suggested somewhere, but an obvious idea is to train many separate models of my values using many different approaches (ex—in addition to what I initially described, also use natural language processing to create a model of human values, and use supervised learning of some sort to learn from many manually entered training examples what human values look like, etc.) Then a superintelligence could test a prospective action against all of these models, and if even one of these models flagged the action as an unethical action, it could flag the action for review before proceeding.
And in order to make these redundant user preference models better, they could be tested against one another: the AI could generate prospective actions at random and test them against all the models; if the models disagreed about the appropriateness of a particular action, this could be flagged as a discrepancy that deserves examination.
My general sense is that with enough safeguards and checks, this “tool AI bootstrapping process” could probably be made arbitrarily safe. Example: the tool AI suggests an improvement to its own code, you review the improvement, you ask the AI why it did things in a particular way, the AI justifies itself, the justification is hard to understand, you make improvements to the justifications module… For each improvement the tool AI generates, it also generates a proof that the improvement does what it says it will do (checked by a separate theorem-proving module) and test coverage for the new improvement… Etc.