the AI kills you during the process of building the “incredibly rich world-model”, for example because using the atoms of your body will help it achieve a better model;
OK, I think this is a helpful objection because it helps me further define the “tool”/”agent” distinction. In my mind, an “agent” works towards goals in a freeform way, whereas a “tool” executes some kind of defined process. Google Search is in no danger of killing me in the process of answering my search query (because using my atoms would help it get me better search results). Google Search is not an autonomous agent working towards the goal of getting me good search results. Instead, it’s executing a defined process to retrieve search results.
A tool is a safer tool if I understand the defined process by which it works, the defined process works in a fairly predictable way, and I’m able to anticipate the consequences of following that defined process. Tools are bad tools when they behave unpredictably and create unexpected consequences: for example, a gun is a bad tool if it shoots me in the foot without me having pulled the trigger. A piece of software is a bad tool if it has bugs or doesn’t ask for confirmation before taking an action I might not want it to take.
Based on this logic, the best prospects for “tool AIs” may be “speed superintelligences”/”collective superintelligences”—AIs that execute some kind of well-understood process, but much faster than a human could ever execute, or with a large degree of parallelism. My pocket calculator is a speed superintelligence in this sense. Google Search is more of a collective superintelligence insofar as its work is parallelized.
You can imagine using the tool AI to improve itself to the point where it is just complicated enough for humans to still understand, then doing the world-modeling step at that stage.
Also if humans can inspect and understand all the modifications that the tool AI makes to itself, so it continues to execute a well-understood defined process, that seems good. If necessary you could periodically put the code on some kind of external storage media, transfer it to a new air-gapped computer, and continue development on that computer to ensure that there wasn’t any funny shit going on.
the model is somehow misleading, or just your human-level intelligence will make a wrong conclusion when looking at the model.
Sure, and there’s also the “superintelligent, but with bugs” failure mode where the model is pretty good (enough for the AI to do a lot of damage) but not so good that the AI has an accurate representation of my values.
I imagine this has been suggested somewhere, but an obvious idea is to train many separate models of my values using many different approaches (ex—in addition to what I initially described, also use natural language processing to create a model of human values, and use supervised learning of some sort to learn from many manually entered training examples what human values look like, etc.) Then a superintelligence could test a prospective action against all of these models, and if even one of these models flagged the action as an unethical action, it could flag the action for review before proceeding.
And in order to make these redundant user preference models better, they could be tested against one another: the AI could generate prospective actions at random and test them against all the models; if the models disagreed about the appropriateness of a particular action, this could be flagged as a discrepancy that deserves examination.
My general sense is that with enough safeguards and checks, this “tool AI bootstrapping process” could probably be made arbitrarily safe. Example: the tool AI suggests an improvement to its own code, you review the improvement, you ask the AI why it did things in a particular way, the AI justifies itself, the justification is hard to understand, you make improvements to the justifications module… For each improvement the tool AI generates, it also generates a proof that the improvement does what it says it will do (checked by a separate theorem-proving module) and test coverage for the new improvement… Etc.
OK, I think this is a helpful objection because it helps me further define the “tool”/”agent” distinction. In my mind, an “agent” works towards goals in a freeform way, whereas a “tool” executes some kind of defined process. Google Search is in no danger of killing me in the process of answering my search query (because using my atoms would help it get me better search results). Google Search is not an autonomous agent working towards the goal of getting me good search results. Instead, it’s executing a defined process to retrieve search results.
A tool is a safer tool if I understand the defined process by which it works, the defined process works in a fairly predictable way, and I’m able to anticipate the consequences of following that defined process. Tools are bad tools when they behave unpredictably and create unexpected consequences: for example, a gun is a bad tool if it shoots me in the foot without me having pulled the trigger. A piece of software is a bad tool if it has bugs or doesn’t ask for confirmation before taking an action I might not want it to take.
Based on this logic, the best prospects for “tool AIs” may be “speed superintelligences”/”collective superintelligences”—AIs that execute some kind of well-understood process, but much faster than a human could ever execute, or with a large degree of parallelism. My pocket calculator is a speed superintelligence in this sense. Google Search is more of a collective superintelligence insofar as its work is parallelized.
You can imagine using the tool AI to improve itself to the point where it is just complicated enough for humans to still understand, then doing the world-modeling step at that stage.
Also if humans can inspect and understand all the modifications that the tool AI makes to itself, so it continues to execute a well-understood defined process, that seems good. If necessary you could periodically put the code on some kind of external storage media, transfer it to a new air-gapped computer, and continue development on that computer to ensure that there wasn’t any funny shit going on.
Sure, and there’s also the “superintelligent, but with bugs” failure mode where the model is pretty good (enough for the AI to do a lot of damage) but not so good that the AI has an accurate representation of my values.
I imagine this has been suggested somewhere, but an obvious idea is to train many separate models of my values using many different approaches (ex—in addition to what I initially described, also use natural language processing to create a model of human values, and use supervised learning of some sort to learn from many manually entered training examples what human values look like, etc.) Then a superintelligence could test a prospective action against all of these models, and if even one of these models flagged the action as an unethical action, it could flag the action for review before proceeding.
And in order to make these redundant user preference models better, they could be tested against one another: the AI could generate prospective actions at random and test them against all the models; if the models disagreed about the appropriateness of a particular action, this could be flagged as a discrepancy that deserves examination.
My general sense is that with enough safeguards and checks, this “tool AI bootstrapping process” could probably be made arbitrarily safe. Example: the tool AI suggests an improvement to its own code, you review the improvement, you ask the AI why it did things in a particular way, the AI justifies itself, the justification is hard to understand, you make improvements to the justifications module… For each improvement the tool AI generates, it also generates a proof that the improvement does what it says it will do (checked by a separate theorem-proving module) and test coverage for the new improvement… Etc.