In The genie knows, but it doesn’t care, RobbBB argues that even if an AI is intelligent enough to understand its creator’s wishes in perfect detail, that doesn’t mean that its creator’s wishes are the same as its own values. By analogy, even though humans were optimized by evolution to have as many descendants as possible, we can understand this without caring about it. Very smart humans may have lots of detailed knowledge of evolution & what it means to have many descendants, but then turn around and use condoms & birth control in order to stymie evolution’s “wishes”.
I thought of a potential way to get around this issue:
Use the tool AI as a tool to improve itself, similar to the way I might use my new text editor to edit my new text editor’s code.
Use the tool AI to build an incredibly rich world-model, which includes, among other things, an incredibly rich model of what it means to be Friendly.
Use the tool AI to build tools for browsing this incredibly rich world-model and getting explanations about what various items in the ontology correspond to.
Browse this incredibly rich world-model. Find the item in the ontology that corresponds to universal flourishing and tell the tool AI “convert yourself in to an agent and work on this”.
There’s a lot hanging on the “tool AI/agent AI” distinction in this narrative. So before actually working on this plan, one would want to think hard about the meaning of this distinction. What if the tool AI inadvertently self-modifies & becomes “enough of an agent” to deceive its operator?
The tool vs agent distinction probably has something to do with (a) the degree to which the thing acts autonomously and (b) the degree to which its human operator stays in the loop. A vacuum is a tool: I’m not going to vacuum over my prized rug and rip it up. A Roomba is more of an agent: if I let it run while I am out of the house, it’s possible that it will rip up my prized rug as it autonomously moves about the house. But if I stay home and glance over at my Roomba every so often, it’s possible that I’ll notice that my rug is about to get shredded and turn off my Roomba first. I could also be kept in the loop if the thing gives me warnings about undesirable outcomes I might not want: for example, my Roomba could scan the house before it ran, giving me an inventory of all the items it might come in contact with.
An interesting proposition I’m tempted to argue for is the “autonomy orthogonality thesis”. The original “orthogonality thesis” says that how intelligent an agent is and what values it has are, in principle, orthogonal. The autonomy orthogonality thesis says that how intelligent an agent is and the degree to which it has autonomy and can be described as an “agent” are also, in principle, orthogonal. My pocket calculator is vastly more intelligent than I am at doing arithmetic, but it’s still vastly less autonomous than me. Google Search can instantly answer questions it would take me a lifetime to answer working independently, but Google Search is in no danger of “waking up” and displaying autonomy. So the question here is whether you could create something like Google Search that has the capacity for general intelligence while lacking autonomy.
I feel like the “autonomy orthogonality thesis” might be a good steelman of a lot of mainstream AI researchers who blow raspberries in the general direction of people concerned with AI safety. The thought is that if AI researchers have programmed something in detail to do one particular thing, it’s not about to “wake up” and start acting autonomous.
Another thought: One might argue that if a Tool AI starts modifying itself in to a superintelligence, the result will be too complicated for humans to ever verify. But there’s an interesting contradiction here. A key disagreement in the Hanson/Yudkowsky AI-foom debate was the existence of important, undiscovered chunky insights about intelligence. Either these insights exist or they don’t. If they do, then the amount of code one needs to write in order to create a superintelligence is relatively little, and it should be possible for humans to independently verify the superintelligence’s code. If they don’t, then we are more likely going to have a soft takeoff anyway because intelligence is about building lots of heterogenous structures and getting lots of little things right, and that takes time.
Another thought: maybe it’s valuable to try to advance natural language processing, differentially speaking, so AIs can better understand human concepts by reading about them?
An interesting idea, but I can still imagine it failing in a few ways:
the AI kills you during the process of building the “incredibly rich world-model”, for example because using the atoms of your body will help it achieve a better model;
the model is somehow misleading, or just your human-level intelligence will make a wrong conclusion when looking at the model.
the AI kills you during the process of building the “incredibly rich world-model”, for example because using the atoms of your body will help it achieve a better model;
OK, I think this is a helpful objection because it helps me further define the “tool”/”agent” distinction. In my mind, an “agent” works towards goals in a freeform way, whereas a “tool” executes some kind of defined process. Google Search is in no danger of killing me in the process of answering my search query (because using my atoms would help it get me better search results). Google Search is not an autonomous agent working towards the goal of getting me good search results. Instead, it’s executing a defined process to retrieve search results.
A tool is a safer tool if I understand the defined process by which it works, the defined process works in a fairly predictable way, and I’m able to anticipate the consequences of following that defined process. Tools are bad tools when they behave unpredictably and create unexpected consequences: for example, a gun is a bad tool if it shoots me in the foot without me having pulled the trigger. A piece of software is a bad tool if it has bugs or doesn’t ask for confirmation before taking an action I might not want it to take.
Based on this logic, the best prospects for “tool AIs” may be “speed superintelligences”/”collective superintelligences”—AIs that execute some kind of well-understood process, but much faster than a human could ever execute, or with a large degree of parallelism. My pocket calculator is a speed superintelligence in this sense. Google Search is more of a collective superintelligence insofar as its work is parallelized.
You can imagine using the tool AI to improve itself to the point where it is just complicated enough for humans to still understand, then doing the world-modeling step at that stage.
Also if humans can inspect and understand all the modifications that the tool AI makes to itself, so it continues to execute a well-understood defined process, that seems good. If necessary you could periodically put the code on some kind of external storage media, transfer it to a new air-gapped computer, and continue development on that computer to ensure that there wasn’t any funny shit going on.
the model is somehow misleading, or just your human-level intelligence will make a wrong conclusion when looking at the model.
Sure, and there’s also the “superintelligent, but with bugs” failure mode where the model is pretty good (enough for the AI to do a lot of damage) but not so good that the AI has an accurate representation of my values.
I imagine this has been suggested somewhere, but an obvious idea is to train many separate models of my values using many different approaches (ex—in addition to what I initially described, also use natural language processing to create a model of human values, and use supervised learning of some sort to learn from many manually entered training examples what human values look like, etc.) Then a superintelligence could test a prospective action against all of these models, and if even one of these models flagged the action as an unethical action, it could flag the action for review before proceeding.
And in order to make these redundant user preference models better, they could be tested against one another: the AI could generate prospective actions at random and test them against all the models; if the models disagreed about the appropriateness of a particular action, this could be flagged as a discrepancy that deserves examination.
My general sense is that with enough safeguards and checks, this “tool AI bootstrapping process” could probably be made arbitrarily safe. Example: the tool AI suggests an improvement to its own code, you review the improvement, you ask the AI why it did things in a particular way, the AI justifies itself, the justification is hard to understand, you make improvements to the justifications module… For each improvement the tool AI generates, it also generates a proof that the improvement does what it says it will do (checked by a separate theorem-proving module) and test coverage for the new improvement… Etc.
I am trying to imagine the weakest dangerous Google Search successor.
Probably this: Imagine that the search engine is able to model you. Adding such ability would make sense commercially, if the producers want to make sure that the customers are satisfied with their product. Let’s assume that the computing power is too cheap and they added too much of this ability. Now the search engine could e.g. find a result with highest rank, but then predict that seeing this result would make you disapointed, so it chooses another result instead, with somewhat lower rank, but with high predicted satisfaction. For the producers this may seem like a desired ability (tailored, personally relevant search results).
As an undesired side-effect, the search engine would de facto gain an ability to lie to you, convincingly. For example, let’s say that the function for measuring customer satisfaction only includes emotional reaction, and doesn’t include things like “a desire to know truth, even if it’s unpleasant”. That could happen for various reason, such as the producers not giving a fuck about our abstract desires, or concluding that abstract desires are mostly a hypocrisy but emotions are honest. Now as a side-effect, instead of unpleasant truth, the search engine would return a comfortable lie, if available. (Because the answer which makes the customer most happy is selected.)
Perhaps people would become aware of this, and would always double-check the answers. But suppose that the search engine is insanely good at modelling you, so it can also predict how specifically are you going to verify the questions, and whether you succeed or fail to find the truth. Now we get the more scary version which lies to you if and only if you are unable to find out that it lied. Thus to you, the search engine will seem completely trustworthy. All answers you have ever received, if you verified them, you learned that they were correct. You are only surprised to see that the search engine sometimes delivers wrong answers to other people; but in such situations you are always unable to convince the other people that those answers were wrong, because the answers are perfectly aligned with their existing beliefs. You could be smart enough to use an outside view to suspect that maybe something similar is happening to you, too. Or you may conclude that the other people are simply idiots.
Let’s imagine even more powerful search engine, and more clever designers, who instead of individual satisfaction with search results try to optimize for general satisfaction with their product in the population as a whole. As a side effect of this, now the search engine would only lie in ways that make society as a whole more happy with the results, and where the society as a whole is unable to find out what is happening. So for example, you could notice that the search engine is spreading a false information, but you would not be able to convince a majority of other people about it (because if the search engine would predict that you could, it would not have displayed the information at the first place).
Why could this be dangerous? A few “noble lies” here and there, what’s the worst thing that could happen? Imagine that the working definition of “satisfaction” is somewhat simplistic and does not include all human values. And imagine an insanely powerful search engine that could predict the results of its manipulation centuries ahead. Such engine could gently push the whole humanity towards some undesired attractor, such as a future where all people are wireheaded (from the point of view of the search engine: customers are maximally satisfied with the outcome), or just brainwashed in a cultish society which supports the search engine because the search engine never contradicts the cult teaching. That pushing would be achieved by giving higher visibility to pages supporting the idea (especially if the idea would seem appealing to the reader), lower visibility to pages explaining the dangers of the idea; and also on more meta levels, e.g. giving higher visibility to pages explaining personal scandals related to the people prominently explaining the dangers of the idea, etc.
Okay, this is stretching the credibility at a few places, but I tried to find a hypothetical scenario where a too powerful but still completely transparently designed Google Search successor would doom humanity.
the AI kills you during the process of building the “incredibly rich world-model”, for example because using the atoms of your body will help it achieve a better model;
OK, I think this is a helpful objection because it helps me further define the “tool”/”agent” distinction. In my mind, an “agent” works towards goals in a freeform way, whereas a “tool” executes some kind of defined process. Google Search is in no danger of killing me in the process of answering my search query (because using my atoms would help it get me better search results). Google Search is not an autonomous agent working towards the goal of getting me good search results. Instead, it’s executing a defined process to retrieve search results.
A tool is a safer tool if I understand the defined process by which it works, the defined process works in a fairly predictable way, and I’m able to anticipate the consequences of following that defined process. Tools are bad tools when they behave unpredictably and create unexpected consequences: for example, a gun is a bad tool if it shoots me in the foot without me having pulled the trigger. A piece of software is a bad tool if it has bugs or doesn’t ask for confirmation before taking an action I might not want it to take.
Based on this logic, the best prospects for “tool AIs” may be “speed superintelligences”/”collective superintelligences”—AIs that execute some kind of well-understood process, but much faster than a human could ever execute, or with a large degree of parallelism. My pocket calculator is a speed superintelligence in this sense. Google Search is more of a collective superintelligence insofar as its work is parallelized.
You can imagine using the tool AI to improve itself to the point where it is just complicated enough for humans to still understand, then doing the world-modeling step at that stage.
Also if humans can inspect and understand all the modifications that the tool AI makes to itself, so it continues to execute a well-understood defined process, that seems good. If necessary you could periodically put the code on some kind of external storage media, transfer it to a new air-gapped computer, and continue development on that computer to ensure that there wasn’t any funny shit going on.
the model is somehow misleading, or just your human-level intelligence will make a wrong conclusion when looking at the model.
Sure, and there’s also the “superintelligent, but with bugs” failure mode where the model is pretty good (enough for the AI to do a lot of damage) but not so good that the AI has an accurate representation of my values.
I imagine this has been suggested somewhere, but an obvious idea is to train many separate models of my values using many different approaches (ex—in addition to what I initially described, also use natural language processing to create a model of human values, and use supervised learning of some sort to learn from many manually entered training examples what human values look like, etc.) Then a superintelligence could test a prospective action against all of these models, and if even one of these models flagged the action as an unethical action, it could flag the action for review before proceeding.
And in order to make these redundant user preference models better, they could be tested against one another: the AI could generate prospective actions at random and test them against all the models; if the models disagreed about the appropriateness of a particular action, this could be flagged as a discrepancy that deserves examination.
My general sense is that with enough safeguards and checks, this “tool AI bootstrapping process” could probably be made arbitrarily safe. Example: the tool AI suggests an improvement to its own code, you review the improvement, you ask the AI why it did things in a particular way, the AI justifies itself, the justification is hard to understand, you make improvements to the justifications module… For each improvement the tool AI generates, it also generates a proof that the improvement does what it says it will do (checked by a separate theorem-proving module) and test coverage for the new improvement… Etc.
Very smart humans may have lots of detailed knowledge of evolution & what it means to have many descendants, but then turn around and use condoms & birth control in order to stymie evolution’s “wishes”.
Evolution doesn’t have “wishes”. It’s not a teleological entity.
In The genie knows, but it doesn’t care, RobbBB argues that even if an AI is intelligent enough to understand its creator’s wishes in perfect detail, that doesn’t mean that its creator’s wishes are the same as its own values. By analogy, even though humans were optimized by evolution to have as many descendants as possible, we can understand this without caring about it. Very smart humans may have lots of detailed knowledge of evolution & what it means to have many descendants, but then turn around and use condoms & birth control in order to stymie evolution’s “wishes”.
I thought of a potential way to get around this issue:
Create a tool AI.
Use the tool AI as a tool to improve itself, similar to the way I might use my new text editor to edit my new text editor’s code.
Use the tool AI to build an incredibly rich world-model, which includes, among other things, an incredibly rich model of what it means to be Friendly.
Use the tool AI to build tools for browsing this incredibly rich world-model and getting explanations about what various items in the ontology correspond to.
Browse this incredibly rich world-model. Find the item in the ontology that corresponds to universal flourishing and tell the tool AI “convert yourself in to an agent and work on this”.
There’s a lot hanging on the “tool AI/agent AI” distinction in this narrative. So before actually working on this plan, one would want to think hard about the meaning of this distinction. What if the tool AI inadvertently self-modifies & becomes “enough of an agent” to deceive its operator?
The tool vs agent distinction probably has something to do with (a) the degree to which the thing acts autonomously and (b) the degree to which its human operator stays in the loop. A vacuum is a tool: I’m not going to vacuum over my prized rug and rip it up. A Roomba is more of an agent: if I let it run while I am out of the house, it’s possible that it will rip up my prized rug as it autonomously moves about the house. But if I stay home and glance over at my Roomba every so often, it’s possible that I’ll notice that my rug is about to get shredded and turn off my Roomba first. I could also be kept in the loop if the thing gives me warnings about undesirable outcomes I might not want: for example, my Roomba could scan the house before it ran, giving me an inventory of all the items it might come in contact with.
An interesting proposition I’m tempted to argue for is the “autonomy orthogonality thesis”. The original “orthogonality thesis” says that how intelligent an agent is and what values it has are, in principle, orthogonal. The autonomy orthogonality thesis says that how intelligent an agent is and the degree to which it has autonomy and can be described as an “agent” are also, in principle, orthogonal. My pocket calculator is vastly more intelligent than I am at doing arithmetic, but it’s still vastly less autonomous than me. Google Search can instantly answer questions it would take me a lifetime to answer working independently, but Google Search is in no danger of “waking up” and displaying autonomy. So the question here is whether you could create something like Google Search that has the capacity for general intelligence while lacking autonomy.
I feel like the “autonomy orthogonality thesis” might be a good steelman of a lot of mainstream AI researchers who blow raspberries in the general direction of people concerned with AI safety. The thought is that if AI researchers have programmed something in detail to do one particular thing, it’s not about to “wake up” and start acting autonomous.
Another thought: One might argue that if a Tool AI starts modifying itself in to a superintelligence, the result will be too complicated for humans to ever verify. But there’s an interesting contradiction here. A key disagreement in the Hanson/Yudkowsky AI-foom debate was the existence of important, undiscovered chunky insights about intelligence. Either these insights exist or they don’t. If they do, then the amount of code one needs to write in order to create a superintelligence is relatively little, and it should be possible for humans to independently verify the superintelligence’s code. If they don’t, then we are more likely going to have a soft takeoff anyway because intelligence is about building lots of heterogenous structures and getting lots of little things right, and that takes time.
Another thought: maybe it’s valuable to try to advance natural language processing, differentially speaking, so AIs can better understand human concepts by reading about them?
An interesting idea, but I can still imagine it failing in a few ways:
the AI kills you during the process of building the “incredibly rich world-model”, for example because using the atoms of your body will help it achieve a better model;
the model is somehow misleading, or just your human-level intelligence will make a wrong conclusion when looking at the model.
OK, I think this is a helpful objection because it helps me further define the “tool”/”agent” distinction. In my mind, an “agent” works towards goals in a freeform way, whereas a “tool” executes some kind of defined process. Google Search is in no danger of killing me in the process of answering my search query (because using my atoms would help it get me better search results). Google Search is not an autonomous agent working towards the goal of getting me good search results. Instead, it’s executing a defined process to retrieve search results.
A tool is a safer tool if I understand the defined process by which it works, the defined process works in a fairly predictable way, and I’m able to anticipate the consequences of following that defined process. Tools are bad tools when they behave unpredictably and create unexpected consequences: for example, a gun is a bad tool if it shoots me in the foot without me having pulled the trigger. A piece of software is a bad tool if it has bugs or doesn’t ask for confirmation before taking an action I might not want it to take.
Based on this logic, the best prospects for “tool AIs” may be “speed superintelligences”/”collective superintelligences”—AIs that execute some kind of well-understood process, but much faster than a human could ever execute, or with a large degree of parallelism. My pocket calculator is a speed superintelligence in this sense. Google Search is more of a collective superintelligence insofar as its work is parallelized.
You can imagine using the tool AI to improve itself to the point where it is just complicated enough for humans to still understand, then doing the world-modeling step at that stage.
Also if humans can inspect and understand all the modifications that the tool AI makes to itself, so it continues to execute a well-understood defined process, that seems good. If necessary you could periodically put the code on some kind of external storage media, transfer it to a new air-gapped computer, and continue development on that computer to ensure that there wasn’t any funny shit going on.
Sure, and there’s also the “superintelligent, but with bugs” failure mode where the model is pretty good (enough for the AI to do a lot of damage) but not so good that the AI has an accurate representation of my values.
I imagine this has been suggested somewhere, but an obvious idea is to train many separate models of my values using many different approaches (ex—in addition to what I initially described, also use natural language processing to create a model of human values, and use supervised learning of some sort to learn from many manually entered training examples what human values look like, etc.) Then a superintelligence could test a prospective action against all of these models, and if even one of these models flagged the action as an unethical action, it could flag the action for review before proceeding.
And in order to make these redundant user preference models better, they could be tested against one another: the AI could generate prospective actions at random and test them against all the models; if the models disagreed about the appropriateness of a particular action, this could be flagged as a discrepancy that deserves examination.
My general sense is that with enough safeguards and checks, this “tool AI bootstrapping process” could probably be made arbitrarily safe. Example: the tool AI suggests an improvement to its own code, you review the improvement, you ask the AI why it did things in a particular way, the AI justifies itself, the justification is hard to understand, you make improvements to the justifications module… For each improvement the tool AI generates, it also generates a proof that the improvement does what it says it will do (checked by a separate theorem-proving module) and test coverage for the new improvement… Etc.
I am trying to imagine the weakest dangerous Google Search successor.
Probably this: Imagine that the search engine is able to model you. Adding such ability would make sense commercially, if the producers want to make sure that the customers are satisfied with their product. Let’s assume that the computing power is too cheap and they added too much of this ability. Now the search engine could e.g. find a result with highest rank, but then predict that seeing this result would make you disapointed, so it chooses another result instead, with somewhat lower rank, but with high predicted satisfaction. For the producers this may seem like a desired ability (tailored, personally relevant search results).
As an undesired side-effect, the search engine would de facto gain an ability to lie to you, convincingly. For example, let’s say that the function for measuring customer satisfaction only includes emotional reaction, and doesn’t include things like “a desire to know truth, even if it’s unpleasant”. That could happen for various reason, such as the producers not giving a fuck about our abstract desires, or concluding that abstract desires are mostly a hypocrisy but emotions are honest. Now as a side-effect, instead of unpleasant truth, the search engine would return a comfortable lie, if available. (Because the answer which makes the customer most happy is selected.)
Perhaps people would become aware of this, and would always double-check the answers. But suppose that the search engine is insanely good at modelling you, so it can also predict how specifically are you going to verify the questions, and whether you succeed or fail to find the truth. Now we get the more scary version which lies to you if and only if you are unable to find out that it lied. Thus to you, the search engine will seem completely trustworthy. All answers you have ever received, if you verified them, you learned that they were correct. You are only surprised to see that the search engine sometimes delivers wrong answers to other people; but in such situations you are always unable to convince the other people that those answers were wrong, because the answers are perfectly aligned with their existing beliefs. You could be smart enough to use an outside view to suspect that maybe something similar is happening to you, too. Or you may conclude that the other people are simply idiots.
Let’s imagine even more powerful search engine, and more clever designers, who instead of individual satisfaction with search results try to optimize for general satisfaction with their product in the population as a whole. As a side effect of this, now the search engine would only lie in ways that make society as a whole more happy with the results, and where the society as a whole is unable to find out what is happening. So for example, you could notice that the search engine is spreading a false information, but you would not be able to convince a majority of other people about it (because if the search engine would predict that you could, it would not have displayed the information at the first place).
Why could this be dangerous? A few “noble lies” here and there, what’s the worst thing that could happen? Imagine that the working definition of “satisfaction” is somewhat simplistic and does not include all human values. And imagine an insanely powerful search engine that could predict the results of its manipulation centuries ahead. Such engine could gently push the whole humanity towards some undesired attractor, such as a future where all people are wireheaded (from the point of view of the search engine: customers are maximally satisfied with the outcome), or just brainwashed in a cultish society which supports the search engine because the search engine never contradicts the cult teaching. That pushing would be achieved by giving higher visibility to pages supporting the idea (especially if the idea would seem appealing to the reader), lower visibility to pages explaining the dangers of the idea; and also on more meta levels, e.g. giving higher visibility to pages explaining personal scandals related to the people prominently explaining the dangers of the idea, etc.
Okay, this is stretching the credibility at a few places, but I tried to find a hypothetical scenario where a too powerful but still completely transparently designed Google Search successor would doom humanity.
OK, I think this is a helpful objection because it helps me further define the “tool”/”agent” distinction. In my mind, an “agent” works towards goals in a freeform way, whereas a “tool” executes some kind of defined process. Google Search is in no danger of killing me in the process of answering my search query (because using my atoms would help it get me better search results). Google Search is not an autonomous agent working towards the goal of getting me good search results. Instead, it’s executing a defined process to retrieve search results.
A tool is a safer tool if I understand the defined process by which it works, the defined process works in a fairly predictable way, and I’m able to anticipate the consequences of following that defined process. Tools are bad tools when they behave unpredictably and create unexpected consequences: for example, a gun is a bad tool if it shoots me in the foot without me having pulled the trigger. A piece of software is a bad tool if it has bugs or doesn’t ask for confirmation before taking an action I might not want it to take.
Based on this logic, the best prospects for “tool AIs” may be “speed superintelligences”/”collective superintelligences”—AIs that execute some kind of well-understood process, but much faster than a human could ever execute, or with a large degree of parallelism. My pocket calculator is a speed superintelligence in this sense. Google Search is more of a collective superintelligence insofar as its work is parallelized.
You can imagine using the tool AI to improve itself to the point where it is just complicated enough for humans to still understand, then doing the world-modeling step at that stage.
Also if humans can inspect and understand all the modifications that the tool AI makes to itself, so it continues to execute a well-understood defined process, that seems good. If necessary you could periodically put the code on some kind of external storage media, transfer it to a new air-gapped computer, and continue development on that computer to ensure that there wasn’t any funny shit going on.
Sure, and there’s also the “superintelligent, but with bugs” failure mode where the model is pretty good (enough for the AI to do a lot of damage) but not so good that the AI has an accurate representation of my values.
I imagine this has been suggested somewhere, but an obvious idea is to train many separate models of my values using many different approaches (ex—in addition to what I initially described, also use natural language processing to create a model of human values, and use supervised learning of some sort to learn from many manually entered training examples what human values look like, etc.) Then a superintelligence could test a prospective action against all of these models, and if even one of these models flagged the action as an unethical action, it could flag the action for review before proceeding.
And in order to make these redundant user preference models better, they could be tested against one another: the AI could generate prospective actions at random and test them against all the models; if the models disagreed about the appropriateness of a particular action, this could be flagged as a discrepancy that deserves examination.
My general sense is that with enough safeguards and checks, this “tool AI bootstrapping process” could probably be made arbitrarily safe. Example: the tool AI suggests an improvement to its own code, you review the improvement, you ask the AI why it did things in a particular way, the AI justifies itself, the justification is hard to understand, you make improvements to the justifications module… For each improvement the tool AI generates, it also generates a proof that the improvement does what it says it will do (checked by a separate theorem-proving module) and test coverage for the new improvement… Etc.
I will clip your idea and add it to the map of the ways of AI control ideas.
Evolution doesn’t have “wishes”. It’s not a teleological entity.