As far as I understand it, you are proposing that the most realistic failure mode consists of many AI systems, all put into a position of power by humans, and optimizing for their own proxies. Call these Trusted Trial and Error AI’s (TTE)
The distinguishing features of TTE’s are that they were Trusted. A human put them in a position of power. Humans have refined, understood and checked the code enough that they are prepared to put this algorithm in a self driving car, or a stock management system. They are not lab prototypes. They are also Trial and error learners, not one shot learners.
Some More descriptions of what capability range I am considering.
Suppose hypothetically that we had TTE reinforcement learners, a little better than todays state of the art, and nothing beyond that. The AI’s are advanced enough that they can take a mountain of medical data and train themselves to be skilled doctors by trial and error. However they are not advanced enough to figure out how humans work from, say a sequenced genome and nothing more.
Give them control of all the traffic lights in a city, and they will learn how to minimize traffic jams. They will arrange for people to drive in circles rather than stay still, so that they do not count as part of a traffic jam. However they will not do anything outside their preset policy space, like hacking into the traffic light control system of other cities, or destroying the city with nukes.
If such technology is easily available, people will start to use it for things. Some people put it in positions of power, others are more hesitant. As the only way the system can learn to avoid something is through trial and error, the system has to cause a (probably several) public outcrys before it learns not to do so. If no one told the traffic light system that car crashes are bad on simulations or past data, (Alignment failure) Then even if public opinion feeds directly into reward, it will have to cause several car crashes that are clearly its fault before it learns to only cause crashes that can be blamed on someone else. However, deliberately causing crashes will probably get the system shut off or seriously modified.
Note that we are supposing many of these systems existing, so the failures of some, combined with plenty of simulated failures, will give us a good idea of the failure modes.
The space of bad things an AI can get away with is small and highly complex in the space of bad things. An TTE set to reduce crime rates tries making the crime report forms longer, this reduces reported crime, but humans quickly realize what its doing. It would have to do this and be patched many times before it came up with a method that humans wouldn’t notice.
Given Advanced TTE’s as the most advanced form of AI, we might slowly develop a problem, but the deployment of TTE’s would be slowed by the time it takes to gather data and check reliability. Especially given mistrust after several major failures. And I suspect that due to statistical similarity of training and testing, many different systems optimizing different proxies, and humans having the best abstract reasoning about novel situations, and the power to turn the systems off, any discrepancy of goals will be moderately minor. I do not expect such optimization power to be significantly more powerful or less aligned than modern capitalism.
This all assumes that no one will manage to make a linear time AIXI. If such a thing is made, it will break out of any boxes and take over the world. So, we have a social process of adaption to TTE AI, which is already in its early stages with things like self driving cars, and at any time, this process could be rendered irrelevant by the arrival of a super-intelligence.
As far as I understand it, you are proposing that the most realistic failure mode consists of many AI systems, all put into a position of power by humans, and optimizing for their own proxies. Call these Trusted Trial and Error AI’s (TTE)
The distinguishing features of TTE’s are that they were Trusted. A human put them in a position of power. Humans have refined, understood and checked the code enough that they are prepared to put this algorithm in a self driving car, or a stock management system. They are not lab prototypes. They are also Trial and error learners, not one shot learners.
Some More descriptions of what capability range I am considering.
Suppose hypothetically that we had TTE reinforcement learners, a little better than todays state of the art, and nothing beyond that. The AI’s are advanced enough that they can take a mountain of medical data and train themselves to be skilled doctors by trial and error. However they are not advanced enough to figure out how humans work from, say a sequenced genome and nothing more.
Give them control of all the traffic lights in a city, and they will learn how to minimize traffic jams. They will arrange for people to drive in circles rather than stay still, so that they do not count as part of a traffic jam. However they will not do anything outside their preset policy space, like hacking into the traffic light control system of other cities, or destroying the city with nukes.
If such technology is easily available, people will start to use it for things. Some people put it in positions of power, others are more hesitant. As the only way the system can learn to avoid something is through trial and error, the system has to cause a (probably several) public outcrys before it learns not to do so. If no one told the traffic light system that car crashes are bad on simulations or past data, (Alignment failure) Then even if public opinion feeds directly into reward, it will have to cause several car crashes that are clearly its fault before it learns to only cause crashes that can be blamed on someone else. However, deliberately causing crashes will probably get the system shut off or seriously modified.
Note that we are supposing many of these systems existing, so the failures of some, combined with plenty of simulated failures, will give us a good idea of the failure modes.
The space of bad things an AI can get away with is small and highly complex in the space of bad things. An TTE set to reduce crime rates tries making the crime report forms longer, this reduces reported crime, but humans quickly realize what its doing. It would have to do this and be patched many times before it came up with a method that humans wouldn’t notice.
Given Advanced TTE’s as the most advanced form of AI, we might slowly develop a problem, but the deployment of TTE’s would be slowed by the time it takes to gather data and check reliability. Especially given mistrust after several major failures. And I suspect that due to statistical similarity of training and testing, many different systems optimizing different proxies, and humans having the best abstract reasoning about novel situations, and the power to turn the systems off, any discrepancy of goals will be moderately minor. I do not expect such optimization power to be significantly more powerful or less aligned than modern capitalism.
This all assumes that no one will manage to make a linear time AIXI. If such a thing is made, it will break out of any boxes and take over the world. So, we have a social process of adaption to TTE AI, which is already in its early stages with things like self driving cars, and at any time, this process could be rendered irrelevant by the arrival of a super-intelligence.