The obvious thing to do is myopia. Do not create an agent concerned about the “number of paperclips on the earth/universe” in the first place.
Subdivide the problem into tasks, and accomplish the tasks separately, using either the same or many separate general models to accomplish each one.
Task 1: Given this catalog of paperclip factory equipment, and these simulations of their capabilities, find an optimal factory layout for the equipment
Task 2: Given the input data for a machine n, complete the substep with the desired local world state m, solutions must be completable within time limit L.
For example, “remove the wire from the box”, “cut wire”, “fold paperclip”, “box folded paperclips” are all separate steps.
Task 3: Given an N-step manufacturing process, design equipment that fuses steps together into (n-m) steps if there is an efficiency gain.
And so on. Each agent runs in short, time limited sessions, and forgets everything that happened when the session ends—in fact, most information is forgotten on an ongoing basis. Agents are temporally myopic.
These seem like arguments that it should be possible to be very, very cautious, and to create an agent that doesn’t immediately crash and burn due to Russell’s claim, not that they are unlikely, nor that even these agents don’t fail slightly later.
The above is preventing the cause of most embedded system failure—state buildup.
Whether it be routers, laptops, cars, patriot missile systems—the majority cause for any embedded system to fail is not that the system fails during testing in it’s known state right after starting/boot, but it fails later. And the cause of the later failure is internal state in the machine’s memory.
High reliability web services go to “stateless microservices” for this reason. “temporal myopia” actually means “clear state as often as you can” which is functionally the same thing.
So no, it won’t fail later. The above system will probably not ever fail at any rate above the base failure rate when it was built.
Yeah, this may be a crux I have: I do not think that myopia is likely to be retained by default, especially if it impacts capabilities negatively.
Also, even with myopia, you need to have causal decision theory or a variant of this, otherwise deceptive alignment and alignment failures still can happen.
For similar reasons, I am bearish on the Open Agency model.
Another crux I have is that the Open Agency model as well as your plan rely on a strong version of the Factored Cognition hypothesis.
I think that while there are systems that can be factored, I am much more unsure of whether the majority or all of the tasks we might want an AGI/ASI to do is factorable at all.
I do not think that myopia is likely to be retained by default, especially if it impacts capabilities negatively.
This is empirical reality now. Most or all reliable software systems in use right now make heavy use of myopia. It’s a critical strategy for reliability. The software companies that failed to adopt such strategies usually went broke, except for Microsoft et al.
you need to have causal decision theory or a variant of this, otherwise deceptive alignment and alignment failures still can happen.
For similar reasons, I am bearish on the Open Agency model.
This sounds complex, do you have a post you can link on this?
Also for models training on subdivided tasks, where does the reward gradient support development of such complex capabilities?
I think that while there are systems that can be factored, I am much more unsure of whether the majority or all of the tasks we might want an AGI/ASI to do is factorable at all.
Care to give an example? I tend to think of 2 big ones you would use an ASI for.
1. “keep patient n alive, and with more score if at the end of this episode, patient n is in a state where the probability that a model of this type can keep the patient alive is high”
This is very subdividable—keeping someone alive is a bunch of separable life support tasks, where each can be provided by separated equipment, and even parallel instances of that equipment.
Or succinctly, you are using a subdivided system to replace the operation of another extremely subdivided system (network of cells)
2. “develop a nanoforge, defined by a large machine that can make all the parts used in itself, made solely of atomically precise subcomponents”. This also subdivides into many isolated tasks, albeit with many stages of integration and subdivision back into isolated tasks.
Note that for convenience and cost you would likely use general agents, able to do many kinds of tasks, to do each separated task. What makes them separated is they output their communications into a format humans can also read and assign tasks on other agents who may be additional instances of ‘themselves’ or may not be.
This sounds complex, do you have a post you can link on this?
The link is to Open Problems with Myopia, and it talks about the case where myopia works, but there are various failure modes of myopic behavior, and a lot of the problems stem from decision theories that are too smart.
Care to give an example? I tend to think of 2 big ones you would use an ASI for.
My claim is somewhat different than the give you an example. I’m not concerned whether there exist useful tasks that allow factorization and myopia, assembly lines exist as a proof of existence. I’m concerned about whether the majority of tasks/jobs or the majority of economic value that we want AI/AGI to be in are factorizable this way, and whether they are compatible with a myopic setup.
And in particular, I want to get more quantitative on how much myopia/factorization is a usable setup for tasks/jobs.
This is empirical reality now. Most or all reliable software systems in use right now make heavy use of myopia. It’s a critical strategy for reliability. The software companies that failed to adopt such strategies usually went broke, except for Microsoft et al.
I note that the fact that non-myopia was a strategy that Microsoft and other companies used successfully is very concerning to me, as the fact that such companies are now worth billions of dollars and have thousands to tens of thousands of jobs suggests something concerning:
That non-myopia is either necessary or useful for generating lots of economic value as well as getting AI in at least one field, and this is worrying since this almost certainly implies that other jobs that are myopic/factorizable either benefit or are necessary for doing a task/job.
A final word on myopia:
Paul Christiano said that he would be fine with RLHF being myopic for a single episode, but I think that this is actually a problem for one reason:
Per episode myopia relies on you being able to detect how much optimization beyond the episode is occuring, which is harder than detecting the existence of non-myopia that per step myopia offers.
and a lot of the problems stem from decision theories that are too smart.
Complex hostile subsystems won’t be developed by AI models without an optimization pressure that gives them a reward for doing so. This is I think a big chunk of current schisms. We can’t know if a black box model isn’t deceiving in the same way we don’t know the government isn’t hiding secret alien technology, but both can be extremely unlikely. In a way what I am hearing is essentially an AGI “conspiracy theory”, that above a certain level of intelligence an AI model would be invisibly conspiring against us with no measurable sign. It is impossible to disprove, same you cannot actually disprove that the government isn’t secretly doing $conspiracy. (The unlikelihood scales with the number of people who would have to be involved, the cost, the benefit to the government, and the amount of obvious crimes the government is committing depending on the conspiracy that the conspirators remain silent on)
My claim is somewhat different than the give you an example. I’m not concerned whether there exist useful tasks that allow factorization and myopia, assembly lines exist as a proof of existence. I’m concerned about whether the majority of tasks/jobs or the majority of economic value that we want AI/AGI to be in are factorizable this way, and whether they are compatible with a myopic setup.
Care to try to even think through the list from a high level? When I do this exercise I see nothing but factorable tasks everywhere, but part of the bias is that humans have to factor tasks. We are measurably more efficient as singletons. Such as “all manufacturing”, “all resource gathering”, “all construction”, “megascale biotech research”—all very separable tasks.
Per episode myopia relies on you being able to detect how much optimization beyond the episode is occuring, which is harder than detecting the existence of non-myopia that per step myopia offers.
Are you assuming online training? I was assuming offline training, and auto populating simulations from online data that you offline train on.
I note that the fact that non-myopia was a strategy that Microsoft and other companies used successfully is very concerning to me, as the fact that such companies are now worth billions of dollars and have thousands to tens of thousands of jobs suggests something concerning:
Microsoft products are rarely used in high reliability systems anywhere for this reason. Not because humans organizations are perfect but because it’s evolutionary—use Windows in a product that fails, and you lose money.
Care to try to even think through the list from a high level? When I do this exercise I see nothing but factorable tasks everywhere, but part of the bias is that humans have to factor tasks. We are measurably more efficient as singletons. Such as “all manufacturing”, “all resource gathering”, “all construction”, “megascale biotech research”—all very separable tasks.
A counterexample to the factoring of tasks is given by Steven Byrnes:
For benefits of generality (4.3.2.1), an argument I find compelling is that if you’re trying to invent a new invention or design a new system, you need a cross-domain system-level understanding of what you’re trying to do and how. Like at my last job, it was not at all unusual for me to find myself sketching out the algorithms on a project and sketching out the link budget and scrutinizing laser spec sheets and scrutinizing FPGA spec sheets and nailing down end-user requirements, etc. etc. Not because I’m individually the best person at each of those tasks—or even very good!—but because sometimes a laser-related problem is best solved by switching to a different algorithm, or an FPGA-related problem is best solved by recognizing that the real end-user requirements are not quite what we thought, etc. etc. And that kind of design work is awfully hard unless a giant heap of relevant information and knowledge is all together in a single brain / world-model.
Take the task of designing something like a car, internals.
You might start with a rough idea of the specs, and a precise equation for the value of each feature. You have a scaled model for how it needs to look.
You start a search process where you consider many possible ways to arrange the components within the body shell. Say none of the configurations will fit and meet specs.
You send a request up the stack for a scaled up version of the shell. You get it. You arrange the components into possible designs that fit, and then send the candidate design for simulated testing.
The simulated testing reveals a common failure in one of the parts, and all of the available alternatives for that part have a flaw. So you send a request to the “part designer” to give you a part that satisfies these new tightened specs that will not allow the flaw, and ask for a range of alternate packages.
The resulting redesigned part is now too big to fit, so you rearrange the parts again/send a request to the body shell designer for even more space, and so on.
It is many, may iterative interactions where the flow of the process has to go up and down the stack many times. In addition I am describing the flow for one design candidate. It’s actually a large tree of other candidates you should be checking, where each time there was a choice you queue up a message to the next stage for each possible choice you could have made. (and prune, from all the packages in flight in the system, the worst ones)
If you think about how to implement this, one way is data driven. All the “roles” in this network sit there quiescent waiting for an initial data package. All the context of the process is in the message itself, there is no agent “responsible” for the car design getting finished, but a message flow pipeline where after some time you will get valid car design alternatives in the ‘in box’ of the system that sent the request, or a message stating that the process failed from an intractable problem. (there were constraints that could not be satisfied after exhausting every design permutation)
There is no reason these roles cannot be superintelligences, but they get no context. They don’t think or have an internal narrative, they wait forever for a message, but apply superhuman and general skill when given the task. They are stateless microservices, though as they do have superintelligence level neural architectures, they are too fat to be called ‘micro’.
Also see: https://www.lesswrong.com/posts/5hApNw5f7uG8RXxGS/the-open-agency-model
The obvious thing to do is myopia. Do not create an agent concerned about the “number of paperclips on the earth/universe” in the first place.
Subdivide the problem into tasks, and accomplish the tasks separately, using either the same or many separate general models to accomplish each one.
Task 1: Given this catalog of paperclip factory equipment, and these simulations of their capabilities, find an optimal factory layout for the equipment
Task 2: Given the input data for a machine n, complete the substep with the desired local world state m, solutions must be completable within time limit L.
For example, “remove the wire from the box”, “cut wire”, “fold paperclip”, “box folded paperclips” are all separate steps.
Task 3: Given an N-step manufacturing process, design equipment that fuses steps together into (n-m) steps if there is an efficiency gain.
And so on. Each agent runs in short, time limited sessions, and forgets everything that happened when the session ends—in fact, most information is forgotten on an ongoing basis. Agents are temporally myopic.
These seem like arguments that it should be possible to be very, very cautious, and to create an agent that doesn’t immediately crash and burn due to Russell’s claim, not that they are unlikely, nor that even these agents don’t fail slightly later.
The above is preventing the cause of most embedded system failure—state buildup.
Whether it be routers, laptops, cars, patriot missile systems—the majority cause for any embedded system to fail is not that the system fails during testing in it’s known state right after starting/boot, but it fails later. And the cause of the later failure is internal state in the machine’s memory.
High reliability web services go to “stateless microservices” for this reason. “temporal myopia” actually means “clear state as often as you can” which is functionally the same thing.
So no, it won’t fail later. The above system will probably not ever fail at any rate above the base failure rate when it was built.
Yeah, this may be a crux I have: I do not think that myopia is likely to be retained by default, especially if it impacts capabilities negatively.
Also, even with myopia, you need to have causal decision theory or a variant of this, otherwise deceptive alignment and alignment failures still can happen.
For similar reasons, I am bearish on the Open Agency model.
Another crux I have is that the Open Agency model as well as your plan rely on a strong version of the Factored Cognition hypothesis.
I think that while there are systems that can be factored, I am much more unsure of whether the majority or all of the tasks we might want an AGI/ASI to do is factorable at all.
I do not think that myopia is likely to be retained by default, especially if it impacts capabilities negatively.
This is empirical reality now. Most or all reliable software systems in use right now make heavy use of myopia. It’s a critical strategy for reliability. The software companies that failed to adopt such strategies usually went broke, except for Microsoft et al.
you need to have causal decision theory or a variant of this, otherwise deceptive alignment and alignment failures still can happen.
For similar reasons, I am bearish on the Open Agency model.
This sounds complex, do you have a post you can link on this?
Also for models training on subdivided tasks, where does the reward gradient support development of such complex capabilities?
I think that while there are systems that can be factored, I am much more unsure of whether the majority or all of the tasks we might want an AGI/ASI to do is factorable at all.
Care to give an example? I tend to think of 2 big ones you would use an ASI for.
1. “keep patient n alive, and with more score if at the end of this episode, patient n is in a state where the probability that a model of this type can keep the patient alive is high”
This is very subdividable—keeping someone alive is a bunch of separable life support tasks, where each can be provided by separated equipment, and even parallel instances of that equipment.
Or succinctly, you are using a subdivided system to replace the operation of another extremely subdivided system (network of cells)
2. “develop a nanoforge, defined by a large machine that can make all the parts used in itself, made solely of atomically precise subcomponents”. This also subdivides into many isolated tasks, albeit with many stages of integration and subdivision back into isolated tasks.
Note that for convenience and cost you would likely use general agents, able to do many kinds of tasks, to do each separated task. What makes them separated is they output their communications into a format humans can also read and assign tasks on other agents who may be additional instances of ‘themselves’ or may not be.
The link is to Open Problems with Myopia, and it talks about the case where myopia works, but there are various failure modes of myopic behavior, and a lot of the problems stem from decision theories that are too smart.
https://www.lesswrong.com/posts/LCLBnmwdxkkz5fNvH/open-problems-with-myopia
My claim is somewhat different than the give you an example. I’m not concerned whether there exist useful tasks that allow factorization and myopia, assembly lines exist as a proof of existence. I’m concerned about whether the majority of tasks/jobs or the majority of economic value that we want AI/AGI to be in are factorizable this way, and whether they are compatible with a myopic setup.
And in particular, I want to get more quantitative on how much myopia/factorization is a usable setup for tasks/jobs.
I note that the fact that non-myopia was a strategy that Microsoft and other companies used successfully is very concerning to me, as the fact that such companies are now worth billions of dollars and have thousands to tens of thousands of jobs suggests something concerning:
That non-myopia is either necessary or useful for generating lots of economic value as well as getting AI in at least one field, and this is worrying since this almost certainly implies that other jobs that are myopic/factorizable either benefit or are necessary for doing a task/job.
A final word on myopia:
Paul Christiano said that he would be fine with RLHF being myopic for a single episode, but I think that this is actually a problem for one reason:
Per episode myopia relies on you being able to detect how much optimization beyond the episode is occuring, which is harder than detecting the existence of non-myopia that per step myopia offers.
and a lot of the problems stem from decision theories that are too smart.
Complex hostile subsystems won’t be developed by AI models without an optimization pressure that gives them a reward for doing so. This is I think a big chunk of current schisms. We can’t know if a black box model isn’t deceiving in the same way we don’t know the government isn’t hiding secret alien technology, but both can be extremely unlikely. In a way what I am hearing is essentially an AGI “conspiracy theory”, that above a certain level of intelligence an AI model would be invisibly conspiring against us with no measurable sign. It is impossible to disprove, same you cannot actually disprove that the government isn’t secretly doing $conspiracy. (The unlikelihood scales with the number of people who would have to be involved, the cost, the benefit to the government, and the amount of obvious crimes the government is committing depending on the conspiracy that the conspirators remain silent on)
My claim is somewhat different than the give you an example. I’m not concerned whether there exist useful tasks that allow factorization and myopia, assembly lines exist as a proof of existence. I’m concerned about whether the majority of tasks/jobs or the majority of economic value that we want AI/AGI to be in are factorizable this way, and whether they are compatible with a myopic setup.
Care to try to even think through the list from a high level? When I do this exercise I see nothing but factorable tasks everywhere, but part of the bias is that humans have to factor tasks. We are measurably more efficient as singletons. Such as “all manufacturing”, “all resource gathering”, “all construction”, “megascale biotech research”—all very separable tasks.
Per episode myopia relies on you being able to detect how much optimization beyond the episode is occuring, which is harder than detecting the existence of non-myopia that per step myopia offers.
Are you assuming online training? I was assuming offline training, and auto populating simulations from online data that you offline train on.
I note that the fact that non-myopia was a strategy that Microsoft and other companies used successfully is very concerning to me, as the fact that such companies are now worth billions of dollars and have thousands to tens of thousands of jobs suggests something concerning:
Microsoft products are rarely used in high reliability systems anywhere for this reason. Not because humans organizations are perfect but because it’s evolutionary—use Windows in a product that fails, and you lose money.
A counterexample to the factoring of tasks is given by Steven Byrnes:
Fair, though it is separable.
Take the task of designing something like a car, internals.
You might start with a rough idea of the specs, and a precise equation for the value of each feature. You have a scaled model for how it needs to look.
You start a search process where you consider many possible ways to arrange the components within the body shell. Say none of the configurations will fit and meet specs.
You send a request up the stack for a scaled up version of the shell. You get it. You arrange the components into possible designs that fit, and then send the candidate design for simulated testing.
The simulated testing reveals a common failure in one of the parts, and all of the available alternatives for that part have a flaw. So you send a request to the “part designer” to give you a part that satisfies these new tightened specs that will not allow the flaw, and ask for a range of alternate packages.
The resulting redesigned part is now too big to fit, so you rearrange the parts again/send a request to the body shell designer for even more space, and so on.
It is many, may iterative interactions where the flow of the process has to go up and down the stack many times. In addition I am describing the flow for one design candidate. It’s actually a large tree of other candidates you should be checking, where each time there was a choice you queue up a message to the next stage for each possible choice you could have made. (and prune, from all the packages in flight in the system, the worst ones)
If you think about how to implement this, one way is data driven. All the “roles” in this network sit there quiescent waiting for an initial data package. All the context of the process is in the message itself, there is no agent “responsible” for the car design getting finished, but a message flow pipeline where after some time you will get valid car design alternatives in the ‘in box’ of the system that sent the request, or a message stating that the process failed from an intractable problem. (there were constraints that could not be satisfied after exhausting every design permutation)
There is no reason these roles cannot be superintelligences, but they get no context. They don’t think or have an internal narrative, they wait forever for a message, but apply superhuman and general skill when given the task. They are stateless microservices, though as they do have superintelligence level neural architectures, they are too fat to be called ‘micro’.