[Eli’s personal notes for Eli’s personal understanding. Feel free to ignore or engage.]
Eli’s proposed AGI planning-oracle design:
The AGI has four parts:
A human model
A NLP “request parser”
A “reality simulator” / planning module, that can generate plans conditioning on certain outcomes.
A UX system that outputs plans and outcomes
Here’s how it works:
1. A user makes a request of the system, by giving some goal that the user would like to achieve, like “cure cancer”. This request is phrased in natural language, and can include arbitrary details (like “cure cancer, without harming the biosphere, or otherwise meaningfully reducing human quality of life. And try to do it without spending to much money.”
2. The NLP request parser, interprets the human’s request. This entails some amount of “common sense” reasoning about what the human means, not just what they literally asked for (as I would expect of any good NLP system). Furthermore, in places where the NLP system is confused, it can ask followup question about what the user means. (The human model is maybe recruited to do this somehow.)
3. After the request is parsed into a specific goal, that goal is passed to the reality simulator. The reality simulator conditions on the goal being achieved and generates “a bunch” (hundreds? millions?) of plans, action sequences that lead to world states where the goal has been achieved, along with simulations of how that plays out and what the world looks like during plan execution and after the goal is hit.
4. The human model is then recruited to rank the plans according to how much the human will like them. (I almost typed “according to how likely the human is to approve them”, and then realized that that opens the door to optimizing for either deceiving or hacking the human.)
Possibly the Human model is involved in the planning step as well, as something like the discriminator of a GAN: the reality simulator generates some plans, and then the human model prunes the plans that it predicts are abhorrent. <handwavy> Overtime, the reality simulator learns to “think along lines that the human would endorse.” </handwavy>
5. The highest rated plans are sent to the immersive UX system. The user can explore and observe how a given plan plays out on all timescales. The UX system helps the user by 1) giving clear explanations of how the plan works in the equivalent of “explain like I’m 5” descriptions, and with recursive explanations (I would expect the UX system to often have to write textbooks to explain whole fields that the AI invented to solve the problem), and 2) by pointing out consequences of the plan that the human model predicts the human would be most likely to be upset with.
6. [Optional] The user executes the plan, or has the AI execute the plan.
Some notes:
If the AI is optimizing against the human in this setup, the human is fucked. There are ample opportunities for deception and human-hacking.
Clearly, this setup depends on having a solution to inner alignment.
The plan generation step, in particular, seems to be a bit fraught, as there’s a bunch of cognitive tricks that might be useful for generating plans, which seem likley to give rise to mesa optimizers.
There’s a risk that we end up taking plans that seem to have all good consequence, but are actually morally catastrophic in some way that we weren’t able to recognize, even with the immersive UX and the helpful AI. But I guess humanity is already facing this possibility. (This does suggest that we might want to use our AIs conservatively, and try to pick reversible plans.)
It seems like we’re counting on the human model to be good enough to catch many, possibly catastrophic, errors. If the human model is missing some important piece of our preferences, then there maybe something abhorrent in a plan that nevertheless gets ranked highly, and that abhorrent element is not flagged for our assessment in the UX stage.
This is improved somewhat by having the user (or more realistically, armies of teams of users) spend really a lot of time exhaustively exploring the sims.
Relatedly, the human model has to be doing something better than goodhearting on human approval. It needs to want to have an accurate human model, in the vein of moral uncertainty, or something.
Bold posit on the internet: If we had solutions to the following problems, this design would be feasible.
1. How do we search through the space of plans without getting an unaligned mesa optimizer?
2. How do we implement moral uncertainty without running into problems like updated deference?
3. How do we get really really good human models? Sub-problem: How do we assess the quality of our human models so that we know if we can rely on them?
4. How do we make sure that the planing module doesn’t goodheart, and find high ranking plans by finding blind spots and exploits in the human model?
At, least it seems to me that that this design avoids the pitfalls that Eliezer outlines here?
[Eli’s personal notes for Eli’s personal understanding. Feel free to ignore or engage.]
Eli’s proposed AGI planning-oracle design:
The AGI has four parts:
A human model
A NLP “request parser”
A “reality simulator” / planning module, that can generate plans conditioning on certain outcomes.
A UX system that outputs plans and outcomes
Here’s how it works:
1. A user makes a request of the system, by giving some goal that the user would like to achieve, like “cure cancer”. This request is phrased in natural language, and can include arbitrary details (like “cure cancer, without harming the biosphere, or otherwise meaningfully reducing human quality of life. And try to do it without spending to much money.”
2. The NLP request parser, interprets the human’s request. This entails some amount of “common sense” reasoning about what the human means, not just what they literally asked for (as I would expect of any good NLP system). Furthermore, in places where the NLP system is confused, it can ask followup question about what the user means. (The human model is maybe recruited to do this somehow.)
3. After the request is parsed into a specific goal, that goal is passed to the reality simulator. The reality simulator conditions on the goal being achieved and generates “a bunch” (hundreds? millions?) of plans, action sequences that lead to world states where the goal has been achieved, along with simulations of how that plays out and what the world looks like during plan execution and after the goal is hit.
4. The human model is then recruited to rank the plans according to how much the human will like them. (I almost typed “according to how likely the human is to approve them”, and then realized that that opens the door to optimizing for either deceiving or hacking the human.)
Possibly the Human model is involved in the planning step as well, as something like the discriminator of a GAN: the reality simulator generates some plans, and then the human model prunes the plans that it predicts are abhorrent. <handwavy> Overtime, the reality simulator learns to “think along lines that the human would endorse.” </handwavy>
5. The highest rated plans are sent to the immersive UX system. The user can explore and observe how a given plan plays out on all timescales. The UX system helps the user by 1) giving clear explanations of how the plan works in the equivalent of “explain like I’m 5” descriptions, and with recursive explanations (I would expect the UX system to often have to write textbooks to explain whole fields that the AI invented to solve the problem), and 2) by pointing out consequences of the plan that the human model predicts the human would be most likely to be upset with.
6. [Optional] The user executes the plan, or has the AI execute the plan.
Some notes:
If the AI is optimizing against the human in this setup, the human is fucked. There are ample opportunities for deception and human-hacking.
Clearly, this setup depends on having a solution to inner alignment.
The plan generation step, in particular, seems to be a bit fraught, as there’s a bunch of cognitive tricks that might be useful for generating plans, which seem likley to give rise to mesa optimizers.
There’s a risk that we end up taking plans that seem to have all good consequence, but are actually morally catastrophic in some way that we weren’t able to recognize, even with the immersive UX and the helpful AI. But I guess humanity is already facing this possibility. (This does suggest that we might want to use our AIs conservatively, and try to pick reversible plans.)
It seems like we’re counting on the human model to be good enough to catch many, possibly catastrophic, errors. If the human model is missing some important piece of our preferences, then there maybe something abhorrent in a plan that nevertheless gets ranked highly, and that abhorrent element is not flagged for our assessment in the UX stage.
This is improved somewhat by having the user (or more realistically, armies of teams of users) spend really a lot of time exhaustively exploring the sims.
Relatedly, the human model has to be doing something better than goodhearting on human approval. It needs to want to have an accurate human model, in the vein of moral uncertainty, or something.
Bold posit on the internet: If we had solutions to the following problems, this design would be feasible.
1. How do we search through the space of plans without getting an unaligned mesa optimizer?
2. How do we implement moral uncertainty without running into problems like updated deference?
3. How do we get really really good human models? Sub-problem: How do we assess the quality of our human models so that we know if we can rely on them?
4. How do we make sure that the planing module doesn’t goodheart, and find high ranking plans by finding blind spots and exploits in the human model?
At, least it seems to me that that this design avoids the pitfalls that Eliezer outlines here?