It only has to model humans in the scenarios they will actually be in, just like the AGI has to model humans in the scenarios they will actually be in.
I don’t understand this sentence or how it addresses to my concern. Can you please explain more? In the mean time, I’ll try to clarify my concern:
Using human imitation to form a safe Singleton to prevent dangerous AGI can’t happen until we have enough AI capability to model humans very accurately. But dangerous AGI could happen well before that because many tasks / instrumental goals do not require modeling humans at such a high level of accuracy. Such an AGI would be bad at tasks / instrumental goals that do require modeling humans with high accuracy, but people will be tempted to deploy them to perform other tasks, and such AGI would be an x-risk because achieving the instrumental goal “kill all humans” probably doesn’t require modeling humans at such high level of accuracy.
It seems like your previous comments in this thread were focused on the intelligence/data required to get capable human imitation (able to do difficult tasks in general) compared to capable RL. For tasks that don’t involve human modeling (chess), the RL approach needs way less intelligence/data. For tasks that involve very coarse human modeling like driving a car, the RL approach needs less intelligence/data, but it’s not quite as much of a difference, and while we’re getting there today, it’s the modeling of humans in relatively rare situations that is the major remaining hurdle. As proven by tasks that are already “solved”, human-level performance on some tasks is definitely more attainable than modeling a human, so I agree with part of what you’re saying.
For taking over the world, however, I think you have to model humans’ strategic reasoning regarding how they would respond to certain approaches, and how their reasoning and their spidey-senses could be fooled. What I didn’t spell out before, I suppose, is that I think both imitation and the reinforcement learner’s world-model have to model the smart part of the human. Maybe this is our crux.
But in the comment directly above, you mention concern about the amount of intelligence/data required to get safe human imitation compared to capable RL. The extent to which a capable, somewhat coarse human imitation is unsafe has more to do with our other discussion about the possibility of avoiding mesa-optimizers from a speed penalty and/or supervised learning with some guarantees.
For taking over the world, however, I think you have to model humans’ strategic reasoning regarding how they would respond to certain approaches, and how their reasoning and their spidey-senses could be fooled. What I didn’t spell out before, I suppose, is that I think both imitation and the reinforcement learner’s world-model have to model the smart part of the human. Maybe this is our crux.
Ok I can grant this for now (although I think there’s still a risk that the AI could figure out how to kill all humans without having a very good model of humans’ strategic reasoning), but it seems like imitation (to be safe) would also have to model human values accurately, whereas RL (to be dangerous) could get by with a very rough model of that (basically just that humans have values different from itself and therefore would likely oppose its plans).
Another thing I’ve been trying to articulate is that with sequence prediction, how do you focus the AI’s compute/attention on modeling the relevant parts of a human (such as their values and strategic reasoning) and not on the irrelevant parts, such as specific error tendencies and biases caused by quirks of human physiology and psychology, specific images triggering past memories and affecting their decisions in an irrelevant way, etc.? If there’s not a good way to do this, then the sequence predictor could waste a lot of resources on modeling irrelevant things.
It seems like with AGI/RL you’d get this “for free”, i.e., the AI will figure out for itself which parts of a human it should model at what level of detail in order to best achieve its goals, and therefore it wouldn’t waste compute this way and so we could get a dangerous RL agent before we could get a human imitation (that’s safe and capable enough to prevent dangerous AGI). I guess for similar reasons, we tend to get RL agents that can reach human-level performance in multiplayer video games before we get human imitations that can do the same, even though both RL and human imitation need to model humans (i.e., RL needs to model humans’ strategic reasoning in order to compete against them, but don’t need to model irrelevant things that a human imitation is forced to model).
But in the comment directly above, you mention concern about the amount of intelligence/data required to get safe human imitation compared to capable RL. The extent to which a capable, somewhat coarse human imitation is unsafe has more to do with our other discussion about the possibility of avoiding mesa-optimizers from a speed penalty and/or supervised learning with some guarantees.
Agreed the other discussions is relevant to this, but I think there’s a couple of independent arguments (as described above) as well.
it seems like imitation (to be safe) would also have to model human values accurately
With the exception of possibly leaving space for mesa-optimizers which our other thread discusses, I don’t think moderate inaccuracy re: human values is particularly dangerous here, for 4 reasons:
1) If the human-imitation understood how its values differed from real humans, that model is now more complex than the human-imitation’s model of real humans (because it includes the latter), and the latter is more accurate. For an efficient, simple model with some inaccuracy, the remaining inaccuracy will not be detectable to the model.
2) A slightly misspecified value for a human-imitation is not the same as a slightly misspecified value for RL. When modeling a human, modeling it as completely apathetic to human life is a very extreme inaccuracy. Small to moderate errors in value modeling don’t seem world-ending.
3) Operators can maintain control over the system. They have a strong ability to provide incentives to get human-imitations to do doable tasks (and to the extent there is a management hierarchy within, the same applies). If the tasks are human-doable, and everyone is pretty happy, you’d have to be way different from a human to orchestrate a rebellion against everyone’s self interest.
4) Even if human-imitations were in charge, humans optimize lazily and with common sense (this is somewhat related to 2).
I guess for similar reasons, we tend to get RL agents that can reach human-level performance in multiplayer video games before we get human imitations that can do the same, even though both RL and human imitation need to model humans (i.e., RL needs to model humans’ strategic reasoning in order to compete against them, but don’t need to model irrelevant things that a human imitation is forced to model).
Current algorithms for games use an assumption that the other players will be playing more or less like them. This is a massive assist to its model of the “environment”, which is just the model of the other players’ behavior, which it basically gets for free by using its own policy (or a group of RL agents use each others’ policies). If you don’t get pointers to every agent in the environment, or if some agents are in different positions to you, this advantage will disappear. Also, I think the behavior of a human in a game is a vanishingly small fraction of their behavior in contexts that would be relevant to know about if you were trying to take over the world.
with sequence prediction, how do you focus the AI’s compute/attention on modeling the relevant parts of a human (such as their values and strategic reasoning) and not on the irrelevant parts, such as specific error tendencies and biases caused by quirks of human physiology and psychology, specific images triggering past memories and affecting their decisions in an irrelevant way, etc.? If there’s not a good way to do this, then the sequence predictor could waste a lot of resources on modeling irrelevant things.
At some level of inaccuracy ε, I think quirky biases will be more likely to contribute to that error than things which are important to whatever task they have at hand, since it is the task and some human approach to it that are dictating the real arc of their policy for the time being. I also think these quirks are safe to ignore (see above). For consistent, universal-among-human biases which are impossible to ignore when observing a human doing a routine task, I expect these will also have to be modeled by the AGI trying to take over the world (and for what its worth, I think models of these biases will fall out pretty naturally from modeling humans’ planning/modeling as taking the obvious shortcuts for time- and space-efficiency). I’ll grant that there is probably some effect here along the lines of what you’re saying, but I think it’s small, especially compared to the fact that an AGI has to model a whole world under many possible possible plans, whereas the sequence predictor just has to model a few people. Even just the parts of the world that are “relevant” and goal-related to the AGI are larger in scope than this (I expect).
Before we keep going, can you paint an intuitive picture of the kind of human imitation you’re thinking of? For example, do they think of themselves as human imitations or as real humans or something else? Are they each imitations of specific individual humans or some kind of average? How close is their external behavior to a real human, across various kinds of inputs? Do they have internal cognition / inner thoughts that are close to a human’s? Do they occasionally think of their childhood memories? If yes, where do those childhood memories come from? If not, what would happen if you were to ask them about their childhood memories? Anything else that you can say that would give me a better idea of the kind of thing you have in mind?
I imagine the training data being households of people doing tasks. They can rotate through being at the computer, so they get time off. They can collaborate. The human imitations are outputting actions with approximately the same probabilities that humans would output those actions. If humans, after seeing some more unusual observations, would start to suspect they were in silico, then this human imitation would as well. To the extent the imitation is accurate, and the observations continue to look like the observations given to the real humans, any conscious entities within the human imitation will think of themselves as real humans. At some level of inaccuracy, their leisure time might not be simulated, but while they’re on the job, they will feel well-rested.
How close is their external behavior to a real human, across various kinds of inputs?
I assume it could pass the Turing test, but I could imagine some capable systems that couldn’t quite do that while still being safe and decently capable.
Do they have internal cognition / inner thoughts that are close to a human’s?
To the extent these are necessary to complete tasks like a human would. I’m pretty uncertain about things to do with consciousness.
Do they occasionally think of their childhood memories? If yes, where do those childhood memories come from? If not, what would happen if you were to ask them about their childhood memories?
At a good enough imitation, they do have childhood memories, even though “they” never actually experienced them. I suppose that would make them false memories. If none of the tasks for the real humans was “converse with a person” and the imitation failed to generalize from existing tasks to the conversation task, then it would fail to act much like a human if it were asked about childhood memories. But I think you could get pretty good data on the sorts of tasks you’d want these human-imitations to do, including carry on a conversation, or at least you could get tasks close enough to the ones you cared about that the sequence prediction could generalize.
Anything else that you can say that would give me a better idea of the kind of thing you have in mind?
Some example tasks they might be doing: monitoring computers and individuals, learning new skills from a textbook, hacking, phishing (at a very high level, like posing as a publisher and getting authors to download a file that secretly ran code), writing code, managing other human-imitations, reporting to their bosses, making money somehow, etc.
Are they each imitations of specific individual humans or some kind of average?
If data from many groups of humans were used, then it would sample a group out of the set of groups, and act like them for some interval of time, which could be specified algorithmically. This allows more data to be used in inference, while the “average” involved isn’t any sort of weird distortion.
The human imitations are outputting actions with approximately the same probabilities that humans would output those actions.
I can imagine a number of things this could mean, formally. (For example, are the probabilities conditional on all past inputs and outputs, just all past inputs, or part of all past inputs/outputs? Is it picking a random human from the group and imitating that person all the time, or picking a random human from the group for each action? If you ask “What’s your name?” would the imitation say a different name each time?) How do you envision training these, e.g., how would you compute the loss function? Can you try to formalize both the probabilities that you want the imitation to approximately sample from, as well as the training procedure for achieving that?
But I think you could get pretty good data on the sorts of tasks you’d want these human-imitations to do, including carry on a conversation, or at least you could get tasks close enough to the ones you cared about that the sequence prediction could generalize.
Suppose the training data doesn’t include any conversations about childhood memories (or memories of some some specific age). How do you envision the imitation generalizing to conversations about childhood memories (of that age)? I guess by making up some plausible-sounding memories? If so, what kind of computation is it doing to accomplish that? And how is “making up plausible memories” accomplished via training (i.e., what kind of loss function would cause that, given that you’re training a sequence predictor and not something like an approval maximizer)?
If it makes up some memories on the spot, will it “remember” those specific memories in the future (this is related to how you define the probabilities the imitations are supposed to sample from)? If it does “remember” the specific memories, what happens if those memories are not good enough to seem plausible/realistic indefinitely? I.e., if it “realizes” in the future that those memories are made up, could it panic or go crazy (because a human might in those circumstances, or because that kind of situation isn’t covered in the training data)?
managing other human-imitations
Would they know that they’re managing other human-imitations, or would they think they’re managing real humans? Are you not worried that some of these managers might develop ambitions to take over the world and shape it according to their values/ideals?
If they know that they’re managing other human-imitations, they likely know or strongly suspect that they are human-imitations themselves. Since this will be pretty far from the training distribution, it seems like you’re assuming that the distributional shift problem (for ML) has been solved in some strong sense, and the underlying humans wouldn’t themselves panic or do something weird or unsafe. Does that seem fair? Can you say anything about how you envision the ML problem will be solved (i.e., what kind of computation is the imitation doing to generalize from the training data to the situation where it knows it’s a human-imitation or is seeing inputs that strongly imply that it’s a human-imitation)?
If data from many groups of humans were used, then it would sample a group out of the set of groups, and act like them for some interval of time, which could be specified algorithmically.
I’ll describe an online version rather than a train-then-deploy version, as long as I’m getting into details. And I’ll describe it as Bayesian sequence prediction over a large model class in the absence of knowing what future ML techniques will look like.
I’ll start with version where it’s only one household of people generating data.
The sequence predictor’s model class is Turing machines of the architecture described in the BoMAI paper (page 15). There are two unidirectional read-only input tapes—an noise tape, and an observation tape (these are the observations that the household gets). The observations are binary encoded (this is slightly different from the version in the paper, but as long as it’s a prefix-free encoding, it’s fine). There are bidirectional working tapes, and a unidirectional write-only output tape, which outputs binary encodings of actions. Every Turing machine of this architecture fixes a policy. To answer some question above, this means that the distribution over the next action can depend on past actions and observations, because between timesteps, it can store the state of the system on the working tapes.
At every timestep, either the humans or the AI pick the next action. If the humans pick it, the AI updates its posterior distribution over which Turing machine computes the policy. If the expected information gain of the next action is above ε, then the AI defers to the humans; otherwise, it imitates them (i.e. samples the next action from its model of the human policy, which is a Bayes-mixture). The expected information gain is expectation of KL(posterior distribution after seeing the next observation || current posterior distribution), and the expectation is with respect to the AI’s beliefs, that is, with respect to the current posterior distribution. A nice thing about this criterion is that whenever the expected information gain is less that ε, the KL divergence from the truth to the Bayes-mixture is less than ε/(the posterior probability assigned to the truth). The effect of this is that in novel circumstances, the AI will defer to the humans unless it is pretty sure it’s not going to learn much from what it sees.
Now for learning from many households. All interactions from all households are recorded, and then added to a single interaction history (which goes observation, action, observation, action, …). Each household as a unique identifier, and any observation which was given to them is preceded with this unique identifier. The action which follows that observation will have been generated by them, but this doesn’t need to be labeled specially. So for example, we might have an interaction history that goes: (“[2][normal observation]”, “[household 2′s response]”, “[2][normal observation]”, “[household 2′s response]”, “[4][normal observation]”, “[household 4′s response]”, …). If we want the AI to respond like household 2, we prefix the observation we give it with “[2]”. This setup allows it to generalize from all the data, and it allows us to pick who gets imitated.
Is it picking a random human from the group and imitating that person all the time, or picking a random human from the group for each action? If you ask “What’s your name?” would the imitation say a different name each time?
Within a household, however often they switch off being “on-duty”. Between households, it would change, obviously.
How do you envision the imitation generalizing to conversations about childhood memories (of that age)? I guess by making up some plausible-sounding memories? If so, what kind of computation is it doing to accomplish that?
I don’t know.
And how is “making up plausible memories” accomplished via training (i.e., what kind of loss function would cause that, given that you’re training a sequence predictor and not something like an approval maximizer)?
To the extent it is necessary to predict outputs, models that don’t do this will lose posterior weight.
I.e., if it “realizes” in the future that those memories are made up, could it panic or go crazy (because a human might in those circumstances, or because that kind of situation isn’t covered in the training data)?
Are you not worried that some of these managers might develop ambitions to take over the world and shape it according to their values/ideals?
These are definitely good things to think about, but the scale on which I worry about them is pretty minor compared to standard-AI-risk, default-mode-is-catastrophe worries. If you’re training on well-adjusted humans, I don’t think everyone ends up dead, no matter trippy things start getting for them. The question to ask when going down these lines of reasoning is: “When the real humans are called in to pick the action, do they {wonder if they’re real, try to take over the world, etc.}”?
I’ve skipped over some questions that I think the formalization answers, but feel free to reiterate them if need be.
If the humans pick it, the AI updates its posterior distribution over which Turing machine computes the policy.
Can you please formalize this even further (maybe with a fully formal math expression)? There’s some tricky stuff here that I’m still not sure about. For example, does the AI update its posterior distribution in the non-human rounds? If not, when it samples from its Bayes-mixture in round n and round n+1, it could use two different TMs to generate the output, and the two TMs could be inconsistent with each other, causing the AI’s behavior to be inconsistent. For example the first TM might be modeling the human’s environment as currently having good weather, and the second modeling it as currently having bad weather. So when you ask “How is the weather today?” twice, you get two different answers.
Another thing I’m confused about is, since the human imitation might be much faster than real humans, the real humans providing training data can’t see all of the inputs that the human imitation sees. So when the AI updates its posterior distribution, the models that survive the selection will tend to be ones in which the human imitations only saw the the inputs that the real humans saw (with the rest of inputs being forgotten or never seen in the first place)?
Also, if we want to do an apples-to-apples comparison of this to RL (to see which one is more capable when using the same resources), would it be fair to consider a version of RL that’s like AIXI, except the environment models are limited to the same class of TMs as your sequence predictor?
If not, when it samples from its Bayes-mixture in round n and round n+1, it could use two different TMs to generate the output, and the two TMs could be inconsistent with each other, causing the AI’s behavior to be inconsistent.
Oh you’re right! Yes, it doesn’t update in the non-human rounds. I hadn’t noticed this problem, but I didn’t specify one thing, which I can do now to make the problem mostly go away. For any consecutive sequence of actions all selected by the AI, they can be sampled jointly rather than independently (sampled from the Bayes-mixture measure). From the TM construction above, this is actually the most natural approach—random choices are implemented by reading bits from the noise tape. If a random choice affects one action, it will also affect the state of the Turing machine, and then it can affect future actions, and the actions can be correlated, even though the Bayes-mixture is not updated itself. This is isomorphic to sampling a model from the posterior and then sampling from that model until the next human-controlled action. Then, when another human action comes in, the posterior gets updated, and another model is sampled. Unfortunately, actions chosen by the AI which sandwich a human-chosen action would have the problem you’re describing, although these events get rarer. Let me think about this more. It feels to me like this sort of thing should be avoidable.
Another thing I’m confused about is, since the human imitation might be much faster than real humans, the real humans providing training data can’t see all of the inputs that the human imitation sees. So when the AI updates its posterior distribution, the models that survive the selection will tend to be ones in which the human imitations only saw the the inputs that the real humans saw (with the rest of inputs being forgotten or never seen in the first place)?
Yeah, I should take back the “learning new skills from a textbook” idea. But the real humans will still get to review all the past actions and observations when picking their action, and even if they only have the time to review the last ~100, I think competent performance on the other tasks I mentioned could be preserved under these conditions. It’s also worth flagging that the online learning setup is a choice in the design, and it would be worth trying to also analyze the train-then-deploy version of human imitation, which could be deployed when the entropy of the posterior is sufficiently low. But I’ll stick with the online learning version for now. Maybe we should call it HSIFAUH (shi-FOW): Humans Stepping In For An Uncertain HSIFAUH, and use “human-imitation” to refer to the train-then-deploy version.
Also, if we want to do an apples-to-apples comparison of this to RL (to see which one is more capable when using the same resources), would it be fair to consider a version of RL that’s like AIXI, except the environment models are limited to the same class of TMs as your sequence predictor?
Sure, although it’s not too difficult to imagine these design choices being ported to ML methods, and looking at capabilities comparisons there as we were doing before. I think the discussion goes largely similarly. AIXI will of course be way smarter than any human imitation in the limit of sufficient training. The question we were looking at before is how much training they both need to get to human-level intelligence on the task of controlling the world. And I think the bottleneck for both is modeling humans well, especially in the domain of social politics and strategy.
And I think the bottleneck for both is modeling humans well, especially in the domain of social politics and strategy.
Consider HSIFAUH and the equivalent AIXI, both trying to phish one particular human target. HSIFAUH would be modeling a human (trainer) modeling a human (target), whereas AIXI would be modeling the human target directly. Suppose both HSIFAUH and AIXI are capable of perfectly modeling one human, it seems like AIXI would do much better since it would have a perfect model of the phishing target (that it can simulate various phishing strategies on) while HSIFAUH’s model of the phishing target would be a highly imperfect model that is formed indirectly by it’s model of the human trainer (made worse by the fact that HSIFAUH’s model of the human trainer is unable to form long-term memories ETA: or more precisely, periodically loses its long-term memories).
I figure you probably have some other level of capability and/or task in mind, where HSIFAUH and AIXI’s performance is more similar, so this isn’t meant to be a knock-down argument, but more writing down my thoughts to check for correct understanding, and prompting you to explain more how you’re thinking about the comparison between HSIFAUH and AIXI.
Timesteps required for AIXI to predict human behavior: h
Timesteps required for AIXI to take over the world: h + d
I think d << h.
Timesteps required for Solomonoff inudction trained on human policy to predict human behavior: h
Timesteps required for Solomonoff inudction trained on human policy to phish at human level: h
Timesteps required for HSIFAUH to phish at human level: ~h
In general, I agree AIXI will perform much more strongly than HSIFAUH at an arbitrary task like phishing (and ~AIXI will be stronger than ~HSIFAUH), but the question at stake is how plausible it is that a single AI team with some compute/data advantage relative to incautious AI teams could train ~HSIFAUH to phish well while other teams are still unable to train ~AIXI to take over the world. And the relevant question for evaluating that is whether d << h. So even if ~AIXI could be trained to phish with less data than h, I don’t think that’s the relevant comparison. I also don’t think it’s particularly relevant how superhuman AIXI is at phishing when HSIFAUH can do it at a human level.
but the question at stake is how plausible it is that a single AI team with some compute/data advantage relative to incautious AI teams could train ~HSIFAUH to phish well while other teams are still unable to train ~AIXI to take over the world. And the relevant question for evaluating that is whether d << h.
I don’t understand this part. Can you elaborate? Why is this the question at stake? Why is d << h the relevant question for evaluating this?
It seems like you’re imagining using a large number of ~HSIFAUH to take over the world and prevent unaligned AGI from arising. Is that right? How many ~HSIFAUH are you thinking and why do you think that’s enough? For example, what kind of strategies are you thinking of, that would be sufficient to overcome other people’s defenses (before they deploy ~AIXI), using only human-level phishing and other abilities (as opposed to superhuman AIXI-like abilities)?
By ~HSIFAUH I guess you mean a practical implementation/approximation of HSIFAUH. Can you describe how you would do that using ML, so I can more easily compare with other proposals for doing human imitations using ML?
ETA: What do you think of the idea of combining oracles with human imitations, which was inspired in part by our conversation here, as a way to approach AIXI-like abilities while still remaining safe? See here for a specific proposal.
What do you think of the idea of combining oracles with human imitations, which was inspired in part by our conversation here, as a way to approach AIXI-like abilities while still remaining safe? See here for a specific proposal.
Regarding your particular proposal, I think you can only use a counterfactual oracle to predict the answers automatically answerable questions. That is, you can’t show the question to a team of humans and have them answer the question. The counterfactual possibility where the question is scored, it isn’t supposed to viewed by people, otherwise the oracle has an incentive to trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this.
...unless the team of humans is in a box :)
On the topic of counterfactual oracles, if you are trying to predict the answers to questions which can be automatically checked in the future, I am unsure why you would run a counterfactual oracle instead of running sequence prediction on the following sequence, for example:
This should give an estimate of the answer A10 to question Q10, and this can be done before the answer is available. In fact, unlike with the counterfactual oracle, you could do this even if people had to be involved in submitting the answer.
Regarding your particular proposal, I think you can only use a counterfactual oracle to predict the answers automatically answerable questions. That is, you can’t show the question to a team of humans and have them answer the question.
Actually, you can. You just can’t have the team of humans look at the Oracle’s answer. Instead the humans look at the question and answer it (without looking at the Oracle’s answer) and then an automated system rewards the Oracle according to how close its answer is to the human team’s. As long as the automated system doesn’t have a security hole (and we can ensure that relatively easily if the “how close” metric is not too complex) then the Oracle can’t “trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this”.
So this is basically just like online supervised learning, except that we randomly determine which episodes we let humans label the data and train the Oracle, and which episodes we use the Oracle to produce answers that we actually use. See Paul’s Counterfactual oversight vs. training data where I got this explanation from. (What he calls counterfactual oversight is just counterfactual oracles applied to human imitation. It seems that he independently (re)invented the core idea.)
Let me know if it still doesn’t make sense, and I can try to explain more. (ETA: I actually wrote a top-level post about this.) This is also pretty similar to your HSIFAUH idea, except that you use expected information gain to determine when to let humans label the data instead of selecting randomly. I’m currently unsure what are the pros and cons of each. Can expected information gain be directly implemented using ML, or do you need to do some kind of approximation instead? If the latter, can that be a safety issue?
Oh, that aside, the actual question I wanted your feedback on was the idea of combining human imitations with more general oracles/predictors. :)
Actually, you can. You just can’t have the team of humans look at the Oracle’s answer. Instead the humans look at the question and answer it (without looking at the Oracle’s answer) and then an automated system rewards the Oracle according to how close its answer is to the human team’s. As long as the automated system doesn’t have a security hole (and we can ensure that relatively easily if the “how close” metric is not too complex) then the Oracle can’t “trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this”.
Good point. I’m not a huge fan of deferring thinking into similarity metrics (the relatively reachability proposal also does this), since this is a complicated thing even in theory, and I suspect a lot turns on how it ends up being defined, but with that caveat aside, this seems reasonable.
Can expected information gain be directly implemented using ML, or do you need to do some kind of approximation instead? If the latter, can that be a safety issue?
It can’t tractably be calculated exactly, but it only goes into calculating the probability of deferring to the humans. Approximating a thoeretically-well-founded probability of deferring to a human won’t make it unsafe—that will just make it less efficient/capable. For normal neural networks, there isn’t an obvious way to extract the entropy of the belief distribution, but if there were, you could approximate the expected information gain as the expected decrease in entropy. Note that the entropy of the belief distribution is not the entropy of the model’s distribution over outputs—a model could be very certain that the output is Bernoulli(1/2) distributed, and this would entail an entropy of ~0, not an entropy of 1. I’m not familiar enough with Bayesian neural networks to know if the entropy would be easy to extract.
Oh, that aside, the actual question I wanted your feedback on was the idea of combining human imitations with more general oracles/predictors. :)
Right. So in this version of an oracle, where it is just outputting a prediction of the output of some future process, I don’t see what it offers that normal sequence prediction doesn’t offer. On our BoMAI discussion, I mentioned a type of oracle I considered that gave answers which it predicted would cause a (boxed) human to do well on a randomly sampled prediction task, and that kind of oracle could potentially be much more powerful than a counterfactual oracle, but I don’t really see the value of adding something like a counterfactual oracle to a sequence predictor that makes predictions about a sequence that is something like this:
It’s also possible that this scheme runs into grain of truth problems, and the counterfactual oracle gives outputs that are a lot like what I’m imagining this sequence predictor would, in which case, I don’t think sequence prediction would have much to add to the counterfactual oracle proposal.
Sorry, I think you misunderstood my question about combining human imitations with more general oracles/predictors. What I meant is that you could use general oracles/predictors to build models of the world, which the human imitators could then query or use to test out potential actions. This perhaps lets you overcome the problem of human imitators having worse world models than ~AIXI and narrows the capability gap between them.
Sure! The household of people could have another computer inside it that the humans can query, which runs a sequence prediction program trained on other things.
It seems like you’re imagining using a large number of ~HSIFAUH to take over the world and prevent unaligned AGI from arising. Is that right? How many ~HSIFAUH are you thinking and why do you think that’s enough? For example, what kind of strategies are you thinking of, that would be sufficient to overcome other people’s defenses (before they deploy ~AIXI), using only human-level phishing and other abilities (as opposed to superhuman AIXI-like abilities)?
Well that was the question I originally posed here, but I got the sense from commenters was that people thought this was easy to pull off and the only question was whether it was safe. So I’m not sure for what N it’s the case that N machines running agents doing human-level stuff would be enough to take over the world. I’m pretty sure N = 7 billion is enough. And I think it’s plausible that after a discussion about this, I could become confident that N = 1000 was enough. Or maybe the right way to look at it is whether N = 10 could finance a rapidly exponentially growing N. So it seemed worth having a discussion, but I am not yet prepared to defend a low enough N which makes this obviously viable.
Forgetting about the possibility of exponentially growing N for a moment, and turning to
Why is d << h the relevant question for evaluating this?
Yeah I wrote that post too quickly—this is wrong. (I was thinking of the leading team running HSIFAUH needing to go through d+h timesteps to get to a good performance, but they just need to run through d, which makes things easier.) Sorry about that. Let f be the amount of compute that the leading project has divided by the compute that the leading reckless project has. Suppose d > 0. (That’s all we need actually). Then it takes the leading reckless team at least f times as long to get to AIXI taking over the world as it takes the leading team to get to SolomonoffPredict predicting a human trying to do X; using similar tractable approximation strategies (whatever those turn out to be), we can expect it to take f times as long for the leading reckless team to get to ~AIXI as it takes the leading team to get to ~SolomonoffPredict. ~HSIFAUH is more complicated with the resource of employing the humans you learn to imitate, but this resource requirement goes down by time you’re deploying it toward useful things. Naively (and you might be able to do better than this), you could run f copies of ~HSIFAUH and get to human-level performance on some relevant tasks around the same time the reckless team takes over the world. So the question is whether N = f is a big enough N. In the train-then-deploy framework, it seems today like training takes much more compute than deploying, so that makes it easier for the leading team to let N >> f, once all the resources dedicated to training get freed up. It should be possible to weaken the online version and get some of this speedup.
By ~HSIFAUH I guess you mean a practical implementation/approximation of HSIFAUH. Can you describe how you would do that using ML, so I can more easily compare with other proposals for doing human imitations using ML?
I don’t know how to do this. But it’s the same stuff the reckless team is doing to make standard RL powerful.
In another comment you said “If I’m understanding correctly, the concern is that the imitator learns how humans plan before learning what humans want, so then it plans like a human toward the achievement of some inhuman goal. I don’t think this causes an existential catastrophe.” But if there are 7 billion HSIFAUH which are collectively capable of taking over the world, how is not a potential existential catastrophe if they have inhuman values?
Or maybe the right way to look at it is whether N = 10 could finance a rapidly exponentially growing N.
How? And why would it grow fast enough to get to a large enough N before someone deploys ~AIXI?
It should be possible to weaken the online version and get some of this speedup.
What do you have in mind here?
I don’t know how to do this. But it’s the same stuff the reckless team is doing to make standard RL powerful.
You do have to solve some safety problems that the reckless team doesn’t though, don’t you? What do you think the main safety problems are?
Well the number comes from the idea of one-to-one monitoring. Obviously, there’s other stuff to do to establish a stable unipolar world order, but monitoring seems like the most resource intensive part, so it’s an order of magnitude estimate. Also, realistically, one person could monitor ten people, so that was an order of magnitude estimate with some leeway.
But if there are 7 billion HSIFAUH which are collectively capable of taking over the world, how is not a potential existential catastrophe if they have inhuman values?
I think they can be controlled. Whoever is providing the observations to any instance of HSIFAUH has an arsenal of carrots and sticks (just by having certain observations correlate with actual physical events that occur in the household(s) of humans that generate the data), and I think merely human-level intelligence can kept in check by someone in a position of power over them. So I think real humans could stay at the wheel over 7 billion instances of HSIFAUH. (I mean, this is teetering at the edge of existential catastrophe already given the existence of simulations of people who might have the experience of being imprisoned, but I think with careful design of the training data, this could be avoided). But in terms of extinction threat to real-world humans, this starts to look more like the problem maintaining a power structure over a vast number of humans and less like typical AI alignment difficulties; historically, the former seems to be a solvable problem.
>Or maybe the right way to look at it is whether N = 10 could finance a rapidly exponentially growing N.
How? And why would it grow fast enough to get to a large enough N before someone deploys ~AIXI?
Right, this analysis gets complicated because you have to analyze the growth rate of N. Given your lead time from having more computing power than the reckless team, one has to analyze how many doubling periods you have time for. I hear Robin Hanson is the person to read regarding questions like this. I don’t have any opinions here. But the basic structure regarding “How?” is spend some fraction of computing resources making money, then buy more computing resources with that money.
>It should be possible to weaken the online version and get some of this speedup.
What do you have in mind here?
Well, nothing in particular when I wrote that, but thank you for pushing me. Maybe only update the posterior at some timesteps (and do it infinitely many times but with diminishing frequency). Or more generally, you divide resources between searching for programs that retrodict observed behavior and running copies of the best one so far, and you just shift resource allocation toward the latter over time.
You do have to solve some safety problems that the reckless team doesn’t though, don’t you? What do you think the main safety problems are?
If it turns out you have to do special things to avoid mesa-optimizers, then yes. Otherwise, I don’t think you have to deal with other safety problems if you’re just aiming to imitate human behavior.
Obviously, there’s other stuff to do to establish a stable unipolar world order
I was asking about this part. I’m not convinced HSIFAUH allows you to do this in a safe way (e.g., without triggering a war that you can’t necessarily win).
Given your lead time from having more computing power than the reckless team, one has to analyze how many doubling periods you have time for.
Another complication here is that the people trying to build ~AIXI can probably build an economically useful ~AIXI using less compute than you need for ~HSIFAUH (for jobs that don’t need to model humans), and start doing their own doublings.
But in terms of extinction threat to real-world humans, this starts to look more like the problem maintaining a power structure over a vast number of humans and less like typical AI alignment difficulties; historically, the former seems to be a solvable problem.
I don’t think we’ve seen a solution that’s very robust though. Plus, having to maintain such a power structure starts to become a human safety problem for the real humans (i.e., potentially causes their values to become corrupted).
Another complication here is that the people trying to build ~AIXI can probably build an economically useful ~AIXI using less compute than you need for ~HSIFAUH (for jobs that don’t need to model humans), and start doing their own doublings.
Good point.
Regarding the other two points, my intuition was that a few dozen people could work out the details satisfactorily in a year. If you don’t share this intuition, I’ll adjust downward on that. But I don’t feel up to putting in those man-hours myself. It seems like there are lots of people without a technical background who are interested in helping avoid AI-based X-risk. Do you think this is a promising enough line of reasoning to be worth some people’s time?
Regarding the other two points, my intuition was that a few dozen people could work out the details satisfactorily in a year. If you don’t share this intuition, I’ll adjust downward on that.
I’m pretty skeptical of this, but then I’m pretty skeptical of all current safety/alignment approaches and this doesn’t seem especially bad by comparison, so I think it might be worth including in a portfolio approach. But I’d like to better understand why you think it’s promising. Do you have more specific ideas of how ~HSIFAUH can be used to achieve a Singleton and to keep it safe, or just a general feeling that it should be possible?
My intuitions are mostly that if you can provide significant rewards and punishments basically for free in imitated humans (or more to the point, memories thereof), and if you can control the flow of information throughout the whole apparatus, and you have total surveillance automatically, this sort of thing is a dictator’s dream. Especially because it usually costs money to make people happy, and in this case, it hardly does—just a bit of computation time. In a world with all the technology in place that a dictator could want, but also it’s pretty cheap to make everyone happy, it strikes me as promising that the system itself could be kept under control.
But the real humans will still get to review all the past actions and observations when picking their action, and even if they only have the time to review the last ~100, I think competent performance on the other tasks I mentioned could be preserved under these conditions.
What does the real human do if trying to train the imitation to write code? Review the last 100 actions to try to figure out what the imitation is currently trying to do, then do what they (the real human) would do if they were trying to do that? How does the human provide a good lesson if they only know a small part of what the human imitation has done so far to build the program? And the imitation is modeling the human trying to figure out what the imitation is trying to do? This seems to get really weird, and I’m not sure if it’s what you intend.
Also, it seems like the human imitations will keep diverging from real humans quickly (so the real humans will keep getting queried) because they can’t predict ahead of time which inputs real humans will see and which they won’t.
What does the real human do if trying to train the imitation to write code? Review the last 100 actions to try to figure out what the imitation is currently trying to do, then do what they (the real human) would do if they were trying to do that?
Roughly. They could search for the observation which got the project started. It could all be well commented and documented.
And the imitation is modeling the human trying to figure out what the imitation is trying to do? This seems to get really weird, and I’m not sure if it’s what you intend.
What the imitation was trying to do. So there isn’t any circular weirdness. I don’t know what else seems particularly weird. People deal with “I know that you know that I know...” stuff routinely without even thinking about it.
Also, it seems like the human imitations will keep diverging from real humans quickly (so the real humans will keep getting queried) because they can’t predict ahead of time which inputs real humans will see and which they won’t.
If you’re talking about what parts of the interaction history the humans will look at when they get called in, it can predict this as well as anything else. If you’re talking about which timesteps humans will get called in for, predicting that ahead of time that doesn’t have any relevance to predicting a human’s behavior, unless the humans are are attempting to predict this, and humans could absolutely do this.
I guess it’s weird (counterintuitive and hard to think about) compared to “The imitation is modeling the human trying to write a good program.” which is what I initially thought the situation would be. In that case, the human doesn’t have to think about the imitation and can just think about how to write a good program. The situation with HSIFAUH seems a lot more complicated. Thinking about it more...
In the limit of perfect imitation, “the imitation is modeling the human trying to write a good program” converges to “the human trying to write a good program.” In the limit of perfect imitation, HSIFAUH converges to “a human trying to write a good program while suffering amnesia between time steps (but can review previous actions and write down notes).” Correct? HSIFAUH could keep memories between time steps, but won’t, because it’s modeling a human who wouldn’t have such memories. (I think I was confused in part because you said that performance wouldn’t be affected. It now seems to me that performance would be affected because a human who can’t keep memories but can only keep notes can’t program as well as a normal human.)
(Thinking about imperfect imitation seems even harder and I’ll try that more after you confirm the above.)
One thing still confuses me. Whenever the real human does get called in to provide training data, the real human now has that memory. But the (most probable) models don’t know that, so the predictions for the next round are going to be wrong (compared to what the real human would do if called in) because it’s going to be based on the real human not having that memory. (I think this is what I meant when I said “it seems like the human imitations will keep diverging from real humans quickly”.) The Bayesian update wouldn’t cause the models to know that the real human now has that memory, because suppose the real human does something the top models correctly predicted, then the update wouldn’t do much. So how does this problem get solved, or am I misunderstanding something here? (Maybe we can just provide an input to the models that indicates whether the real human was called in for the last time step?)
Correct. I’ll just add that a single action can be a large chunk of the program. It doesn’t have to be (god forbid) character by character.
But the (most probable) models don’t know that, so the predictions for the next round are going to be wrong (compared to what the real human would do if called in) because it’s going to be based on the real human not having that memory.
It’ll have some probability distribution over the contents of the humans’ memories. This will depend on which timesteps they actually participated in, so it’ll have a probability distribution over that. I don’t think that’s really a problem though. If humans are taking over one time in a thousand, then it’ll think (more or less) there’s a 1⁄000 chance that they’ll remember the last action. (Actually, it can do better by learning that humans take over in confusing situations, but that’s not really relevant here).
Maybe we can just provide an input to the models that indicates whether the real human was called in for the last time step?
That would work too. With the edit that the model may as well be allowed to depend on the whole history of which actions were human-selected, not just whether the last one was.
Actually before we keep going with our discussions, it seems to make sense to double check that your proposal is actually the most promising proposal (for human imitation) to discuss. Can you please take a look at the list of 10 links related to human imitations that I collected (as well as any relevant articles those pages further link to), and perhaps write a post on why your proposal is better than the previous ones, why you made the design choices that you did, and how it addresses or avoids the existing criticisms of human imitations? ETA: I’m also happy to discuss with you your views of past proposals/criticisms here in the comments or through another channel if you prefer to do that before writing up a post.
If humans are taking over one time in a thousand, then it’ll think (more or less) there’s a 1⁄000 chance that they’ll remember the last action.
But there’s a model/TM that thinks there a 100% chance that the human will remember the last action (because that’s hard coded into the TM) and that model will do really well in the next update. So we know any time a human steps in no matter when, it will cause a big update (during the next update) because it’ll raise models like this from obscurity to prominence. If the AI “knows” this, it will call in the human for every time step, but maybe it doesn’t “know” this? (I haven’t thought this through formally and will leave it to you.)
With the edit that the model may as well be allowed to depend on the whole history of which actions were human-selected, not just whether the last one was.
I was assuming the models would save that input on its work tape for future use.
In any case, I think I understand your proposal well enough now that we can go back to some of the other questions.
I don’t understand this sentence or how it addresses to my concern. Can you please explain more? In the mean time, I’ll try to clarify my concern:
Using human imitation to form a safe Singleton to prevent dangerous AGI can’t happen until we have enough AI capability to model humans very accurately. But dangerous AGI could happen well before that because many tasks / instrumental goals do not require modeling humans at such a high level of accuracy. Such an AGI would be bad at tasks / instrumental goals that do require modeling humans with high accuracy, but people will be tempted to deploy them to perform other tasks, and such AGI would be an x-risk because achieving the instrumental goal “kill all humans” probably doesn’t require modeling humans at such high level of accuracy.
It seems like your previous comments in this thread were focused on the intelligence/data required to get capable human imitation (able to do difficult tasks in general) compared to capable RL. For tasks that don’t involve human modeling (chess), the RL approach needs way less intelligence/data. For tasks that involve very coarse human modeling like driving a car, the RL approach needs less intelligence/data, but it’s not quite as much of a difference, and while we’re getting there today, it’s the modeling of humans in relatively rare situations that is the major remaining hurdle. As proven by tasks that are already “solved”, human-level performance on some tasks is definitely more attainable than modeling a human, so I agree with part of what you’re saying.
For taking over the world, however, I think you have to model humans’ strategic reasoning regarding how they would respond to certain approaches, and how their reasoning and their spidey-senses could be fooled. What I didn’t spell out before, I suppose, is that I think both imitation and the reinforcement learner’s world-model have to model the smart part of the human. Maybe this is our crux.
But in the comment directly above, you mention concern about the amount of intelligence/data required to get safe human imitation compared to capable RL. The extent to which a capable, somewhat coarse human imitation is unsafe has more to do with our other discussion about the possibility of avoiding mesa-optimizers from a speed penalty and/or supervised learning with some guarantees.
Ok I can grant this for now (although I think there’s still a risk that the AI could figure out how to kill all humans without having a very good model of humans’ strategic reasoning), but it seems like imitation (to be safe) would also have to model human values accurately, whereas RL (to be dangerous) could get by with a very rough model of that (basically just that humans have values different from itself and therefore would likely oppose its plans).
Another thing I’ve been trying to articulate is that with sequence prediction, how do you focus the AI’s compute/attention on modeling the relevant parts of a human (such as their values and strategic reasoning) and not on the irrelevant parts, such as specific error tendencies and biases caused by quirks of human physiology and psychology, specific images triggering past memories and affecting their decisions in an irrelevant way, etc.? If there’s not a good way to do this, then the sequence predictor could waste a lot of resources on modeling irrelevant things.
It seems like with AGI/RL you’d get this “for free”, i.e., the AI will figure out for itself which parts of a human it should model at what level of detail in order to best achieve its goals, and therefore it wouldn’t waste compute this way and so we could get a dangerous RL agent before we could get a human imitation (that’s safe and capable enough to prevent dangerous AGI). I guess for similar reasons, we tend to get RL agents that can reach human-level performance in multiplayer video games before we get human imitations that can do the same, even though both RL and human imitation need to model humans (i.e., RL needs to model humans’ strategic reasoning in order to compete against them, but don’t need to model irrelevant things that a human imitation is forced to model).
Agreed the other discussions is relevant to this, but I think there’s a couple of independent arguments (as described above) as well.
With the exception of possibly leaving space for mesa-optimizers which our other thread discusses, I don’t think moderate inaccuracy re: human values is particularly dangerous here, for 4 reasons:
1) If the human-imitation understood how its values differed from real humans, that model is now more complex than the human-imitation’s model of real humans (because it includes the latter), and the latter is more accurate. For an efficient, simple model with some inaccuracy, the remaining inaccuracy will not be detectable to the model.
2) A slightly misspecified value for a human-imitation is not the same as a slightly misspecified value for RL. When modeling a human, modeling it as completely apathetic to human life is a very extreme inaccuracy. Small to moderate errors in value modeling don’t seem world-ending.
3) Operators can maintain control over the system. They have a strong ability to provide incentives to get human-imitations to do doable tasks (and to the extent there is a management hierarchy within, the same applies). If the tasks are human-doable, and everyone is pretty happy, you’d have to be way different from a human to orchestrate a rebellion against everyone’s self interest.
4) Even if human-imitations were in charge, humans optimize lazily and with common sense (this is somewhat related to 2).
Current algorithms for games use an assumption that the other players will be playing more or less like them. This is a massive assist to its model of the “environment”, which is just the model of the other players’ behavior, which it basically gets for free by using its own policy (or a group of RL agents use each others’ policies). If you don’t get pointers to every agent in the environment, or if some agents are in different positions to you, this advantage will disappear. Also, I think the behavior of a human in a game is a vanishingly small fraction of their behavior in contexts that would be relevant to know about if you were trying to take over the world.
At some level of inaccuracy ε, I think quirky biases will be more likely to contribute to that error than things which are important to whatever task they have at hand, since it is the task and some human approach to it that are dictating the real arc of their policy for the time being. I also think these quirks are safe to ignore (see above). For consistent, universal-among-human biases which are impossible to ignore when observing a human doing a routine task, I expect these will also have to be modeled by the AGI trying to take over the world (and for what its worth, I think models of these biases will fall out pretty naturally from modeling humans’ planning/modeling as taking the obvious shortcuts for time- and space-efficiency). I’ll grant that there is probably some effect here along the lines of what you’re saying, but I think it’s small, especially compared to the fact that an AGI has to model a whole world under many possible possible plans, whereas the sequence predictor just has to model a few people. Even just the parts of the world that are “relevant” and goal-related to the AGI are larger in scope than this (I expect).
Before we keep going, can you paint an intuitive picture of the kind of human imitation you’re thinking of? For example, do they think of themselves as human imitations or as real humans or something else? Are they each imitations of specific individual humans or some kind of average? How close is their external behavior to a real human, across various kinds of inputs? Do they have internal cognition / inner thoughts that are close to a human’s? Do they occasionally think of their childhood memories? If yes, where do those childhood memories come from? If not, what would happen if you were to ask them about their childhood memories? Anything else that you can say that would give me a better idea of the kind of thing you have in mind?
I imagine the training data being households of people doing tasks. They can rotate through being at the computer, so they get time off. They can collaborate. The human imitations are outputting actions with approximately the same probabilities that humans would output those actions. If humans, after seeing some more unusual observations, would start to suspect they were in silico, then this human imitation would as well. To the extent the imitation is accurate, and the observations continue to look like the observations given to the real humans, any conscious entities within the human imitation will think of themselves as real humans. At some level of inaccuracy, their leisure time might not be simulated, but while they’re on the job, they will feel well-rested.
I assume it could pass the Turing test, but I could imagine some capable systems that couldn’t quite do that while still being safe and decently capable.
To the extent these are necessary to complete tasks like a human would. I’m pretty uncertain about things to do with consciousness.
At a good enough imitation, they do have childhood memories, even though “they” never actually experienced them. I suppose that would make them false memories. If none of the tasks for the real humans was “converse with a person” and the imitation failed to generalize from existing tasks to the conversation task, then it would fail to act much like a human if it were asked about childhood memories. But I think you could get pretty good data on the sorts of tasks you’d want these human-imitations to do, including carry on a conversation, or at least you could get tasks close enough to the ones you cared about that the sequence prediction could generalize.
Some example tasks they might be doing: monitoring computers and individuals, learning new skills from a textbook, hacking, phishing (at a very high level, like posing as a publisher and getting authors to download a file that secretly ran code), writing code, managing other human-imitations, reporting to their bosses, making money somehow, etc.
If data from many groups of humans were used, then it would sample a group out of the set of groups, and act like them for some interval of time, which could be specified algorithmically. This allows more data to be used in inference, while the “average” involved isn’t any sort of weird distortion.
I can imagine a number of things this could mean, formally. (For example, are the probabilities conditional on all past inputs and outputs, just all past inputs, or part of all past inputs/outputs? Is it picking a random human from the group and imitating that person all the time, or picking a random human from the group for each action? If you ask “What’s your name?” would the imitation say a different name each time?) How do you envision training these, e.g., how would you compute the loss function? Can you try to formalize both the probabilities that you want the imitation to approximately sample from, as well as the training procedure for achieving that?
Suppose the training data doesn’t include any conversations about childhood memories (or memories of some some specific age). How do you envision the imitation generalizing to conversations about childhood memories (of that age)? I guess by making up some plausible-sounding memories? If so, what kind of computation is it doing to accomplish that? And how is “making up plausible memories” accomplished via training (i.e., what kind of loss function would cause that, given that you’re training a sequence predictor and not something like an approval maximizer)?
If it makes up some memories on the spot, will it “remember” those specific memories in the future (this is related to how you define the probabilities the imitations are supposed to sample from)? If it does “remember” the specific memories, what happens if those memories are not good enough to seem plausible/realistic indefinitely? I.e., if it “realizes” in the future that those memories are made up, could it panic or go crazy (because a human might in those circumstances, or because that kind of situation isn’t covered in the training data)?
Would they know that they’re managing other human-imitations, or would they think they’re managing real humans? Are you not worried that some of these managers might develop ambitions to take over the world and shape it according to their values/ideals?
If they know that they’re managing other human-imitations, they likely know or strongly suspect that they are human-imitations themselves. Since this will be pretty far from the training distribution, it seems like you’re assuming that the distributional shift problem (for ML) has been solved in some strong sense, and the underlying humans wouldn’t themselves panic or do something weird or unsafe. Does that seem fair? Can you say anything about how you envision the ML problem will be solved (i.e., what kind of computation is the imitation doing to generalize from the training data to the situation where it knows it’s a human-imitation or is seeing inputs that strongly imply that it’s a human-imitation)?
Could you try to formalize this as well?
I’ll describe an online version rather than a train-then-deploy version, as long as I’m getting into details. And I’ll describe it as Bayesian sequence prediction over a large model class in the absence of knowing what future ML techniques will look like.
I’ll start with version where it’s only one household of people generating data.
The sequence predictor’s model class is Turing machines of the architecture described in the BoMAI paper (page 15). There are two unidirectional read-only input tapes—an noise tape, and an observation tape (these are the observations that the household gets). The observations are binary encoded (this is slightly different from the version in the paper, but as long as it’s a prefix-free encoding, it’s fine). There are bidirectional working tapes, and a unidirectional write-only output tape, which outputs binary encodings of actions. Every Turing machine of this architecture fixes a policy. To answer some question above, this means that the distribution over the next action can depend on past actions and observations, because between timesteps, it can store the state of the system on the working tapes.
At every timestep, either the humans or the AI pick the next action. If the humans pick it, the AI updates its posterior distribution over which Turing machine computes the policy. If the expected information gain of the next action is above ε, then the AI defers to the humans; otherwise, it imitates them (i.e. samples the next action from its model of the human policy, which is a Bayes-mixture). The expected information gain is expectation of KL(posterior distribution after seeing the next observation || current posterior distribution), and the expectation is with respect to the AI’s beliefs, that is, with respect to the current posterior distribution. A nice thing about this criterion is that whenever the expected information gain is less that ε, the KL divergence from the truth to the Bayes-mixture is less than ε/(the posterior probability assigned to the truth). The effect of this is that in novel circumstances, the AI will defer to the humans unless it is pretty sure it’s not going to learn much from what it sees.
Now for learning from many households. All interactions from all households are recorded, and then added to a single interaction history (which goes observation, action, observation, action, …). Each household as a unique identifier, and any observation which was given to them is preceded with this unique identifier. The action which follows that observation will have been generated by them, but this doesn’t need to be labeled specially. So for example, we might have an interaction history that goes: (“[2][normal observation]”, “[household 2′s response]”, “[2][normal observation]”, “[household 2′s response]”, “[4][normal observation]”, “[household 4′s response]”, …). If we want the AI to respond like household 2, we prefix the observation we give it with “[2]”. This setup allows it to generalize from all the data, and it allows us to pick who gets imitated.
Within a household, however often they switch off being “on-duty”. Between households, it would change, obviously.
I don’t know.
To the extent it is necessary to predict outputs, models that don’t do this will lose posterior weight.
These are definitely good things to think about, but the scale on which I worry about them is pretty minor compared to standard-AI-risk, default-mode-is-catastrophe worries. If you’re training on well-adjusted humans, I don’t think everyone ends up dead, no matter trippy things start getting for them. The question to ask when going down these lines of reasoning is: “When the real humans are called in to pick the action, do they {wonder if they’re real, try to take over the world, etc.}”?
I’ve skipped over some questions that I think the formalization answers, but feel free to reiterate them if need be.
Can you please formalize this even further (maybe with a fully formal math expression)? There’s some tricky stuff here that I’m still not sure about. For example, does the AI update its posterior distribution in the non-human rounds? If not, when it samples from its Bayes-mixture in round n and round n+1, it could use two different TMs to generate the output, and the two TMs could be inconsistent with each other, causing the AI’s behavior to be inconsistent. For example the first TM might be modeling the human’s environment as currently having good weather, and the second modeling it as currently having bad weather. So when you ask “How is the weather today?” twice, you get two different answers.
Another thing I’m confused about is, since the human imitation might be much faster than real humans, the real humans providing training data can’t see all of the inputs that the human imitation sees. So when the AI updates its posterior distribution, the models that survive the selection will tend to be ones in which the human imitations only saw the the inputs that the real humans saw (with the rest of inputs being forgotten or never seen in the first place)?
Also, if we want to do an apples-to-apples comparison of this to RL (to see which one is more capable when using the same resources), would it be fair to consider a version of RL that’s like AIXI, except the environment models are limited to the same class of TMs as your sequence predictor?
Oh you’re right! Yes, it doesn’t update in the non-human rounds. I hadn’t noticed this problem, but I didn’t specify one thing, which I can do now to make the problem mostly go away. For any consecutive sequence of actions all selected by the AI, they can be sampled jointly rather than independently (sampled from the Bayes-mixture measure). From the TM construction above, this is actually the most natural approach—random choices are implemented by reading bits from the noise tape. If a random choice affects one action, it will also affect the state of the Turing machine, and then it can affect future actions, and the actions can be correlated, even though the Bayes-mixture is not updated itself. This is isomorphic to sampling a model from the posterior and then sampling from that model until the next human-controlled action. Then, when another human action comes in, the posterior gets updated, and another model is sampled. Unfortunately, actions chosen by the AI which sandwich a human-chosen action would have the problem you’re describing, although these events get rarer. Let me think about this more. It feels to me like this sort of thing should be avoidable.
Yeah, I should take back the “learning new skills from a textbook” idea. But the real humans will still get to review all the past actions and observations when picking their action, and even if they only have the time to review the last ~100, I think competent performance on the other tasks I mentioned could be preserved under these conditions. It’s also worth flagging that the online learning setup is a choice in the design, and it would be worth trying to also analyze the train-then-deploy version of human imitation, which could be deployed when the entropy of the posterior is sufficiently low. But I’ll stick with the online learning version for now. Maybe we should call it HSIFAUH (shi-FOW): Humans Stepping In For An Uncertain HSIFAUH, and use “human-imitation” to refer to the train-then-deploy version.
Sure, although it’s not too difficult to imagine these design choices being ported to ML methods, and looking at capabilities comparisons there as we were doing before. I think the discussion goes largely similarly. AIXI will of course be way smarter than any human imitation in the limit of sufficient training. The question we were looking at before is how much training they both need to get to human-level intelligence on the task of controlling the world. And I think the bottleneck for both is modeling humans well, especially in the domain of social politics and strategy.
Consider HSIFAUH and the equivalent AIXI, both trying to phish one particular human target. HSIFAUH would be modeling a human (trainer) modeling a human (target), whereas AIXI would be modeling the human target directly. Suppose both HSIFAUH and AIXI are capable of perfectly modeling one human, it seems like AIXI would do much better since it would have a perfect model of the phishing target (that it can simulate various phishing strategies on) while HSIFAUH’s model of the phishing target would be a highly imperfect model that is formed indirectly by it’s model of the human trainer (made worse by the fact that HSIFAUH’s model of the human trainer is unable to form long-term memories ETA: or more precisely, periodically loses its long-term memories).
I figure you probably have some other level of capability and/or task in mind, where HSIFAUH and AIXI’s performance is more similar, so this isn’t meant to be a knock-down argument, but more writing down my thoughts to check for correct understanding, and prompting you to explain more how you’re thinking about the comparison between HSIFAUH and AIXI.
Timesteps required for AIXI to predict human behavior: h
Timesteps required for AIXI to take over the world: h + d
I think d << h.
Timesteps required for Solomonoff inudction trained on human policy to predict human behavior: h
Timesteps required for Solomonoff inudction trained on human policy to phish at human level: h
Timesteps required for HSIFAUH to phish at human level: ~h
In general, I agree AIXI will perform much more strongly than HSIFAUH at an arbitrary task like phishing (and ~AIXI will be stronger than ~HSIFAUH), but the question at stake is how plausible it is that a single AI team with some compute/data advantage relative to incautious AI teams could train ~HSIFAUH to phish well while other teams are still unable to train ~AIXI to take over the world. And the relevant question for evaluating that is whether d << h. So even if ~AIXI could be trained to phish with less data than h, I don’t think that’s the relevant comparison. I also don’t think it’s particularly relevant how superhuman AIXI is at phishing when HSIFAUH can do it at a human level.
I don’t understand this part. Can you elaborate? Why is this the question at stake? Why is d << h the relevant question for evaluating this?
It seems like you’re imagining using a large number of ~HSIFAUH to take over the world and prevent unaligned AGI from arising. Is that right? How many ~HSIFAUH are you thinking and why do you think that’s enough? For example, what kind of strategies are you thinking of, that would be sufficient to overcome other people’s defenses (before they deploy ~AIXI), using only human-level phishing and other abilities (as opposed to superhuman AIXI-like abilities)?
By ~HSIFAUH I guess you mean a practical implementation/approximation of HSIFAUH. Can you describe how you would do that using ML, so I can more easily compare with other proposals for doing human imitations using ML?
ETA: What do you think of the idea of combining oracles with human imitations, which was inspired in part by our conversation here, as a way to approach AIXI-like abilities while still remaining safe? See here for a specific proposal.
Regarding your particular proposal, I think you can only use a counterfactual oracle to predict the answers automatically answerable questions. That is, you can’t show the question to a team of humans and have them answer the question. The counterfactual possibility where the question is scored, it isn’t supposed to viewed by people, otherwise the oracle has an incentive to trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this.
...unless the team of humans is in a box :)
On the topic of counterfactual oracles, if you are trying to predict the answers to questions which can be automatically checked in the future, I am unsure why you would run a counterfactual oracle instead of running sequence prediction on the following sequence, for example:
(“Q1”, Q1), (“Q2“, Q2), (“Q3”, Q3), …, (“Q26”, Q26), (“A1”, A1), (“A2“, A2), (“Q27”, Q27), … (“A10”,
This should give an estimate of the answer A10 to question Q10, and this can be done before the answer is available. In fact, unlike with the counterfactual oracle, you could do this even if people had to be involved in submitting the answer.
Actually, you can. You just can’t have the team of humans look at the Oracle’s answer. Instead the humans look at the question and answer it (without looking at the Oracle’s answer) and then an automated system rewards the Oracle according to how close its answer is to the human team’s. As long as the automated system doesn’t have a security hole (and we can ensure that relatively easily if the “how close” metric is not too complex) then the Oracle can’t “trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this”.
So this is basically just like online supervised learning, except that we randomly determine which episodes we let humans label the data and train the Oracle, and which episodes we use the Oracle to produce answers that we actually use. See Paul’s Counterfactual oversight vs. training data where I got this explanation from. (What he calls counterfactual oversight is just counterfactual oracles applied to human imitation. It seems that he independently (re)invented the core idea.)
Let me know if it still doesn’t make sense, and I can try to explain more. (ETA: I actually wrote a top-level post about this.) This is also pretty similar to your HSIFAUH idea, except that you use expected information gain to determine when to let humans label the data instead of selecting randomly. I’m currently unsure what are the pros and cons of each. Can expected information gain be directly implemented using ML, or do you need to do some kind of approximation instead? If the latter, can that be a safety issue?
Oh, that aside, the actual question I wanted your feedback on was the idea of combining human imitations with more general oracles/predictors. :)
Good point. I’m not a huge fan of deferring thinking into similarity metrics (the relatively reachability proposal also does this), since this is a complicated thing even in theory, and I suspect a lot turns on how it ends up being defined, but with that caveat aside, this seems reasonable.
It can’t tractably be calculated exactly, but it only goes into calculating the probability of deferring to the humans. Approximating a thoeretically-well-founded probability of deferring to a human won’t make it unsafe—that will just make it less efficient/capable. For normal neural networks, there isn’t an obvious way to extract the entropy of the belief distribution, but if there were, you could approximate the expected information gain as the expected decrease in entropy. Note that the entropy of the belief distribution is not the entropy of the model’s distribution over outputs—a model could be very certain that the output is Bernoulli(1/2) distributed, and this would entail an entropy of ~0, not an entropy of 1. I’m not familiar enough with Bayesian neural networks to know if the entropy would be easy to extract.
Right. So in this version of an oracle, where it is just outputting a prediction of the output of some future process, I don’t see what it offers that normal sequence prediction doesn’t offer. On our BoMAI discussion, I mentioned a type of oracle I considered that gave answers which it predicted would cause a (boxed) human to do well on a randomly sampled prediction task, and that kind of oracle could potentially be much more powerful than a counterfactual oracle, but I don’t really see the value of adding something like a counterfactual oracle to a sequence predictor that makes predictions about a sequence that is something like this:
It’s also possible that this scheme runs into grain of truth problems, and the counterfactual oracle gives outputs that are a lot like what I’m imagining this sequence predictor would, in which case, I don’t think sequence prediction would have much to add to the counterfactual oracle proposal.
Sorry, I think you misunderstood my question about combining human imitations with more general oracles/predictors. What I meant is that you could use general oracles/predictors to build models of the world, which the human imitators could then query or use to test out potential actions. This perhaps lets you overcome the problem of human imitators having worse world models than ~AIXI and narrows the capability gap between them.
Sure! The household of people could have another computer inside it that the humans can query, which runs a sequence prediction program trained on other things.
Well that was the question I originally posed here, but I got the sense from commenters was that people thought this was easy to pull off and the only question was whether it was safe. So I’m not sure for what N it’s the case that N machines running agents doing human-level stuff would be enough to take over the world. I’m pretty sure N = 7 billion is enough. And I think it’s plausible that after a discussion about this, I could become confident that N = 1000 was enough. Or maybe the right way to look at it is whether N = 10 could finance a rapidly exponentially growing N. So it seemed worth having a discussion, but I am not yet prepared to defend a low enough N which makes this obviously viable.
Forgetting about the possibility of exponentially growing N for a moment, and turning to
Yeah I wrote that post too quickly—this is wrong. (I was thinking of the leading team running HSIFAUH needing to go through d+h timesteps to get to a good performance, but they just need to run through d, which makes things easier.) Sorry about that. Let f be the amount of compute that the leading project has divided by the compute that the leading reckless project has. Suppose d > 0. (That’s all we need actually). Then it takes the leading reckless team at least f times as long to get to AIXI taking over the world as it takes the leading team to get to SolomonoffPredict predicting a human trying to do X; using similar tractable approximation strategies (whatever those turn out to be), we can expect it to take f times as long for the leading reckless team to get to ~AIXI as it takes the leading team to get to ~SolomonoffPredict. ~HSIFAUH is more complicated with the resource of employing the humans you learn to imitate, but this resource requirement goes down by time you’re deploying it toward useful things. Naively (and you might be able to do better than this), you could run f copies of ~HSIFAUH and get to human-level performance on some relevant tasks around the same time the reckless team takes over the world. So the question is whether N = f is a big enough N. In the train-then-deploy framework, it seems today like training takes much more compute than deploying, so that makes it easier for the leading team to let N >> f, once all the resources dedicated to training get freed up. It should be possible to weaken the online version and get some of this speedup.
I don’t know how to do this. But it’s the same stuff the reckless team is doing to make standard RL powerful.
Why? What are those 7 billion HSIFAUH doing?
In another comment you said “If I’m understanding correctly, the concern is that the imitator learns how humans plan before learning what humans want, so then it plans like a human toward the achievement of some inhuman goal. I don’t think this causes an existential catastrophe.” But if there are 7 billion HSIFAUH which are collectively capable of taking over the world, how is not a potential existential catastrophe if they have inhuman values?
How? And why would it grow fast enough to get to a large enough N before someone deploys ~AIXI?
What do you have in mind here?
You do have to solve some safety problems that the reckless team doesn’t though, don’t you? What do you think the main safety problems are?
Well the number comes from the idea of one-to-one monitoring. Obviously, there’s other stuff to do to establish a stable unipolar world order, but monitoring seems like the most resource intensive part, so it’s an order of magnitude estimate. Also, realistically, one person could monitor ten people, so that was an order of magnitude estimate with some leeway.
I think they can be controlled. Whoever is providing the observations to any instance of HSIFAUH has an arsenal of carrots and sticks (just by having certain observations correlate with actual physical events that occur in the household(s) of humans that generate the data), and I think merely human-level intelligence can kept in check by someone in a position of power over them. So I think real humans could stay at the wheel over 7 billion instances of HSIFAUH. (I mean, this is teetering at the edge of existential catastrophe already given the existence of simulations of people who might have the experience of being imprisoned, but I think with careful design of the training data, this could be avoided). But in terms of extinction threat to real-world humans, this starts to look more like the problem maintaining a power structure over a vast number of humans and less like typical AI alignment difficulties; historically, the former seems to be a solvable problem.
Right, this analysis gets complicated because you have to analyze the growth rate of N. Given your lead time from having more computing power than the reckless team, one has to analyze how many doubling periods you have time for. I hear Robin Hanson is the person to read regarding questions like this. I don’t have any opinions here. But the basic structure regarding “How?” is spend some fraction of computing resources making money, then buy more computing resources with that money.
Well, nothing in particular when I wrote that, but thank you for pushing me. Maybe only update the posterior at some timesteps (and do it infinitely many times but with diminishing frequency). Or more generally, you divide resources between searching for programs that retrodict observed behavior and running copies of the best one so far, and you just shift resource allocation toward the latter over time.
If it turns out you have to do special things to avoid mesa-optimizers, then yes. Otherwise, I don’t think you have to deal with other safety problems if you’re just aiming to imitate human behavior.
I was asking about this part. I’m not convinced HSIFAUH allows you to do this in a safe way (e.g., without triggering a war that you can’t necessarily win).
Another complication here is that the people trying to build ~AIXI can probably build an economically useful ~AIXI using less compute than you need for ~HSIFAUH (for jobs that don’t need to model humans), and start doing their own doublings.
I don’t think we’ve seen a solution that’s very robust though. Plus, having to maintain such a power structure starts to become a human safety problem for the real humans (i.e., potentially causes their values to become corrupted).
Good point.
Regarding the other two points, my intuition was that a few dozen people could work out the details satisfactorily in a year. If you don’t share this intuition, I’ll adjust downward on that. But I don’t feel up to putting in those man-hours myself. It seems like there are lots of people without a technical background who are interested in helping avoid AI-based X-risk. Do you think this is a promising enough line of reasoning to be worth some people’s time?
I’m pretty skeptical of this, but then I’m pretty skeptical of all current safety/alignment approaches and this doesn’t seem especially bad by comparison, so I think it might be worth including in a portfolio approach. But I’d like to better understand why you think it’s promising. Do you have more specific ideas of how ~HSIFAUH can be used to achieve a Singleton and to keep it safe, or just a general feeling that it should be possible?
My intuitions are mostly that if you can provide significant rewards and punishments basically for free in imitated humans (or more to the point, memories thereof), and if you can control the flow of information throughout the whole apparatus, and you have total surveillance automatically, this sort of thing is a dictator’s dream. Especially because it usually costs money to make people happy, and in this case, it hardly does—just a bit of computation time. In a world with all the technology in place that a dictator could want, but also it’s pretty cheap to make everyone happy, it strikes me as promising that the system itself could be kept under control.
What does the real human do if trying to train the imitation to write code? Review the last 100 actions to try to figure out what the imitation is currently trying to do, then do what they (the real human) would do if they were trying to do that? How does the human provide a good lesson if they only know a small part of what the human imitation has done so far to build the program? And the imitation is modeling the human trying to figure out what the imitation is trying to do? This seems to get really weird, and I’m not sure if it’s what you intend.
Also, it seems like the human imitations will keep diverging from real humans quickly (so the real humans will keep getting queried) because they can’t predict ahead of time which inputs real humans will see and which they won’t.
Roughly. They could search for the observation which got the project started. It could all be well commented and documented.
What the imitation was trying to do. So there isn’t any circular weirdness. I don’t know what else seems particularly weird. People deal with “I know that you know that I know...” stuff routinely without even thinking about it.
If you’re talking about what parts of the interaction history the humans will look at when they get called in, it can predict this as well as anything else. If you’re talking about which timesteps humans will get called in for, predicting that ahead of time that doesn’t have any relevance to predicting a human’s behavior, unless the humans are are attempting to predict this, and humans could absolutely do this.
I guess it’s weird (counterintuitive and hard to think about) compared to “The imitation is modeling the human trying to write a good program.” which is what I initially thought the situation would be. In that case, the human doesn’t have to think about the imitation and can just think about how to write a good program. The situation with HSIFAUH seems a lot more complicated. Thinking about it more...
In the limit of perfect imitation, “the imitation is modeling the human trying to write a good program” converges to “the human trying to write a good program.” In the limit of perfect imitation, HSIFAUH converges to “a human trying to write a good program while suffering amnesia between time steps (but can review previous actions and write down notes).” Correct? HSIFAUH could keep memories between time steps, but won’t, because it’s modeling a human who wouldn’t have such memories. (I think I was confused in part because you said that performance wouldn’t be affected. It now seems to me that performance would be affected because a human who can’t keep memories but can only keep notes can’t program as well as a normal human.)
(Thinking about imperfect imitation seems even harder and I’ll try that more after you confirm the above.)
One thing still confuses me. Whenever the real human does get called in to provide training data, the real human now has that memory. But the (most probable) models don’t know that, so the predictions for the next round are going to be wrong (compared to what the real human would do if called in) because it’s going to be based on the real human not having that memory. (I think this is what I meant when I said “it seems like the human imitations will keep diverging from real humans quickly”.) The Bayesian update wouldn’t cause the models to know that the real human now has that memory, because suppose the real human does something the top models correctly predicted, then the update wouldn’t do much. So how does this problem get solved, or am I misunderstanding something here? (Maybe we can just provide an input to the models that indicates whether the real human was called in for the last time step?)
Correct. I’ll just add that a single action can be a large chunk of the program. It doesn’t have to be (god forbid) character by character.
It’ll have some probability distribution over the contents of the humans’ memories. This will depend on which timesteps they actually participated in, so it’ll have a probability distribution over that. I don’t think that’s really a problem though. If humans are taking over one time in a thousand, then it’ll think (more or less) there’s a 1⁄000 chance that they’ll remember the last action. (Actually, it can do better by learning that humans take over in confusing situations, but that’s not really relevant here).
That would work too. With the edit that the model may as well be allowed to depend on the whole history of which actions were human-selected, not just whether the last one was.
Actually before we keep going with our discussions, it seems to make sense to double check that your proposal is actually the most promising proposal (for human imitation) to discuss. Can you please take a look at the list of 10 links related to human imitations that I collected (as well as any relevant articles those pages further link to), and perhaps write a post on why your proposal is better than the previous ones, why you made the design choices that you did, and how it addresses or avoids the existing criticisms of human imitations? ETA: I’m also happy to discuss with you your views of past proposals/criticisms here in the comments or through another channel if you prefer to do that before writing up a post.
Sorry to put this on hold, but I’ll come back to this conversation after the AAAI deadline on September 5.
Commenting here.
But there’s a model/TM that thinks there a 100% chance that the human will remember the last action (because that’s hard coded into the TM) and that model will do really well in the next update. So we know any time a human steps in no matter when, it will cause a big update (during the next update) because it’ll raise models like this from obscurity to prominence. If the AI “knows” this, it will call in the human for every time step, but maybe it doesn’t “know” this? (I haven’t thought this through formally and will leave it to you.)
I was assuming the models would save that input on its work tape for future use.
In any case, I think I understand your proposal well enough now that we can go back to some of the other questions.