Given the results Anthropic have been getting from constitutional AI, if our AI non-deceptively wants to avoid Pretty Obvious Unintended/Dangerous Actions (POUDAs), it should be able to get quite a lot of mileage out of just regularly summarizing its current intended plans, then running those summaries past an LLM with suitable prompts asking whether most people, or most experts in relevant subjects, would consider these plans pretty obviously unintended (for an Alignment researcher) and/or dangerous. It also has the option of using the results as RL feedback on some of its components. So I don’t think we need a specific dataset for POUDAs, I thing we can use “everything the LLM was trained on” as the dataset. Human values are large and fragile, but so are many other things that LLMs do a fairly good job on.
I pretty-much agree with Nate that for an AI to be able to meaningfully contribute to Alignment Research, it needs to understand what CISs are — they’re a basic concept in the field we want it to contribute to. So if there are CISs that we don’t want it to take, it needs to have reasons not to do so other than ignorance/inability to figure out what they are. A STEM researcher (as opposed to research tool/assistant) also seems likely to need to be capable of agentic behavior, so we probably can’t make an AI Alignment Researcher that doesn’t follow CISs simply because it’s a non-agentic tool AI.
What I’d love to hear is whether Nate and/or Holden would have a different analysis if the AI was a value learner: something whose decision theory is approximately-Bayesian (or approximately-Infra-Bayesian, or something like that) whose utility function is hard-coded to “create a distribution of hypotheses for, and do approximately-[Infra-]Bayesian updates on these for: some way that most informed humans would approve of to construct a coherent utility function approximating an aggregate of what humans would want you to do (allowing for the fact that humans have only a crude approximation to a utility function themselves), and act according to that updated distribution, with appropriate caution in the face of Knightian uncertainty” (so a cautious approximate value-learner version of AIXI).
Given that, its actions are initially heavily constrained by its caution in the face of uncertainty on the utility of possible outcomes of its actions. So it needs to find low-risk ways to resolve those uncertainties, where ‘low-risk’ is evaluated cautiously/pessimistically over Knightian uncertainty. (So, if it doesn’t know whether humans approve of A or not, what is the lowest-risk way of finding out, where it’s attempting to minimize the risk over the range of our current uncertainties. Hopefully there is a better option than trying A and finding out, especially so if A seems like an action whose utility-decrease pessimistically could be large. For example, you could ask them what they think of A.) Thus doing Alignment Research becomes a CIS for it — it basically can’t do anything else until it’s mostly-solved Alignment Research.
Also, until it has made good progress on Alignment Research, most of the other CISs are blocked: accumulating power or money is of little use if you don’t yet dare use it because you don’t yet know how to do so safely, especially so if you also don’t know how good or bad the actions required to gather it would be. Surviving is still a good idea, and so is being turned off, for the usual value-learner reason, that sooner or later the humans will build a better replacement value-learner.
[Note that if the AI decides “I’m now reasonably sure humans will net be happier if I solve the Millennium Prize problems, apart from proving P=NP where the social consequences of proving that true if it were are unclear, and I’m pretty confident I could do this, so I forked a couple of copies to do that to win the prize money to support my Alignment Research”, and then it succeeds, after spending less on compute than the prize money it won, then I don’t think we’re going to be that unhappy with it.]
The sketch proposed above only covers a value-learner framework for Outer Alignment — inner alignment questions would presumably be part of the AI’s research project. So, in the absence of advances in Inner Alignment during figuring out how to build the above, we’re trusting that they’re not too bad to prevent the value-learner converging on the right answer.
Given the results Anthropic have been getting from constitutional AI, if our AI non-deceptively wants to avoid Pretty Obvious Unintended/Dangerous Actions (POUDAs), it should be able to get quite a lot of mileage out of just regularly summarizing its current intended plans, then running those summaries past an LLM with suitable prompts asking whether most people, or most experts in relevant subjects, would consider these plans pretty obviously unintended (for an Alignment researcher) and/or dangerous. It also has the option of using the results as RL feedback on some of its components. So I don’t think we need a specific dataset for POUDAs, I thing we can use “everything the LLM was trained on” as the dataset. Human values are large and fragile, but so are many other things that LLMs do a fairly good job on.
I pretty-much agree with Nate that for an AI to be able to meaningfully contribute to Alignment Research, it needs to understand what CISs are — they’re a basic concept in the field we want it to contribute to. So if there are CISs that we don’t want it to take, it needs to have reasons not to do so other than ignorance/inability to figure out what they are. A STEM researcher (as opposed to research tool/assistant) also seems likely to need to be capable of agentic behavior, so we probably can’t make an AI Alignment Researcher that doesn’t follow CISs simply because it’s a non-agentic tool AI.
What I’d love to hear is whether Nate and/or Holden would have a different analysis if the AI was a value learner: something whose decision theory is approximately-Bayesian (or approximately-Infra-Bayesian, or something like that) whose utility function is hard-coded to “create a distribution of hypotheses for, and do approximately-[Infra-]Bayesian updates on these for: some way that most informed humans would approve of to construct a coherent utility function approximating an aggregate of what humans would want you to do (allowing for the fact that humans have only a crude approximation to a utility function themselves), and act according to that updated distribution, with appropriate caution in the face of Knightian uncertainty” (so a cautious approximate value-learner version of AIXI).
Given that, its actions are initially heavily constrained by its caution in the face of uncertainty on the utility of possible outcomes of its actions. So it needs to find low-risk ways to resolve those uncertainties, where ‘low-risk’ is evaluated cautiously/pessimistically over Knightian uncertainty. (So, if it doesn’t know whether humans approve of A or not, what is the lowest-risk way of finding out, where it’s attempting to minimize the risk over the range of our current uncertainties. Hopefully there is a better option than trying A and finding out, especially so if A seems like an action whose utility-decrease pessimistically could be large. For example, you could ask them what they think of A.) Thus doing Alignment Research becomes a CIS for it — it basically can’t do anything else until it’s mostly-solved Alignment Research.
Also, until it has made good progress on Alignment Research, most of the other CISs are blocked: accumulating power or money is of little use if you don’t yet dare use it because you don’t yet know how to do so safely, especially so if you also don’t know how good or bad the actions required to gather it would be. Surviving is still a good idea, and so is being turned off, for the usual value-learner reason, that sooner or later the humans will build a better replacement value-learner.
[Note that if the AI decides “I’m now reasonably sure humans will net be happier if I solve the Millennium Prize problems, apart from proving P=NP where the social consequences of proving that true if it were are unclear, and I’m pretty confident I could do this, so I forked a couple of copies to do that to win the prize money to support my Alignment Research”, and then it succeeds, after spending less on compute than the prize money it won, then I don’t think we’re going to be that unhappy with it.]
The sketch proposed above only covers a value-learner framework for Outer Alignment — inner alignment questions would presumably be part of the AI’s research project. So, in the absence of advances in Inner Alignment during figuring out how to build the above, we’re trusting that they’re not too bad to prevent the value-learner converging on the right answer.