This is one of the documents I was responding to when I wrote A general model of safety-oriented AI development, Three AI Safety Related Ideas, and Two Neglected Problems in Human-AI Safety. (I didn’t cite it because it was circulating semi-privately in draft form, and Eric apparently didn’t want its existence to be publicly known.) I’m disappointed that although Eric wrote to me “I think that your two neglected problems are critically important”, the perspectives in those posts didn’t get incorporated more into the final document, which spends only 3 short paragraphs out of hundreds of pages to talk about what I think of as “human safety problems”. (I think those paragraphs were in the draft even before I wrote my posts.)
I worry about the framing adopted in this document that the main problem in human-AI safety is “questions of what humans might choose to do with their capabilities”, as opposed to my preferred framing of “how can we design human-AI systems to minimize total risk”. (To be fair to Eric, a lot of other AI safety people also only talk about “misuse risk” and not about how AI is by default likely to exacerbate human safety problems, e.g., by causing rapid distributional shifts for humans.) I worry that this gives AI researchers and developers license to think, “I’m just developing an AI service. AI services will be comprehensive anyway so there’s no reason for me to hold back or think more about what I’m doing. It’s someone else’s job to worry about what humans might choose to do with these capabilities.”
I actually think the CAIS model gives me optimism for these sorts of problems. As long as we acknowledge that the problems exist and can be an issue, we could develop services that help us mitigate them. Safety in the CAIS world already depends on having services that are in charge of good engineering, testing, red teaming, monitoring, etc., as well as services that evaluate objectives and make sure humans would approve of them. It seems fairly easy to expand this to include services that consider how disruptive new technologies will be, how underdetermined human values are, whether a proposed plan reduces option value, what risk aversion implies about a particular plan of action, what blind spots people have, etc.
I’d be interested in a list of services that you think would be helpful for addressing human safety problems. You might think of this as “our best current guess at metaphilosophy and metaphilosophy research”.
(I know you were mainly talking about the document’s framing, I don’t have much to say about that.)
It seems fairly easy to expand this to include services that consider how disruptive new technologies will be, how underdetermined human values are, whether a proposed plan reduces option value, what risk aversion implies about a particular plan of action, what blind spots people have, etc.
Can you explain how you’d implement these services? Take “how disruptive new technologies will be” for example. I imagine you can’t just apply ML given the paucity of training data and how difficult it would be to generalize from historical data to new technologies and new social situations. And it seems to me that if you base it on any kind of narrow AI technology, it would be easy to miss some of the novel implications/consequences of the new technologies and social situations and end up with a wrong answer. Maybe you could instead base it on a general purpose reasoner or question-answerer, but if something like that exists, AI would already have created a lot of new technologies that are risky for humans to face. Plus, the general purpose AI could replace a lot of discrete/narrow AI services, so I feel like we would already have moved past the CAIS world at that point. BTW, if the service is not just a thin wrapper on top of a general purpose AI which is generally trustworthy, I also don’t know how you’d know whether you can trust the answers that it gives.
I’d be interested in a list of services that you think would be helpful for addressing human safety problems. You might think of this as “our best current guess at metaphilosophy and metaphilosophy research”.
I could try to think in that direction after I get a better sense of what kinds of services might be both feasible and trustworthy in the CAIS world. It seems easy to become too optimistic/complacent under the CAIS model if I just try to imagine what safety-enhancing services might be helpful without worrying about whether those services would be feasible or how well they’d work at the time when they’re needed.
Can you explain how you’d implement these services?
Not really. I think of CAIS as suggesting that we take an outside view that says “looking at how AI has been progressing, and how humans generally do things, we’ll probably be able to do more and more complex tasks as time goes on”. But the emphasis that CAIS places is that the things we’ll be able to do will be domain-specific tasks, rather than getting a general-purpose reasoner. I don’t have a detailed enough inside view to say how complex tasks might be implemented in practice.
I agree with the rest of what you said, which feels to me like considering a few possible inside-view scenarios and showing that they don’t work.
One way to think about this is through the lens of iterated amplification. With iterated amplification, we also get the property that our AI systems will be able to do more and more complex tasks as time goes on. The key piece that enables this is the ability to decompose problems, so that iterated amplification always bottoms out into a tree of questions and subquestions down to leaves which the base agent can answer. You could think of (my conception of) CAIS as a claim that a similar process will happen in a decentralized way for all of ML by default, and at any point the things we can do will look like an explicit iterated amplification deliberation tree of depth one or two, where the leaves are individual services and the top level question will be some task that is accomplished through a combination of individual services.
I could try to think in that direction after I get a better sense of what kinds of services might be both feasible and trustworthy in the CAIS world. It seems easy to become too optimistic/complacent under the CAIS model if I just try to imagine what safety-enhancing services might be helpful without worrying about whether those services would be feasible or how well they’d work at the time when they’re needed.
Agreed, I’m making a bid for generating ideas without worrying about feasibility and trustworthiness, but not spending too much time on this and not taking the results too seriously.
You could think of (my conception of) CAIS as a claim that a similar process will happen in a decentralized way for all of ML by default, and at any point the things we can do will look like an explicit iterated amplification deliberation tree of depth one or two, where the leaves are individual services and the top level question will be some task that is accomplished through a combination of individual services.
This seems like a sensible way of looking at things, and in this framing I’d say that my worry is that crucial safety-enhancing services may only appear fairly high in the overall tree of services, or outside the tree altogether (see also #3 in Three AI Safety Related Ideas which makes a similar point), and in the CAIS world it would be hard to limit access to the lower-level services (as a risk-reduction measure).
This is one of the documents I was responding to when I wrote A general model of safety-oriented AI development, Three AI Safety Related Ideas, and Two Neglected Problems in Human-AI Safety. (I didn’t cite it because it was circulating semi-privately in draft form, and Eric apparently didn’t want its existence to be publicly known.) I’m disappointed that although Eric wrote to me “I think that your two neglected problems are critically important”, the perspectives in those posts didn’t get incorporated more into the final document, which spends only 3 short paragraphs out of hundreds of pages to talk about what I think of as “human safety problems”. (I think those paragraphs were in the draft even before I wrote my posts.)
I worry about the framing adopted in this document that the main problem in human-AI safety is “questions of what humans might choose to do with their capabilities”, as opposed to my preferred framing of “how can we design human-AI systems to minimize total risk”. (To be fair to Eric, a lot of other AI safety people also only talk about “misuse risk” and not about how AI is by default likely to exacerbate human safety problems, e.g., by causing rapid distributional shifts for humans.) I worry that this gives AI researchers and developers license to think, “I’m just developing an AI service. AI services will be comprehensive anyway so there’s no reason for me to hold back or think more about what I’m doing. It’s someone else’s job to worry about what humans might choose to do with these capabilities.”
I actually think the CAIS model gives me optimism for these sorts of problems. As long as we acknowledge that the problems exist and can be an issue, we could develop services that help us mitigate them. Safety in the CAIS world already depends on having services that are in charge of good engineering, testing, red teaming, monitoring, etc., as well as services that evaluate objectives and make sure humans would approve of them. It seems fairly easy to expand this to include services that consider how disruptive new technologies will be, how underdetermined human values are, whether a proposed plan reduces option value, what risk aversion implies about a particular plan of action, what blind spots people have, etc.
I’d be interested in a list of services that you think would be helpful for addressing human safety problems. You might think of this as “our best current guess at metaphilosophy and metaphilosophy research”.
(I know you were mainly talking about the document’s framing, I don’t have much to say about that.)
Can you explain how you’d implement these services? Take “how disruptive new technologies will be” for example. I imagine you can’t just apply ML given the paucity of training data and how difficult it would be to generalize from historical data to new technologies and new social situations. And it seems to me that if you base it on any kind of narrow AI technology, it would be easy to miss some of the novel implications/consequences of the new technologies and social situations and end up with a wrong answer. Maybe you could instead base it on a general purpose reasoner or question-answerer, but if something like that exists, AI would already have created a lot of new technologies that are risky for humans to face. Plus, the general purpose AI could replace a lot of discrete/narrow AI services, so I feel like we would already have moved past the CAIS world at that point. BTW, if the service is not just a thin wrapper on top of a general purpose AI which is generally trustworthy, I also don’t know how you’d know whether you can trust the answers that it gives.
I could try to think in that direction after I get a better sense of what kinds of services might be both feasible and trustworthy in the CAIS world. It seems easy to become too optimistic/complacent under the CAIS model if I just try to imagine what safety-enhancing services might be helpful without worrying about whether those services would be feasible or how well they’d work at the time when they’re needed.
Not really. I think of CAIS as suggesting that we take an outside view that says “looking at how AI has been progressing, and how humans generally do things, we’ll probably be able to do more and more complex tasks as time goes on”. But the emphasis that CAIS places is that the things we’ll be able to do will be domain-specific tasks, rather than getting a general-purpose reasoner. I don’t have a detailed enough inside view to say how complex tasks might be implemented in practice.
I agree with the rest of what you said, which feels to me like considering a few possible inside-view scenarios and showing that they don’t work.
One way to think about this is through the lens of iterated amplification. With iterated amplification, we also get the property that our AI systems will be able to do more and more complex tasks as time goes on. The key piece that enables this is the ability to decompose problems, so that iterated amplification always bottoms out into a tree of questions and subquestions down to leaves which the base agent can answer. You could think of (my conception of) CAIS as a claim that a similar process will happen in a decentralized way for all of ML by default, and at any point the things we can do will look like an explicit iterated amplification deliberation tree of depth one or two, where the leaves are individual services and the top level question will be some task that is accomplished through a combination of individual services.
Agreed, I’m making a bid for generating ideas without worrying about feasibility and trustworthiness, but not spending too much time on this and not taking the results too seriously.
This seems like a sensible way of looking at things, and in this framing I’d say that my worry is that crucial safety-enhancing services may only appear fairly high in the overall tree of services, or outside the tree altogether (see also #3 in Three AI Safety Related Ideas which makes a similar point), and in the CAIS world it would be hard to limit access to the lower-level services (as a risk-reduction measure).
Yeah, that seems right, I don’t think anyone is arguing against that claim.