I’d be interested in hearing more about what Rohin means when he says:
… it’s really just “we notice when they do bad stuff and the easiest way for gradient descent to deal with this is for the AI system to be motivated to do good stuff”.
This sounds something likegradient descent retargeting the search for you because it’s the simplest thing to do when there are already existing abstractions for the “good stuff” (e.g. if there already exists a crisp abstraction for something like ‘helpfulness’, and we punish unhelpful behaviors, it could potentially be ‘really easy’ for gradient descent to simply use the existing crisp abstraction of ‘helpfulness’ to do much better at the tasks we give it).
I think this might be plausible, but a problem I anticipate is that the abstractions for things we “actually want” don’t match the learned abstractions that end up being formed in future models and you face what’s essentially a classic outer alignment failure (see Leo Gao’s ‘Alignment Stream of Thought’ post on human abstractions). I see this happening for two reasons:
Our understanding of what we actually want is poor, such that we wouldn’t want to optimize for how we understand what we want
We poorly express our understanding of what we actually want in the data we train our models with, such that we wouldn’t want to optimize for the expression of what we want
High level response: yes, I agree that “gradient descent retargets the search” is a decent summary; I also agree the thing you outline is a plausible failure mode, but it doesn’t justify confidence in doom.
Our understanding of what we actually want is poor, such that we wouldn’t want to optimize for how we understand what we want
I’m not very worried about this. We don’t need to solve all of philosophy and morality, it would be sufficient to have the AI system to leave us in control and respect our preferences where they are clear.
We poorly express our understanding of what we actually want in the data we train our models with, such that we wouldn’t want to optimize for the expression of what we want
I agree this is more of an issue, but it’s very unclear to me how badly this issue will bite us. Does this lead to AI systems that sometimes say what we want to hear rather than what is actually true, but are otherwise nice? Seems mostly fine. Does this lead to AI systems that tamper with all of our sources of information about things that are happening in the world, to make things simply appear to be good rather than actually being good? Seems pretty bad. Which of the two (or the innumerable other possibilities) happens? Who knows?
We don’t need to solve all of philosophy and morality, it would be sufficient to have the AI system to leave us in control and respect our preferences where they are clear
I agree that we don’t need to solve philosophy/morality if we could at least pin down things like corrigibility, but humans may poorly understand “leaving humans in control” and “respecting human preferences” such that optimizing for human abstractions of these concepts could be unsafe (this belief isn’t that strongly held, I’m just considering some exotic scenarios where humans are technically ‘in control’ according to the specification we thought of, but the consequences are negative nonetheless, normal goodharting failure mode).
Which of the two (or the innumerable other possibilities) happens?
Depending on the work you’re asking the AI(s) to do (e.g. automating large parts of open ended software projects, automating large portions of STEM work), I’d say the world takeover/power-seeking/recursive self improvement type of scenarios happen since these tasks incentivize the development of unbounded behaviors (because open-ended, project based work doesn’t have clear deadlines, may require multiple retries, and has lots of uncertainty, I can imagine unbounded behaviors like “gain more resources because that’s broadly useful under uncertainty” to be strongly selected for).
I’d be interested in hearing more about what Rohin means when he says:
This sounds something like gradient descent retargeting the search for you because it’s the simplest thing to do when there are already existing abstractions for the “good stuff” (e.g. if there already exists a crisp abstraction for something like ‘helpfulness’, and we punish unhelpful behaviors, it could potentially be ‘really easy’ for gradient descent to simply use the existing crisp abstraction of ‘helpfulness’ to do much better at the tasks we give it).
I think this might be plausible, but a problem I anticipate is that the abstractions for things we “actually want” don’t match the learned abstractions that end up being formed in future models and you face what’s essentially a classic outer alignment failure (see Leo Gao’s ‘Alignment Stream of Thought’ post on human abstractions). I see this happening for two reasons:
Our understanding of what we actually want is poor, such that we wouldn’t want to optimize for how we understand what we want
We poorly express our understanding of what we actually want in the data we train our models with, such that we wouldn’t want to optimize for the expression of what we want
High level response: yes, I agree that “gradient descent retargets the search” is a decent summary; I also agree the thing you outline is a plausible failure mode, but it doesn’t justify confidence in doom.
I’m not very worried about this. We don’t need to solve all of philosophy and morality, it would be sufficient to have the AI system to leave us in control and respect our preferences where they are clear.
I agree this is more of an issue, but it’s very unclear to me how badly this issue will bite us. Does this lead to AI systems that sometimes say what we want to hear rather than what is actually true, but are otherwise nice? Seems mostly fine. Does this lead to AI systems that tamper with all of our sources of information about things that are happening in the world, to make things simply appear to be good rather than actually being good? Seems pretty bad. Which of the two (or the innumerable other possibilities) happens? Who knows?
I agree that we don’t need to solve philosophy/morality if we could at least pin down things like corrigibility, but humans may poorly understand “leaving humans in control” and “respecting human preferences” such that optimizing for human abstractions of these concepts could be unsafe (this belief isn’t that strongly held, I’m just considering some exotic scenarios where humans are technically ‘in control’ according to the specification we thought of, but the consequences are negative nonetheless, normal goodharting failure mode).
Depending on the work you’re asking the AI(s) to do (e.g. automating large parts of open ended software projects, automating large portions of STEM work), I’d say the world takeover/power-seeking/recursive self improvement type of scenarios happen since these tasks incentivize the development of unbounded behaviors (because open-ended, project based work doesn’t have clear deadlines, may require multiple retries, and has lots of uncertainty, I can imagine unbounded behaviors like “gain more resources because that’s broadly useful under uncertainty” to be strongly selected for).