However, in AI alignment, the hope is to learn from failures of narrow AI systems, and use that to prevent failures in more powerful AI systems.
This also jumped out at me as being only a subset of what I think of as “AI alignment”; like, ontological collapse doesn’t seem to have been a failure of narrow AI systems. [By ‘ontological collapse’, I mean the problem where the AI knows how to value ‘humans’, and then it discovers that ‘humans’ aren’t fundamental and ‘atoms’ are fundamental, and now it’s not obvious how its preferences will change.]
Perhaps you mean “AI alignment in the slow takeoff frame”, where ‘narrow’ is less a binary judgment and more of a continuous judgment; then it seems more compelling, but I still think the baseline prediction should be doom if we can only ever solve problems after encountering them.
Perhaps you mean “AI alignment in the slow takeoff frame”, where ‘narrow’ is less a binary judgment and more of a continuous judgment
I do mean this.
This also jumped out at me as being only a subset of what I think of as “AI alignment”; like, ontological collapse doesn’t seem to have been a failure of narrow AI systems.
I’d predict that either ontological collapse won’t be a problem, or we’ll notice it in AI systems that are less general than humans. (After all, humans have in fact undergone ontological collapse, so presumably AI systems will also have undergone it by the time they reach human level generality.)
I still think the baseline prediction should be doom if we can only ever solve problems after encountering them.
This depends on what you count as “encountering a problem”.
At one extreme, you might look at Faulty Reward Functions in the Wild and this counts as “encountering” the problem “If you train using PPO with such-and-such hyperparameters on the score reward function in the CoastRunners game then on this specific level the boat might get into a cycle of getting turbo boosts instead of finishing the race”. If this is what it means to encounter a problem, then I agree the baseline prediction should be doom if we only solve problems after encountering them.
At the other extreme, maybe you look at it and this counts as “encountering” the problem “sometimes AI systems are not beneficial to humans”. So, if you solve this problem (which we’ve already encountered), then almost tautologically you’ve solved AI alignment.
I’m not sure how to make further progress on this disagreement.
This also jumped out at me as being only a subset of what I think of as “AI alignment”; like, ontological collapse doesn’t seem to have been a failure of narrow AI systems. [By ‘ontological collapse’, I mean the problem where the AI knows how to value ‘humans’, and then it discovers that ‘humans’ aren’t fundamental and ‘atoms’ are fundamental, and now it’s not obvious how its preferences will change.]
Perhaps you mean “AI alignment in the slow takeoff frame”, where ‘narrow’ is less a binary judgment and more of a continuous judgment; then it seems more compelling, but I still think the baseline prediction should be doom if we can only ever solve problems after encountering them.
I do mean this.
I’d predict that either ontological collapse won’t be a problem, or we’ll notice it in AI systems that are less general than humans. (After all, humans have in fact undergone ontological collapse, so presumably AI systems will also have undergone it by the time they reach human level generality.)
This depends on what you count as “encountering a problem”.
At one extreme, you might look at Faulty Reward Functions in the Wild and this counts as “encountering” the problem “If you train using PPO with such-and-such hyperparameters on the score reward function in the CoastRunners game then on this specific level the boat might get into a cycle of getting turbo boosts instead of finishing the race”. If this is what it means to encounter a problem, then I agree the baseline prediction should be doom if we only solve problems after encountering them.
At the other extreme, maybe you look at it and this counts as “encountering” the problem “sometimes AI systems are not beneficial to humans”. So, if you solve this problem (which we’ve already encountered), then almost tautologically you’ve solved AI alignment.
I’m not sure how to make further progress on this disagreement.