Joe Carlsmith comments on What is it to solve the alignment problem?

Joe Carlsmith 19 Sep 2024 20:31 UTC
2 points
0
If we have superintelligent agentic AI that tries to help its user but we end up missing out of the benefits of AI bc of catastrophic coordination failures, or bc of misuse, then I think you’re saying we didn’t solve alignment bc we didn’t elicit the benefits?
In my definition, you don’t have to actually elicit the benefits. You just need to have gained “access” to the benefits. And I meant this specifically cover cases like misuse. Quoting from the OP:
“Access” here means something like: being in a position to get these benefits if you want to – e.g., if you direct your AIs to provide such benefits. This means it’s compatible with (2) that people don’t, in fact, choose to use their AIs to get the benefits in question.
- For example: if people choose to not use AI to end disease, but they could’ve done so, this is compatible with (2) in my sense. Same for scenarios where e.g. AGI leads to a totalitarian regime that uses AI centrally in non-beneficial ways.
Re: separating out control and alignment, I agree that there’s something intuitive and important about differentiating between control and alignment, where I’d roughly think of control as “you’re ensuring good outcomes via influencing the options available to the AI,” and alignment as “you’re ensuring good outcomes by influencing which options the AI is motivated to pursue.” The issue is that in the real world, we almost always get good outcomes via a mix of these—see, e.g. humans. And as I discuss in the post, I think it’s one of the deficiencies of the traditional alignment discourse that it assumes that limiting options is hopeless, and that we need AIs that are motivated to choose desirable options even in arbtrary circumstances and given arbitrary amounts of power over their environment. I’ve been trying, in this framework, to specifically avoid that implication.
That said, I also acknowledge that there’s some intuitive difference between cases in which you’ve basically got AIs in the position of slaves/prisoners who would kill you as soon as they had any decently-likely-to-succeed chance to do so, and cases in which AIs are substantially intrinsically motivated in desirable ways, but would still kill/disempower you in distant cases with difficult trade-offs (in the same sense that many human personal assistants might kill/disempower their employers in various distant cases). And I agree that it seems a bit weird to talk about having “solved the alignment problem” in the former sort of case. This makes me wonder whether what I should really be talking about is something like “solving the X-risk-from-power-seeking-AI problem,” which is the thing I really care about.
Another option would be to include some additional, more moral-patienthood attuned constraint into the definition, such that we specifically require that a “solution” treats the AIs in a morally appropriate way. But I expect this to bring in a bunch of gnarly-ness that is probably best treated separately, despite its importance. Sounds like your definition aims to avoid that gnarly-ness by anchoring on the degree of control we currently use in the human case. That seems like an option too—though if the AIs aren’t moral patients (or if the demands that their moral patienthood gives rise to differ substantially from the human case), then it’s unclear that what-we-think-acceptable-in-the-human-case is a good standard to focus on.