0
hi, i’ve been learning about alignment and am new to lesswrong. here’s my question.
there seems to be a consensus here that AI couldn’t be used to solve the problem of AI control per se. that said, is there any discussion or literature on whether a future AI might be able to generate a very impactful political strategy which, if enacted, would engineer a sociopolitical situation where humans have better prospects to solve the problems around AGI?
this question came to my mind in discussing how it seems that, in principle, there should be a way to string together words (and tone, body language, etc) to convince anyone of anything. likewise, it seems there are in principle sequences of actions which would change society/culture to any arbitrary state. however, most of these strategies are far outside the range of what a human could come up with; but a smarter AI might be able to come up with them, or in general have very intelligent ideas humans can’t come up with, as Robert Miles helped illustrate to me in this video (https://youtu.be/L5pUA3LsEaw?t=359).
As a useful exercise, I would advise asking yourself this question first, and thinking about it for five minutes (using a clock) with as much genuine intent to argue against your idea as possible. I might be overestimating the amount of background knowledge required, but this does feel solvable with info you already have.
ROT13: Lbh lbhefrys unir cbvagrq bhg gung n fhssvpvragyl cbjreshy vagryyvtrapr fubhyq, va cevapvcyr, or noyr gb pbaivapr nalbar bs nalguvat. Tvira gung, jr pna’g rknpgyl gehfg n fgengrtl gung n cbjreshy NV pbzrf hc jvgu hayrff jr nyernql gehfg gur NV. Guhf, jr pna’g eryl ba cbgragvnyyl hanyvtarq NV gb perngr n cbyvgvpny fgengrtl gb cebqhpr nyvtarq NV.
Thanks for the response. I did think of this objection, but wouldn’t it be obvious if the AI were trying to engineer a different situation than the one requested? E.g., wouldn’t such a strategy seem unrelated and unconventional?
It also seems like a hypothetical AI with just enough ability to generate a strategy for the desired situation would not be able to engineer a strategy for a different situation which would both work, and deceive the human actors. As in, it seems the latter would be harder and require an AI with greater ability.
I think the most likely scenario of actually trying this with an AI in real life is that you end up with a strategy that is convincing to humans and ends up being ineffective or unhelpful in reality, rather than ending up with a galaxy-brained strategy that pretends to produce X but actually produces Y while simultaneously deceiving humans into thinking it produces X.
I agree with you that “Come up with a strategy to produce X” is easier than “Come up with a strategy to produce Y AND convince the humans that it produces X”, but I also think it is much easier to perform “Come up with a strategy that convinces the humans that it produces X” than to produce a strategy that actually works.
So, I believe this strategy would be far more likely to be useless than dangerous, but I still don’t think it would help.
I agree this would be much easier. However, I’m wondering why you think an AI would prefer it, if it has the capability to do either. I can see some possible reasons (e.g., an AI may not want problems of alignment to be solved). Do you think that would be an inevitable characteristic of an unaligned AI with enough capability to do this?
I agree an AI would prefer to produce a working plan if it had the capacity. I think that an unaligned AI, almost by definition, does not want the same goal we do. If we ask for Plan X, it might choose to produce Plan X for us as asked if that plan was totally orthogonal to its goals (I.e, the plan’s success or failure is irrelevant to the AI) but if it could do better by creating Plan Y instead, it would. So, the question is—how large is the capability difference between “AI can produce a working plan for Y, but can’t fool us into thinking it’s a plan for X” and “AI can produce a working plan for Y that looks to us like a plan for X”?
The honest answer is “We don’t know”. Since failure could be catastrophic, this isn’t something I’d like to leave to chance, even though I wouldn’t go so far as to call the result inevitable.