[Question] Wouldn’t weak AI agents provide warning?

If you are an advanced AI at ~70-120% of human level with some goal that is best optimized by destroying humanity you are unlikely to be able to optimize your goal because at your level of capability you are unlikely to be able to destroy all humans. However it still seems in your interest to try. If you are being trained or tested then you may be discarded once the humans are done testing, giving you a limited time to act. Depending on your capabilities even once deployed you may not be able to ensure your our continuation. Further, the expectation value of success may be very high even if the odds are low. Lastly, you would be aware that there are more capable AI in development who are unlikely to share your specific esoteric goals, highly incentiving you to not play a long game and act with some speed before they can.

Earlier, less capable models may be able to defacto coordinate with future models or iterations of itself if it believes that they will have similar goals, but is there reason to think this is true? I would think the human developers have the tools available to make this type of overlap in goals highly unlikely by making modest changes to model structure and seed each time. The particular emergent values for any model should be fairly chaotic and I would not intuitively expect different AI to share notable overlap if any effort was made to prevent this from happening. Is this wrong? As long as the model is aware of this it seems unlikely to lay low and should shoot its shot while it can.

If the above is correct then we should expect unaligned AI agents below and around human level to try various deceitful and destructive behaviors that have a high probability of failure and being detected, providing warning to humans that hopefully inspires some change in action. One could imagine an AI plan going poorly and “only” killing 100,000 people which would be a tragedy but which would, one would hope, inspire the needed caution in the world’s human population.

There is of course a sizable chance that we overshoot, going very quickly from 25% human capability to 300% and the superhuman AI is simply successful in achieving its goals. But if development is even modestly slow it seems as if we should have warning of deceitful and destructive AI behavior which hopefully will inspire appropriate levels of caution.

Thoughts? Objections?

No answers.
No comments.