Okay, so if that’s just a small component, then sure, first issue goes away (though I still have questions on how you’re gonna make this simulation realistic enough to just hook it up to an LLM or “something smart” and expect it to set coherent and meaningful goals in real life, though that’s more of a technical issue).
However, I believe there are still other issues with this appoach. The way you describe it makes me think it’s really similar to Axelrod’s Iterated Prisoner’s Dilemma tournament, and that did invent tit-for-tat strategy as one of the most successful ones. But that wasn’t the only successful strategy. For example there were strategies that were mostly tit-for-tat, but would defect if they could get away with it. If, for example, that still mostly results in tit-for-tat, except for some rare cases of it defecting when the agents in question are too stupid to “vote it out”, do we punish it?
Second, tit-for-tat is quite succeptible to noise. What if the trained agent misinterprets someone’s actions as “bad”, even if they in actuality did something innocent, like “affectionately pat their friend on the back”, which the AI interpreted as fighting? No matter how smart the AI gets, there still will be cases where it wrongly believes someone to be a liar, and so believes it has every justification to “deter” them. Do we really want even some small probability of that behaviour? How about an AI that doesn’t hurt humans unconditionally, and not only when it believes us to be “good agents”?[1]
Last thing, how does AI determine what is an “intelligent agent” and what is a “rock”? If there are explicit tags in the simulation for that thing, then how do you make sure every actual human in the real world gets tagged correctly? Also, do we count animals? Should animal abuse be enough justification for the AI to turn on the rage mode? What about accidentally stepping on an ant? And if you define “intelligent agent” as relative to the AI, when what do we do once it gets smart enough to rationally think of us like ants?
Okay, so if that’s just a small component, then sure, first issue goes away (though I still have questions on how you’re gonna make this simulation realistic enough to just hook it up to an LLM or “something smart” and expect it to set coherent and meaningful goals in real life, though that’s more of a technical issue).
However, I believe there are still other issues with this appoach. The way you describe it makes me think it’s really similar to Axelrod’s Iterated Prisoner’s Dilemma tournament, and that did invent tit-for-tat strategy as one of the most successful ones. But that wasn’t the only successful strategy. For example there were strategies that were mostly tit-for-tat, but would defect if they could get away with it. If, for example, that still mostly results in tit-for-tat, except for some rare cases of it defecting when the agents in question are too stupid to “vote it out”, do we punish it?
Second, tit-for-tat is quite succeptible to noise. What if the trained agent misinterprets someone’s actions as “bad”, even if they in actuality did something innocent, like “affectionately pat their friend on the back”, which the AI interpreted as fighting? No matter how smart the AI gets, there still will be cases where it wrongly believes someone to be a liar, and so believes it has every justification to “deter” them. Do we really want even some small probability of that behaviour? How about an AI that doesn’t hurt humans unconditionally, and not only when it believes us to be “good agents”?[1]
Last thing, how does AI determine what is an “intelligent agent” and what is a “rock”? If there are explicit tags in the simulation for that thing, then how do you make sure every actual human in the real world gets tagged correctly? Also, do we count animals? Should animal abuse be enough justification for the AI to turn on the rage mode? What about accidentally stepping on an ant? And if you define “intelligent agent” as relative to the AI, when what do we do once it gets smart enough to rationally think of us like ants?
Somehow that reminds me of “I have been a good Bing, you have not been a good user” situation