For the first problem, the basic idea is that this is used to solve the specification problem of defining values and training a “conscience”, rather than it being the full extent of training. The conscience can remain static, and provide goals for the rest of the “brain”, which can then update its beliefs.
For the second issue, I meant that we would have no objective way to check “cooperate” and “respect” on the individual agent level, except that the individual can get other agents to cooperate with it. So eg, in order to survive/reproduce/get RL rewards, the agents have to consume a virtual resource that requires effort from multiple/many agents (simple implementation: some sort of voting; but can be more complicated, eg requiring tokens that are generated at a fixed rate for each agent), but also generally be non-competitive, eg no stealing tokens or food, and there’s more than enough food for everyone, if they can cooperate. The theory is that this should lead to a form of tit-for-tat, including AIs detecting and deterring liars.
Thinking a bit more: I think the really dangerous part of AI is the “independent agent”, presumably trained with methods resembling RL; so that’s the part I would train in this environment; it can then be hooked up to eg an LLM which is optimized on something like perplexity and acts more like ChatGPT, ie predicting the next word. Ie, have a separate “brain” and “conscience”, with the brain possibly smarter but the “conscience” holding the reins; during the above training, mix different variants of both components, with different intelligence levels.
Okay, so if that’s just a small component, then sure, first issue goes away (though I still have questions on how you’re gonna make this simulation realistic enough to just hook it up to an LLM or “something smart” and expect it to set coherent and meaningful goals in real life, though that’s more of a technical issue).
However, I believe there are still other issues with this appoach. The way you describe it makes me think it’s really similar to Axelrod’s Iterated Prisoner’s Dilemma tournament, and that did invent tit-for-tat strategy as one of the most successful ones. But that wasn’t the only successful strategy. For example there were strategies that were mostly tit-for-tat, but would defect if they could get away with it. If, for example, that still mostly results in tit-for-tat, except for some rare cases of it defecting when the agents in question are too stupid to “vote it out”, do we punish it?
Second, tit-for-tat is quite succeptible to noise. What if the trained agent misinterprets someone’s actions as “bad”, even if they in actuality did something innocent, like “affectionately pat their friend on the back”, which the AI interpreted as fighting? No matter how smart the AI gets, there still will be cases where it wrongly believes someone to be a liar, and so believes it has every justification to “deter” them. Do we really want even some small probability of that behaviour? How about an AI that doesn’t hurt humans unconditionally, and not only when it believes us to be “good agents”?[1]
Last thing, how does AI determine what is an “intelligent agent” and what is a “rock”? If there are explicit tags in the simulation for that thing, then how do you make sure every actual human in the real world gets tagged correctly? Also, do we count animals? Should animal abuse be enough justification for the AI to turn on the rage mode? What about accidentally stepping on an ant? And if you define “intelligent agent” as relative to the AI, when what do we do once it gets smart enough to rationally think of us like ants?
Thanks for helping me think this through.
For the first problem, the basic idea is that this is used to solve the specification problem of defining values and training a “conscience”, rather than it being the full extent of training. The conscience can remain static, and provide goals for the rest of the “brain”, which can then update its beliefs.
For the second issue, I meant that we would have no objective way to check “cooperate” and “respect” on the individual agent level, except that the individual can get other agents to cooperate with it. So eg, in order to survive/reproduce/get RL rewards, the agents have to consume a virtual resource that requires effort from multiple/many agents (simple implementation: some sort of voting; but can be more complicated, eg requiring tokens that are generated at a fixed rate for each agent), but also generally be non-competitive, eg no stealing tokens or food, and there’s more than enough food for everyone, if they can cooperate. The theory is that this should lead to a form of tit-for-tat, including AIs detecting and deterring liars.
Thinking a bit more: I think the really dangerous part of AI is the “independent agent”, presumably trained with methods resembling RL; so that’s the part I would train in this environment; it can then be hooked up to eg an LLM which is optimized on something like perplexity and acts more like ChatGPT, ie predicting the next word. Ie, have a separate “brain” and “conscience”, with the brain possibly smarter but the “conscience” holding the reins; during the above training, mix different variants of both components, with different intelligence levels.
Okay, so if that’s just a small component, then sure, first issue goes away (though I still have questions on how you’re gonna make this simulation realistic enough to just hook it up to an LLM or “something smart” and expect it to set coherent and meaningful goals in real life, though that’s more of a technical issue).
However, I believe there are still other issues with this appoach. The way you describe it makes me think it’s really similar to Axelrod’s Iterated Prisoner’s Dilemma tournament, and that did invent tit-for-tat strategy as one of the most successful ones. But that wasn’t the only successful strategy. For example there were strategies that were mostly tit-for-tat, but would defect if they could get away with it. If, for example, that still mostly results in tit-for-tat, except for some rare cases of it defecting when the agents in question are too stupid to “vote it out”, do we punish it?
Second, tit-for-tat is quite succeptible to noise. What if the trained agent misinterprets someone’s actions as “bad”, even if they in actuality did something innocent, like “affectionately pat their friend on the back”, which the AI interpreted as fighting? No matter how smart the AI gets, there still will be cases where it wrongly believes someone to be a liar, and so believes it has every justification to “deter” them. Do we really want even some small probability of that behaviour? How about an AI that doesn’t hurt humans unconditionally, and not only when it believes us to be “good agents”?[1]
Last thing, how does AI determine what is an “intelligent agent” and what is a “rock”? If there are explicit tags in the simulation for that thing, then how do you make sure every actual human in the real world gets tagged correctly? Also, do we count animals? Should animal abuse be enough justification for the AI to turn on the rage mode? What about accidentally stepping on an ant? And if you define “intelligent agent” as relative to the AI, when what do we do once it gets smart enough to rationally think of us like ants?
Somehow that reminds me of “I have been a good Bing, you have not been a good user” situation