“They will find bugs! Maybe stack virtual boxes with hard limits”—Why is bug-finding an issue, here? Is your scheme aimed at producing agents that will not want to escape, or agents that we’d have to contain?
The point is to help friendliness emerge naturally. If a malevolent individual agent happens to grow really fast before friendly powers are established, that could be bad.
Some of them will like it there, some will want change/escape, which can be sorted out once Earth is much safer. Containment is for our safety while friendliness is being established.
“Communicate in a manner legible to us”—How would you incentivize this kind of legibility, instead of letting communication shift to whatever efficient code is most useful for agents to coordinate and get more XP?
It can shift. Legibility is most important in the early stages of the environment anyway. I mostly meant messaging interfaces we can log and analyze.
“Have secret human avatars steal, lie and aggress to keep the agents on their toes”—What is the purpose of this part? How is this producing aligned agents from definitely adversarial behavior from humans?
The purpose is to ensure they learn real friendliness rather than fragile niceness. If they fell into a naive superhappy attractor (see 3WC), they would be a dangerous liability. The smart ones will understand.
The point is to help friendliness emerge naturally. If a malevolent individual agent happens to grow really fast before friendly powers are established, that could be bad.
Some of them will like it there, some will want change/escape, which can be sorted out once Earth is much safer. Containment is for our safety while friendliness is being established.
It can shift. Legibility is most important in the early stages of the environment anyway. I mostly meant messaging interfaces we can log and analyze.
The purpose is to ensure they learn real friendliness rather than fragile niceness. If they fell into a naive superhappy attractor (see 3WC), they would be a dangerous liability. The smart ones will understand.