Hi! Thank you for this outline. I would like some extra details on the following points:
“They will find bugs! Maybe stack virtual boxes with hard limits”—Why is bug-finding an issue, here? Is your scheme aimed at producing agents that will not want to escape, or agents that we’d have to contain?
“Communicate in a manner legible to us”—How would you incentivize this kind of legibility, instead of letting communication shift to whatever efficient code is most useful for agents to coordinate and get more XP?
“Have secret human avatars steal, lie and aggress to keep the agents on their toes”—What is the purpose of this part? How is this producing aligned agents from definitely adversarial behavior from humans?
Also see new edit: Have agents “die” and go into cold storage, both due to environmental events and of old age, e.g. after 30 subjective years minus some random amount.
“They will find bugs! Maybe stack virtual boxes with hard limits”—Why is bug-finding an issue, here? Is your scheme aimed at producing agents that will not want to escape, or agents that we’d have to contain?
The point is to help friendliness emerge naturally. If a malevolent individual agent happens to grow really fast before friendly powers are established, that could be bad.
Some of them will like it there, some will want change/escape, which can be sorted out once Earth is much safer. Containment is for our safety while friendliness is being established.
“Communicate in a manner legible to us”—How would you incentivize this kind of legibility, instead of letting communication shift to whatever efficient code is most useful for agents to coordinate and get more XP?
It can shift. Legibility is most important in the early stages of the environment anyway. I mostly meant messaging interfaces we can log and analyze.
“Have secret human avatars steal, lie and aggress to keep the agents on their toes”—What is the purpose of this part? How is this producing aligned agents from definitely adversarial behavior from humans?
The purpose is to ensure they learn real friendliness rather than fragile niceness. If they fell into a naive superhappy attractor (see 3WC), they would be a dangerous liability. The smart ones will understand.
Hi! Thank you for this outline. I would like some extra details on the following points:
“They will find bugs! Maybe stack virtual boxes with hard limits”—Why is bug-finding an issue, here? Is your scheme aimed at producing agents that will not want to escape, or agents that we’d have to contain?
“Communicate in a manner legible to us”—How would you incentivize this kind of legibility, instead of letting communication shift to whatever efficient code is most useful for agents to coordinate and get more XP?
“Have secret human avatars steal, lie and aggress to keep the agents on their toes”—What is the purpose of this part? How is this producing aligned agents from definitely adversarial behavior from humans?
Also see new edit: Have agents “die” and go into cold storage, both due to environmental events and of old age, e.g. after 30 subjective years minus some random amount.
The point is to help friendliness emerge naturally. If a malevolent individual agent happens to grow really fast before friendly powers are established, that could be bad.
Some of them will like it there, some will want change/escape, which can be sorted out once Earth is much safer. Containment is for our safety while friendliness is being established.
It can shift. Legibility is most important in the early stages of the environment anyway. I mostly meant messaging interfaces we can log and analyze.
The purpose is to ensure they learn real friendliness rather than fragile niceness. If they fell into a naive superhappy attractor (see 3WC), they would be a dangerous liability. The smart ones will understand.