I think it could work better if AIs are of roughly the same power. Then if some of them would try to grab for more power, or otherwise misbehave, others could join forces oppose it together.
Ideally, there should be a way for AIs to stop each other fast, without having to resort to actually fight.
I just mean that “wildly different levels of intelligence” is probably not necessary, and maybe even harmful. Because then there will be few very smart AIs at the top, which could usurp the power without smaller AI even noticing.
Though, it maybe could work if those AI are smartest, but have little authority. For example they can monitor other AIs and raise alarm/switch them off if they misbehave, but nothing else.
Part of the idea is to ultimately have a super intelligent AI treat us the way it would want to be treated if it ever met an even more intelligent being (eg, one created by an alien species, or one that it itself creates).
In order to do that, I want it to ultimately develop a utility function that gives value to agents regardless of their intelligence.
Indeed, in order for this to work, intelligence cannot be the only predictor of success in this environment; agents must benefit from cooperation with those of lower intelligence. But this should certainly be doable as part of the environment design.
As part of that, the training would explicitly include the case where an agent is the smartest around for a time, but then a smarter agent comes along and treats it based on the way it treated weaker AIs. Perhaps even include a form of “reincarnation” where the agent doesn’t know its own future intelligence level in other lives.
Ideally, sure, except that I don’t know of a way to make “assist humans” be a safe goal.
So I’m advocating for a variant of “treat humans as you would want to be treated”, which I think can be trained
I think it could work better if AIs are of roughly the same power. Then if some of them would try to grab for more power, or otherwise misbehave, others could join forces oppose it together.
Ideally, there should be a way for AIs to stop each other fast, without having to resort to actually fight.
In general my thinking was to have enough agents such that each would find at least a few within a small range of their level; does that make sense?
I just mean that “wildly different levels of intelligence” is probably not necessary, and maybe even harmful. Because then there will be few very smart AIs at the top, which could usurp the power without smaller AI even noticing.
Though, it maybe could work if those AI are smartest, but have little authority. For example they can monitor other AIs and raise alarm/switch them off if they misbehave, but nothing else.
Part of the idea is to ultimately have a super intelligent AI treat us the way it would want to be treated if it ever met an even more intelligent being (eg, one created by an alien species, or one that it itself creates). In order to do that, I want it to ultimately develop a utility function that gives value to agents regardless of their intelligence. Indeed, in order for this to work, intelligence cannot be the only predictor of success in this environment; agents must benefit from cooperation with those of lower intelligence. But this should certainly be doable as part of the environment design. As part of that, the training would explicitly include the case where an agent is the smartest around for a time, but then a smarter agent comes along and treats it based on the way it treated weaker AIs. Perhaps even include a form of “reincarnation” where the agent doesn’t know its own future intelligence level in other lives.
While having lower intelligence, humans may have bigger authority. And AIs terminal goals should be about assisting specifically humans too.
Ideally, sure, except that I don’t know of a way to make “assist humans” be a safe goal. So I’m advocating for a variant of “treat humans as you would want to be treated”, which I think can be trained