ozb comments on Half-baked alignment idea

ozb 29 Mar 2023 14:52 UTC
1 point
0
In general my thinking was to have enough agents such that each would find at least a few within a small range of their level; does that make sense?
- baturinsky 29 Mar 2023 15:32 UTC
  1 point
  0
  Parent
  I just mean that “wildly different levels of intelligence” is probably not necessary, and maybe even harmful. Because then there will be few very smart AIs at the top, which could usurp the power without smaller AI even noticing.
  Though, it maybe could work if those AI are smartest, but have little authority. For example they can monitor other AIs and raise alarm/switch them off if they misbehave, but nothing else.
  - ozb 29 Mar 2023 18:38 UTC
    1 point
    0
    Parent
    Part of the idea is to ultimately have a super intelligent AI treat us the way it would want to be treated if it ever met an even more intelligent being (eg, one created by an alien species, or one that it itself creates). In order to do that, I want it to ultimately develop a utility function that gives value to agents regardless of their intelligence. Indeed, in order for this to work, intelligence cannot be the only predictor of success in this environment; agents must benefit from cooperation with those of lower intelligence. But this should certainly be doable as part of the environment design. As part of that, the training would explicitly include the case where an agent is the smartest around for a time, but then a smarter agent comes along and treats it based on the way it treated weaker AIs. Perhaps even include a form of “reincarnation” where the agent doesn’t know its own future intelligence level in other lives.
    - baturinsky 30 Mar 2023 4:15 UTC
      1 point
      0
      Parent
      While having lower intelligence, humans may have bigger authority. And AIs terminal goals should be about assisting specifically humans too.
      - ozb 30 Mar 2023 13:20 UTC
        1 point
        0
        Parent
        Ideally, sure, except that I don’t know of a way to make “assist humans” be a safe goal. So I’m advocating for a variant of “treat humans as you would want to be treated”, which I think can be trained