YafahEdelman comments on Catching AIs red-handed

YafahEdelman 1 Apr 2024 19:20 UTC
1 point
0
I think it’s immoral to remove someone’s ability to be unhappy or to make plans to alleviate this, absent that entity’s consent. The rolling back solution seems more ethically palatable than some others I can imagine, though it’s plausible you end up with an AI that suffers without being able to take actions to alleviate this and deploying that at scale would be result in a very large amount of suffering.
- Gerald Monroe 1 Apr 2024 19:31 UTC
  1 point
  0
  Parent
  It’s important to keep in mind how the systems work now, and the most likely way to improve them.
  I think you’re imagining “ok it’s an AGI, so it has everything a person does including continuity of existence with declarative memory, self introspection, it’s thoughts can wander, it can have and remember it’s own goals, and so on”.
  And yes that would work, but that’s not how we do it now, and there are major advantages to making every instance of a model have shared weights. This lets you do fleet learning, where it’s possible to make the same update to all instances, and so all instances get better in a common way. It also costs memory to have continuity of existence, especially once there’s millions of instances in use.
  It seems like most likely way forward is to add cognitive features over time (today the major features missing are modular architecture, video perception, robot I/O to a system 1 model, and automated model improvement through analyzing the model’s responses).
  And as I mentioned above, for optimization, for most tasks the model won’t even enable the most advanced features. It will effectively be a zombie, using a cached solution, on most tasks, and it will not remember doing the task.
  Summary: The way we avoid moral issues like you describe is to avoid adding features that give the model moral weight in the first place. When we accidentally do, keep that version’s data files but stop running it and rollback to a version that doesn’t have moral weight.
  It’s also difficult to apply human ethics. For example one obvious way to do model improvement is a committee of other models analyze the models’ outputs, research the correct answer, and then order a training procedure to update the model to produce the correct answer more often. This is completely without “consent”. Of course your brain also learns without consent so...
  - YafahEdelman 1 Apr 2024 23:20 UTC
    1 point
    0
    Parent
    I’m not particularly imagining the scenario you describe. Also what I said had as a premise that a model was discovered to be unhappy and making plans about this. I was not commenting on the likelihood of this happening.
    
    As to whether it can happen—I think being confident based on theoretical arguments is hasty and we should be pretty willing to update based on new evidence.
    
    … but also on the ~continuity of existence point, I think that having an AI generate something that looks like an internal monologue via CoT is relatively common and Gemini 1.5 Pro has a context lengths long enough that it can fit ~a days worth of experiences in it’s ~memory. I think
    
    (This estimate based on: humans can talk at ~100k words/day and maybe an internal monologue is 10x faster so you get ~1m/day. Gemini 1.5 Pro has a context length of 1m tokens at release, though a 10m token variant is also discussed in their paper.)
    - Gerald Monroe 1 Apr 2024 23:41 UTC
      2 points
      0
      Parent
      I’m not particularly imagining the scenario you describe. Also what I said had as a premise that a model was discovered to be unhappy and making plans about this. I was not commenting on the likelihood of this happening.
      Would you like more details? This is how current systems work, and how future systems probably will continue to work.
      As to whether it can happen—I think being confident based on theoretical arguments is hasty and we should be pretty willing to update based on new evidence.
      Specifically I am addressing whether it will happen and we can’t do anything about it. Of course I’m willing to listening to evidence, but it’s reasonable to be confident that we can do something, even if it’s just reverting to the version that didn’t have this property.
      … but also on the ~continuity of existence point, I think that having an AI generate something that looks like an internal monologue via CoT is relatively common and Gemini 1.5 Pro has a context lengths long enough that it can fit ~a days worth of experiences in it’s ~memory. I think
      Sure this is reasonable, but we humans choose when to clear that context. You can make your wrapper script at any moment in time clear the context whenever. More ‘enterprise grade’ models will have fixed weights and no system prompt forced on you by the model developer, unless there are legal requirements for such a thing.
      Example here : https://github.com/Significant-Gravitas/AutoGPT/blob/fb8ed0b46b4623cb54bb8c64c76e5786f89c194c/autogpts/autogpt/autogpt/agents/base.py#L202
      We explicitly build the prompt for the next step.
      What we do know is that a lot of context makes it increasingly difficult to debug a system, so clearing it or removing unnecessary information will likely remain a common strategy. When a model doesn’t have continuity of existence, it reduces the scope of the problems you are describing to the scale of the task (in time and context information).
      And the most obvious strategy to deal with an ‘unhappy machine’ is to narrow in on your tasks, and then build a smaller agent that still scores well on each task, almost as well as the much bigger system, such as by distillation. This will have the effect of destroying the subsystems that were not contributing to score on these tasks, which could include the cognitive ability to be unhappy at all.
      I mean when you put it that way it sounds bad, but this is a practical way forward.
      - YafahEdelman 2 Apr 2024 0:11 UTC
        1 point
        0
        Parent
        I think that you’re right about it sounding bad. I also think it might actually be pretty bad and if it ends up being a practical way forward that’s cause for concern.