jsd comments on jsd’s Shortform

jsd 19 Jan 2024 7:36 UTC
1 point
I’m not even sure what it would mean for a non-instantiated model without input to do anything.
For goal-directedness, I’d interpret it as “all instances are goal-directed and share the same goal”.
As an example, I wish Without specific countermeasures had made the distinction more explicit.
More generally, when discussing whether a model is scheming, I think it’s useful to keep in mind worlds where some instances of the model scheme while others don’t.
- Dagon 19 Jan 2024 18:00 UTC
  2 points
  Parent
  I don’t think I’ve seen any research about cross-instance similarity, or even measuring the impact of instance-differences (including context and prompts) on strategic/goal-oriented actions. It’s an interesting question, but IMO not as interesting as “if instances are created/selected for their ability to make and execute long-term plans, how do those instances behave”.
  How would you say humanity does on this distinction? When we talk about planning and goals, how often are we talking about “all humans”, vs “representative instances”?
  - jsd 20 Jan 2024 20:18 UTC
    1 point
    Parent
    Mostly I care about this because if there’s a small number of instances that are trying to take over, but a lot of equally powerful instances that are trying to help you, this makes a big difference. My best guess is that we’ll be in roughly this situation for “near-human-level” systems.
    I don’t think I’ve seen any research about cross-instance similarity
    I think mode-collapse (update) is sort of an example.
    How would you say humanity does on this distinction? When we talk about planning and goals, how often are we talking about “all humans”, vs “representative instances”?
    It’s not obvious how to make the analogy with humanity work in this case—maybe comparing the behavior of clones of the same person put in different situations?