Dagon comments on jsd’s Shortform

Dagon 18 Jan 2024 19:14 UTC
2 points
Can you give a few examples where it’s both confusing and important? Almost all concrete experiments and examples I’ve seen are the latter (an instance with a context and prompt(s)), because that’s really the point of interaction and existence for LLMs. I’m not even sure what it would mean for a non-instantiated model without input to do anything.
- jsd 19 Jan 2024 7:36 UTC
  1 point
  Parent
  I’m not even sure what it would mean for a non-instantiated model without input to do anything.
  For goal-directedness, I’d interpret it as “all instances are goal-directed and share the same goal”.
  As an example, I wish Without specific countermeasures had made the distinction more explicit.
  More generally, when discussing whether a model is scheming, I think it’s useful to keep in mind worlds where some instances of the model scheme while others don’t.
  - Dagon 19 Jan 2024 18:00 UTC
    2 points
    Parent
    I don’t think I’ve seen any research about cross-instance similarity, or even measuring the impact of instance-differences (including context and prompts) on strategic/goal-oriented actions. It’s an interesting question, but IMO not as interesting as “if instances are created/selected for their ability to make and execute long-term plans, how do those instances behave”.
    How would you say humanity does on this distinction? When we talk about planning and goals, how often are we talking about “all humans”, vs “representative instances”?
    - jsd 20 Jan 2024 20:18 UTC
      1 point
      Parent
      Mostly I care about this because if there’s a small number of instances that are trying to take over, but a lot of equally powerful instances that are trying to help you, this makes a big difference. My best guess is that we’ll be in roughly this situation for “near-human-level” systems.
      I don’t think I’ve seen any research about cross-instance similarity
      I think mode-collapse (update) is sort of an example.
      How would you say humanity does on this distinction? When we talk about planning and goals, how often are we talking about “all humans”, vs “representative instances”?
      It’s not obvious how to make the analogy with humanity work in this case—maybe comparing the behavior of clones of the same person put in different situations?