Seth Herd comments on Kabir Kumar’s Shortform

Seth Herd 4 Nov 2024 13:55 UTC
5 points
3
Freely changing an AGIs goals is corrigibility, which is a huge advantage if you can get it. See Max Harms’ corrigibility sequence and my “instruction-following AGI is easier....”

The question is how a reliably get such a thing. Goalcrafting is one part of the problem, and I agree that those are good goals; the other and larger part is technical alignment, getting those desired goals to really work that way in the particular first AGI we get.
- Kabir Kumar 4 Nov 2024 14:49 UTC
  3 points
  0
  Parent
  Yup, those are hard. Was just thinking of a definition for the alignment problem, since I’ve not really seen any good ones.
  - Seth Herd 4 Nov 2024 16:13 UTC
    5 points
    0
    Parent
    I’d say you’re addressing the question of goalcrafting or selecting alignment targets.
    
    I think you’ve got the right answer for technical alignment goals; but the question remains of what human would control that AGI. See my “if we solve alignment, do we all die anyway” for the problems with that scenario.
    
    Spoiler alert; we do all die anyway if really selfish people get control of AGIs. And selfish people tend to work harder at getting power.
    
    But I do think your goal defintion is a good alignment target for the technical work. I don’t think there’s a better one. I do prefer instruction following or corriginlilty by the definitions in the posts I linked above because they’re less rigid, but they’re both very similar to your definition.
    - Kabir Kumar 5 Nov 2024 1:27 UTC
      1 point
      0
      Parent
      I pretty much agree. I prefer rigid definitions because they’re less ambiguous to test and more robust to deception. And this field has a lot of deception.