Dweomite comments on If we had known the atmosphere would ignite

Dweomite 19 Aug 2023 2:00 UTC
1 point
0
When you say that “aligned AGI” might need to solve some impossible problem in order to function at all, do you mean
1. Coherence is impossible; any AGI will inevitably sabotage itself
2. Coherent AGI can exist, but there’s some important sense in which it would not be “aligned” with anything, not even itself
3. You could have an AGI that is aligned with some things, but not the particular things we want to align it with, because our particular goals are hard in some special way that makes the problem impossible
4. You can’t have a “universally alignable” AGI that accepts an arbitrary goal as a runtime input and self-aligns to that goal
5. Something else
- dr_s 19 Aug 2023 9:16 UTC
  3 points
  0
  Parent
  Something in between 1 and 2. Basically, that you can’t have a program that is both general enough to act reflexively on the substrate within which it is running (a Turing machine that understands it is a machine, understands the hardware it is running on, understands it can change that hardware or its own programming) and at the same time is able to guarantee sticking to any given set of values or constraints, especially if those values encompass its own behaviour (so a bit of 3, since any desirable alignment values are obviously complex enough to encompass the AGI itself).
  
  Not sure how to formalize that precisely, but I can imagine something to that effect being true. Or even something instead like “you can not produce a proof that any given generally intelligent enough program will stick to any given constraints; it might, but you can’t know beforehand”.
  - Remmelt 5 Nov 2024 10:03 UTC
    1 point
    0
    Parent
    For an overview of why such a guarantee would turn out impossible, suggest taking a look at Will Petillo’s post Lenses of Control.
  - Dweomite 19 Aug 2023 18:24 UTC
    1 point
    0
    Parent
    I can write a simple program that modifies its own source code and then modifies it back to its original state, in a trivial loop. That’s acting on its own substrate while provably staying within extremely tight constraints. Does that qualify as a disproof of your hypothesis?
    - dr_s 19 Aug 2023 18:40 UTC
      3 points
      0
      Parent
      I wouldn’t say it does, any more than a program that can identify whether a very specific class of programs will halt disproves the Halting Theorem. I’m just gesturing in what I think might be the general direction of where a proof may lay; usually recursivity is where such traps hide. Obviously a rigorous proof would need rigorous definitions and all.
      - Dweomite 19 Aug 2023 20:23 UTC
        1 point
        0
        Parent
        “A program that can identify whether a very specific class of programs will halt” does disprove the stronger analog of the Halting Theorem that (I argued above) you’d need in order for it to make alignment impossible.