MiguelDev comments on A list of core AI safety problems and how I hope to solve them

MiguelDev 26 Aug 2023 23:31 UTC
2 points
0
2. Corrigibility is anti-natural.

Hello! I can see a route where corrigibility can become part of the AI’s attention mechanism—and is natural to its architecture.

If alignment properties are available in the training data and is amplified by a tuning data—that is very much possible to happen.

Thanks!
- davidad 27 Aug 2023 6:16 UTC
  6 points
  −3
  Parent
  There’s something to be said for this, because with enough RLHF, GPT-4 does seem to have become pretty corrigible, especially compared to Bing Sydney. However, that corrigible persona is probably only superficial, and the larger and more capable a single Transformer gets, the more of its mesa-optimization power we can expect will be devoted to objectives which are uninfluenced by in-context corrections.
  - MiguelDev 27 Aug 2023 8:00 UTC
    3 points
    0
    Parent
    Sorry for not specifying the method, but I wasn’t referring to RL-based or supervised learning methods. There’s a lot of promise in using a smaller dataset that explains corrigibility characteristics, as well as a shutdown mechanism, all fine-tuned through unsupervised learning.
    I have a prototype at this link where I modified GPT2-XL to mention a shutdown phrase whenever all of its attention mechanisms activate and determine that it could harm humans due to its intelligence. I used unsupervised learning to allow patterns from a smaller dataset to achieve this.