xarkn comments on Towards_Keeperhood’s Shortform

xarkn Feb 26, 2025, 1:15 PM
4 points
3
We can try to select for AIs that outwardly seem friendly, but on anything close to our current ignorance about their cognition, we cannot be nearly confident that an AI going through the intelligence explosion will be aligned to human values.
This bolded part is a bit difficult to understand. Or at least I can’t understand what exactly is meant by it.
It would then go about optimizing the lightcone according to its values
“lightcone” is an obscure term, and even within Less Wrong I don’t see why the word is clearer than using “the future” or “the universe”. I would not use the term with a lay audience.
- Towards_Keeperhood Feb 26, 2025, 2:06 PM
  1 point
  0
  Parent
  Thank you for your feedback! Feedback is great.
  We can try to select for AIs that outwardly seem friendly, but on anything close to our current ignorance about their cognition, we cannot be nearly confident that an AI going through the intelligence explosion will be aligned to human values.
  It means that we have only very little understanding of how and why AIs like ChatGPT work. We know almost nothing about what’s going on inside them that they are able to give useful responses. Basically all I’m saying here is that we know so little that it’s hard to be confident of any nontrivial claim about future AI systems, including that they are aligned.
  A more detailed argument for worry would be: We are restricted to training AIs through giving feedback on their behavior, and cannot give feedback on their thoughts directly. For almost any goal an AI might have, it is in the interest of the AI to do what the programmers want it to do, until it is robustly able to escape and without being eventually shut down (because if it does things people don’t like while it is not yet powerful enough, people will effectively replace it with another AI which will then likely have different goals, and thus this ranks worse according to the AI’s current goals). Thus, we basically cannot behaviorally distinguish friendly AIs from unfriendly AIs, and thus training for friendly behavior won’t select for friendly AIs. (Except in the early phases where the AIs are still so dumb that they cannot realize very simple instrumental strategies, but just because a dumb AI starts out with some friendly tendencies, doesn’t mean this friendliness will generalize to the grown-up superintelligence pursuing human values. E.g. there might be some other inner optimizers with other values cropping up during later training.)
  (An even more detailed introduction would try to concisely explain why AIs that can achieve very difficult novel tasks will be optimizers, aka trying to achieve some goal. But empirically it seems like this part is actually somewhat hard to explain, and I’m not going to write this now.)
  It would then go about optimizing the lightcone according to its values
  “lightcone” is an obscure term, and even within Less Wrong I don’t see why the word is clearer than using “the future” or “the universe”. I would not use the term with a lay audience.
  Yep, true.