Alexander Gietelink Oldenziel comments on Alexander Gietelink Oldenziel’s Shortform

Alexander Gietelink Oldenziel 4 Oct 2024 17:59 UTC
6 points
2
It’s a plausible argument imho. Time will tell.

To my mind an important dimension, perhaps the most important dimensions is how values be evolve under reflection.

It’s quite plausible to me that starting with an AI that has pretty aligned values it will self-reflect into evil. This is certainly not unheard of in the real world (let alone fiction!). Of course it’s a question about the basin of attraction around helpfulness and harmlessness. I guess I have only weak priors on what this might look like under reflection, although plausibly friendliness is magic.
- Garrett Baker 4 Oct 2024 18:14 UTC
  4 points
  0
  Parent
  
  It’s quite plausible to me that starting with an AI that has pretty aligned values it will self-reflect into evil.
  
  I disagree, but could be a difference in definition of what “perfectly aligned values” means. Eg if the AI is dumb (for an AGI) and in a rush, sure. If its a superintelligence already, even in a rush, seems unlikely. [edit:] If we have found an SAE feature which seems to light up for good stuff, and down for bad stuff 100% of the time, then we clamp it, then yeah, that could go away on reflection.
- Noosphere89 4 Oct 2024 18:13 UTC
  4 points
  0
  Parent
  Another way to say it is how values evolve in OOD situations.
  
  My general prior, albeit reasonably weak is that the best single way to predict how values evolve is looking at their data sources, as well as what data they received up to now, and the second best way to predict it is looking at what their algorithms are, especially for social situations, and that most of the other factors don’t matter nearly as much.