KvmanThinking comments on fake alignment solutions????

KvmanThinking 11 Dec 2024 12:27 UTC
1 point
0
1. Yes, human intelligence augmentation sounds like a good idea.
2. There are all sorts of “strategies” (turn it off, raise it like a kid, disincentivize changing the environment, use a weaker AI to align it) that people come up with when they’re new to the field of AI safety, but that are ineffective. And their ineffectiveness is only obvious and explainable by people who specifically know how AI behaves. Supposes there are strategies which ineffectiveness is only obvious and explainable by people who know way more about decisions and agents and optimal strategies and stuff than humanity has currently figured out thus far. (Analogy: A society who only know basic arithmetic could reasonably stumble upon and understand the Collatz conjecture; and yet, with all our mathematical development, we can’t do anything to prove it. Just like we could reasonably stumble upon an “alignment solution” that we can’t disprove that it would work, because that would take a much higher understanding of these kinds of situations.)
3. If the solution to alignment were simple, we would have found it by now. Humans are far from simple, human brains are far from simple, human behavior is far from simple. That there is one simple thing from which comes all of our values, or a simple way to derive such a thing, just seems unlikely.
- quila 12 Dec 2024 2:22 UTC
  1 point
  0
  Parent
  There are all sorts of “strategies” (turn it off, raise it like a kid, disincentivize changing the environment, use a weaker AI to align it) that people come up with when they’re new to the field of AI safety, but that are ineffective. And their ineffectiveness is only obvious and explainable by people who specifically know how AI behaves.
  yep but the first three all fail for the same reason: not believing (enough) that programs will do what they actually say to do, including in response to your actions. (the fourth one, ‘use a weaker AI to align it’, is at least that obviously not itself a solution. the weakest form of it, using an LLM to assist an alignment researcher, is possible, and some less weak forms likely are too.)
  (there are some issues which a programmatic model doesn’t automatically make obvious to a human: they must follow from it, but one could fail to see them without making that basic mistake. probable environment hacking and decision theory issues come to mind. i agree that on general priors this is a bit of evidence that there are deeper subjects that would not be noticed even conditional on those researchers approving a solution.)
  i guess my next response then would be that some subjects are bounded, and we might notice (if not ‘be able to prove’) such bounds telling us ‘theres not more things beyond what you have already written down’, which would be negative evidence. (this is more of an intuition, i don’t know how to elaborate this)
  (also on what johnswentworth wrote. a similar point i was considering making is that you have set up the question such that you’re forced into playing a game similar to ‘show how you’d outsmart a superintelligence’ - for any consideration you can think of, one can respond that eliezer and others will probably also think of it, which might preclude them from actually approving, which makes your conditional ‘they approve but its wrong’ harder to be true and basically dependent on them instead of object-level properties of alignment.
  you may want to set it up instead to be conditional on / about the likelyhood of {they both don’t notice flaws and think it’s unlikely to succeed}, to (1) not require you to know something they don’t to make the question true, and (2) abstract away the world where they might approve despite thinking success is unlikely (out of no better choice).)
  i am interested in reading more arguments about that object-level difficulty level if anyone has them.
  If the solution to alignment were simple, we would have found it by now [...] That there is one simple thing from which comes all of our values, or a simple way to derive such a thing, just seems unlikely.
  the pointer to values does not need to be complex (even if the values themselves are)
- [ ]
  [deleted]