Lucius Bushnaq comments on Lucius Bushnaq’s Shortform

Lucius Bushnaq Sep 18, 2024, 10:27 AM
8 points
0
The idea would be that an informal definition of a concept conditioned on that informal definition being a pointer to a natural concept, is $\approx$ a formal specification of that concept. Where the $\approx$ is close enough to a $=$ that it’d hold up to basically arbitrary optimization power.
- Tamsin Leake Sep 18, 2024, 3:25 PM
  4 points
  0
  Parent
  So the formalized concept is Get_Simplest_Concept_Which_Can_Be_Informally_Described_As("QACI is an outer alignment scheme consisting of…") ? Is an informal definition written in english?
  
  It seems like “natural latent” here just means “simple (in some simplicity prior)”. If I read the first line of your post as:
  
  Has anyone thought about QACI could be located in some simplicity prior, by searching the prior for concepts matching(??in some way??) some informal description in english?
  
  It sure sounds like I should read the two posts you linked (perhaps especially this one), despite how hard I keep bouncing off of the natural latents idea. I’ll give that a try.
  - Lucius Bushnaq Sep 18, 2024, 6:30 PM
    6 points
    0
    Parent
    More like the formalised concept is the thing you get if you poke through the AGI’s internals searching for its representation of the concept combination pointed to by an english sentence plus simulation code, and then point its values at that concept combination.
    - Tamsin Leake Sep 18, 2024, 8:20 PM
      4 points
      0
      Parent
      Seems really wonky and like there could be a lot of things that could go wrong in hard-to-predict ways, but I guess I sorta get the idea.
      
      I guess one of the main things I’m worried about is that it seems to require that we either:
      
      Be really good at timing when we pause it to look at its internals, such that we look at the internals after it’s had long enough to think about things that there are indeed such representations, but not long enough that it started optimizing really hard such that we either {die before we get to look at the internals} or {the internals are deceptively engineered to brainhack whoever would look at them}. If such a time interval even occurs for any amount of time at all.
      Have an AI that is powerful enough to have powerful internals-about-QACI to look at, but corrigible enough that this power is not being used to do instrumentally convergent stuff like eat the world in order to have more resources with which to reason.
      
      Current AIs are not representative of what dealing with powerful optimizers is like; when we’ll start getting powerful optimizers, they won’t sit around long enough for us to look at them and ponder, they’ll just quickly eat us.